# Data Preparation
## Part 1 of the Workshop "Text Classification - From Zero to Hero", by Dr. Omri Allouche, Gong.io, Bar Ilan University
In this notebook, we prepare the data for analysis, and save it as CSV files.  
We will create different files for train, validation and test named `train.csv`, `val.csv` and `test.csv`.  
The Fast-BERT package gets as an input a folder containing these 3 files, and a file with the possible labels.

In the notebook [Bag of Words and Tf-Idf](bow_tfidf.ipynb) we perform basic Exploratory Data Analysis on the dataset.

For this exercise, we will use only some of the groups - `rec.sport.baseball`, `rec.sport.hockey`, `talk.politics.guns`, `talk.politics.mideast`.  
We will also get only the post text, and remove its header information.

In [1]:
from sklearn.datasets import fetch_20newsgroups

types_to_remove = ('headers', 'footers', 'quotes')
newsgroups_categories = ['rec.sport.baseball', 'rec.sport.hockey', 'talk.politics.guns', 'talk.politics.mideast']

newsgroups_data = fetch_20newsgroups(subset='all',
                                      categories=newsgroups_categories,
                                      remove=types_to_remove)

In [2]:
import pandas as pd
df = pd.DataFrame({'data': newsgroups_data.data, 
                   'target': newsgroups_data.target,
                  'target_name': [newsgroups_data.target_names[x] for x in newsgroups_data.target]})

In [3]:
df.shape

(3843, 3)

Next we'll perform basis preprocessing of the text.  

In [4]:
def preprocess_text(txt):
    txt = txt.replace('\n', ' ').replace('\r', ' ').replace('\t', ' ').replace('?', ' ?').replace('.', ' .').replace(',', ' ,')
    txt = txt.lower().strip()
    txt = txt.split(' ')
    txt = " ".join([w for w in txt if w!=''])
    return txt

In [5]:
df['data_processed'] = df['data'].apply(preprocess_text)

In [6]:
# Let's filter out documents that are empty after our preprocessing
df['num_chars'] = df['data_processed'].apply(len)
df = df[ df['num_chars']>0 ]

In [7]:
print(df.iloc[1]['data'])
print(df.iloc[1]['data_processed'])

...

If we are indeed talking about CS, then this is not quite accurate. CS is
"just" tear gas--albeit the worst kind. It isn't a nausea gas, and doesn't
have direct CNS effects. However, it's quite bad--much worse than CN gas. I
was briefly exposed to it once (during an engagement in Berkeley circa 1968
8^) and it's not the kind of thing you forget. It seems to be
moisture-activated--it not only made my eyes sting and water, but attacked
my breathing passages and lungs. Breathing was painful, and my entire face
felt as if it was on fire. These effects persisted for hours after
exposure, and I was coughing for days afterwards.  If I was exposed to a
dense concentration of this stuff in a closed space for several hours, I
doubt whether I could find the exit. Indeed, I can't imagine living through
it.


. . . if we are indeed talking about cs , then this is not quite accurate . cs is "just" tear gas--albeit the worst kind . it isn't a nausea gas , and doesn't have direct cns effects . ho

In [19]:
df = df[ df['num_chars']<=300 ]
df.shape

(1092, 6)

In [20]:
import os
os.makedirs('./data', exist_ok=True)

In [21]:
# Save the entire dataset to a single file
df.to_csv('data/20newsgroups.csv', index=False)

In [22]:
# Save files for train, test and validation
df = df.reset_index().rename({'data_processed': 'text', 'target_name': 'label'}, axis=1)
df_train = df.sample(frac=0.7, random_state=0)
df_test = df.drop(df_train.index)
df_val = df_test.sample(frac=0.5, random_state=0)
df_test = df_test.drop(df_val.index)

In [23]:
df_test.head()

Unnamed: 0,level_0,index,data,target,label,text,num_chars
11,17,35,"\n\nYeah Valentine, how many rings does Clemen...",0,rec.sport.baseball,"yeah valentine , how many rings does clemens h...",248
23,39,95,I have read that there will be some concrete p...,3,talk.politics.mideast,i have read that there will be some concrete p...,249
24,41,98,DON MATTINGLY IS THE BEST FIRST BASEMAN IN THE...,0,rec.sport.baseball,don mattingly is the best first baseman in the...,148
36,62,137,KC(?) news was doing a report on that. They s...,0,rec.sport.baseball,kc( ?) news was doing a report on that . they ...,229
41,71,156,I saw a previous request for the Rules and Ins...,1,rec.sport.hockey,i saw a previous request for the rules and ins...,212


In [24]:
df_train.to_csv('data/train.csv', index=False)
df_val.to_csv('data/val.csv', index=False)
df_test.to_csv('data/test.csv', index=False)

In [25]:
# Save a labels.csv file, needed for fast-bert
pd.Series(df['label'].unique()).to_csv('data/labels.csv', index=False)

  
