# Data Preparation
## Part 1 of the Workshop "Text Classification - From Zero to Hero", by Dr. Omri Allouche, Gong.io, Bar Ilan University
In this notebook, we prepare the data for analysis, and save it as CSV files.  
We will create different files for train, validation and test named `train.csv`, `val.csv` and `test.csv`.  
The Fast-BERT package gets as an input a folder containing these 3 files, and a file with the possible labels.

In the notebook [Bag of Words and Tf-Idf](bow_tfidf.ipynb) we perform basic Exploratory Data Analysis on the dataset.

For this exercise, we will use only some of the groups - `rec.sport.baseball`, `rec.sport.hockey`, `talk.politics.guns`, `talk.politics.mideast`.  
We will also get only the post text, and remove its header information.

In [48]:
from sklearn.datasets import fetch_20newsgroups

types_to_remove = ('headers', 'footers', 'quotes')
newsgroups_categories = ['rec.sport.baseball', 'rec.sport.hockey', 'talk.politics.guns', 'talk.politics.mideast']

newsgroups_data = fetch_20newsgroups(subset='all',
                                      categories=newsgroups_categories,
                                      remove=types_to_remove)

In [49]:
import pandas as pd
df = pd.DataFrame({'data': newsgroups_data.data, 
                   'target': newsgroups_data.target,
                  'target_name': [newsgroups_data.target_names[x] for x in newsgroups_data.target]})

In [50]:
df.shape

(3843, 3)

Next we'll perform basis preprocessing of the text.  

In [51]:
def preprocess_text(txt):
    txt = txt.replace('\n', ' ').replace('\r', ' ').replace('\t', ' ').replace('?', ' ?').replace('.', ' .').replace(',', ' ,')
    txt = txt.lower().strip()
    txt = txt.split(' ')
    txt = " ".join([w for w in txt if w!=''])
    return txt

In [52]:
df['data_processed'] = df['data'].apply(preprocess_text)

In [53]:
# Let's filter out documents that are empty after our preprocessing
df['num_chars'] = df['data_processed'].apply(len)
df = df[ df['num_chars']>0 ]

In [54]:
print(df.iloc[1]['data'])
print(df.iloc[1]['data_processed'])

...

If we are indeed talking about CS, then this is not quite accurate. CS is
"just" tear gas--albeit the worst kind. It isn't a nausea gas, and doesn't
have direct CNS effects. However, it's quite bad--much worse than CN gas. I
was briefly exposed to it once (during an engagement in Berkeley circa 1968
8^) and it's not the kind of thing you forget. It seems to be
moisture-activated--it not only made my eyes sting and water, but attacked
my breathing passages and lungs. Breathing was painful, and my entire face
felt as if it was on fire. These effects persisted for hours after
exposure, and I was coughing for days afterwards.  If I was exposed to a
dense concentration of this stuff in a closed space for several hours, I
doubt whether I could find the exit. Indeed, I can't imagine living through
it.


. . . if we are indeed talking about cs , then this is not quite accurate . cs is "just" tear gas--albeit the worst kind . it isn't a nausea gas , and doesn't have direct cns effects . ho

In [55]:
import os
os.makedirs('./data', exist_ok=True)

In [56]:
# Save the entire dataset to a single file
df.to_csv('data/20newsgroups.csv', index=False)

In [57]:
# Save files for train, test and validation
df = df.reset_index().rename({'data_processed': 'text', 'target_name': 'label'}, axis=1)
df_train = df.sample(frac=0.7, random_state=0)
df_test = df.drop(df_train.index)
df_val = df_test.sample(frac=0.5, random_state=0)
df_test = df_test.drop(df_val.index)

In [58]:
df_test.head()

Unnamed: 0,index,data,target,label,text,num_chars
0,0,Oops! I came across this file from last year....,0,rec.sport.baseball,oops! i came across this file from last year ....,5944
12,12,"I basically agree, the Tigers are my favorite ...",0,rec.sport.baseball,"i basically agree , the tigers are my favorite...",250
21,21,\nlittle.\n\nI know what you mean! I glow eve...,1,rec.sport.hockey,little . i know what you mean! i glow everytim...,1275
26,26,\nDo you really have *that* much information o...,0,rec.sport.baseball,do you really have *that* much information on ...,1607
62,62,(good point about registration schemes being ...,2,talk.politics.guns,(good point about registration schemes being u...,214


In [59]:
df_train.to_csv('data/train.csv', index=False)
df_val.to_csv('data/val.csv', index=False)
df_test.to_csv('data/test.csv', index=False)

In [60]:
# Save a labels.csv file, needed for fast-bert
pd.Series(df['label'].unique()).to_csv('data/labels.csv', index=False)

  
