## Preprocessing the data

In this notebook we will guide you to how the corpus of comments for the ToxPost project was preprocessed. One can divide this preprocessing phase into the following steps:
```
1. cleaning each comment: removing numbers, links, stopwords, hyphenation, non ascii-characters etc.  
2. using TfIdf to shrink each comment to a length of at most 60
3. embedding each word in a comment in 100d space using GloVe
4. applying pca to in turn embed each word into 25d space
5. apply the necessary padding
```

In [1]:
# We begin by changing the current working directory:
import os
os.chdir("..")
print("the current working directory is: {}".format(os.getcwd()))

the current working directory is: /Users/Louis/ml_projects/ToxPost


In [2]:
# import Python modules:
import random

In [3]:
# import the necessary ToxPost modules:
from src.data.make_dataset import load_data
from src.features.clean_corpus import clean_corpus
from src.features.shrink_corpus import shrink_corpus
from src.features.embed_corpus import embed_corpus
from resources.glove.load_embedding import load_embedding

We begin by importing the raw data (a corpus of 150.000 Youtube comments) and splitting it into __features__ and __labels__ each of which consists of 6 binary values 

In [4]:
raw_data_path = "./data/raw/data.csv"

# the raw data:
raw_data = load_data(raw_data_path, header=True, id=True)
# the assciated raw features:
raw_features = [datapoint[0] for datapoint in raw_data]
#the associated labels:
labels = [datapoint[1] for datapoint in raw_data]

To illustrate the preprocessing, we first build a corpus of 10 examples:

In [6]:
indices = [index for index in random.sample(range(0,len(raw_data)),10)]
example_corpus = [raw_features[index] for index in indices]

for comment in example_corpus:
    print(" ".join(comment))
    print("\n")

Mezzo's Biased editing in Controversy Mezzo stop Neither You can not delete all the articles from Barelwi page nor will be successful in inserting wahabi Propganda in this page .The news report You cited is lacking credentials and full report if you accept will not be digestable by You. (t • c)


actually== i get to do whatever i want to do. DO NOT SUGGEST OTHERWISE. when i do something, it is not vandalism. JUST BECAUSE I SAID SO. ==


If you actually looked at the end of the article, I cited a reference. (82.28.237.200 )


" BIASED/SELECTIVE EDITING OF WALMART ENTRY == User:Rollback Guy User:Salamuraiand User: Gscshoyru You are hereby charged with adding biased information to the [walmart] wikipedia entry. Though it is currently unknown exactly WHO added the link to a known corporate propaganda site, records reviewed so far indicate the following: On 04:29, 30 June 2007 at 0429hrs User:Rollback Guy changed link from ""public relations"" to ""walmart facts"" On 30 July 2007 at 0208hrs

First, we clean up each comment in the corpus by:  

* removing numbers
* removing links
* removing punctuation
* replacing certain words according to a custom replacement list (to handle typos)
* removing stopwords
* removing extra whitespaces
* removing articles

In [7]:
# clean the example corpus
cleaned_corpus = clean_corpus(example_corpus)
for i in range(0,10):
    print("the comment \n\n{}\n\nwas cleaned to \n\n{}\n\n\n".format(" ".join(example_corpus[i]), " ".join(cleaned_corpus[i])))

100%|██████████| 10/10 [00:00<00:00, 515.47it/s]

the comment 

Mezzo's Biased editing in Controversy Mezzo stop Neither You can not delete all the articles from Barelwi page nor will be successful in inserting wahabi Propganda in this page .The news report You cited is lacking credentials and full report if you accept will not be digestable by You. (t • c)

was cleaned to 

mezzos biased editing controversy mezzo stop neither delete articles barelwi page successful inserting wahabi propganda page news report cited lacking credentials full report accept digestable



the comment 

actually== i get to do whatever i want to do. DO NOT SUGGEST OTHERWISE. when i do something, it is not vandalism. JUST BECAUSE I SAID SO. ==

was cleaned to 

actually get whatever want suggest otherwise something vandalism said



the comment 

If you actually looked at the end of the article, I cited a reference. (82.28.237.200 )

was cleaned to 

actually looked end article cited reference



the comment 

" BIASED/SELECTIVE EDITING OF WALMART ENTRY == Us




Next, we shrink the length of each comment.  

To do this, we first compute the TfIdf matrix of the corpus. Recall that this produces a matrix \\(M\\) whose \\((i,j)\\) entry is the number of times word \\(j\\) appears in comment i divided by the number of comments word \\(j\\) appears in.  

As a second step, for each comment, we order the words according to their top TfIdf score and keep only the top ones:

In [11]:
# shrink each comment to size 20:
shrunken_corpus = shrink_corpus(cleaned_corpus, 20)
for i in range(0,10):
    print("the comment \n\n{}\n\nwas shrunken to \n\n{}\n\n\n".format(" ".join(cleaned_corpus[i]), " ".join(shrunken_corpus[i])))

100%|██████████| 10/10 [00:00<00:00, 565.90it/s]

the comment 

mezzos biased editing controversy mezzo stop neither delete articles barelwi page successful inserting wahabi propganda page news report cited lacking credentials full report accept digestable

was shrunken to 

mezzos controversy mezzo stop neither delete articles barelwi page successful inserting wahabi propganda page news report lacking credentials full report accept digestable



the comment 

actually get whatever want suggest otherwise something vandalism said

was shrunken to 

actually get whatever want suggest otherwise something vandalism said



the comment 

actually looked end article cited reference

was shrunken to 

actually looked end article cited reference



the comment 

biasedselective editing walmart entry userrollback guy usersalamuraiand user gscshoyru hereby charged adding biased information walmart wikipedia entry though currently unknown exactly added link known corporate propaganda site records reviewed far indicate following june 0429hrs user




Our next step is to embed each comment into a vector space.  
To do this, we load the pretrained [GloVe embedding](https://nlp.stanford.edu/pubs/glove.pdf)
which embeds each word in the vocabulary into \\(\mathbb{R}^{25}\\)  

After that, we apply pca to reduce the embedding space to \\(\mathbb{R}^{25}\\) 

Finally, for each comment, we simply replace the word with its embedding...

In [12]:
# embed the comments into 25d space using the GloVe embedding:
embedding_path = "./resources/glove/glove.twitter.27B.25d.txt"
embedding = load_embedding(embedding_path)
#Apply pca to reduce the embedding space into 10d using pca
dim = 10
embedded_corpus = embed_corpus(shrunken_corpus, embedding, dim)
for i in range(0,10):
    print("the comment \n\n{}\n\nwas embedded to \n\n{}\n\n\n".format(" ".join(shrunken_corpus[i]), embedded_corpus[i]))

100%|██████████| 10/10 [00:00<00:00, 6583.43it/s]

the comment 

mezzos controversy mezzo stop neither delete articles barelwi page successful inserting wahabi propganda page news report lacking credentials full report accept digestable

was embedded to 

[None, array([ 1.69124459, -0.8723022 , -0.14628378, -0.77887184, -0.22494147,
       -0.73765962, -0.45450419,  1.05619925, -0.18320105, -0.42267778]), array([ 2.72900457,  1.79943535,  1.17120738,  2.93609566,  0.45825952,
       -2.92265122,  0.52847979,  0.82854934, -0.38328871,  1.3082093 ]), array([-2.07690939, -0.11835476, -0.06871179,  0.94776088,  0.71361946,
       -0.70717535,  0.26383244,  0.22055578, -0.33807923, -0.68188404]), array([-1.02113698, -1.60127231, -0.20378808, -0.12844892,  0.76647795,
       -0.04617622,  0.0861091 , -0.18659671, -0.12465711, -0.14640386]), array([-0.69431818,  0.20298202, -1.56193045,  2.48184006,  0.4591212 ,
       -1.00498675,  0.91148123, -0.98106872,  0.37117279, -0.71746993]), array([ 1.10985507,  0.91167081, -2.49847365, -0.03008372,




And finally, we apply the above preprocessing procedures to the whole corpus and write the results in the appropriate data files:

In [33]:
# finally, we apply the above preprocessing procedures to the whole corpus and write the results in the appropriate data files
import src.features.preprocess

..loading the data..


  0%|          | 63/159570 [00:00<04:14, 626.95it/s]

..using NLP to clean each comment in the corpus.. 


100%|██████████| 159570/159570 [02:39<00:00, 997.59it/s] 


..writing the results to the cleaned data file
..using TfIdf to shrink each comment in the corpus to size 100..



100%|██████████| 159570/159570 [03:35<00:00, 740.56it/s]


..writing the results to the shrunken data file..
..loading the Glove embedding..
..using PCA to reduce the embedding space of each word to size 20..


100%|██████████| 159570/159570 [00:10<00:00, 14929.70it/s]


..writing the results to the embedded data file. (this takes a while)..
..data preprocesssed..
