## Preprocessing the ToxPost data

In this notebook we will guide you to how the corpus of comments for the ToxPost project was preprocessed. One can divide this preprocessing phase into the following steps:
```
1. cleaning each comment: removing numbers, links, stopwords, hyphenation, non ascii-characters etc.  
2. using TfIdf to shrink each comment to a length of at most 60
3. embedding each word in a comment in 100d space using GloVe
4. applying pca to in turn embed each word into 25d space
5. apply the necessary padding
```

In [1]:
# We begin by changing the current working directory:
import os
os.chdir("..")
print("the current working directory is: {}".format(os.getcwd()))

the current working directory is: /Users/Louis/ml_projects/ToxPost


In [2]:
# import Python modules:
import random

In [3]:
# import the necessary ToxPost modules:
from src.load import load_data
from src.features.clean_corpus import clean_corpus
from src.features.shrink_corpus import shrink_corpus
from src.features.embed_corpus import embed_corpus
from resources.glove.load import load_embedding

We begin by importing the raw data (a corpus of 150.000 Youtube comments) and splitting it into __features__ and __labels__ each of which consists of 6 binary values 

In [4]:
raw_data_path = "./data/raw/data.csv"

# the raw data:
raw_data = load_data(raw_data_path, header=True, id=True)
# the assciated raw features:
raw_features = [datapoint[0] for datapoint in raw_data]
#the associated labels:
labels = [datapoint[1] for datapoint in raw_data]

To illustrate the preprocessing, we first build a corpus of 10 examples:

In [5]:
indices = [index for index in random.sample(range(0,len(raw_data)),10)]
example_corpus = [raw_features[index] for index in indices]

for comment in example_corpus:
    print(" ".join(comment))
    print("\n")

It is a perfectly normal way to write. It should be god(s) with a small g though.


" baktun is NOT a period of ""years"" ... POD ... it is theorized or understood to be 400-tun or 20-katun. please verify content!!! ~M~ 67.164.145.1 "


*blushes* Thank you very much for the barnstar and your guidance. Your kind words are very much appreciated.-


" If ""assume good faith"" applies anywhere, it applies here."


The article was just fine as it was, stop changing it


wtf you temp block me for?


Why deletion that arictle?????


" — Preceding unsigned comment added by (talk • contribs) "


did not realize that he had only spoken with representatives of the Tokugawa Shogun


Thank you, I am working on getting the Town website corrected. Hopefully I can finally get this fixed after the election.




First, we clean up each comment in the corpus by:  

* removing numbers
* removing links
* removing punctuation
* replacing certain words according to a custom replacement list (to handle typos)
* removing stopwords
* removing extra whitespaces
* removing articles

In [6]:
# clean the example corpus
cleaned_corpus = clean_corpus(example_corpus)
for i in range(0,10):
    print("the comment \n\n{}\n\nwas cleaned to \n\n{}\n\n\n".format(" ".join(example_corpus[i]), " ".join(cleaned_corpus[i])))

100%|██████████| 10/10 [00:00<00:00, 658.85it/s]

the comment 

It is a perfectly normal way to write. It should be god(s) with a small g though.

was cleaned to 

perfectly normal way write gods small though



the comment 

" baktun is NOT a period of ""years"" ... POD ... it is theorized or understood to be 400-tun or 20-katun. please verify content!!! ~M~ 67.164.145.1 "

was cleaned to 

baktun period years pod theorized understood tun katun please verify content



the comment 

*blushes* Thank you very much for the barnstar and your guidance. Your kind words are very much appreciated.-

was cleaned to 

blushes thank much barnstar guidance kind words much appreciated



the comment 

" If ""assume good faith"" applies anywhere, it applies here."

was cleaned to 

assume good faith applies anywhere applies



the comment 

The article was just fine as it was, stop changing it

was cleaned to 

article fine stop changing



the comment 

wtf you temp block me for?

was cleaned to 

wtf temp block



the comment 

Why deletion that




Next, we shrink the length of each comment.  

To do this, we first compute the TfIdf matrix of the corpus. Recall that this produces a matrix \\(M\\) whose \\((i,j)\\) entry is the number of times word \\(j\\) appears in comment i divided by the number of comments word \\(j\\) appears in.  

As a second step, for each comment, we order the words according to their top TfIdf score and keep only the top ones:

In [7]:
# shrink each comment to size 20:
shrunken_corpus = shrink_corpus(cleaned_corpus, 20)
for i in range(0,10):
    print("the comment \n\n{}\n\nwas shrunken to \n\n{}\n\n\n".format(" ".join(cleaned_corpus[i]), " ".join(shrunken_corpus[i])))

100%|██████████| 10/10 [00:00<00:00, 772.26it/s]

the comment 

perfectly normal way write gods small though

was shrunken to 

perfectly normal way write gods small though



the comment 

baktun period years pod theorized understood tun katun please verify content

was shrunken to 

baktun period years pod theorized understood tun katun please verify content



the comment 

blushes thank much barnstar guidance kind words much appreciated

was shrunken to 

blushes thank much barnstar guidance kind words much appreciated



the comment 

assume good faith applies anywhere applies

was shrunken to 

assume good faith applies anywhere applies



the comment 

article fine stop changing

was shrunken to 

article fine stop changing



the comment 

wtf temp block

was shrunken to 

wtf temp block



the comment 

deletion arictle

was shrunken to 

deletion arictle



the comment 

— preceding unsigned comment added talk contribs

was shrunken to 

preceding unsigned comment added talk contribs



the comment 

realize spoken represent




Our next step is to embed each comment into a vector space.  
To do this, we load the pretrained [GloVe embedding](https://nlp.stanford.edu/pubs/glove.pdf)
which embeds each word in the vocabulary into \\(\mathbb{R}^{25}\\)  

After that, we apply pca to reduce the embedding space to \\(\mathbb{R}^{25}\\) 

Finally, for each comment, we simply replace the word with its embedding...

In [8]:
# embed the comments into 25d space using the GloVe embedding:
embedding_path = "./resources/glove/glove.twitter.27B.25d.txt"
embedding = load_embedding(embedding_path)
#Apply pca to reduce the embedding space into 10d using pca
dim = 10
embedded_corpus = embed_corpus(shrunken_corpus, embedding, dim)
for i in range(0,10):
    print("the comment \n\n{}\n\nwas embedded to \n\n{}\n\n\n".format(" ".join(shrunken_corpus[i]), embedded_corpus[i]))

100%|██████████| 10/10 [00:00<00:00, 21642.44it/s]

the comment 

perfectly normal way write gods small though

was embedded to 

[array([-0.24193238, -0.03976884,  1.47509678,  0.97681112, -1.93304001,
        0.02535983,  0.69377276,  0.41232793,  0.43330005,  0.11392516]), array([-0.40167002,  0.97490409, -1.46951848,  1.54668204, -0.36695469,
        1.87608986,  0.26783316, -0.16639932, -0.57314258, -0.01347903]), array([-2.09550734,  0.65870088, -0.04530984, -0.51514199, -0.17846698,
        0.52045516,  0.06560927,  0.06956183,  0.53142837,  0.20438685]), array([-1.25724584, -1.60114076,  0.29954597,  0.09081673, -0.12156692,
       -0.0947948 , -0.56225194,  0.42476465, -0.01721878, -0.53428182]), array([ 0.37628384, -0.18310109,  1.05691994, -1.15116073,  0.44885865,
        1.05957712,  1.206314  ,  0.39845276,  0.26111729, -0.30816421]), array([-0.65253115,  0.41351668, -0.19612429, -0.50335689, -1.42096687,
        0.16513254,  0.48909643, -0.07045761,  1.16277152,  0.11439013]), array([-2.17196824,  0.73884306,  0.44868645,




And finally, we apply the above preprocessing procedures to the whole corpus and write the results in the appropriate data files:

In [9]:
# finally, we apply the above preprocessing procedures to the whole corpus and write the results in the appropriate data files
import src.features.preprocess

..loading the data..


  0%|          | 70/159570 [00:00<04:26, 597.79it/s]

..using NLP to clean each comment in the corpus..


100%|██████████| 159570/159570 [03:20<00:00, 796.38it/s] 


..writing the results to the cleaned data file
..using TfIdf to shrink each comment in the corpus to size 100..



100%|██████████| 159570/159570 [03:57<00:00, 671.33it/s]


..writing the results to the shrunken data file..

..loading the Glove embedding..
..using PCA to reduce the embedding space of each word to size 20..


100%|██████████| 159570/159570 [00:02<00:00, 59550.16it/s]


..writing the results to the embedded data file. (this takes a while)..

..data preprocesssed..
