In [1]:
# change the current working directory:
import os
os.chdir("..")
print("the current working directory is: {}".format(os.getcwd()))

the current working directory is: /Users/Louis/ml_projects/ToxPost


In [2]:
# import Python modules:
import random

In [3]:
# import ToxPost modules:
from src.data.make_dataset import load_data
from src.features.clean_corpus import clean_corpus
from src.features.shrink_corpus import shrink_corpus
from src.features.embed_corpus import embed_corpus
from resources.glove.load_embedding import load_embedding

In [4]:
# import the raw data:
raw_data_path = "./data/raw/data.csv"

# the raw data:
raw_data = load_data(raw_data_path, header=True, id=True)
# the assciated raw features
raw_features = [datapoint[0] for datapoint in raw_data]
#the associated labels
labels = [datapoint[1] for datapoint in raw_data]

In [7]:
# let's look at a few typical features:
indices = [index for index in random.sample(range(0,len(raw_data)),10)]
example_corpus = [raw_features[index] for index in indices]

In [8]:
for comment in example_corpus:
    print(" ".join(comment))
    print("\n")

Pat Smear Please read WP:LEDE#Length. For an article of this length, the lede should be 1-2 paragraphs long, not three sentences. It is a fallacy that the lede of an article is meant to do the bare minimum to introduce the subject; it should serve as an acceptable overview of the entire article. Right now roughly 75% of the article has no representation at all in the lede. I'll be re-tagging on my next pass if the lede remains inadequate. Chris Cunningham (not at work) - talk


I really like your picture of Saguaro National Park. Was that taken with a polarized lense?


Yes of course - it was supposed to have the word motorcycle - it has now thanks


" Question I'm sorry, I should have called you ""Mr. Phelps"" from the start. To call you by your Christian first name was certainly uncivil. Now please, Mr. Phelps, tell us why you did it. Tell us why you did WTC. You receieved a testimony, didn't you? You either received a testimony or you just hate our freedom. Perhaps the testimony tol

In [9]:
# Next, we clean up the corpus:

# we tokenize  
# we remove numbers, links, punctuation and articles
# we remove stopwords as defined in nltk.stopwords
# we replace words using a custom list
cleaned_corpus = clean_corpus(example_corpus)

# and display the cleaned-up corpus:
for i in range(0,10):
    print("the comment \n\n{}\n\nwas cleaned to \n\n{}\n\n\n".format(" ".join(example_corpus[i]), " ".join(cleaned_corpus[i])))

100%|██████████| 10/10 [00:00<00:00, 438.50it/s]

the comment 

Pat Smear Please read WP:LEDE#Length. For an article of this length, the lede should be 1-2 paragraphs long, not three sentences. It is a fallacy that the lede of an article is meant to do the bare minimum to introduce the subject; it should serve as an acceptable overview of the entire article. Right now roughly 75% of the article has no representation at all in the lede. I'll be re-tagging on my next pass if the lede remains inadequate. Chris Cunningham (not at work) - talk

was cleaned to 

pat smear please read wpledelength article length lede paragraphs long three sentences fallacy lede article meant bare minimum introduce subject serve acceptable overview entire article right roughly article representation lede ill retagging next pass lede remains inadequate chris cunningham work talk



the comment 

I really like your picture of Saguaro National Park. Was that taken with a polarized lense?

was cleaned to 

really like picture saguaro national park taken polarized




In [11]:
# Next, we shrink each comment to size 20, keeping only the top 20 words with the top TfIdf score:
shrunken_corpus = shrink_corpus(cleaned_corpus, 20)
# and display the shrunken comments:
for i in range(0,10):
    print("the comment \n\n{}\n\nwas shrunken to \n\n{}\n\n\n".format(" ".join(cleaned_corpus[i]), " ".join(shrunken_corpus[i])))

100%|██████████| 10/10 [00:00<00:00, 588.82it/s]

the comment 

pat smear please read wpledelength article length lede paragraphs long three sentences fallacy lede article meant bare minimum introduce subject serve acceptable overview entire article right roughly article representation lede ill retagging next pass lede remains inadequate chris cunningham work talk

was shrunken to 

pat smear read wpledelength article length lede paragraphs long three sentences fallacy lede article meant bare minimum introduce subject serve acceptable article article lede lede work



the comment 

really like picture saguaro national park taken polarized lense

was shrunken to 

really like picture saguaro national park taken polarized lense



the comment 

yes course supposed word motorcycle thanks

was shrunken to 

yes course supposed word motorcycle thanks



the comment 

question sorry called phelps start call christian first name certainly uncivil please phelps tell tell wtc receieved testimony didnt either received testimony hate freedom per




In [27]:
# display the shrunken comments:


the comment 

added highly reference help reader understand kind insurance best suited level insurance applicable types domicile adjective necessarily mean sentence longer neutral

was shrunken to 

added highly reference help reader understand kind insurance best suited level insurance applicable types domicile adjective necessarily mean sentence longer neutral



the comment 

calpez really need back assertions far tell none teams exist main problem none places nations kind cant possibly national teams examples australian indigenous national football team hits google wikipedia christmas island national football team hits wikipedia wikipedia mirrors cocos keeling islands national football team hits wikipedia wikipedia mirrors could wasting time want save section please start looking reliable sources back information please bear mind wikipedia policy original research cant claim teams exist wikipedia pages thats things work 桜ん坊

was shrunken to 

back none teams exist none places cant 

In [12]:
# we embed the comments into 25d space using the GloVe embedding:
embedding_path = "./resources/glove/glove.twitter.27B.25d.txt"
embedding = load_embedding(embedding_path)

#and use pca to reduce the embedding space into 10d using pca
dim = 10
embedded_corpus = embed_corpus(shrunken_corpus, embedding, dim)

# we display the embedded corpus:
for i in range(0,10):
    print("the comment \n\n{}\n\nwas embedded to \n\n{}\n\n\n".format(" ".join(shrunken_corpus[i]), embedded_corpus[i]))

100%|██████████| 10/10 [00:00<00:00, 17112.62it/s]

the comment 

pat smear read wpledelength article length lede paragraphs long three sentences fallacy lede article meant bare minimum introduce subject serve acceptable article article lede lede work

was embedded to 

[array([ 0.69770049, -1.02777431,  1.15221836,  0.80133225, -0.27544395,
       -0.09901532, -0.92167517,  0.16526414, -0.59970558,  0.29849607]), array([ 2.91805186, -0.55279117, -0.23765049,  0.9006732 , -0.4458878 ,
        0.90998984,  0.41050187,  0.10279885, -0.30886601,  1.1982359 ]), array([-1.07440882,  1.64005859,  0.10285497,  0.88499192, -0.0304564 ,
        0.0314805 , -0.04783014, -0.18339021, -0.39516322, -0.39510616]), None, array([ 0.73820291,  2.44039438,  0.42246898,  0.63109435,  0.61082832,
       -0.63777845,  0.02107652, -1.19100108,  0.03657945, -0.06736897]), array([ 1.39503408, -0.45952399, -0.81580932,  0.0626284 ,  2.57891897,
       -1.36459264,  0.56733882,  0.12270991,  0.66397859,  0.7000497 ]), array([ 3.52138338, -0.04194904,  1.24687592




finally, we apply the above preprocessing procedures to the whole corpus and write the results in the appropriate data files

In [33]:
# finally, we apply the above preprocessing procedures to the whole corpus and write the results in the appropriate data files
import src.features.preprocess

..loading the data..


  0%|          | 63/159570 [00:00<04:14, 626.95it/s]

..using NLP to clean each comment in the corpus.. 


100%|██████████| 159570/159570 [02:39<00:00, 997.59it/s] 


..writing the results to the cleaned data file
..using TfIdf to shrink each comment in the corpus to size 100..



100%|██████████| 159570/159570 [03:35<00:00, 740.56it/s]


..writing the results to the shrunken data file..
..loading the Glove embedding..
..using PCA to reduce the embedding space of each word to size 20..


100%|██████████| 159570/159570 [00:10<00:00, 14929.70it/s]


..writing the results to the embedded data file. (this takes a while)..
..data preprocesssed..
