# 3. Preprocessing

In [19]:
#Import the necessary packages
import pandas as pd
import numpy as np
import nltk
import spacy
import contractions as cont
import gensim.downloader as api
import re
import unicodedata
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords

In [20]:
#Load the data
df = pd.read_csv('Corona_NLP.csv', index_col=0)

In its current form, the OriginalTweet column contains unstructured data in the form of Tweets. Most NLP classification algorithms, however, operate on word or document vectors and so it is necessary to vectorize each Tweet before preceding to the modelling stage. To aid in the process of vectorization, I defined three functions and applied them to the OriginalTweet column: preprocess, lemmatize_and_remove_ents, fix_spelling, and vectorize. These functions either remove any words that do not contribute to sentiment (e.g., hashtags, handles, links, proper names) or reduce noise by removing stopwords, proper names, and mispellings. 

The first function preforms several important tasks: it converts each Tweet to lower case; expands contractions; and removes a variety of non-essential information (hashtags, twitter handles, links, accented characters, non-alphabetic characters apart from the dash "-", and stopwords). The second step proved necessary to preserve the word "not," which is often essential for distinguishing positive sentiments from negative ones. Consider, for example, the difference in meaning between 'happy' and 'not happy'. 

In [21]:
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('not') 

def preprocess(text):
    text = text.lower() #lowercase
    text = " ".join([cont.fix(word) for word in text.split()]) #expand contractions
    text = re.sub(r"(#\S+)", '', text) #remove hashtags
    text = re.sub(r"(@\S+)", '', text) #remove handles
    text = re.sub(r"(http\S+)", '', text) #remove links
    text = unicodedata.normalize('NFKD', text) #remove diacritics
    text = text.encode('ascii', errors='ignore').decode('utf-8', errors='ignore') 
    tokens = [word.strip() for word in text.split() if word not in stopword_list] #remove stopwords
    text = " ".join(tokens)
    text = re.sub(r"[^a-z ]+", '', text) #remove special characters 
    return text

In [22]:
df['ProcessedTweet'] = df['OriginalTweet'].apply(preprocess)

In [23]:
nlp = spacy.load("en_core_web_md", exclude=['Parser'])

def lemmatize_and_remove_ents(text):
    text = nlp(text)
    text = " ".join([word.lemma_ if word.lemma_ != "-PRON-" else word.text for word in text if not word.ent_type_ ])
    return text

In [24]:
df['ProcessedTweet'] = df['ProcessedTweet'].apply(lemmatize_and_remove_ents)

The third and final function converts the processed tweets into document vectors using a combination of a pretrained GloVe embedding and Tfidf weighting. It does so by calculating the sum of individual word vectors multiplied by its Tfidf weight for each word within a Tweet. Weighting the individual word vectors helps ensure that the embeddings for long and short Tweets do not differ too drastically. 

Originally, I had planned use a larger embedding, such as the 'word2vec-google-news-300', but my computer couldn't handle it. 

In [25]:
glove = api.load('glove-twitter-25')

tv = TfidfVectorizer(stop_words=stopword_list)
tv_transformed = tv.fit_transform(df['ProcessedTweet'])
tfidf_values = dict(zip(tv.get_feature_names(), tv.idf_))

In [26]:
def vectorizer(text):
    vector = sum([glove[word]*tfidf_values[word] for word in text.split() if word in tfidf_values.keys() and word in glove.key_to_index])
    return vector

In [27]:
df['DocVector'] = df['ProcessedTweet'].apply(vectorizer)

df.head()

Unnamed: 0,OriginalTweet,Sentiment,ProcessedTweet,DocVector
0,TRENDING: New Yorkers encounter empty supermar...,Extremely Negative,trend new yorkers encounter empty supermarket ...,"[-31.063377, 4.315184, 3.272048, -18.336933, 2..."
1,When I couldn't find hand sanitizer at Fred Me...,Positive,could not find hand sanitizer fred meyer turn ...,"[-18.462826, 11.85384, -2.6134028, 6.3497987, ..."
2,Find out how you can protect yourself and love...,Extremely Positive,find protect love one,"[-2.7797158, 3.3976798, -5.5020814, 8.440672, ..."
3,#Panic buying hits #NewYork City as anxious sh...,Negative,buy hit city anxious shopper stock foodampmedi...,"[-40.63399, 35.210224, -24.598307, -49.416866,..."
4,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral,everyone buy baby milk powder next everyone bu...,"[-22.85473, 14.65009, 17.584171, 6.3308125, -3..."


Next I need to transform DocVector from a column of lists into a series of separate columns, one for each of the 25 vectorized features. This dataframe will contain all of the explanatory variables for the modelling step of the project. 

In [28]:
X = df.DocVector.apply(pd.Series)
X.columns = ['Feature ' + str(i) for i in range(25)]

X.head()

Unnamed: 0,Feature 0,Feature 1,Feature 2,Feature 3,Feature 4,Feature 5,Feature 6,Feature 7,Feature 8,Feature 9,...,Feature 15,Feature 16,Feature 17,Feature 18,Feature 19,Feature 20,Feature 21,Feature 22,Feature 23,Feature 24
0,-31.063377,4.315184,3.272048,-18.336933,20.507353,-18.973701,39.690273,-63.8983,24.546728,-10.839511,...,12.867351,36.950283,-19.829515,22.515028,1.602518,-13.314288,-18.912729,-7.682436,-29.146791,-32.590302
1,-18.462826,11.85384,-2.613403,6.349799,-18.324095,-4.529284,22.41613,-37.670242,22.03301,-21.450129,...,-1.200512,8.932176,-8.927069,15.534106,-15.571704,-4.428163,11.713925,12.25454,5.701349,-26.367716
2,-2.779716,3.39768,-5.502081,8.440672,-8.699121,-5.226532,29.461939,2.573042,-7.642838,0.953844,...,4.212283,5.015493,-11.010517,5.61166,-7.700294,-2.587625,-0.576298,-5.781661,0.555235,-8.109362
3,-40.633991,35.210224,-24.598307,-49.416866,8.011915,7.370157,49.256386,-25.614607,34.770287,-19.868443,...,6.071897,28.622406,-23.586481,-16.194962,-31.503876,-27.649782,-26.42931,-12.657497,-22.968901,-21.575285
4,-22.854731,14.65009,17.584171,6.330812,-3.588906,6.745114,55.977005,-31.296013,4.496396,18.477859,...,-7.855762,31.614111,-41.765991,2.643752,8.316241,-27.605976,-7.106748,15.256669,24.210051,-8.261431


In [29]:
X.reset_index(inplace=True)
df.reset_index(inplace=True)
df = df.merge(X, how='outer', on='index')
df.drop(columns=['level_0', 'index'], inplace=True)
df.head()

Unnamed: 0,index,OriginalTweet,Sentiment,ProcessedTweet,DocVector
0,0,TRENDING: New Yorkers encounter empty supermar...,Extremely Negative,trend new yorkers encounter empty supermarket ...,"[-31.063377, 4.315184, 3.272048, -18.336933, 2..."
1,1,When I couldn't find hand sanitizer at Fred Me...,Positive,could not find hand sanitizer fred meyer turn ...,"[-18.462826, 11.85384, -2.6134028, 6.3497987, ..."
2,2,Find out how you can protect yourself and love...,Extremely Positive,find protect love one,"[-2.7797158, 3.3976798, -5.5020814, 8.440672, ..."
3,3,#Panic buying hits #NewYork City as anxious sh...,Negative,buy hit city anxious shopper stock foodampmedi...,"[-40.63399, 35.210224, -24.598307, -49.416866,..."
4,4,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral,everyone buy baby milk powder next everyone bu...,"[-22.85473, 14.65009, 17.584171, 6.3308125, -3..."


Unfortunately, the preprocessing steps introduced some NaNs into the data where the original Tweet consisted of a combination of hashtags, handles, links, stopwords, and words not found in the twitter-glove-25 dictionary. These entries need to be removed before modelling. 

In [36]:
df[df['Feature 1'].isna()]

Unnamed: 0,OriginalTweet,Sentiment,ProcessedTweet,DocVector,Feature 0,Feature 1,Feature 2,Feature 3,Feature 4,Feature 5,...,Feature 15,Feature 16,Feature 17,Feature 18,Feature 19,Feature 20,Feature 21,Feature 22,Feature 23,Feature 24
1821,Il #coronavirus colpisce maggiormente i polmon...,Neutral,,0,0.0,,,,,,...,,,,,,,,,,
2598,I've #never #seen so may #men in a #supermarke...,Neutral,,0,0.0,,,,,,...,,,,,,,,,,
2729,#Coronavirus #preparation: What to #stock-up o...,Neutral,,0,0.0,,,,,,...,,,,,,,,,,
3063,@Janetb172 @denyessence @NoScienceDenial @west...,Neutral,,0,0.0,,,,,,...,,,,,,,,,,
3192,@KrampusFu @JackHer18731941 @Twistagirl1958 @W...,Neutral,,0,0.0,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42726,Nightcrawler #nightcrawler #thespot #scoop #po...,Neutral,,0,0.0,,,,,,...,,,,,,,,,,
43245,Adel and Karina xxx\r\r\n\r\r\nhttps://t.co/E3...,Neutral,,0,0.0,,,,,,...,,,,,,,,,,
44280,@jamisonglory @DailyMail This is what #antifa ...,Neutral,,0,0.0,,,,,,...,,,,,,,,,,
44323,It is a great week again \r\r\n#mondaymotivati...,Extremely Positive,,0,0.0,,,,,,...,,,,,,,,,,


In [55]:
df.dropna()
df.reset_index()
df.head()

Unnamed: 0,OriginalTweet,Sentiment,ProcessedTweet,DocVector,Feature 0,Feature 1,Feature 2,Feature 3,Feature 4,Feature 5,...,Feature 15,Feature 16,Feature 17,Feature 18,Feature 19,Feature 20,Feature 21,Feature 22,Feature 23,Feature 24
0,TRENDING: New Yorkers encounter empty supermar...,Extremely Negative,trend new yorkers encounter empty supermarket ...,"[-31.063377, 4.315184, 3.272048, -18.336933, 2...",-31.063377,4.315184,3.272048,-18.336933,20.507353,-18.973701,...,12.867351,36.950283,-19.829515,22.515028,1.602518,-13.314288,-18.912729,-7.682436,-29.146791,-32.590302
1,When I couldn't find hand sanitizer at Fred Me...,Positive,could not find hand sanitizer fred meyer turn ...,"[-18.462826, 11.85384, -2.6134028, 6.3497987, ...",-18.462826,11.85384,-2.613403,6.349799,-18.324095,-4.529284,...,-1.200512,8.932176,-8.927069,15.534106,-15.571704,-4.428163,11.713925,12.25454,5.701349,-26.367716
2,Find out how you can protect yourself and love...,Extremely Positive,find protect love one,"[-2.7797158, 3.3976798, -5.5020814, 8.440672, ...",-2.779716,3.39768,-5.502081,8.440672,-8.699121,-5.226532,...,4.212283,5.015493,-11.010517,5.61166,-7.700294,-2.587625,-0.576298,-5.781661,0.555235,-8.109362
3,#Panic buying hits #NewYork City as anxious sh...,Negative,buy hit city anxious shopper stock foodampmedi...,"[-40.63399, 35.210224, -24.598307, -49.416866,...",-40.633991,35.210224,-24.598307,-49.416866,8.011915,7.370157,...,6.071897,28.622406,-23.586481,-16.194962,-31.503876,-27.649782,-26.42931,-12.657497,-22.968901,-21.575285
4,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral,everyone buy baby milk powder next everyone bu...,"[-22.85473, 14.65009, 17.584171, 6.3308125, -3...",-22.854731,14.65009,17.584171,6.330812,-3.588906,6.745114,...,-7.855762,31.614111,-41.765991,2.643752,8.316241,-27.605976,-7.106748,15.256669,24.210051,-8.261431


Next I need to replace the sentiment labels with numerical dummy values. 

In [57]:
replacements = {'Extremely Negative': 0, 'Negative': 1, 'Neutral' : 2, 'Positive' : 3, 'Extremely Positive' : 4}
df['Sentiment'].replace(replacements, inplace=True)

df.head()

Unnamed: 0,OriginalTweet,Sentiment,ProcessedTweet,DocVector,Feature 0,Feature 1,Feature 2,Feature 3,Feature 4,Feature 5,...,Feature 15,Feature 16,Feature 17,Feature 18,Feature 19,Feature 20,Feature 21,Feature 22,Feature 23,Feature 24
0,TRENDING: New Yorkers encounter empty supermar...,0,trend new yorkers encounter empty supermarket ...,"[-31.063377, 4.315184, 3.272048, -18.336933, 2...",-31.063377,4.315184,3.272048,-18.336933,20.507353,-18.973701,...,12.867351,36.950283,-19.829515,22.515028,1.602518,-13.314288,-18.912729,-7.682436,-29.146791,-32.590302
1,When I couldn't find hand sanitizer at Fred Me...,3,could not find hand sanitizer fred meyer turn ...,"[-18.462826, 11.85384, -2.6134028, 6.3497987, ...",-18.462826,11.85384,-2.613403,6.349799,-18.324095,-4.529284,...,-1.200512,8.932176,-8.927069,15.534106,-15.571704,-4.428163,11.713925,12.25454,5.701349,-26.367716
2,Find out how you can protect yourself and love...,4,find protect love one,"[-2.7797158, 3.3976798, -5.5020814, 8.440672, ...",-2.779716,3.39768,-5.502081,8.440672,-8.699121,-5.226532,...,4.212283,5.015493,-11.010517,5.61166,-7.700294,-2.587625,-0.576298,-5.781661,0.555235,-8.109362
3,#Panic buying hits #NewYork City as anxious sh...,1,buy hit city anxious shopper stock foodampmedi...,"[-40.63399, 35.210224, -24.598307, -49.416866,...",-40.633991,35.210224,-24.598307,-49.416866,8.011915,7.370157,...,6.071897,28.622406,-23.586481,-16.194962,-31.503876,-27.649782,-26.42931,-12.657497,-22.968901,-21.575285
4,#toiletpaper #dunnypaper #coronavirus #coronav...,2,everyone buy baby milk powder next everyone bu...,"[-22.85473, 14.65009, 17.584171, 6.3308125, -3...",-22.854731,14.65009,17.584171,6.330812,-3.588906,6.745114,...,-7.855762,31.614111,-41.765991,2.643752,8.316241,-27.605976,-7.106748,15.256669,24.210051,-8.261431


The final step in preprocessing is to divide the data into test and training sets. 

In [58]:
y = df['Sentiment']
X = df.drop(columns=['OriginalTweet', 'Sentiment', 'ProcessedTweet', 'DocVector'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)

In [59]:
df.to_csv('processed_Corona_NLP.csv')