## CSC 0620-01 Natural Language Technologies Spring 2021

Joseph Edradan <br>
03/16/2021 <br>
Source: https://www.kaggle.com/jayantawasthi/nlp-corona-tweet-with-random-forest-and-naivebayes <br>

#### Submission for hands-on workshop on Mar 11th

<ol>
  <li>Explore this labeled dataset: <a href="https://www.kaggle.com/datatattle/covid-19-nlp-text-classification">Coronavirus tweets NLP - Text Classification</a></li>
  <li>Understand the Naive Bayes based text classification implemented here: <a href="https://www.kaggle.com/jayantawasthi/nlp-corona-tweet-with-random-forest-and-naivebayes">nlp(corona tweet)with random forest and NaiveBayes</a></li>
  <li>Create a copy of the above Jupyter notebook (or export it as python program).  In this new notebook (or program) add a detailed description (and in your own words) of what is happening in each code block. (You can skip the steps 27-32 that are related to Random Forest classifier.)
</li>
</ol>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import re
import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
import xgboost
from sklearn.model_selection import RandomizedSearchCV
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

from typing import Tuple, Any
from IPython.display import display
import sys
import concurrent

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
"""
Make a training dataset pd.df of the Corona_NLP_train.csv

"""
train = pd.read_csv("Corona_NLP_train.csv", encoding='latin1')

In [3]:
"""
Show first 5 rows of the training dataset

"""
train.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [4]:
"""
Create a function to remove irrelavent data (remove irrelavent columns) 

"""


def drop(p):
    p.drop(["UserName",
            "ScreenName",
            "Location",
            "TweetAt"], axis=1, inplace=True)

In [5]:
"""
Call the "remove irrelavent data (remove irrelavent columns)" function on the training data set


"""
drop(train)

In [6]:
"""
Show first 5 rows of the training dataset


"""
train.head()

Unnamed: 0,OriginalTweet,Sentiment
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,advice Talk to your neighbours family to excha...,Positive
2,Coronavirus Australia: Woolworths to give elde...,Positive
3,My food stock is not the only one which is emp...,Positive
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [7]:
"""
Count the amount rows by their given sentiment

"""
train["Sentiment"].value_counts()

Positive              11422
Negative               9917
Neutral                7713
Extremely Positive     6624
Extremely Negative     5481
Name: Sentiment, dtype: int64

In [8]:
"""
Total amount of rows

"""
len(train.index)

41157

In [9]:
"""
Function that takes a pd.df and replaces the sentiment column's values (strings) to int.

"""


def rep(t):
    d = {"Sentiment": {'Positive': 0,
                       'Negative': 1,
                       "Neutral": 2,
                       "Extremely Positive": 3,
                       "Extremely Negative": 4}}
    t.replace(d, inplace=True)

In [10]:
"""
Call replace function on the training dataset to clean it

"""
rep(train)

In [11]:
"""
Show first 5 rows of the training dataset


"""
train.head()

Unnamed: 0,OriginalTweet,Sentiment
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,2
1,advice Talk to your neighbours family to excha...,0
2,Coronavirus Australia: Woolworths to give elde...,0
3,My food stock is not the only one which is emp...,0
4,"Me, ready to go at supermarket during the #COV...",4


In [12]:
"""
Make a nltk TweetTokenizer object to tokenize the tweet.
Basically use a custom regex patterns to split the words/special symbols/emojis/etc... from a given tweet.

Notes:
    Interesting that the nltk library had a Tweet tokenizer...
    
Reference:
    https://www.nltk.org/api/nltk.tokenize.html
"""
tweettoken = TweetTokenizer(strip_handles=True, reduce_len=True)

In [13]:
"""
Make a nltk WordNetLemmatizer to get a meaningful base word from a given word

"""

lemmatizer = WordNetLemmatizer()

In [14]:
"""
Make a nltk PorterStemmer to get a non meaningful base string of chars from a given word

"""
stemmer = PorterStemmer()

In [15]:
"""
Function that takes in a line of text (tweet)
1. Uses regex to replace all non Alphabet words with space
2. Lowercase each word
3. Uses the TweetTokenizer object to tokenize the tweet into a list
4. Remove english stop words in the list 
5. Lemmatize the words in the list into a new list
6. Make a string from the new list that contains the lemmatized words
7. Add new string into a list of strings (list is called collect)
"""

# List of tokenized, lemmatized, non stopping words strings
collect = []

# USER FOR THREADING
# stopwords = stopwords.words('english') # Late binding to prevent threading issues

def preprocess(t):

    # Regex remove non alphabet
    tee = re.sub('[^a-zA-Z]', " ", t)

    # Lowercase each word
    tee = tee.lower()

    # Tokenize
    res = tweettoken.tokenize(tee)

    # Remove enlglish stop words
    for i in res:
        if i in stopwords.words('english'):
            res.remove(i)

    # Make new list for words
    rest = []

    # Add lemmatized word into new list
    for k in res:
        rest.append(lemmatizer.lemmatize(k))

    # Make list into string
    ret = " ".join(rest)

    # Add string into list of strings
#     collect.append(ret)

    return ret

In [16]:
"""
For each each tweet, preprocess that tweet using index 

Notes:
    Cell duration: 2:38 minutes
    
    Can be threaded...
"""

# def preprocess_pd_df(pd, column_name):
#     collect_temp = []
#     for j in range(len(pd.index)):
#         collect_temp.append(preprocess(train[column_name].iloc[j]))
        
#     return collect_temp

# threads_amount = 5

# chunk_size = len(train["OriginalTweet"])//threads_amount

# list_chunk_sizes = [chunk for chunk in range(0, len(train.index), chunk_size)]
# print(list_chunk_sizes)

# np_array_pd_df_text = np.array_split(train["OriginalTweet"], list_chunk_sizes)

# pd_df_temp = pd.DataFrame()

# with concurrent.futures.ThreadPoolExecutor(max_workers=threads_amount) as executor:
#     results = [executor.submit(preprocess_pd_df, i, "OriginalTweet") for i in np_array_pd_df_text]
    
#     # type: concurrent.futures.Future
#     for i in concurrent.futures.as_completed(results):
#         print(i.result())
# #         pd_df_temp.append(i.result)
        

for j in range(len(train.index)):
    collect.append(preprocess(train["OriginalTweet"].iloc[j]))





In [17]:
"""
Print the first 5 strings from collect
"""
for i, text in enumerate(collect[:5]):
    print(f"{i+1}.", text, "\n")

1. menyrbie phil gahan chrisitv http co ifz fan pa http co xx ghgfzcc http co nlzdxno 

2. advice talk your neighbour family exchange phone number create contact list phone number neighbour school employer chemist gp set online shopping account po adequate supply regular med not order 

3. coronavirus australia woolworth give elderly disabled dedicated shopping hour amid covid outbreak http co binca vp p 

4. food stock not only one is empty please panic will enough food everyone not take than you need stay calm stay safe covid france covid covid coronavirus confinement confinementotal confinementgeneral http t co zrlg z j 

5. ready go supermarket covid outbreak because m paranoid because food stock litteraly empty the coronavirus a serious thing please panic cause shortage coronavirusfrance restezchezvous stayathome confinement http t co usmualq n 



In [18]:
"""
Function that uses sklearn CountVectorizer to count the amount of words per given tweet by using the
List of tokenized, lemmatized, non stopping words strings AKA the list called collect as a word bank.
Then this function will return the vector representation (word count for each word, though the word is not shown)
for each tweet as an array of arrays

Reference:
    scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

"""


def bow(ll) -> Tuple[Any, CountVectorizer]:
    cv = CountVectorizer(max_features=200)
    x = cv.fit_transform(ll).toarray()
    return x, cv

# Assigning y using CountVectorizer on the list called "collect"

In [19]:
"""
Call bow function on the collection of tweets (List of tokenized, lemmatized, non stopping words strings)

"""

y, cv = bow(collect)

In [20]:
"""
Show the first tweet in its vector representation

"""
y[:1]

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0]], dtype=int64)

In [21]:
"""
Show the first tweet in its vector representation with its corresponding word

Notes:
    I made it because I want to see the words and their counts - Joseph
"""

pd_df_temp = pd.DataFrame(y[:1], columns=cv.get_feature_names())
pd.options.display.max_columns = len(pd_df_temp.columns)

pd_df_temp

Unnamed: 0,all,also,amid,amp,an,and,are,around,at,back,bank,be,been,business,buy,buying,can,care,case,chain,change,check,co,come,company,consumer,corona,coronacrisis,coronavirus,could,country,covid,crisis,customer,day,delivery,demand,distancing,do,don,due,economy,employee,empty,essential,even,every,everyone,face,family,find,first,food,for,free,gas,get,getting,global,go,going,good,got,government,grocery,hand,have,health,help,high,home,hour,how,http,impact,in,increase,industry,is,it,item,job,just,keep,know,last,let,life,like,line,local,lockdown,long,look,low,make,many,market,mask,may,money,month,more,much,my,need,new,news,no,not,now,of,oil,on,one,online,open,order,other,our,outbreak,pandemic,panic,paper,people,please,price,product,public,quarantine,re,read,really,report,retail,right,risk,safe,said,sanitizer,say,see,service,shelf,shop,shopping,since,social,socialdistancing,some,spread,staff,state,stay,still,stock,stop,store,supermarket,supply,support,take,thank,that,the,their,there,these,they,thing,think,this,time,to,today,toilet,toiletpaper,uk,use,ve,via,virus,wa,want,way,we,week,well,went,what,will,with,work,worker,working,world,would,year,you,your
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [22]:
"""
Show any given tweet in its vector representation with its corresponding word if the count is not 0

Notes:
    I made it because I want to see the words and their counts clearly - Joseph
"""


def print_row_with_col_useful(df, index_row=0):
    display(df[[word for word, count in zip(
        df.columns, df.loc[index_row]) if count]])


print_row_with_col_useful(pd_df_temp)

Unnamed: 0,co,http
0,3,3


In [23]:
"""
Show the amount of features (words) of the first tweet vector representation

"""
len(y[0][:])

200

In [24]:
"""
Assign values as the values in the Sentiment column
"""
values = train["Sentiment"].values

In [25]:
"""
Display the values array

"""
values

array([2, 0, 0, ..., 0, 2, 1], dtype=int64)

In [26]:
"""
Call the train_test_split function to split the dataset of the vector representations (the x) and the
sentiment values aka "values" variable (the y) where 75% of the dataset will be used for training using a random seed
to split the data.

"""

(x_train, x_test, y_train, y_test) = train_test_split(
    y, values, train_size=0.75, random_state=42)

In [27]:
"""
Display the training data (it should be the y that came from the bow function)
"""
x_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [28]:
"""
Make a Random Forests classifier with 200 trees and use random seed

Notes:
    It makes a bunch of random decision trees and uses the general consensus of all the trees
    to determine the classification of a given thing

Reference:
    https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
    
"""
rnd_clf = RandomForestClassifier(n_estimators=200, random_state=42)

In [29]:
"""
Train the Random Forests classifier on the training dataset of x and y

Notes:
    Cell duration: 18.3 seconds

"""
rnd_clf.fit(x_train, y_train)

RandomForestClassifier(n_estimators=200, random_state=42)

In [30]:
"""
Run the Random Forests classifier on the testing dataset of x and y to see how accurate the classifer is

Notes:
    Cell duration 2:56 minutes
    
"""
rnd_clf.score(x_test, y_test)

0.4172983479105928

In [31]:
"""
Use the Random Forests classifier to predict the testing dataset y given testing dataset x.
Then use a confusion matrix to show the predictions and the truths. It will be used to determine what machine
learning algo to use.

Basically for the confusion matrix, look at the diagonal, that determines the amount of correct classifications it got

Notes:
    Cell duration 2:56 minutes
"""
y_pred = rnd_clf.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
cm

array([[1358,  592,  501,  338,  109],
       [ 702, 1014,  448,  106,  243],
       [ 477,  438,  913,   46,   45],
       [ 653,  183,  162,  617,   28],
       [ 287,  457,  148,   33,  392]], dtype=int64)

In [32]:
"""
Instead of 1 Random Forests classifier (from above) now make the len(a) amount of Random Forests classifiers
with different amount of trees in the forest

Notes:
    Cell duration: 8:06 minutes
    
    Cell duration Threaded: 1 minute 53.5 seconds
"""


a = [400, 500, 600, 700, 800, 900, 1000]


def run_rand_forest_classifer(i):
    rnd_clf = RandomForestClassifier(n_estimators=i, random_state=42)
    rnd_clf.fit(x_train, y_train)
    t = rnd_clf.score(x_test, y_test)
#     print(t)
    return i, t


with concurrent.futures.ThreadPoolExecutor(max_workers=len(a)) as executor:
    results = [executor.submit(run_rand_forest_classifer, i) for i in a]

    # type: concurrent.futures.Future
    for i in concurrent.futures.as_completed(results):
        print(i.result()[0], i.result()[1])

400 0.4152575315840622
500 0.41438289601554906
600 0.41438289601554906
700 0.41243926141885323
800 0.41564625850340137
900 0.4182701652089407
1000 0.41778425655976675


In [33]:
"""
Make a sklearn MultinomialNB (Naive bayes with mutiple classification) algorithm calssifer and train it on the training
dataset of x and y

Reference:
    https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
"""
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(x_train, y_train)

MultinomialNB()

In [34]:
"""
Display how accurate the MultinomialNB is to the testing dataset of x and y
"""
clf.score(x_test, y_test)

0.3825072886297376

In [35]:
"""
Make a term frequency inverse document frequency vectorizer  

Notes:
    Finds how relavent a word is to the document in collection of documents.
    In this case, how relavent a word is to a tweet to a all the other tweets.

Reference:
    https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
    
"""


def tfidf(xx) -> Tuple[Any, TfidfVectorizer]:
    cv = TfidfVectorizer(max_features=4000)
    x = cv.fit_transform(xx).toarray()
    return x, cv

In [36]:
"""
Recall what y looks like

"""
display(y[0])
print_row_with_col_useful(pd_df_temp)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0], dtype=int64)

Unnamed: 0,co,http
0,3,3


# Assigning y using TfidfVectorizer on the list called "collect"

In [37]:
"""
Do tf-idf on the List of tokenized, lemmatized, non stopping words strings (collect)

"""
y, cv2 = tfidf(collect)

In [56]:
"""
Print both the collect and the y from tfidif
"""
# np.set_printoptions(threshold=sys.maxsize)

display(collect[0])
display(y[0])
print(len(y[0]))  # 4000
print(len(cv2.get_feature_names()))  # 4000
pd_df_temp_2 = pd.DataFrame(y[:1], columns=cv2.get_feature_names())
print_row_with_col_useful(pd_df_temp_2)

'menyrbie phil gahan chrisitv http co ifz fan pa http co xx ghgfzcc http co nlzdxno'

array([0., 0., 0., ..., 0., 0., 0.])

4000
4000


Unnamed: 0,co,fan,http,pa
0,0.395158,0.607944,0.396195,0.563279


In [57]:
"""
Again, split the dataset of the vector representations (the x) and the sentiment values aka "values" variable (the y) 
where 75% of the dataset will be used for training using a random seed to split the data.

"""
(x_train, x_test, y_train, y_test) = train_test_split(
    y, values, train_size=0.75, random_state=42)

In [58]:
"""
Make a Random Forest Classifier and train it on the training dataset and test it against the testing
dataset. Then check how accruate the algorithm is.

Notes:
    Cell Duration: 25.4 seconds with n_estimators = 200
    
    Cell Duration: 2 minutes 14 seconds with n_estimators = 1000
"""
rnd_clf = RandomForestClassifier(
    n_estimators=200, max_leaf_nodes=8, random_state=42)
rnd_clf.fit(x_train, y_train)
rnd_clf.score(x_test, y_test)

0.2937803692905734

In [59]:
"""
Make a MultinomialNB Classifier and train it on the training dataset and test it against the testing
dataset. Then check how accruate the algorithm is.

"""
clf = MultinomialNB()
clf.fit(x_train, y_train)
clf.score(x_test, y_test)

0.46763848396501456

## Conclusion: 

### MultinomialNB is better with TfidfVectorizer than CountVectorizer and RandomForestClassifier is better with CountVectorizer than TfidfVectorizer