In [299]:
#Should be version is 3.7.3.
from platform import python_version

print("Python version:", python_version())

Python version: 3.7.3


### Import the required libraries

In [300]:
import os, re
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet

In [301]:
os.listdir(".")

['.ipynb_checkpoints',
 'NLP Disaster Tweets - Naive Bayes.ipynb',
 'nlp-disaster-tweets-shallow-bilstm-w-attention.ipynb',
 'test.csv',
 'train.csv']

--------

# 1. Exploratory Data Analysis

------------------

### Load the training and testing data & display the first 5 rows

In [302]:
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

train_data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [303]:
print("There are {} tweets".format(len(train_data)))

There are 7613 tweets


### Display the information content of the columns - specifically, how many null values are present in each column

In [304]:
print("Null values for each column (% of total amount of data):\n")
round(train_data.isnull().sum() / len(train_data) * 100, 1)

Null values for each column (% of total amount of data):



id           0.0
keyword      0.8
location    33.3
text         0.0
target       0.0
dtype: float64

### Observations: 
* 30% of location are NaN. Location *may* indicate likelihood of disaster tweet (eg. in locations such as Syria, Afghan, etc.) more likely tweet is about a bombing, but still not always indicative. We will take the easy route and leave the location out for now.

* Also, there is a very small percentage with missing keywords, so I will remove these entries for now too.

In [305]:
train_data = train_data.dropna(axis=0) # remove missing keyword entries

In [306]:
print("Null values for each column (% of total amount of data):\n")
(train_data.isnull().sum() / len(train_data)) * 100

Null values for each column (% of total amount of data):



id          0.0
keyword     0.0
location    0.0
text        0.0
target      0.0
dtype: float64

### Since we are performing supervised classification with the BLSTM model, it is essential that we have as close to equal class balance as possible (this is not so important with the Naive Bayes model). So, let's check how balanced the data is.

In [307]:
train_data.target.value_counts() # Sample from majority class accordingly

0    2884
1    2196
Name: target, dtype: int64

### So we have a slightly imbalanced 57-43 split of the data. Since there are plenty of observations for both classes (over 2000 for each), and the data is not wildly imbalanced (although that depends on who you ask) it may not be so necessary to balance the data. Nevertheless, we will do so here to ensure we have a perfectly equal split. 

In [308]:
num_neg = train_data.target.value_counts()[0] # number of examples belonging to class 0
num_pos = train_data.target.value_counts()[1] # number of examples belonging to class 1

frac = num_pos / num_neg

neg_data = train_data[train_data.target == 0].sample(frac=frac)
pos_data = train_data[train_data.target == 1]

new_train_data = pd.concat([pos_data, neg_data])

### IMPORTANT: We shuffle the new training dataset in order to better ensure a random mix of class labels, and less overfitting of our models to one particular class.

In [309]:
#Shuffle the dataframe
new_train_data = new_train_data.sample(frac=1)

### Are our balance issues fixed? (Hint: yes)

In [310]:
new_train_data.target.value_counts()

0    2196
1    2196
Name: target, dtype: int64

In [311]:
print("Total number of tweets in re-sampled dataframe: {}".format(len(new_train_data)))

Total number of tweets in re-sampled dataframe: 4392


In [312]:
print("Number of tweets w/ #: ", len([x for x in new_train_data.text.values if '#' in x])) # tweets with #
print("Number of tweets w/ @: ", len([x for x in new_train_data.text.values if '@' in x])) # tweets with @

Number of tweets w/ #:  1066
Number of tweets w/ @:  1199


### A quarter of the total tweets come with hashtags and just under a third come with @'s. Since this is such a large number, it might not be best to remove these tokens completely

### Below, we create the `clean_text` function which cleanses and standardizes the tweets into a form that is fit for our models

In [313]:
def clean_text(text):
    
    new_text = text.lower() # lowercase the text
    new_text = re.sub(r"\w+\:\/\/([a-z]+)\.co\/\w+(\n)?", "", new_text) #remove urls
    #new_text = re.sub(r"@[a-zA-Z0-9]+(?:;)*", "", new_text) # remove @s
    #new_text = re.sub(r"#", "", new_text) # remove #s
    new_text = re.sub(r"[^a-z0-9A-Z]", " ", new_text) # remove non alphanumerics
    new_text = re.sub(r"[0-9]+[^\w+]", "", new_text) # remove words made wholy of digits
    new_text = re.sub(r"\b\w{1,2}\b", "", new_text) # remove words w/ 1 char
    new_text = re.sub(" +", " ", new_text) # remove multiple consecutive spaces
    
    new_text = new_text.strip() # remove leading/trailing whitespaces
    
    return new_text

In [314]:
new_train_data.text.values # display some raw tweets

array(['Greedy bastards @Fullscreen way to ruin creativity.  CENSORSHIP ON YOUTUBE: https://t.co/nMtlpO4B58',
       'Is it seclusion when a class is evacuated and a child is left alone in the class to force compliance?  #MoreVoices',
       '#FedEx no longer to transport bioterror germs in wake of anthrax lab mishaps http://t.co/S4SiCMYRmH',
       ...,
       "Couples having less sex... for fear it'll be a let down: Internet movies and books saying how sex 'ought to be' p\x89Û_ http://t.co/c1xhIzPrAd",
       'The Catastrophic Effects of Hiroshima and Nagasaki Atomic Bombings Still Being Felt Today http://t.co/WC8AqXeDF7',
       'Roosevelt Wash. under evacuation order due to wildfire http://t.co/FiJAPxyKRQ'],
      dtype=object)

In [315]:
#Test cleaning on given tweet
i = 6
for tweet in new_train_data.text.values[i:]:
    print("Original: ", tweet)
    tweet = clean_text(tweet)
    print("Cleaned: ", tweet)
    break

Original:  @Caitsroberts see U the night wee bArra to get absolutely wrecked ????
Cleaned:  caitsroberts see the night wee barra get absolutely wrecked


---------------

# 2. Data Preprocessing

-----

### Here, we preprocess the tweets using the `clean_text` function created above

In [316]:
tweets = {}
for i, tweet in enumerate(new_train_data.text):
    tweets[i] = clean_text(tweet)
    
labels = {}
for i, label in enumerate(new_train_data.target):
    labels[i] = label

----------------------------------------

### Data augmentation is a very important step in order to give your model more data to train on which can potentially improve performance. Here, we cleverly inflate the size of our training set by deleting a random word from each tweet, and adding the new tweet back into dataset

In [317]:
import random

for i in range(len(tweets)):
    temp, label = tweets[i], labels[i]
    j = random.randint(0, len(temp.split())-1)
    word = temp.split()[j]
    temp = temp.replace(word, "")
    temp = re.sub(" +", " ", temp) # remove multiple consecutive spaces
    temp = temp.strip() # remove leading/trailing whitespaces

    tweets[len(tweets)] = temp
    labels[len(labels)] = label

### Now, we extract the tokens and lemmatize the tweets. In previous experiments, all English stopwords were excluded from the token list but no noticeable performance increase was observed.

#### Note: Intuitively, stopwords (such as "and," "but," "a" etc.) convey little if any information when predicting sentiment, however it was found that removable of these words did not increase the performance of the model by any significant margin, so they were left in

In [318]:
lm = nltk.stem.WordNetLemmatizer()
all_tokens = [item for _, value in tweets.items() for item in word_tokenize(value)]
all_tokens_lm = [lm.lemmatize(t) for t in all_tokens]
#all_tokens_lm = [lm.lemmatize(t) for t in all_tokens if t not in stopwords.words('english')]

### Get the number of tokens and vocabulary size

In [319]:
N = len(all_tokens_lm)
V = len(set(all_tokens_lm))
        
print(f"There are {N} tokens after processing")
print(f"There are {V} unique tokens after processing")

There are 95656 tokens after processing
There are 10664 unique tokens after processing


### Below, we create a model for extracting tweets of a user-defined sentiment from our corpus. This comes in handy for the Naive Bayes model

In [320]:
def filter_dict(tweets, sentiments, sent):
    """
    Gets a dictionary with tweets of a certain sentiment
    
    Inputs:
        tweets: dict, contains the tweets (key = ID, value = tweet)
        sentiments: dict, contains the sentiments (key = ID, value = 0 or 1)
        sent: string, the sentiment (1 for "disaster", 0 for "non-disaster")
    
    Note: tweets & sentiments need to have the same ID
    """
    new_dict = {}
    for key, value in tweets.items():
        if sentiments[key] == sent:
            new_dict[key] = value
            
    return new_dict

def count_occurences(w, counts):
    try:
        return counts[w]
    except:
        return 0

In [321]:
### Testing Functionality ### 
test = filter_dict(tweets, labels, 0) # extract tweets belonging to class 0
list(test.items())[:10]

[(0, 'greedy bastards fullscreen way ruin creativity censorship youtube'),
 (4,
  'also dont think sewing thought leather belt would work out that well lol'),
 (5, 'ideas food flattened'),
 (6, 'caitsroberts see the night wee barra get absolutely wrecked'),
 (7,
  'just added some more fire the flames for saturday rick wonder will spinning guest set along with chachi'),
 (10,
  'petchary but can say that either should displeased move five spots jamaica congrats the reggaeboyz'),
 (11, 'utahcanary sigh daily battle'),
 (13,
  'officialtjonez your lost for words made new fan yours fam crazy skills beyond blessed keep blazing dude made love and respect'),
 (14,
  'today was such hastle from getting drug tested blood drown out shot document filling'),
 (15,
  'sirbrandonknt exactly that why the lesnar cena match from summerslam last year was great because brock annihilated guy who')]

## Here, we build the Naive Bayes model with +k smoothing. 

The NB model is a "bag of words" model that will predict the most likely sentiment $c$ given the words $w$ in a tweet. Formally, we compute $P(c|w)$ = $c_{NB}$ using Bayes Rule:

$c_{NB}$ = $argmax(log(P(c)) + \sum_{i}(log(P(w_i|c)))$

where

$P(c)$: prior probability, = # of tweets of sentiment c / total number of tweets

$P(w_i|c)$: likelihood (posterior), = count($w_i$) in all documents of class $c$ / number of words in docs of class $c$

#### Note: The smoothing parameter was varied but any value of k>1 did not significantly increase performance; thus, k=1 was chosen

In [322]:
class NaiveBayesClassifier():
    """Naive Bayes with +k smoothing"""
    def __init__(self, documents, sentiments):
        """
        Inputs: 
            documents: dict, key = ID, value = tweet
            sentiments: dict, key = ID, value = sentiment (1 or 0)
        """
        self.documents = documents
        self.sentiments = sentiments
        self.classes = list(set(self.sentiments.values()))
        
    def train(self, tokens, k):
        logprior, lhoods, bigdoc = {}, {}, {c : [] for c in self.classes}
        Ndoc = len(self.documents) 
        V = set(tokens)
        for c in self.classes:
            c_tweets = filter_dict(self.documents, self.sentiments, c)
            Nc = len(c_tweets)
            logprior[c] = np.log(Nc / Ndoc)
            bigdoc[c] = [item for _, value in c_tweets.items() for item in word_tokenize(value)]
            counts = Counter(bigdoc[c])
            allw_count = {v:count_occurences(v, counts) for v in V}
            likelihood = {}
            for w in V:
                w_count = allw_count[w]
                likelihood[w] = np.log((w_count + k) / (len(bigdoc[c]) + k*len(V)))
                
            lhoods[c] = likelihood
            
            print(f"Finished with class {c}")
        
        print("\nFinished training!")
        return logprior, lhoods
    
    def classify(self, tweet, tokens, prior, lhoods):
        V = set(tokens)
        probs = {}
        tweet = word_tokenize(tweet)
        for c in self.classes:
            probs[c] = prior[c]
            for i in range(len(tweet)):
                word = tweet[i]
                if word in V:
                    probs[c] += lhoods[c][word]
                    
        return self.classes[np.argmax(list(probs.values()))]

In [323]:
### Testing Class Functionality ###
model = NaiveBayesClassifier(tweets, labels)
print("Classes: ", model.classes)
print("Documents: ", list(model.documents.items())[:2])
print("Sentiments: ", list(model.sentiments.values())[:2])

Classes:  [0, 1]
Documents:  [(0, 'greedy bastards fullscreen way ruin creativity censorship youtube'), (1, 'seclusion when class evacuated and child left alone the class force compliance morevoices')]
Sentiments:  [0, 1]


### Split the training data into train + validation sets. Here, a standard 60-20-20 training/validation/testing split is employed.

In [324]:
#60-20-20 train-dev-test split

cutoff = int(0.8*len(tweets))
train_cutoff = int(0.6*len(tweets))

train_set = dict(list(tweets.items())[:train_cutoff])
train_labels = dict(list(labels.items())[:train_cutoff])

validation_set = dict(list(tweets.items())[train_cutoff:cutoff])
validation_labels = dict(list(labels.items())[train_cutoff:cutoff])

test_set = dict(list(tweets.items())[cutoff:])
test_labels = dict(list(labels.items())[cutoff:])

In [325]:
#For NB, we use only the training and testing set
model = NaiveBayesClassifier(train_set, train_labels)
prior, lhood = model.train(all_tokens_lm, k=1)

Finished with class 0
Finished with class 1

Finished training!


----- 

# 3. Evaluation

----

### Here, we evaluate the model on the basis of the F1 score. An `evaluate` function is created below that takes in the parameters we got from the training step and uses them for inference with the testing set. The `f1_score` function from the scikit-learn library is an easy way to get the F1 score.

In [326]:
from sklearn.metrics import f1_score

def evaluate(parameters):
    prior, lhood = parameters
    predictions = {k : model.classify(v, all_tokens_lm, prior, lhood) for k, v in test_set.items()}
    score = f1_score(list(test_labels.values()), list(predictions.values()))
    return score

In [327]:
score_nb = evaluate((prior, lhood))

print(f"Naive Bayes F1-score: {score_nb}")

Naive Bayes F1-score: 0.8991150442477878


### While the NB model achieves a pretty good result on the training set, it does not perform too well in practice. Part of the reason for this is that the model does not consider the context of the surrounding words and instead has its predictions based on frequentist statistics (ie. how frequently words occur in with particular sentiment labels and how often those labels occur in the dataset). It completely disregards language nuance and as a result fails to capture the meaning of words and sentences effectively.

### Try it out on custom tweets

In [328]:
linking_dict = {0: "non-disaster", 1: "disaster"}

In [329]:
tweet1 = "Help, there's been an earthquake!"
tweet2 = "Enjoying my time here in Mexico :)"
tweet3 = "My legs are killing me!"

your_tweet = tweet1

print("Your tweet:", your_tweet)

pred = linking_dict[model.classify(your_tweet, all_tokens_lm, prior, lhood)]
print("The Naive Bayes model predicts that your tweet is {} related.".format(pred))

Your tweet: Help, there's been an earthquake!
The Naive Bayes model predicts that your tweet is disaster related.
