**Authors**:
* *Joshua Frenchwood*
* *Sean Schuepbach*
* *Sai Sandeep Kumar*
* *Manikanta Addanki*

**Date**: *May 5th, 2020*

*** *Note*: This uses Google's server to run the code remotely on a large supercomputer, and is therefore much faster than trying to run this code on the average local computer. ***

**For improved execution speed**: 
* Try using the GPU or TPU provided by Google Colab to run the code (not necesarry)...
 * Edit -> Notebook Settings -> Hardware Accelerator -> GPU or TPU

**Github Link**: https://github.com/langadudeabu/Tweet-Classifier

**To Run the Code**: Press Ctrl+F9

# Setup

# Imports and Globals

In [91]:
import nltk
import numpy as np
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize, TweetTokenizer
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
import random
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC
from nltk.classify import ClassifierI
from statistics import mode, mean

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
n = 8000 # number of features in algorithms ( top n most common words )
percent_train = 80

ORIG = 1 # 1 when you want to include it in the active dataset
CHI = 1  # 1 when you want to include it in the active dataset
HOU = 1  # 1 when you want to include it in the active dataset

# Functions and Descriptions

Class `VoteClassifier`:
* Inherits from ClassifierI (has compatability with `nltk.classify.accuracy()` function, the same function used to determine accuracy for each individual classifier).
* `classify()` member function uses voting system to make a prediction
* `confidence()` member function returns the fraction of majority votes / total votes, and can be either 60%, 80%, or 100%.

In [0]:
class VoteClassifier(ClassifierI): #inherits from ClassifierI
  def __init__(self, *classifiers): #pass in a list of different Classifiers
    self._classifiers = classifiers
    self._name_list = [ 'Naive_Bayes_Classifier',
              'MultinomialNB_classifier',
              'BernoulliNB_classifier',
              'LogisticRegression_classifier',
              'SGDClassifier_classifier' ]

  def classify(self, tweetVector, p=0):
    votes = []
    for i in range(len(self._classifiers)):
      c = self._classifiers[i]
      v = c.classify(tweetVector) # from nltk.classify
      if p :
        print(self._name_list[i],':', v)
      votes.append(v)
    
    self.classification = votes
    bestPick = mode(votes) # from statistics.mode
    return bestPick
    
  def confidence(self, tweetVector):
    bestPick = self.classify(tweetVector, p=1)
    choice_votes = self.classification.count(bestPick)
    conf = choice_votes / len(self.classification)
    return conf

Function `removeApostrophe()`:
* **Inputs**: tweet that needs apostrophe removed as a string
* **Globals**: None.
* **Outputs**: tweet without apostrophe as string

In [0]:
def removeApostrophe(tweet):
  return tweet.replace('\'', '')

Function `cleanTweet`:
* **Description**: This function removes irrelevent words from the tweet such as "of" and "the", along with symbols, hyperlinks, emojis, numbers, punctuation, and apostrophes.
* **Inputs**: Unmodified tweet as string.
* **Globals**: None.
* **Outputs**: Cleaned tweet as string.

In [0]:
def cleanTweet(tweet):
  t = TweetTokenizer()
  stop_words = set(stopwords.words("english"))
  more = [';', '/','.',':','!','of','?','-','...','&','""','',
          "'s","'m",'\'','\'s','\'re','\'t','[',']','(',')',
          'û','ûª', 'ûªt', 'ûªs', 'ûªve', 'ûªre', 'ûªm']
  for letts in more:
    stop_words.add(letts)

  def isSymbol(word):
    syms = 0
    for i in range(len(word)):
      if(not word[i].isalpha()):
        syms = syms + 1
    if(syms >= len(word)/2): return True
    else: return False

  words = t.tokenize(tweet)
  resultwords  = [w.lower() for w in words 
                  if w not in stop_words 
                  and not w.startswith('http') 
                  and not w.startswith('@') 
                  and not w.isnumeric() 
                  and not isSymbol(w) 
                  and not len(w) <= 2
                  ]
                  #and not len(w) <= 2
  tweet = ' '.join(resultwords)
  tweet = removeApostrophe(tweet)
  return tweet

Function `cleanTweets()`:
* **Description**: This function goes through a list of tweets, and cleans them individually, by passing them to the function `cleanTweet()`
* **Inputs**: list of tweets to be cleaned (string form)
* **Globals**: None.
* **Outputs**: None. Just modifies the input parameter.

In [0]:
def cleanTweets(tweets):  
  for i in range(len(tweets)):
    tweets[i] = cleanTweet(tweets[i])

Function `y_convert()`:
* **Description**: This function translates the dataset's y-values (list of actual classifications) into either *disaster* or *non-disaster*.
* **Inputs**: `y_train` is a list of actual classification values from either ORIG, CHI, or HOU. The Original dataset has integer values (0 or 1), and the Chicago and Houston datasets have string values ("relevent" or "non-relevent")
* **Globals**: None.
* **Outputs**: `y_train2` is a list of string values ("disaster" or "non-disaster")

In [0]:
def y_convert(y_train):
  y_train2 = []
  for i in range(len(y_train)) :
    if (y_train[i] == 1) or (y_train[i] == 'relevent'):
      y_train2.append('disaster')
    else:
      y_train2.append('non-disaster')
  return y_train2

Function `read_in_tweets()`:
* **Description**: This function reads in data from the csv files located in our github repository. The `cleanTweets()` function cleans the tweets, and the `y_convert()` function translates the classifications to either "disaster" or "non-disaster".
* **Inputs**: None. The urls are hardcoded to our github repository.
* **Outputs**: Several Series (pandas Series) containing the CLEANED tweet data and the TRANSLATED classification values for each of the 3 datasets.

In [0]:
def read_in_tweets():
  url = 'https://raw.githubusercontent.com/langadudeabu/MachineLearningData/master/train.csv'
  dataOrig = pd.read_csv(url) # Returns a pandas.DataFrame
  url = 'https://raw.githubusercontent.com/langadudeabu/nlp-disaster-analysis/master/dataset/recent_tweets_test/chicago_tweets-labeled.csv'
  dataChi = pd.read_csv(url) # Returns a pandas.DataFrame
  url = 'https://raw.githubusercontent.com/glrn/nlp-disaster-analysis/master/dataset/recent_tweets_test/houston_tweets-labeled.csv'
  dataHou = pd.read_csv(url) # Returns a pandas.DataFrame

  train_tweets = dataOrig.pop('text').str.lower() # Returns a pandas.Series
  y_train = dataOrig.pop('target') # Returns a pandas.Series

  chi_tweets = dataChi.pop('text').str.lower() # Returns a pandas.Series
  chi_results = dataChi.pop('choose_one').str.lower() # Returns a pandas.Series

  hou_tweets = dataHou.pop('text').str.lower() # Returns a pandas.Series
  hou_results = dataHou.pop('choose_one').str.lower() # Returns a pandas.Series

  cleanTweets(train_tweets)
  cleanTweets(chi_tweets)
  cleanTweets(hou_tweets)

  y_trainOrig = y_convert(y_train)
  y_trainChi = y_convert(chi_results)
  y_trainHou = y_convert(hou_results)

  return train_tweets, y_trainOrig, chi_tweets, y_trainChi, hou_tweets, y_trainHou

Function `combineDatasets()`:
* **Description**: This function appends specified datasets together into one cohesive dataset. The tweets and results are appended to separate lists, but indexed to the same length.
* **Inputs**: `orig`, `chi`, and `hou` are boolean values. A value of 1 (True) means the particular dataset will be included in the overall training dataset.
* **Globals**: `train_tweets`, `y_trainOrig`, `chi_tweets`, `y_trainChi`, `hou_tweets`, and `y_trainHou` are all global *Pandas Series* variables.
* **Outputs**: `data` is a list of tweets(strings), and `results` is a list of corresponding classification values (strings - "disaster" or "non-disaster") 

In [0]:
def combineDatasets(orig=1, chi=0, hou=0):
  data = []
  results = []
  if orig :
    for i in range(len(train_tweets)):
      data.append(train_tweets[i])
      results.append(y_trainOrig[i])
  if chi :
    for i in range(len(chi_tweets)):
      data.append(chi_tweets[i])
      results.append(y_trainChi[i])
  if hou :
    for i in range(len(hou_tweets)):
      data.append(hou_tweets[i])
      results.append(y_trainHou[i])

  return data, results

Function `apply_BagOfWords()`:
* **Description**: This function applies the Bag Of Words model, creating frequency distribution of all words in the collection of cleaned tweets. Also creates a list of tuples to hold the tweets (in word tokenized form) next to their corresponding classification values.
* **Globals**: `train_data` and `train_results` are lists of tweets and results from our overall combined dataset.
* **Outputs**: `docs` is a list of tuples, each holding the tokenized tweet alongside the corresponding classification. `wordFrequencies` is a list of dictionary values: ( word : total # of occurences )

In [0]:
def apply_BagOfWords(train_data, train_results):
  t = TweetTokenizer()
  tokenizedData = []
  for i in range(len(train_data)):
    tweet = train_data[i]
    classification = train_results[i]
    tokenizedTweet = list(t.tokenize(tweet)) #tokenize one tweet at a time
    
    tup = (tokenizedTweet, classification)
    tokenizedData.append(tup)
  
  random.shuffle(tokenizedData)

  all_words = [ w for tweet in train_data for w in t.tokenize(tweet) ] 
  wordFrequencies = nltk.FreqDist(all_words)

  return tokenizedData, wordFrequencies

Function `vectorizeTweet()`:
* **Description**:  This function fills in the vectorized representation of a single tweet with Boolean values (True if contains feature, False if not).
* **Inputs**: `tokenizedTweet` is a word-tokenized version of a single tweet
* **Globals**: `wordFrequencies` is a list of BagOfWords dictionary values: ( word : total # of occurences )
* **Outputs**: `features` is a list of dictionary values: ( feature_word : Boolean ), where the Boolean is True if that specific tweet contains the feature, and False if not. This is the *vectorized form of the tweet* that will be fed into the ML algorithms.

In [0]:
def vectorizeTweet(tokenizedTweet, word_features):
    words = set(tokenizedTweet)
    features = {}
    for w in word_features :
      features[w] = (w in words) # the Value of the dictionary is a BOOLEAN
    return features

Function `vectorizeTweets()`:
* **Description**: This function takes the top n values from the top of the wordFrequency list (top n most common words) and makes them into a list of features. Also, this function creates a list of vectorized tweets, individually vectorizing them with the `vectorizeTweet()` function, and links them to their corresponding classification by returning them in tuple format.
* **Inputs**: None. Just references globals.
* **Globals**: `n` is an integer that represents the desired length of the featureset (top *n* most common words). `tokenizedData` is list of tuples containing: ( word-tokenized tweet : classification ).
* **Outputs**: `tweetVectorData` is a list of tuples, containing: ( vectorized tweet : classification ). This is the format that will be fed into the machine learning algorithms.

In [0]:
def vectorizeTweets():
  featureList = list( wordFrequencies.keys() )[:n] # Top n most common words
  tweetVectorData = [ ( vectorizeTweet(tokenizedTweet, featureList) , classification ) 
                 for (tokenizedTweet, classification) in tokenizedData]

  return tweetVectorData, featureList

Function `splitData()`:
* **Description**: This function takes the Vectorized tweet data along with their corresponding classifications (tuple form), and splits them into two separate variables, one to be used for Training, and one to be used for calculating Accuracy.
* **Inputs**: None. Just references globals.
* **Globals**: `percent_train` is an integer representing the percentage of the total dataset that will be allocated for training purposes. `tweetVectorData` holds the vectorized form of each tweet along with the corresponding classifications (tuple form).
* **Outputs**: `training_set` holds the vectorized training data along with the corresponding answers for the ML algorithms to train on. `testing_set` holds the vectorized testing data along with the corresponding answers, used for calculating Accuracy. `cap` is an index for reference if needed, to refer back to the main tweetVectorData (acts as a zero point of index).

In [0]:
def splitData():
  cap = int( len(train_data) * percent_train/100 )
  training_set = tweetVectorData[:cap]
  testing_set = tweetVectorData[cap:]
  return training_set, testing_set, cap

Function `readFromKeyboard()`:
* **Description**: This function reads sentence in from the keyboard (until *Enter* is pressed), cleans the "tweet", vectorizes the tweet, and returns the vectorized form of the entered tweet. *Note: only knows the tweet, does not know the classification...*
* **Inputs**: The values input to the Keyboard.
* **Globals**: `featureList` is required, so that the tweet can be vectorized along the elements in the list of working features.
* **Outputs**: `tweet` is the raw tweet as entered in the keyboard. `clean_tweet` is the cleaned version of the tweet (only relevent words). `tweetVector` is the Vectorized version of the tweet (same length as *featureList*).

In [0]:
def readFromKeyboard():
  t = TweetTokenizer()
  tweet = input("Enter Tweet: ")
  clean_tweet = cleanTweet(tweet)
  print('Cleaned Tweet:', clean_tweet,'\n')

  tweetVector = vectorizeTweet(t.tokenize(clean_tweet), featureList)
  return tweet, clean_tweet, tweetVector

Function `enterTweet()`:
* **Description**: This function allows the User to input their own "tweet" and makes a prediction of whether the input tweet was referring to a disaster or a non-disaster, using the `voted_classifier` (uses 5-classifier voting system).
* **Inputs**: None. Tweet will eventually be entered in through the keyboard.
* **Globals**:  `voted_classifier` (object from ClassifierI family) is a *VoteClassifier* object, which uses member functions `self.classify()` and `self.confidence()` to output the results.
* **Outputs**: Prints the voting system's prediction, along with its confidence value. Also, prints the predictions of each individual classifier in the voting system.

In [0]:
def enterTweet():
  _, _, tweetVector = readFromKeyboard() 
  classification = voted_classifier.classify(tweetVector)
  confidence = voted_classifier.confidence(tweetVector)*100

  print("\nClassification:__", classification, 
        "__\tConfidence %:__", confidence, '__')
  print('----------------------------------------------------------\n')

# Main

# Load and Clean Data

In [0]:
train_tweets, y_trainOrig, chi_tweets, y_trainChi, hou_tweets, y_trainHou = read_in_tweets()

In [0]:
train_data, train_results = combineDatasets(orig=ORIG, chi=CHI, hou=HOU)

# Bag of Words

In [0]:
tokenizedData, wordFrequencies = apply_BagOfWords(train_data, train_results) # does frequency distribution, shuffles data

# Create Features / Vectorize Tweets 

In [0]:
tweetVectorData, featureList = vectorizeTweets() # returns list of tuples: (vectorized tweet : classification), and the list of features
training_set, testing_set, cap = splitData() # splits tweetVectorData into two sets

# Training

In [0]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [112]:
MultinomialNB_classifier = SklearnClassifier(MultinomialNB())
MultinomialNB_classifier.train(training_set)

<SklearnClassifier(MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))>

In [113]:
BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)

<SklearnClassifier(BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))>

In [114]:
LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)

<SklearnClassifier(LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False))>

In [115]:
SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)

<SklearnClassifier(SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False))>

In [116]:
LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)

<SklearnClassifier(LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0))>

In [0]:
voted_classifier = VoteClassifier(classifier,
                                  MultinomialNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier,
                                  SGDClassifier_classifier)

# Results

In [118]:
classifier.show_most_informative_features(25)

Most Informative Features
               hiroshima = True           disast : non-di =     88.4 : 1.0
                northern = True           disast : non-di =     86.8 : 1.0
                malaysia = True           disast : non-di =     63.6 : 1.0
                  debris = True           disast : non-di =     57.0 : 1.0
                wildfire = True           disast : non-di =     47.1 : 1.0
                   kills = True           disast : non-di =     47.1 : 1.0
                  atomic = True           disast : non-di =     46.1 : 1.0
                 bombing = True           disast : non-di =     40.7 : 1.0
                 reunion = True           disast : non-di =     38.8 : 1.0
                 typhoon = True           disast : non-di =     35.5 : 1.0
                    amid = True           disast : non-di =     33.9 : 1.0
                 suicide = True           disast : non-di =     32.6 : 1.0
                  feared = True           disast : non-di =     32.2 : 1.0

In [0]:
accuracy1 = (nltk.classify.accuracy(classifier, testing_set))*100
accuracy2 = (nltk.classify.accuracy(MultinomialNB_classifier, testing_set))*100
accuracy3 = (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100
accuracy4 = (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100
accuracy5 = (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100
accuracy6 = (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100
votedAccuracy = (nltk.classify.accuracy(voted_classifier, testing_set))*100

In [120]:
print("Naive Bayes Algo accuracy percent: ", accuracy1)
print("MultinomialNB_classifier Algo accuracy percent: ", accuracy2)
print("BernoulliNB_classifier Algo accuracy percent: ", accuracy3)
print("LogisticRegression_classifier Algo accuracy percent: ", accuracy4)
print("SGDClassifier_classifier Algo accuracy percent: ", accuracy5)
print("LinearSVC_classifier Algo accuracy percent: ", accuracy6)
print("voted_classifier accuracy percent:", votedAccuracy)

Naive Bayes Algo accuracy percent:  83.93574297188755
MultinomialNB_classifier Algo accuracy percent:  83.89112003569835
BernoulliNB_classifier Algo accuracy percent:  83.80187416331995
LogisticRegression_classifier Algo accuracy percent:  83.75725122713075
SGDClassifier_classifier Algo accuracy percent:  81.25836680053547
LinearSVC_classifier Algo accuracy percent:  81.07987505577867
voted_classifier accuracy percent: 84.24810352521196


# Try it Yourself

In [139]:
enterTweet()

Enter Tweet: Texas man kills 30 in school shooting
Cleaned Tweet: texas man kills school shooting 

classifier : disaster
MultinomialNB_classifier : disaster
BernoulliNB_classifier : disaster
LogisticRegression_classifier : disaster
SGDClassifier_classifier : disaster

Classification:__ disaster __	Confidence %:__ 100.0 __
----------------------------------------------------------

