<a href="https://colab.research.google.com/github/malaika-n/Inspirit-AI-ClassifyingTweets/blob/main/Section_2__Data_Preprocessing_and_Simple_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project: Disaster Relief
#Section 2: Data Reprocessing and Simple Models


Today we will be building some more sophisticated models called One-hot-encoding, CountVectorizor, and Bag of Words. You can review the slides to remember how these models work. But first we will need to clean our data a little bit, an important step in any AI Pipeline.

In [None]:
#@title Load your dataset { display-mode: "form" }
# Run this every time you open the spreadsheet
%load_ext autoreload
%autoreload 2
from collections import Counter
from importlib.machinery import SourceFileLoader
import numpy as np
from os.path import join
import warnings
warnings.filterwarnings("ignore")

import nltk
nltk.download('punkt')
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import *
from nltk.corpus import stopwords
nltk.download('stopwords' ,quiet=True)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.metrics import multilabel_confusion_matrix
from sklearn import metrics
import gdown


gdown.download('https://drive.google.com/uc?id=1umFXM7SvdBvTlHW0r0CXDcxNqL73jU8Z', 'disaster_data.csv', True)


import nltk
import string
nltk.download('stopwords')


nltk.download('wordnet')

# Introduction - What is A Model?


Up until now, we have looked at disaster-related tweets, made a rule based classifier, and evaluated it. Today, we shall clean the data, stem it, and learn about One-Hot Encoding, CountVectorizer, and Logistic Regression.

Before, we do all this, we must understand what a model really is.

What is a model? A model is something that we will make to predict the category of a given tweet. It is a numerical understanding of the data such that given a new data point, we can figure out how it links to the previous data. This is a crude definition, which you will understand more through this project.

The models we have used or will use in this project are:

1. Rule Based Classifier (You specify rules based on which the model gives the output - the category of the tweet)
2. CountVectorizer + Logistic Regression: A model based on counting the number of occurances of a word and applying regression on it to predict the category of a new tweet
3. Word2Vec + Logistic Regression: A model which applies the idea that words that occur in similar contexts tend to be close in a sentence, and uses this to predict the category of the tweet (It is quite difficult to understand this, so don't worry if you don't get it in the first go! More explanation on this later.)


## Data Preprocessing

Before we can jump into building a model, we must clean the data a little bit!

In [None]:
# Load the data.
disaster_tweets = pd.read_csv('disaster_data.csv',encoding ="ISO-8859-1")

In [None]:
disaster_tweets.head()

**Discussion Exercise**: Consider the tweet *I really need food.... I am very hungry. The hunger is unbearable. #pleasehelp*

1. Are all words in the tweet equally informative?
2. Are there any words in this sentence that mean the same thing, but are technically distinct words?
3. Are there any unecessary words or symbols that we could remove from the tweet before building a model?

We are going to play with three pre-processing steps to address these two questions.

###Removal of non alphabetic characters

In tweet classification we use words as the features, so it's important to remove unwanted characters such as numbers and punctuation marks as they dont provide us with any valuable information

In [None]:
#Read the tweet data and convert it to lowercase
tweets = disaster_tweets['text'].str.lower()
tweets = tweets.apply(lambda x: re.sub(r'[^a-zA-Z0-9]+', ' ',x))

In [None]:
#Extract the labels from the csv
tweet_labels = disaster_tweets['category']

### Tokenizing
First we need to split a sentence into individual words, or *tokens*.


###**Discussion Exercise**:  

What would be the token list be of the following sentence be? "*AI is so fun! I love to learn about NLP and machine learning!*""

Discuss your answer with your group, then write it down in your worksheet.

In [None]:
#@title Tokenizing
tweet = "AI is so fun! I love to learn about NLP and machine learning!nter your own tweet here" #@param {type:'string'}
for i in word_tokenize(tweet):
  print(i)

## Stemming and Lemmatization

Remember that the goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

A difference between stemming and lemmatization is that stemming looks at the current word only, while lemmatization also takes the context into consideration. Either way, this pre-processing step could be somewhat tedious. Luckily, the powerful `nltk` provides tools for both.



### Exercise: Stemming using the Porter stemmer
*Porter's algorithm*, developed in the 1980s, is one of the most commonly used stemmers.  



Try and find a word that Porter's stemming doesn't work well on! (Hint: Try some plurals of words that end in -e)

In [None]:
#@title Stem words { run: "auto", vertical-output: true, display-mode: "form" }
stemmer = PorterStemmer()
word = "" #@param {type:"string"}
print(stemmer.stem(word))


### Lemmatizer

You can add more words to `plurals` and see what the stemming results look like.  
You may find that the results may look a bit mechanical. This is because the Porter's algorithm is essentially a sequential application of a set of rules. To get better looking results, let's try out a lemmatizer.

In [None]:
#@title Lemmatize Words { run: "auto", vertical-output: true, display-mode: "form" }
# Get the lemmatizer
lemma = WordNetLemmatizer()
word = "" #@param {type:"string"}
print(lemma.lemmatize(word))


###**Discussion Exercise**:

 What are the differences between the Porter stemmer and the lematizer? How do you think the lemmatizer works?

 Discuss your answer then write it down.

## Stop Words

###**Discussion Exercise**:

Are there words that can be removed without affecting the model?

Write examples of a few words that you think can be removed from the sentence, but yet the sentence would not be mis-classified (Think of words that occur most common, and in both the tweet categories...)

Stop words are words that occur in both category, that are not relevant to the context, such as 'at', 'is', 'the' and so on... It is usually advantageous for the classifier to ignore these stop words, since they may add noises or cause numerical issues as they add baggage to the model

In [None]:
#@title Few Stop-Words { vertical-output: true, display-mode: "form" }
eng_stopwords = set(stopwords.words('english'))
for i,word in enumerate(eng_stopwords):
    if i>10: break
    print(word)

Let us see if the words you identified are stop words or not. Check your words here, using this interactive piece of code

In [None]:
#@title Check stop words { run: "auto", vertical-output: true, display-mode: "form" }
word = "" #@param {type:"string"}
if not word: raise Exception('Please enter a word')
eng_stopwords = set(stopwords.words('english'))
if word.lower().strip() in eng_stopwords: print('YES')
else: print('NO')



## Preprocessing pipeline of our data

Explore how combining these methods changes the structure of our tweet dataset. Here are the first 5 tweets after preprocessing.

In [None]:
stopword_set = set(stopwords.words('english'))

'''
Complete the following function to remove the stopwords from the tokenized tweets
'''
def remove_stopwords(token_list):
  filtered_sentences = []
  """
  YOUR CODE HERE
  """
  return filtered_sentences

In [None]:
#Tokenize all the tweets
tokenized_tweets = [word_tokenize(t) for t in tweets]

#Remove Stopwords from all the tweets
tweet_set = remove_stopwords(tokenized_tweets)

## First 5 tweets:
for i in range(5):
  print("Original tweet: %s: \nCleaned and tokenized data: %s\n" % (tweets[i], tweet_set[i]))


# Bag of Words Model

## One-Hot Encoding
**One Hot Encoding**, also known as one-of-K scheme is a way to encode the data to be used in other functions (such as linear regression).

Let us consider an example to understand one hot encoding:

Before we apply a model on our tweets, we need to convert it to a form the model, i.e. a machine, can understand - esentially convert a tweet to numerical form. We cannot just pass words to the model, because it won't know what those mean and migt try and exrtract information from them. Hence, a numerical format is the best.

The easiet way to do so is to map each word in a tweet to a number, a categorical value. This will represent all words in a tweet uniquely!

---
Suppose we have a tweet:
> Tweet: 'I am hungry need food' <br>
> Category: Food

Its numerical representation would be:

In [None]:
#@title Numeric Represention { vertical-output: true, display-mode: "form" }
d = {'I': 1, 'am': 2, 'hungry': 3, 'need': 4, 'food': 5}
print('{:<12}|{:>2}'.format('word', 'value'))
print('-------------------')
for k,v in d.items(): print('{:<12}|{:>3}'.format(k,v))

**Discussion Exercise**: Why do you think the above representation is wrong? What information could a model possibly extract from the above information such that its conclusions would be wrong/way off to what we want to achieve.


We need to encode the words and include them as a feature to train the model. That is where One Hot Encoding comes into play. Understanding how it is done will make the process clearer

Let us consider the same example and see what its one hot encoding would be

In [None]:
#@title One Hot Encoding { vertical-output: true, display-mode: "form" }
print('{:^10}|{:^7}|{:^8}|{:^2}|{:^9}'.format('I', 'am', 'hungry','need','food'))
print('---------------------------------------------------')
print('{:^10}|{:^7}|{:^8}|{:^2}|{:^9}'.format('1', '0', '0','0','0'))
print('{:^10}|{:^7}|{:^8}|{:^2}|{:^9}'.format('0', '1', '0','0','0'))
print('{:^10}|{:^7}|{:^8}|{:^2}|{:^9}'.format('0', '0', '1','0','0'))
print('{:^10}|{:^7}|{:^8}|{:^2}|{:^9}'.format('0', '0', '0','1','0'))
print('{:^10}|{:^7}|{:^8}|{:^2}|{:^9}'.format('0', '0', '0','0','1'))


You see here, each letter is represented using a row of 1's and 0's, a row which can esentially represent the whole of the vocabulary. This is called one hot encoding.

So 'I' would be [1,0,0,0,0]

And a whole sentence is the combination of all the words and hence their one hot encoding. So the representation of the sentence 'I hungry' would be [1,0,1,0,0]

### One Hot Encoding Function

In [None]:
def one_hot_encoding(sentences, sentence, print_word_dict = False):
  """
  param: sentences - list of sentences to form the one hot encoding
  param: sentence - sentence to return the one hot encoding of
  return: sent_encoding - the encoded sentence
  """

  # words_list = an empty list
  words_list = []

  # Loop through all the sentences.
  # Split each sentence using the .split() method for a string, to get a list of words
  # Add those words to words_list
  for sent in sentences:
    words_list.extend(word_tokenize(sent))

  # remove the duplicates by making it into a set and back to a list
  words_list = list(set(words_list))

  # words_map_dict = make an empty dictionary
  words_map_dict = {}

  # loop through all the words in words_list
  # add each word as a key in words_map_dict and the index of the word as value.
  for w in range(len(words_list)):
    word = words_list[w]
    words_map_dict[word] = w

  # sent_encoding = a numpy array of all zeros of the same length as words_list
  sent_encoding = np.zeros(len(words_list))

  # loop through the words in the given sentence (sentence to check)
  # Find the index of each word, using the words_map_dict dictionary
  # Change the value of the numpy array at that index to one
  for word in word_tokenize(sentence):
    if word not in words_map_dict:
      continue
    print(word)
    indx = words_map_dict[word]
    sent_encoding[indx] = 1

  if print_word_dict:
    print ("Word Dictionary: ", words_map_dict)

  return sent_encoding

  ### Your code ends here ###


In [None]:
encoding = one_hot_encoding(["oh I love AI", "AI is so much fun", "is AI fun or what"], "oh what fun AI is ", True)
print("Encoding: ", encoding)


## Count Vectorizer - The Bag of words model

### Introduction

Now that we have the one hot encoding for all tweets, its time to attach value to each word, such that it can be distinguished and used by the model.

One idea is to assign each word the same weight, say a weight of one. However, that is non distinguishable and the model won't be able to learn anything with that information. Can you think of a simple way to add weights to words which, say, occur more frequently?

That is the Bag of Words Model. The **Bag of Words Model** converts the tweets into a matrix of token counts. Let us consider an example

----

Recall the first 3 tweets:

In [None]:
for t in tweets[:3]:
  print(t)

The Vocabulary for the tweets would be:

In [None]:
#@title Vocabulary { vertical-output: true, display-mode: "form" }
word_count = Counter()
for tweet in tweets[:3]:
  for t in word_tokenize(tweet):
    word_count[t]+=1
word_count_list = [(k,v) for k,v in word_count.items()]
word_count_list.sort(key=lambda x:x[0])
print('{:<12}|{:>2}'.format('word', 'position'))
print('-------------------')
for k,v in enumerate(word_count_list): print('{:<12}|{:>3}'.format(v[0],k))

If there are N words in the vocabulary, each row in the matrix would be of N words.

Based on this, one hot encoding for each tweet and the matrix that will be given to the model would be the following.


In [None]:
print("Tweet 1: ", one_hot_encoding([str(t) for t in tweets[:3]], str(tweets[0])))
print("Tweet 2: ", one_hot_encoding([str(t) for t in tweets[:3]], str(tweets[1])))
print("Tweet 3: ", one_hot_encoding([str(t) for t in tweets[:3]], str(tweets[2])))

In [None]:
#@title Token Counts { vertical-output: true }
print('{:<12}|{:>2}'.format('word', 'word_count'))
print('-------------------')
for k,v in word_count_list: print('{:<12}|{:>3}'.format(k,v))

The above is Bag of Words model. Sklearn's Countvectorizer does the same thing, however, in a much more sophisticated manner.

In [None]:
train_text = [t for t in tweets[:3]]
print(train_text)
vectorizer = CountVectorizer()
vectorizer.fit(train_text)
print('Number of Words in Vocabulary of train tweets are: {}'.format(len(vectorizer.vocabulary_)))

And the Vocabulary is *almost* the same as ours above.

**Discussion Exercise**: What is different about vocabulary results from the CountVectorizer than our method above?   Why do you think that is? Which is better?

In [None]:
vectorizer.vocabulary_

### CountVectorizer - Fit and Transform

CountVectorizer's Fit method fits the given data to the vectorizer - consider it as if it is learning from the data. It produces a matrix, which is then be passed on to Logistic regression (later, so don't worry about understanding that). Think of the fit method this way: it just allows you to generate the matrix (but doesnt let us see it) used by the model to learn information about the training data.

CountVectorizer's Transform method transforms the given data to the matrix, there is no learning here. It does not generate a vocabulary, nor does it allow for many other functionalities that the fit method allows for. Hence, it cannot be used as a substitue for the fit method as it just spews out the matrix representation of the data!<br>

### Coding Exercise

Let us make the countvectorizer and use its fit and transform method, to learn their working. An example has been shown for you below. Try it on tweets of your choice!

In [None]:
tweet_01 = 'This is a big tweet '
tweet_02 = 'Sample Tweet two'
# You can add more tweets here
train_text = [tweet_01, tweet_02]
print(train_text)

### Your code starts here ###

# make a CountVectorizer
# fit the data train_text to the vectorizer using the .fit method
# print the vocabulary of the vectorizer

tweet_03 = 'Sample tweet'
# Now, transform this tweet's tokenList

### Your code ends here ###



# Logistic Regression

## Logistic Regression Refresher

We've just spent the last week or so learning about more sophisticated neural network architectures.  Remember that logistic regression is just linear regression followed by a sigmoid function.

**Review:** What is Logistic Regression?

Logistic regression is a type of linear regression that is generally used for classification. Unlike linear regression which outputs continuous number values, logistic regression uses the logsitic function, also called the sigmoid function, to transform the output to return a probability value between 1 and 0, which can then be mapped to the different categories. The logistic (sigmoid) function looks something like this:

![Logistic Function](https://ml-cheatsheet.readthedocs.io/en/latest/_images/sigmoid.png)

Consider an example to understand logistic regression and to enchance the difference between logisitc and linear regression:

> Given data on time spent studying and exam scores. Linear Regression and logistic regression can predict different things:

>> *Linear Regression* could help us predict the student’s test score on a scale of 0 - 100. Linear regression predictions are continuous (numbers in a range).<br>
*Logistic Regression* could help use predict whether the student passed or failed. Logistic regression predictions are discrete (only specific values or categories are allowed). We can also view probability scores underlying the model’s classifications.

 **Why Logistic Regression?**

**A couple of reasons for using Logistic regression:**
1.   Using a simpler model tells us how much room we have to improve.
2.   A simple model makes iteration quick and easy.
3.   Lastly, and perhaps most importantly, logistic regression is interpretable. You may have heard in the past that one thing deep neural networks struggle with is interpretability–when you are using these models to make predictions that affect people's wellbeing (e.g., sentencing decisions, predictive policing decisions), it becomes extremely important that you are able to understand why a model is making the predictions it makes.For simpler models like logistic regression, we get interpretability for free!

## Logistic Regression in Python

Logistic regression in python can be done easily with the help of sklearn's Logistic Regression function. Let us first do it for the three tweet examples that we saw above.

In [None]:
tweet1 = 'please help we desperately need food'
label1 = "Food"
tweet2 = 'We are very thirsty please send water'
label2 = 'Water'
tweet3 = 'we need water and are very thirsty'
label3 = 'Water'

train_tweets = [tweet1, tweet2]
train_tweets_label = [label1, label2]
test_tweets = [tweet3]
test_tweets_label = [label3]

Next, let us make the vectorizer and encode tweets

In [None]:
vectorizer = CountVectorizer() # Countvectorizer
train_vect = vectorizer.fit_transform(train_tweets) # fit_transform fits the train tweets and returns the sparse matrix of the tweets
model = LogisticRegression() # create a LogisticRegression Model
model.fit(train_vect, train_tweets_label) # Fit the data values to the model

Over here, model.fit(train_vect, train_tweets_label) applies logistic regression to the data given by the  matrix and hence, fits the data to the function. It takes two arguments: the matrix that is the train_vect variable and the train_tweets_label, which is the category each tweet belongs to.

Let us now predict the third tweet using this model, using the method predict. However, first we need to transform the tweet into a vector form

In [None]:
test_vect = vectorizer.transform(test_tweets)
result = model.predict(test_vect)
print('Actual Category: {}\nPredicted Category: {}'.format(label3, result[0]))

Yay! It predicted it correctly! However, that might not always be the case, as the training set here is so small.

**Exercise**: Can you trick the logistic regression? Try and make a tweet get classified as "Water" when it is about "Food".  *Hint: What are some words that were in the Food tweet in the training set that had no significance to food?*

In [None]:
#@title Trick the logistic regression { run: "auto", vertical-output: true, display-mode: "form" }
test_tweet = "Apple juice is the best" #@param {type:"string"}
true_label = "Food" #@param ["Food", "Water"]

test_vect = vectorizer.transform([test_tweet])
result = model.predict(test_vect)
print('Actual Category: {}\nPredicted Category: {}'.format(true_label, result[0]))


## Logistic Regression for Tweet Classification (Coding Exercise)

In [None]:
#Split the Data into Training and Testing
X_train, X_test, y_train, y_test = train_test_split(tweet_set, tweet_labels, test_size=0.2, random_state=1)

Let us now build our own regression model for all the training data:

In [None]:
def train_model(tweets_to_train,train_labels):
  """
  param: tweets_to_train - list of tweets to train on
  return: the vectorizer, the logistic regression model, the train_vector
  """

  train_tweets = ["".join(t) for t in tweets_to_train]
  train_tweets_label = [l for l in train_labels]


  ### Your code starts here ###

  # vectorizer = Initialize CountVectorizer
  # fit the train_tweets in the CountVectorizer using the vectorizer's fit method
  # train_vect = get the sparse matrix of the train_tweets from the vectorizer using the transform function

  # model = initialize a Logistic Regression model
  # fit train_vect and train_tweets_label using the fit function of LogisticRegression

  ### Your code ends here ###

  #train_tweets = [str(t) for t in tweets_to_train]
  #train_tweets_label = [t.category for t in tweets_to_train]

  vectorizer = ___
  train_vect = vectorizer.fit_transform(train_tweets)

  model = ___ # create a LogisticRegression Model
  model.fit(___, ___)
  model = LogisticRegression() # create a LogisticRegression Model
  model.fit(train_vect, train_tweets_label)


  return model,train_vect



In [None]:
def predict(tweets_to_test, vectorizer, model):
  """
  param: tweets_to_test - list of tweets to test the model on
  param: vectorizer - the CountVectorizer
  param: model - the LogisticRegression model
  return result (the prediction), the test_vect
  """

  test_tweets = [" ".join(t) for t in  tweets_to_test]

  print (test_tweets)
  ### Your code starts here ###

  # test_vect = transform the test_tweets to sparce matrix using the vectorizer's transform function
  test_vect = ____
  # result = predict the result using the model's predict function on test_vect
  result = ____

  return result

  ### Your code ends here ###

In [None]:
model,train_countvect = train_model(X_train,y_train)
#Predict labels for test set
y_pred = predict (X_test,train_countvect,model)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Evaluation
Let's see how our classifier did!  We will train our classifier on 80% of the dataset and then test it on 20%. This is called a *train-test split* and is usually done to evaluate models.

In [None]:
table=pd.DataFrame([[" ".join(t) for t in X_test],y_pred, y_test]).transpose()
table.columns = ['Tweet', 'Predicted Category', 'True Category']
print("Percent Correct: %.2f" % (sum(table['Predicted Category'] == table['True Category'])/len(table['True Category'])))
table

**Discussion Exercise** Which categories does the regressor perform best on?  Would the classifier perform better or worse if we only used the food vs water tweets?

## Evaluation Metrics

Let us look at some stats about the prediction to understand what the model predicted! Review day 1 for a refresher on accuracy-related metrics.

In [None]:
#@title Helper Function-Confusion Matrix
'''
Plots the confusion Matrix and saves it
'''
def plot_confusion_matrix(y_true,y_predicted):
  cm = metrics.confusion_matrix(y_true, y_predicted)
  print ("Plotting the Confusion Matrix")
  labels = ['Energy', 'Food', 'Medical', 'None', 'Water']
  df_cm = pd.DataFrame(cm,index =labels,columns = labels)
  fig = plt.figure()
  res = sns.heatmap(df_cm, annot=True,cmap='Blues', fmt='g')
  plt.yticks([0.5,1.5,2.5,3.5,4.5], labels,va='center')
  plt.title('Confusion Matrix - TestData')
  plt.ylabel('True label')
  plt.xlabel('Predicted label')
  plt.show()
  plt.close()




In [None]:
plot_confusion_matrix(y_test,y_pred)

Review day 1 for a refresher on accuracy metrics!

In [None]:
print('The total number of correct predictions are: {}'.format(sum(table['Predicted Category'] == table['True Category'])))
print('The total number of incorrect predictions are: {}'.format(sum(table['Predicted Category'] != table['True Category'])))

In [None]:
print('Accuracy on the test data is: {:.2f}%'.format(metrics.accuracy_score(y_test, y_pred)*100))

###Exercise: Discussion

Comment out the following code line in the Data Preprocessing part and run through the algorithm again!

' tweets = tweets.apply(lambda x: re.sub(r'[^a-zA-Z0-9]+', ' ',x))''

Comment on the classifier's accuracy!

## Conclusion

**Discussion Exercise**:
1. How could you use what you have built to help during a disaster?


That is it for today! Tomorrow we shall cover another model - GloVe. Review the concepts we covered today and enjoy the rest of your day!
