# Project: Anti-Refugee Tweet Classification using Sentiment Analysis
---
### Goal: Predict whether a given tweet is Pro-Refugee or Anti-Refugee.
---

## Introduction - A Model

Up until now, we have looked at how to extract tweets from twitter, tokenized it, and made a rule-based classifier. Today, we shall clean the data and train a binary classifier on Bag of Words vectors. We will also learn what Tf-idf weighting is, and train the classifier on a simple Neural Network and see if that is more accurate than the simple Logistic Regression model. 

Before we do all this, we must understand what the standard pipeline for an NLP model is like. 

What is a model? A model is something that we will use to make predictions from our data. It gains a numerical understanding of the data such that given a new data point it has never seen before, it can figure out how it links to the previous data. We can choose to specify the *functional form* of the model beforehand, if we have some understanding of how our data looks (for example, using linear regression to predict a continuous value and logistic regression to predict binary values). 

Yesterday, we created rule-based classifiers based on the nature of the specific problem at hand (deciding whether or not a particular tweet expresses anti-refugee sentiment). Today, we will take a more general approach that allows us to classify text without needing to hard-code rules about the specific problem at hand.

In many ways this will be similar to the Yelp Review Sentiment Analysis we did last week. Getting practice with creating these baseline models is extremely important because it is often the first step one takes when trying to solve a more difficult problem.

## Imports

Run the below cells to get started–this will take a minute or so.

In [None]:
#@title Run this to import all the necessary packages { display-mode: "form" }
import json
import tweepy
from datetime import datetime, timedelta
import re
import numpy as np
import random
import json
import math
from collections import Counter
import matplotlib.pyplot as plt
import os
import sys
import pandas

import nltk
nltk.download('punkt', quiet=True)
nltk.download('wordnet',quiet=True)
from nltk.tokenize import TweetTokenizer
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords' ,quiet=True)

from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

import warnings
warnings.filterwarnings('ignore')

import gdown
import zipfile
import shutil

gdown.download('https://drive.google.com/uc?id=1ifYLZ-19ZyjjRUICe4PDRmZFAkyL73d0','./source_data.zip',True)
my_zip = zipfile.ZipFile('./source_data.zip', mode = 'r')
my_zip.extractall()
basepath = './drive/Team Drives/Inspirit Curriculum/Inspirit AI Program/Working Materials/Tejit\'s Material/Anti-Refugee Sentiment Analysis'

try:
  shutil.move('./Anti-Refugee Sentiment Analysis/', basepath)
except shutil.Error:
  pass

module_folder = './drive/Team Drives/Inspirit Curriculum/Inspirit AI Program/Working Materials/Tejit\'s Material/Anti-Refugee Sentiment Analysis/'
if module_folder not in sys.path: sys.path.append(module_folder)
import lib
from lib import Tweet
from lib import Tweet_counts

# # If the above doesn't work, then upload the file!
# from google.colab import files
# src = list(files.upload().values())[0]
# open('lib.py','wb').write(src)
# import lib
# from lib import Tweet
# from lib import Tweet_counts

In [None]:
#@title Run this to read the data from data.json and split in train and test { display-mode: "form" }
file_name = 'data.json'
dir_name = 'Data'
file_path = os.path.join(basepath, dir_name, file_name)
data = lib.read_json(file_path, shuffle=True)
train, test = train_test_split(data, test_size=0.1)

Total number of unique tweets read are: 672


# Milestone 1: Data Cleaning

## Cleaning


We often can improve the performance of our models by cleaning data before using it for training and prediction. You might have heard that data is what makes machine learning models effective, and garbage training data in will lead to garbage output predictions. This is especially true for data pulled from the web, which is often especially in need of cleaning.

### Exercise (Discussion)

Why do you think data cleaning might be important for this dataset? Discuss a few reasons!

### Examining data to clean - Exercise (Discussion)

Let us look at the data once more, to figure out what we should clean.

In [None]:
tweet_list = [t.original_tweet_text for t in data] # Shorthand For loop
tweet_list[:10] # Looking at the first 10

['rt @care4calais c4c volunteer sophie documents what life is like for refugee families living in calais &gt&gt',
 "i knew this would happen .  it\\'s one reason why i was #nevertrump",
 "merkel should shut she\\'s the cause of europe\\'s refugee problems and terrorism .  liberal douchebag !",
 'this entire thread = excellent analysis of how the so-called #refugee crisis operates as a rhetorical device a del',
 'rt @m_star_online thousands of refugees in britain at risk of homelessness and destitution due to a shortage of government support',
 "rt @thenewdomshow my friend is working in a refugee camp in calais making syrian kids smile by teaching them how to juggle and i\\'ve never",
 'rt @refugeecouncil did you know 1 in every 113 people on the planet is an asylum-seeker internally displaced or a refugee ?  if you a',
 'rt @addamschloe albert einstein was literally a refugee of an aspiring ethnostate and was politically very left-wing',
 'refugee history british aid to refugees in swe

We can get the following properties from the tweets:

> Hashtag

> Mentions

> Punctuations

**Out of these properties, what properties do you think should be removed and why?**


### Tweets post-cleaning

Let us look at what a tweet would look post cleaning!

In [None]:
#@title Tweet post cleaning { run: "auto", vertical-output: true }
text = "rt @shadilayforever well at least now the african immigrants we get won\\\\'t have ebola !" #@param {type:"string"}
if not text: raise Exception ('Please enter some text')
category = "True" #@param ["True", "False"] {type:"string"}
if category=='True': 
  t = Tweet(text,True)
else: 
  t = Tweet(text,False)
t.tweet_text



'rt well at least now the african immigrants we get won have ebola'

The cleaned tweet text, based on the above rules, can be accessed using the Tweet class like so:  `tweet1.tweet_text`

In [None]:
tweet_list = [t.tweet_text for t in data] # Shorthand For loop
tweet_list[:10] # Looking at the first 10

['rt c4c volunteer sophie documents what life is like for refugee families living in calais gt gt',
 'i knew this would happen it s one reason why i was nevertrump',
 'merkel should shut she s the cause of europe s refugee problems and terrorism liberal douchebag',
 'this entire thread = excellent analysis of how the so-called refugee crisis operates as a rhetorical device a del',
 'rt thousands of refugees in britain at risk of homelessness and destitution due to a shortage of government support',
 'rt my friend is working in a refugee camp in calais making syrian kids smile by teaching them how to juggle and i ve never',
 'rt did you know 1 in every 113 people on the planet is an asylum-seeker internally displaced or a refugee if you a',
 'rt albert einstein was literally a refugee of an aspiring ethnostate and was politically very left-wing',
 'refugee history british aid to refugees in sweden during the napoleonic wars',
 'refugee support shouldn be a hard ask it s about common sen

## Stop Words

### Exercise (Discussion)

We covered properties of tweet that needs to be removed. What about words? Are there words that can be removed without affecting the model? Write examples of a few words that you think can be removed from the sentence, but yet the sentence would not be mis-classified (Think of words that occur most common, and in both the tweet categories...)

### Stop Words

Stop words are words that occur in both category, that are not relevant to the context, such as 'at', 'is', 'the' and so on... It is usually advantageous for the classifier to ignore these stop words, since they may add noises or cause numerical issues as they add baggage to the model. For example, the word "are" doesn't tell you much that could be helpful for our sentiment analysis classification task, and it can be removed to simplify the task for our model.

In [None]:
#@title Few Stop-Words { vertical-output: true, display-mode: "form" }
eng_stopwords = set(stopwords.words('english'))
for i,word in enumerate(eng_stopwords):
    if i>10: break
    print(word)


here
against
no
mustn
hasn't
those
such
before
now
did
you're


Let us see if the words you identified are stop words or not. Check your words here, using this interactive piece of code. After you run once, the code will run automatically whenever you change the word.

In [None]:
#@title Check stop words { run: "auto", vertical-output: true, display-mode: "both" }
word = "how" #@param {type:"string"}
if not word: raise Exception('Please enter a word')
eng_stopwords = set(stopwords.words('english'))
if word.lower().strip() in eng_stopwords: print('Yes, this is a stop word.')
else: print('No, ' + word + ' is not a stop word.')




Yes, this is a stop word.


## Stemming vs Lemmatization

### Introduction

Stemming is the process of reducing words to their root/base word. For instance, it reduces all the following words to like: 

* "likes"
* "liked"
* "likely"
* "liking"

Lemmatization is also the process of reducing words to their root/base, but it refers to the dictionary form of the word. While lemmatization always preserves a grammatically correct version of a word (as opposed to stemming), it may also not reduce some words down to the form you expected. In the forms of the word "like" above, only "likes" would be lemmatized down to "like".

We can see the difference between the lemmatized and the stemmed versions below.

```
 Original |Stem |Lemma 
--------------------
  likes   |like | like 
  liked   |like |liked 
  likely  |like |likely
  liking  |like |liking
```

The stem of each word is the same!

### Stemming vs Lemmatization - Interactive

The following piece of code allows you to check the stem of different words. You can enter multiple words at a time by seperating them with commas.

In [None]:
#@title Stemming { display-mode: "both" }
word = "was" #@param {type:'string'}
if not word: raise Exception('Please enter a word/words')
words = word.split(',')
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
print('Original: ' + word)
print("STEM: ")
print(stemmer.stem(word))
print("LEMMA: ")
print(lemmatizer.lemmatize(word))
  


Original: was
STEM: 
wa
LEMMA: 
wa


# Milestone 2: Bag of Words and Tfidf Vectorizers

## One-Hot Encoding

**One Hot Encoding**, also known as one-of-K scheme is a way to encode the data to be used in other functions (such as linear regression).

Let us consider an example to understand one hot encoding!

Before we apply a model on our tweets, we need to convert it to a form the model, i.e. a machine, can understand - esentially convert a tweet to numerical form. We cannot just pass words to the model, because it won't know what those mean and migt try and exrtract information from them. Hence, a numerical format is the best. 

The easiest way to do so is to map each word in a tweet to a number, a categorical value. This will represent all words in a tweet uniquely!

---
Suppose we have a tweet:
> Tweet: 'supporting refugee children in education' <br>
> Category: False

Its numerical representation would be:

In [None]:
#@title Numeric Represention { vertical-output: true, display-mode: "form" }
d = {'supporting': 1, 'refugee': 2, 'children': 3, 'in': 4, 'education': 5}
print('{:<12}|{:>2}'.format('word', 'value'))
print('-------------------')
for k,v in d.items(): print('{:<12}|{:>3}'.format(k,v))

word        |value
-------------------
supporting  |  1
refugee     |  2
children    |  3
in          |  4
education   |  5


### Exercise (Discussion)

Why do you think the above representation is wrong (or at least unhelpful)? What information could a model possibly extract from the above information such that its conclusions would be wrong/way off to what we want to achieve.

### One Hot Encoding

We need to encode the words and include them as a feature to train the model. That is where One Hot Encoding comes into play. Understanding how it is done will make the process clearer

Let us consider the same example and see what its one hot encoding would be

In [None]:
#@title One Hot Encoding { vertical-output: true, display-mode: "form" }

print('{:^10}|{:^7}|{:^8}|{:^2}|{:^9}'.format('supporting', 'refugee', 'children','in','education'))
print('---------------------------------------------------')
print('{:^10}|{:^7}|{:^8}|{:^2}|{:^9}'.format('1', '0', '0','0','0'))
print('{:^10}|{:^7}|{:^8}|{:^2}|{:^9}'.format('0', '1', '0','0','0'))
print('{:^10}|{:^7}|{:^8}|{:^2}|{:^9}'.format('0', '0', '1','0','0'))
print('{:^10}|{:^7}|{:^8}|{:^2}|{:^9}'.format('0', '0', '0','1','0'))
print('{:^10}|{:^7}|{:^8}|{:^2}|{:^9}'.format('0', '0', '0','0','1'))


supporting|refugee|children|in|education
---------------------------------------------------
    1     |   0   |   0    |0 |    0    
    0     |   1   |   0    |0 |    0    
    0     |   0   |   1    |0 |    0    
    0     |   0   |   0    |1 |    0    
    0     |   0   |   0    |0 |    1    


You see here, each letter is represented using a row of 1's and 0's, a row which can essentially represent the whole of the vocabulary. This is called one hot encoding.

So 'supporting' would be [1,0,0,0,0]

And a whole sentence is the combination of all the words and hence their one hot encoding. So the representation of the sentence 'supporting children' would be [1,0,1,0,0]. We can see how this encoding does not suffer the problem of our initial numerical representation.

### The Bag of words model

Now that we have the one hot encoding for all tweets, it's time to attach a value to each word, such that it can be distinguished and used by the model. 

One idea is to assign each word the same weight, say a weight of one. However, that is non-distinguishable and the model won't be able to learn anything with that information. Can you think of a simple way to add weights to words which, say, occur more frequently?

That is the Bag of Words Model. The **Bag of Words Model** converts the tweets into a matrix of token counts. Let us consider an example.

----

Say I have the following tweets:

> Tweet: 'supporting refugee children in education' <br>Category: False

> Tweet: 'supporting children is bad' <br>Category: True

> Tweet: 'refugee have the same right' <br>Category: False

The Vocabulary for the tweets would be:

In [None]:
#@title Vocabulary { vertical-output: true, display-mode: "form" }
tweet1 = Tweet('supporting refugee children in education', False)
tweet2 = Tweet('supporting children is bad', True)
tweet3 = Tweet('refugee have the same right', False)
word_count = Counter()
for tweet in [tweet1, tweet2, tweet3]:
  for t in tweet.tokenList: word_count[t]+=1
word_count_list = [(k,v) for k,v in word_count.items()]
word_count_list.sort(key=lambda x:x[0])
print('{:<12}|{:>2}'.format('word', 'position'))
print('-------------------')
for k,v in enumerate(word_count_list): print('{:<12}|{:>3}'.format(v[0],k))

word        |position
-------------------
bad         |  0
children    |  1
education   |  2
have        |  3
in          |  4
is          |  5
refugee     |  6
right       |  7
same        |  8
supporting  |  9
the         | 10


Since there are 11 words in the vocabulary, each row in the matrix would be of 11 words. 

Based on this, one hot encoding for each tweet is:

> Tweet: 'supporting refugee children in education' <br>Encoding: [0,1,1,0,1,0,1,0,0,1,0]

> Tweet: 'supporting children is bad' <br>Encoding: [1,1,0,0,0,1,1,0,0,1,0]

> Tweet: 'refugee have the same right' <br>Encoding: [0,0,0,1,0,0,1,1,1,0,1]

The matrix that will be given to the model would be:


In [None]:
#@title Matrix { vertical-output: true }
np.array([[0,1,1,0,1,0,1,0,0,1,0],
          [1,1,0,0,0,1,1,0,0,1,0],
          [0,0,0,1,0,0,1,1,1,0,1]], dtype=np.int64)


array([[0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0],
       [1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1]])

And the token count for each word is

In [None]:
#@title Token Counts { vertical-output: true }
print('{:<12}|{:>2}'.format('word', 'word_count'))
print('-------------------')
for k,v in word_count_list: print('{:<12}|{:>3}'.format(k,v))

word        |word_count
-------------------
bad         |  1
children    |  2
education   |  1
have        |  1
in          |  1
is          |  1
refugee     |  2
right       |  1
same        |  1
supporting  |  2
the         |  1


### Exercise (Coding) (Challenge)

Let us try and code a function that gives us the bag of words of a sentence. As arguments, we will need all the sentences to make up the list of words, and we will need the sentence that we want to get the bag of words of. 

First we would need to create a vocabulary of all the unique words in all the sentences. Next, we can use a Python dictionary to map each word to a particular index i.e. the `key` would be the word and the `value` would be a unique number corresponding to each word.

We can then initialize our bag of words encoding as a numpy array of zeros (`np.zeros()`) of the same length as the vocabulary, and loop through all the words of our individual sentence. We can access the index of that particular word using our dictionary, and then increment that corresponding index in the bag of words encoding.

If you have any questions about this code, feel free to ask your instructor!

*Hint: an easy way to delete duplicate items from a list is to use a set, e.g. `lst = list(set(lst)).`*

In [None]:
def bow_encoding(sentences, sentence):
  """
  param: sentences - list of sentences to form the one hot encoding
  param: sentence - sentence to return the one hot encoding of
  return: sent_encoding - the encoded sentence
  """
  vocab = []
  for sent in sentences:
    vocab.extend(sent.split())
  vocab = list(set(vocab))
  
  str_to_idx = {}
  for w in range(len(vocab)):
    word = vocab[w]
    str_to_idx[word] = w     # fill in the blanks!
    
  sent_encoding = np.zeros(len(vocab))
  
  for word in sentence:
    try:
      indx = str_to_idx[word]
      sent_encoding[indx] += 1
    except KeyError:
      pass
  
  
  return sent_encoding
  

  

Test your encoding function here.

In [None]:
encode = bow_encoding(tweet_list,'remembering an earlier refugee crisis and a family who risked their lives to help')

In [None]:
for i in encode:
  if i > 0.:
    print (i)

5.0
2.0
1.0
2.0
12.0
2.0
1.0
3.0
8.0
2.0
1.0
4.0
1.0
4.0
3.0
2.0
3.0
1.0
1.0


In [None]:
# we can see how many nonzero elements are in our vector by running this cell
np.count_nonzero(encode)

19

Congratulations! You have successfully replicated the functionality of scikit-learn's CountVectorizer class! However we will still be using that going forward, because it's important to get familiar with the tools that are used commonly in industry. 

### CountVectorizer - Review

CountVectorizer produces a vector from each individual piece of text. The length of each vector is the same (the length of the vocabulary) and so the group of all the texts together is a matrix, which can then be passed on to a machine learning model later on (such as logistic regression or a neural network). Think of it this way: it just allows you to generate the correct kind of input to be used by the model to learn information about the training data. Any test data that it makes predictions on must be similarly transformed with a vectorizer. 

CountVectorizer's Transform method transforms the given data to the matrix, there is no learning here. It does not generate a vocabulary, nor does it allow for many other functionalities that the fit method allows for. Hence, it cannot be used as a substitue for the fit method as it just spews out the matrix representation of the data!<br>

### Exercise (Coding)

Let us make the CountVectorizer and use its fit and transform methods. An example has been shown for you below. Try it on tweets of your choice!

In [None]:
tweet_01 = Tweet('This is a big tweet ', True)
tweet_02 = Tweet('Sample Tweet two', True)
tweet_03 = Tweet('this is the third sample', True)
train_text = [t.tweet_text for t in [tweet_01, tweet_02, tweet_03]]
print(train_text)


vectorizer = CountVectorizer()
vectorizer.fit(train_text)

print(vectorizer.transform(tweet_02.tokenList))

['this is a big tweet', 'sample tweet two', 'this is the third sample']
  (0, 2)	1
  (1, 6)	1
  (2, 7)	1


The transformed vector is generally a very sparse vector with a large majority of zeros (here in our example above only has about 10 words in the vocabulary). The output of `print` for this sparse vector, instead of displaying all the zeros, simply outputs the count of each word in that particular sentence. 

The three numbers in each row are - `(index in the sentence, index in the vocabulary)                  count`

## Tf-Idf Weighting

Bag of words and counting words by how frequently they occur in a corpus is the most basic way to "vectorize" a model. However one step up of sophistication is **tf-idf** weighting. Tf-idf stands for "term-frequency inverse-document-frequency". This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. The two terms are calculated as

*   TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

  TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).


*   IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

  IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
  
The intuition for the inverse document frequency weighting term is similar to that of stop words - some words do not contribute much meaning. However it is more subtle. Imagine we have a collection of legal documents we want to classify. There are certain words that are not stop words but would be common to many of the documents we may want to classify - "affadavit", "prosecute", "case" etc. These words indicate we are in the legal space but not help us in our task of classifying one legal document from another. You might see how this would be helpful for our task of classifying anti refugee tweets from pro refugee tweets!

We can use scikit-learn's `TfidfVectorizer` in the same way as `CountVectorizer`

`vectorizer = TfidfVectorizer()`

![alt text](https://skymind.ai/images/wiki/tfidf.png)







In [None]:
tweet_01 = Tweet('This is a big tweet ', True)
tweet_02 = Tweet('Sample Tweet two', True)
tweet_03 = Tweet('this is a third sample', True)
# You can add more tweets here
train_text = [t.tweet_text for t in [tweet_01, tweet_02, tweet_03]]
print(train_text)

### YOUR CODE STARTS HERE ###


vectorizer = TfidfVectorizer(train_text)
X = vectorizer.fit(train_text)
print(X.vocabulary)
# make a TfidfVectorizer
# fit the data train_text to the vectorizer using the .fit method
# print the vocabulary of the vectorizer


### END CODE ###


print(vectorizer.transform(tweet_03.tokenList))


['this is a big tweet', 'sample tweet two', 'this is a third sample']
None
  (0, 4)	1.0
  (1, 1)	1.0
  (3, 3)	1.0
  (4, 2)	1.0


## Logistic Regression

### What is Logistic Regression?

We've just spent the last week or so learning about more sophisticated neural network architectures. Why should we begin working on a complicated task using such a simple model? Remember that logistic regression is just linear regression followed by a sigmoid function.

**Review:** What is Logistic Regression?

Logistic regression is a type of linear regression that is generally used for binary classification. Unlike linear regression which outputs continuous number values, logistic regression uses the logsitic function, also called the sigmoid function, to transform the output to return a probability value between 1 and 0, which can then be mapped to the different categories. The logistic (sigmoid) function looks something like this:

![Logistic Function](https://ml-cheatsheet.readthedocs.io/en/latest/_images/sigmoid.png)

Consider an example to understand logistic regression and to enchance the difference between logisitc and linear regression:

> Given data on time spent studying and exam scores. Linear Regression and logistic regression can predict different things:

>> *Linear Regression* could help us predict the student’s test score on a scale of 0 - 100. Linear regression predictions are continuous (numbers in a range).<br>
*Logistic Regression* could help use predict whether the student passed or failed. Logistic regression predictions are discrete (only specific values or categories are allowed). We can also view probability scores underlying the model’s classifications.

### Why Logistic Regression?

**A couple of reasons for using Logistic regression:**
1.   Using a simpler model tells us how much room we have to improve. 
2.   A simple model makes iteration quick and easy–we'll see that for the project of classifying a tweet based words and hashtags and punctuations, cleverly extracting all these features from the text will be important for our success. Using a model that trains and evaluates quickly is essential for rapid feature selection.
3.   Lastly, and perhaps most importantly, logistic regression is interpretable. You may have heard in the past that one thing deep neural networks struggle with is interpretability–when you are using these models to make predictions that affect people's wellbeing (e.g., sentencing decisions, predictive policing decisions), it becomes extremely important that you are able to understand why a model is making the predictions it makes. For simpler models like logistic regression, we get interpretability for free! 

### Comparison between Bag of Words and Tfidf

We can train a Bag of Words model and a Tf-idf model separately to vectorize the Tweets, and use those vectors to train a logistic regression model. We can compare which of those are better.

However, there are also a lot of arguments we could pass into our Vectorizers and our our Logistic Regression model that could potentially improve it. In particular, we could pass in a list of custom `stop_words` to our vectorizers that might be common to our problem that might not help distinguish between a pro and anti refugee tweet. 

We can also pass in a parameter to our vectorizer `sublinear_tf = True` (this replaces tf with 1 + log(tf)). 

For the Logistic Regression model, we can pass in a parameter `penalty` that can take values `l1` and `l2` (these are different values the logistic model takes to fit the data).

[Other parameters for Bag of Words (CountVectorizer)](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

[Other parameters for Tf-idf](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

[Other parameters for Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
#@title List of your custom stop words
stopwords = "[]" #@param {type:'string'}


In [None]:
# load the train and test data
train_tweets = [t.tweet_text for t in train]
train_tweets_label = [t.category for t in train]
test_tweets = [t.tweet_text for t in test]

In [None]:
bow_vectorizer = CountVectorizer(stop_words=stopwords) # try passing in different parameter values
bow_train_vect = bow_vectorizer.fit_transform(train_tweets)
bow_test_vect = bow_vectorizer.transform(test_tweets)

bow_model = LogisticRegression() # try passing in penalty=l1 or penalty=l2 here
bow_model.fit(bow_train_vect, train_tweets_label)

In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords,sublinear_tf=True) # try passing in different parameter values (try sublinear_tf = True)
tfidf_train_vect = tfidf_vectorizer.fit_transform(train_tweets)
tfidf_test_vect = tfidf_vectorizer.transform(test_tweets)

tfidf_model = LogisticRegression() # try passing in penalty=l1 or penalty=l2 here
tfidf_model.fit(tfidf_train_vect, train_tweets_label)

In [None]:
# make the predictions and get the correct values
bow_predictions = bow_model.predict(bow_test_vect)
tfidf_predictions = tfidf_model.predict(tfidf_test_vect)
correct = np.array([t.category for t in test])

In [None]:
bow_accuracy = accuracy_score(correct,bow_predictions)
tfidf_accuracy = accuracy_score(correct,tfidf_predictions)
print('Bag of words accuracy: {:0.4f}'.format(bow_accuracy))
print('Bag of words accuracy: {:0.4f}'.format(tfidf_accuracy))

Play around with the parameters mentioned above to try and get the highest accuracy!

### Confusion Matrix

Let us look at some stats about the predictions to understand what the model predicted, where it did well, and where it went wrong.

In [None]:
matrix = confusion_matrix(correct, tfidf_predictions)
lib.disp_confusion_matrix(matrix)

Let us look at the tweets that the model predicted incorrectly

In [None]:
print('{:^125}|{:^10}|{:^7}'.format('Tweet','Category','Result'))
for i in range(len(test)):
  if correct[i] != bow_predictions[i]:
    print('{:<125}|{:^10}|{:^7}'.format(str(test[i]), correct[i], bow_predictions[i]))

Based on these incorrect predictions, are there any words you think you might want to add/remove to the list of stop words passed in to your model above?

# Milestone 3: Using Tfidf Vectors in a Neural Network

We used the Tfidf vectors we created for each tweet in a logistic regression model. But we can actually use these vectors as inputs to a simple neural network as well. We can use `scikit-learn`'s `MLPClassifier` (multilayer perceptron) classifier and modify the size of the hidden layer to see if that affects our accuracy.

In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:
nnet = MLPClassifier((400,)) # try passing in the parameter solver=lbfgs or solver=sgd

In [None]:
nnet.fit(tfidf_train_vect, train_tweets_label)
nn_predictions = nnet.predict(tfidf_test_vect)

In [None]:
nn_acc = accuracy_score(correct,nn_predictions)

In [None]:
nn_acc

### Exercise (Discussion)

How did the neural network perform compared to the logistic regression network? What can be done to improve it?

In [None]:
'''
YOUR ANSWER HERE
'''

## Conclusion

That is it for today! Tomorrow we will cover another concept, word2vec, also known as word embeddings. 