### Codio Activity 18.5: Tfidf

This activity focuses on using term frequency inverse document frequency (tfidf) to vectorize text.  First, you will compute tfidf by hand on a small example.  Then, you will use scikitlearn to implement the `TfidfVectorizer` together with a `LogisticRegression` estimator to see if the performance on predicting the WhatsApp status improves with a different representation.

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)
- [Problem 6](#-Problem-6)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

### Small Example

As discussed in the lectures, the formula for tfidf is given as:

$$\text{tfidf} = \frac{\text{term frequency}}{\text{inverse document frequency}}$$



In [3]:
tokens = [['The', 'burritos', 'were', 'not', 'great'],
 ['The', 'burritos', 'were', 'great', 'great', 'great'],
         ['The', 'taco', 'was', 'good']]

In [4]:
tokens

[['The', 'burritos', 'were', 'not', 'great'],
 ['The', 'burritos', 'were', 'great', 'great', 'great'],
 ['The', 'taco', 'was', 'good']]

### Problem 1

#### Term Frequency

$$tf(t, d) = \frac{\text{number of times that t occurs in d}} {\text{number of words in d}}$$

Compute the tf scores for the three documents in `tokens` for the word **great**.  Assign as a list of floats to `tfs` below.

In [5]:
tfs = [0.20, 0.50, 0]

### Problem 2

#### Inverse document frequency

The inverse document frequency is given by the formula:

$$idf(t) = -\log(\frac{\text{number of documents that contain t}}{\text{total number of documents}})$$

Compute the idf score for the word great.  Assign as a float to `idf` below. Be sure to use `np.log` to compute the logarithm.

In [6]:
idf = -np.log(2/3)
idf

np.float64(0.40546510810816444)

### Problem 3

####  tfidf by hand

Now, combine the tf and idf scores to compute the tfidf as:

$$tfidf(t, d) = tf(t, d) \times idf(t)$$

for the word "great".  Assign your solution as a list of floats to `tfidfs` below.



In [7]:
tfidfs = [i*idf for i in tfs]
tfidfs

[np.float64(0.0810930216216329),
 np.float64(0.20273255405408222),
 np.float64(0.0)]

### Problem 4

#### Using `TfidfVectorizer` in a `Pipeline`

Now, you are to use the scikitlearn transformer `TfidfVectorizer` to transform the WhatsApp data from kaggle.  The data is loaded and split below. 

Create a `TfidfVectorizer`, assigning it to the variable `tfidif`. Fit it with the training data and assign it to the variable `dtm`.


In [9]:
happy_df = pd.read_csv('codio_18_5_solution/data/Emotion(happy).csv')
sad_df = pd.read_csv('codio_18_5_solution/data/Emotion(sad).csv.zip', compression = 'zip')
full_df = pd.concat([happy_df, sad_df]).reset_index(drop = True)
X = full_df.drop('sentiment', axis = 1)
y = full_df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X['content'], y, random_state = 42)

In [10]:
tfidf = TfidfVectorizer()
dtm = tfidf.fit_transform(X_train)

pd.DataFrame(dtm.toarray(), columns = tfidf.get_feature_names_out()).head()

Unnamed: 0,0_0,100,123whatsappstatus,204,30,404,44,45,55,805,...,yes,yesterday,yet,you,young,your,yours,yourself,yous,yuh
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01397,0.0,0.006924,0.398355,0.0,0.070646,0.0,0.01305,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.184358,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Problem 5

#### Pipeline with `TfidfVectorizer`

Now, create a pipeline named `tfidf_pipe` below.  This should have named step `tfidf` and `lgr` and implement a `TfidfVectorizer` and `LogisticRegression` estimator respectively.  

In [11]:
tfidf_pipe = Pipeline([('tfidf',TfidfVectorizer()),
                      ('lgr',LogisticRegression())])
tfidf_pipe

In [12]:
tfidf_pipe.fit(X_train, y_train)
test_acc = tfidf_pipe.score(X_test, y_test)

tfidf_pipe.named_steps

{'tfidf': TfidfVectorizer(), 'lgr': LogisticRegression()}

### Problem 6

#### Grid Searching the Pipeline

Use the parameter grid below to create a grid search object named `tfidf_grid` using your pipeline `tfidf_pipe` and the parameter grid given.  Assess the performance on the test set as `tfidf_acc` and consider how this representation of the text compared to the pure counts of `CountVectorizer`.

In [13]:
params = {'tfidf__max_features': [100, 500, 1000, 2000],
         'tfidf__stop_words': ['english', None]}

In [14]:
grid = GridSearchCV(tfidf_pipe, param_grid = params)
grid.fit(X_train,y_train)
test_acc = grid.score(X_test, y_test)
grid.best_params_


{'tfidf__max_features': 500, 'tfidf__stop_words': None}

### Codio Activity 18.6: Naive Bayes Algorithm

This activity focuses on the implementation of the Naive Bayes algorithm.  You will use the scikit-learn estimator together with your earlier vectorization strategies to model the WhatsApp text and compare to your earlier work with Logistic Regression.   

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV

### Problem 1

#### Small Example

The example below is adapted from Marsland's *Machine Learning an Algorithmic Perspective*.  A small dataset where the features are whether or not a student has a looming deadline, if there is a party going on, and whether or not the student feels lazy.  The activity column is the target, and your aim is to use the naive bayes formula below:

$$P(C_i) \prod_{k} P(X_j^k = a_k | C_i)$$

In [16]:
deadline = ['urgent','urgent','near', 'none', 'none', 'none', 'near', 'near', 'near','urgent']
party = ['yes', 'no', 'yes', 'yes', 'no', 'yes', 'no', 'no', 'yes', 'no']
lazy = ['yes', 'yes', 'yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no']
activity = ['party', 'study', 'party', 'party', 'pub', 'party', 'study', 'tv', 'party', 'study']

In [17]:
df = pd.DataFrame({'deadline': deadline, 
                  'party': party,
                  'lazy': lazy,
                  'activity': activity})
df

Unnamed: 0,deadline,party,lazy,activity
0,urgent,yes,yes,party
1,urgent,no,yes,study
2,near,yes,yes,party
3,none,yes,no,party
4,none,no,yes,pub
5,none,yes,no,party
6,near,no,no,study
7,near,no,yes,tv
8,near,yes,yes,party
9,urgent,no,no,study


Here, $C_i$ represents the class in the `activity` columm.  Accordingly, if we want to predict a category of activity given the input: 

```
deadline = near
party = no
lazy = yes
```

This means we need four probabilities:

- $P(party) \times P(near | party) \times P(no party | party) \times P(lazy | party)$
- $P(study) \times P(near | study) \times p(noparty | study) \times P(lazy | study)$
- $P(pub) \times P(near | pub) \times P(noparty | pub) \times P(lazy | pub)$
- $P(tv) \times P(near | tv) \times P(noparty | tv) \times P(lazy |tv)$

Compute these four probabilities and assign them to the list `probs` in the order above (party, study, pub, tv). 

Hint: No need to calculate the probabilities by hand.

In [18]:
probs = [1/2*2/5*0, 3/10*1/3*1*1/3, 1/10*0, 1/10*1*1*1]

In [19]:
1/2*2/5*0

0.0

In [20]:
probs

[0.0, 0.03333333333333333, 0.0, 0.1]

### Problem 2

#### MAP solution

Using these probabilities, the maximum aposteriori solution involves selecting the outcome that is associated with the highest probability.  Use your list of probabilities to identify the `argmax`.  Note you can use `np.argmax` for this or just inspect the values.  What is the activity associated with the MAP solution?  Assign your answer as a string -- `party`, `study`, `pub`, or `tv` -- to `ans2` below.

In [21]:
ans2 = 'tv'

### Larger Example

Now, you are to use the scikitlearn vectorizers together with the `MultinomialNB` estimator to implement naive bayes algorithm for classifying the WhatsApp data.  The data is loaded and split for you below.

In [23]:
happy_df = pd.read_csv('codio_18_5_solution/data/Emotion(happy).csv')
sad_df = pd.read_csv('codio_18_5_solution/data/Emotion(sad).csv.zip', compression = 'zip')
full_df = pd.concat([happy_df, sad_df]).reset_index(drop = True)
X = full_df.drop('sentiment', axis = 1)
y = full_df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X['content'], y, random_state = 42)

### Problem 3

#### Pipeline with `CountVectorizer`

Below, create a pipeline called `cvect_pipe` with named steps `cvect` and `bayes` that first vectorizes the text and then uses the `MultinomialNB` estimator with all default settings.  Fit this on the train and score it on the test, assigning the accuracy to `cvect_acc` below.

In [24]:
cvect_pipe = Pipeline([('cvect', CountVectorizer()),
                       ('bayes', MultinomialNB())])
cvect_pipe.fit(X_train, y_train)
cvect_acc = cvect_pipe.score(X_test, y_test)

cvect_pipe.named_steps

{'cvect': CountVectorizer(), 'bayes': MultinomialNB()}

### Problem 4

#### Pipeline with `TfidfVectorizer`

Below, create a pipeline called `tfidf_pipe` with named steps `tfidf` and `bayes` that first vectorizes the text and then uses the `MultinomialNB` estimator with all default settings.  Fit this on the train and score it on the test, assigning the accuracy to `tfidf_acc` below.


In [25]:
tfidf_pipe = Pipeline([('tfidf', TfidfVectorizer()),
                       ('bayes', MultinomialNB())])
tfidf_pipe.fit(X_train, y_train)
tfidf_acc = tfidf_pipe.score(X_test, y_test)

tfidf_pipe.named_steps

{'tfidf': TfidfVectorizer(), 'bayes': MultinomialNB()}

### Problem 5

#### Assessing performance

Now, consider searching the hyperparameters of the model.  Specifically, what is the parameter that controls Laplacian smoothing?  Assign your answer as a string to `ans5` below.  As an extra activity, perform a grid search over this parameter and compare the performance to that of `LogisticRegression`.  Also, compare the speed of fit between the logistic and naive bayes models.

In [26]:
ans5 = 'alpha'