### Codio Activity 18.5: TF-IDF

**Expected Time = 90 minutes** 

**Total Points = 60** 

This activity focuses on using term frequency inverse document frequency (TF_IDF) to vectorize text.  First, you will compute tfidf by hand on a small example.  Then, you will use ScikitLearn to implement the `TfidfVectorizer` together with a `LogisticRegression` estimator to see if the performance on predicting the WhatsApp status improves with a different representation.

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)
- [Problem 6](#-Problem-6)

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

### Small Example

As discussed in the lectures, the formula for TF-IDF is given as:

$$\text{tfidf} = \frac{\text{term frequency}}{\text{inverse document frequency}}$$



In [5]:
tokens = [['The', 'burritos', 'were', 'not', 'great'],
 ['The', 'burritos', 'were', 'great', 'great', 'great'],
         ['The', 'taco', 'was', 'good']]

In [7]:
tokens

[['The', 'burritos', 'were', 'not', 'great'],
 ['The', 'burritos', 'were', 'great', 'great', 'great'],
 ['The', 'taco', 'was', 'good']]

[Back to top](#-Index)


### Problem 1

#### Term Frequency

**10 Points** 

$$tf(t, d) = \frac{\text{number of times that t occurs in d}} {\text{number of words in d}}$$

Compute the tf scores for the three documents in `tokens` for the word **great**.  Assign as a list of floats to `tfs` below.

In [9]:
### GRADED
tfs = [(doc.count('great') / len(doc)) for doc in tokens]

### ANSWER CHECK
len(tfs) #should be 3
tfs

[0.2, 0.5, 0.0]

[Back to top](#-Index)


### Problem 2

#### Inverse document frequency

**10 Points** 

The inverse document frequency is given by the formula:

$$idf(t) = -\log(\frac{\text{number of documents that contain t}}{\text{total number of documents}})$$

Compute the idf score for the word **great**.  Assign the result as a float to `idf` below. 

Be sure to use `np.log` to compute the logarithm.

In [11]:
### GRADED
# Count the number of documents that contain the word "great"
num_docs_with_great = sum(1 for doc in tokens if 'great' in doc)

# Total number of documents
total_docs = len(tokens)

# Compute the IDF score
idf = -np.log(num_docs_with_great / total_docs)

### ANSWER CHECK
print(idf)

0.40546510810816444


[Back to top](#-Index)


### Problem 3

####  tfidf by hand

**10 Points** 

Now, combine the tf and idf scores to compute the TF-IDF as:

$$tfidf(t, d) = tf(t, d) \times idf(t)$$

for the word **great**.  Assign your solution as a list of floats to `tfidfs` below.



In [13]:
### GRADED
tfidfs = [tf * idf for tf in tfs]

### ANSWER CHECK
print(tfidfs)

[0.0810930216216329, 0.20273255405408222, 0.0]


[Back to top](#-Index)


### Problem 4

#### Using `TfidfVectorizer` in a `Pipeline`

**10 Points** 

Now, you are to use the Scikit-Learn transformer `TfidfVectorizer` to transform the WhatsApp data from Kaggle.  The data is loaded and split below. 

Initialize a `TfidfVectorizer` object with default parameters and assign it to the variable `tfidif`. 

Next, use the function `fit_transform` with argument equal to `X_train` on `tfidf`. Assign this result to the variable `dtm`.


In [15]:
happy_df = pd.read_csv('../data/Emotion(happy).csv')
sad_df = pd.read_csv('../data/Emotion(sad).csv.zip', compression = 'zip')
full_df = pd.concat([happy_df, sad_df]).reset_index(drop = True)
X = full_df.drop('sentiment', axis = 1)
y = full_df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X['content'], y, random_state = 42)

In [17]:
### GRADED
tfidf = TfidfVectorizer()
dtm = tfidf.fit_transform(X_train)

### ANSWER CHECK
pd.DataFrame(dtm.toarray(), columns = tfidf.get_feature_names_out()).head()

Unnamed: 0,0_0,100,123whatsappstatus,204,30,404,44,45,55,805,...,yes,yesterday,yet,you,young,your,yours,yourself,yous,yuh
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01397,0.0,0.006924,0.398355,0.0,0.070646,0.0,0.01305,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.184358,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


[Back to top](#-Index)


### Problem 5

#### Pipeline with `TfidfVectorizer`

**10 Points** 

Below, create a pipeline named `tfidf_pipe` with steps `tfidf` and `lgr` given by a `TfidfVectorizer` and a `LogisticRegression` estimators, respectively. 

Next, use the function `fit` on `tfidf_pipe` to fit the training data `X_train` and `y_train`.

Finally, use the function `score` on `tfidf_pipe` to compute the score on the test data `X_test` and `y_test`. Assign the result to `test_acc`.

In [21]:
### GRADED
tfidf_pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('lgr', LogisticRegression())
])

tfidf_pipe.fit(X_train, y_train)

test_acc = tfidf_pipe.score(X_test, y_test)

### ANSWER CHECK
tfidf_pipe.named_steps
print(test_acc)

0.7946428571428571


[Back to top](#-Index)


### Problem 6

#### Grid Searching the Pipeline

**10 Points** 

Initialize a `GridSearchCV` object with the pipeline `tfidf_pipe` and parameter grid `params` given below. Assign this result to the variable `grid`.

Fit the `grid` object on training data `X_train` and `y_train`.

Finaly, use the function `score` to evaluate it on the test set `X_test` and `y_test`. Assign the result to `test_acc`. 


In [25]:
params = {'tfidf__max_features': [100, 500, 1000, 2000],
         'tfidf__stop_words': ['english', None]}

In [27]:
### GRADED
grid = GridSearchCV(estimator=tfidf_pipe, param_grid=params, cv=5)
grid.fit(X_train, y_train)
test_acc = grid.score(X_test, y_test)

### ANSWER CHECK
grid.best_params_

{'tfidf__max_features': 500, 'tfidf__stop_words': 'english'}

In [29]:
print(test_acc)

0.8005952380952381
