### Codio Activity 18.4: Bag of Words: Count Vectorization

**Expected Time = 60 minutes**

**Total Points = 25**

In this activity you will use the Scikit-Learn vectorization tool `CountVectorizer` to create a bag of words representation of text in a DataFrame.  You will explore how different parameter settings affect the performance of a `LogisticRegression` estimator on a binary classification problem.

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

### The Data

The cell below uses the "sad" and "happy" sentiments datasets from Kaggle to form the target of our classification models.  The data is also split and named appropriately below. 

In [4]:
happy_df = pd.read_csv('../data/Emotion(happy).csv')
sad_df = pd.read_csv('../data/Emotion(sad).csv.zip', compression = 'zip')

In [6]:
full_df = pd.concat([happy_df, sad_df]).reset_index(drop = True)

In [8]:
X = full_df.drop('sentiment', axis = 1)
y = full_df['sentiment']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X['content'], y, random_state = 42)

In [12]:
X_train.head()

1287    ['You Hurt Me But I Still Love You.', 'True Lo...
1112    Sorry isn’t always enough. Sometimes you actua...
823     Sometimes two people have to fall apart to rea...
651     True love isn’t love at first sight but love a...
1101    i am scared of getting too close to anyone bec...
Name: content, dtype: object

[Back to top](#-Index)

### Problem 1

#### Using the `CountVectorizer`

**5 Points**

To create a bag of words representation of your text data, below create an instance of the `CountVectorizer` with default settings as `cvect`. 

Next, use the `fit_transform` function on `cvect` to transform the training data `X_train` and assign the transformed version of the text to `dtm`.  


Hint: Make sure to transform X_train

In [14]:
### GRADED
cvect = CountVectorizer()
dtm = cvect.fit_transform(X_train)

### ANSWER CHECK
pd.DataFrame(dtm.toarray(), columns = cvect.get_feature_names_out()).head()

Unnamed: 0,0_0,100,123whatsappstatus,204,30,404,44,45,55,805,...,yes,yesterday,yet,you,young,your,yours,yourself,yous,yuh
0,0,0,0,0,0,0,0,0,0,0,...,2,0,1,112,0,13,0,2,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Problem 2

#### Limiting words with the `CountVectorizer`

**5 Points**

Now, to remove stopwords from the text before vectorizing create a new instance of the `CountVectorizer` with argument `stop_words = 'english'`. Assign this to the variable `cvect2`.

Next, use the `fit_transform` function on `cvect2` to transform the training data `X_train` and assign the transformed version of the text to `X_train_vect_2`.  

Finally, transform the test data `X_test` as `X_test_vect_2` below.

In [16]:
### GRADED
cvect2 = CountVectorizer(stop_words='english')
X_train_vect_2 = cvect2.fit_transform(X_train)
X_test_vect_2 = cvect2.transform(X_test)

### ANSWER CHECK
X_train_vect_2

<1007x1622 sparse matrix of type '<class 'numpy.int64'>'
	with 41589 stored elements in Compressed Sparse Row format>

### Problem 3

#### Limiting words with stopwords and higher counts

**5 Points**


Now, to remove stopwords from the text before vectorizing with 300 features create a new instance of the `CountVectorizer` with arguments `stop_words = 'english'` and `max_features = 300`. Assign this to the variable `cvect3`.

Next, use the `fit_transform` function on `cvect3` to transform the training data `X_train` and assign the transformed version of the text to `X_train_vect_3`.  

Finally, transform the test data `X_test` as `X_test_vect_3` below.


In [36]:
### GRADED
cvect3 = CountVectorizer(stop_words='english', max_features=300)
X_train_vect_3 = cvect3.fit_transform(X_train)
X_test_vect_3 = cvect3.transform(X_test)

### ANSWER CHECK
X_train_vect_3

<1007x300 sparse matrix of type '<class 'numpy.int64'>'
	with 33225 stored elements in Compressed Sparse Row format>

[Back to top](#-Index)

### Problem 4

#### Using the text with `LogisticRegression`

**5 Points**

Create a `Pipeline` object named `vect_pipe_1` below that has steps named `cvect` and `lgr`, using both a default `CountVectorizer` transformer and `LogisticRegression` estimator. 

Fit this pipeline on the training data `X_train` and `y_train`.

Finally, use the function `score` to evaluate it on the test set `X_test` and `y_test`. 

In [24]:
### GRADED
vect_pipe_1 = Pipeline([
    ('cvect', CountVectorizer()),
    ('lgr', LogisticRegression())
])

vect_pipe_1.fit(X_train, y_train)

test_acc = vect_pipe_1.score(X_test, y_test)

### ANSWER CHECK
vect_pipe_1.named_steps
print(test_acc)

0.8273809523809523


[Back to top](#-Index)

### Problem 5

#### Pipeline and Grid Search

**5 Points**

Initialize a `GridSearchCV` object with the pipeline `vect_pipe_1` and parameter grid `params` given below. Assign this result to the variable `grid`.

Fit the `grid` object on training data `X_train` and `y_train`.

Finaly, use the function `score` to evaluate it on the test set `X_test` and `y_test`. Assign the result to `test_acc`. 

In [28]:
params = {'cvect__max_features': [100, 500, 1000, 2000],
         'cvect__stop_words': ['english', None]}

In [34]:
### GRADED
grid = GridSearchCV(estimator=vect_pipe_1, param_grid=params, cv=5)
grid.fit(X_train, y_train)
test_acc = grid.score(X_test, y_test)

### ANSWER CHECK
grid.best_params_
print(test_acc)

0.8273809523809523
