### Bag of words: Exercises


- In this Exercise, you are going to classify whether a given movie review is **positive or negative**.
- you are going to use Bag of words for pre-processing the text and apply different classification algorithms.
- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.

In [1]:
#Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

### **About Data: IMDB Dataset**

Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download


- This data consists of two columns.
        - review
        - sentiment
- Reviews are the statements given by users after watching the movie.
- sentiment feature tells whether the given review is positive or negative.

In [3]:
#1. read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable

df = pd.read_csv("movies_sentiment_data.csv")

#2. print the shape of the data
df.shape

#3. print top 5 datapoints


(19000, 2)

In [4]:
df.head(5)

Unnamed: 0,review,sentiment
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive
1,I enjoyed the movie and the story immensely! I...,positive
2,I had a hard time sitting through this. Every ...,negative
3,It's hard to imagine that anyone could find th...,negative
4,This is one military drama I like a lot! Tom B...,positive


In [8]:
#creating a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
df["Category"]=df['sentiment'].apply(lambda x:1 if x=='positive' else 0)
df.head()

Unnamed: 0,review,sentiment,Category
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive,1
1,I enjoyed the movie and the story immensely! I...,positive,1
2,I had a hard time sitting through this. Every ...,negative,0
3,It's hard to imagine that anyone could find th...,negative,0
4,This is one military drama I like a lot! Tom B...,positive,1


In [10]:
#check the distribution of 'Category' and see whether the Target labels are balanced or not.
df.Category.value_counts()

1    9500
0    9500
Name: Category, dtype: int64

In [11]:
#Do the 'train-test' splitting with test size of 20%
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(df.review,df.Category,test_size=0.2)


In [18]:
y_train[:1]

7265    1
Name: Category, dtype: int64

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
v=CountVectorizer()
x_train_cv = v.fit_transform(X_train.values)
x_train_cv

<15200x62369 sparse matrix of type '<class 'numpy.int64'>'
	with 2079352 stored elements in Compressed Sparse Row format>

In [20]:
x_train_cv.shape

(15200, 62369)

In [36]:
v.get_feature_names_out()[1327]

'actress'

In [35]:
x_train_cv.toarray()[0][1327]

1

In [None]:
v.vocabulary_

In [28]:
x_train_np = x_train_cv.toarray()

In [31]:
np.where(x_train_np[0]!=0)

(array([ 1327,  1741,  1747,  2272,  2693,  3080,  3580,  3673,  3748,
         3929,  5297,  5388,  5893,  6250,  8335,  8539,  9521, 10561,
        12235, 13777, 16168, 18188, 19853, 20472, 20503, 21185, 21222,
        21747, 21755, 22984, 23231, 23401, 23805, 25569, 26137, 26557,
        26566, 27093, 27465, 27526, 28878, 28951, 29023, 31819, 32271,
        32933, 34762, 36450, 36707, 36715, 36789, 37112, 38281, 38360,
        38783, 38994, 39796, 42834, 44696, 44791, 44807, 45601, 48067,
        49538, 51229, 51416, 51468, 52756, 53176, 54900, 54908, 55349,
        55362, 55367, 55404, 55530, 55563, 55981, 56004, 57239, 58604,
        60185, 60229, 60655, 60660, 61169, 61465, 61951, 61967],
       dtype=int64),)

In [34]:
v.get_feature_names_out()[1327]

'actress'

**Exercise-1**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

**Note:**
- use CountVectorizer for pre-processing the text.

- use **Random Forest** as the classifier with estimators as 50 and criterion as entropy.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [4]:
#1. create a pipeline object




#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred



#4. print the classfication report


**Exercise-2**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **KNN** as the classifier with n_neighbors of 10 and metric as 'euclidean'.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html



In [5]:

#1. create a pipeline object


#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred


#4. print the classfication report


**Exercise-3**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html



In [6]:

#1. create a pipeline object



#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred



#4. print the classfication report


### Can you write some observations of why model like KNN fails to produce good results unlike RandomForest and MultinomialNB?



## [**Solution**](./bag_of_words_exercise_solutions.ipynb)