### Bag of words: Exercises


- In this Exercise, you are going to classify whether a given movie review is **positive or negative**.
- you are going to use Bag of words for pre-processing the text and apply different classification algorithms.
- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.

In [1]:
#Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

### **About Data: IMDB Dataset**

Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download


- This data consists of two columns.
        - review
        - sentiment
- Reviews are the statements given by users after watching the movie.
- sentiment feature tells whether the given review is positive or negative.

In [3]:
#1. read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable

df = pd.read_csv("movies_sentiment_data.csv")

#2. print the shape of the data
df.shape

#3. print top 5 datapoints


(19000, 2)

In [4]:
df.head(5)

Unnamed: 0,review,sentiment
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive
1,I enjoyed the movie and the story immensely! I...,positive
2,I had a hard time sitting through this. Every ...,negative
3,It's hard to imagine that anyone could find th...,negative
4,This is one military drama I like a lot! Tom B...,positive


In [8]:
#creating a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
df["Category"]=df['sentiment'].apply(lambda x:1 if x=='positive' else 0)
df.head()

Unnamed: 0,review,sentiment,Category
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive,1
1,I enjoyed the movie and the story immensely! I...,positive,1
2,I had a hard time sitting through this. Every ...,negative,0
3,It's hard to imagine that anyone could find th...,negative,0
4,This is one military drama I like a lot! Tom B...,positive,1


In [10]:
#check the distribution of 'Category' and see whether the Target labels are balanced or not.
df.Category.value_counts()

1    9500
0    9500
Name: Category, dtype: int64

In [45]:
#Do the 'train-test' splitting with test size of 20%
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(df.review,df.Category,test_size=0.2)


In [46]:
y_train[:1]

4520    0
Name: Category, dtype: int64

In [47]:
from sklearn.feature_extraction.text import CountVectorizer
v=CountVectorizer()
x_train_cv = v.fit_transform(X_train.values)
x_train_cv

<15200x62448 sparse matrix of type '<class 'numpy.int64'>'
	with 2073586 stored elements in Compressed Sparse Row format>

In [48]:
x_train_cv.shape

(15200, 62448)

In [49]:
v.get_feature_names_out()[1327]

'actores'

In [50]:
x_train_cv.toarray()[0][1327]

0

In [51]:
v.vocabulary_

{'horrible': 26496,
 'movie': 36788,
 'this': 55603,
 'beat': 5352,
 'out': 39512,
 'revenge': 46342,
 'of': 38857,
 'the': 55433,
 'living': 32612,
 'zombies': 62352,
 'for': 21321,
 'worst': 61542,
 'have': 25235,
 'ever': 19052,
 'suffered': 53615,
 'through': 55718,
 'what': 60703,
 'were': 60627,
 'morons': 36577,
 'who': 60875,
 'made': 33489,
 'film': 20572,
 'thinking': 55580,
 'was': 60277,
 'it': 29032,
 'supposed': 53940,
 'to': 56064,
 'be': 5296,
 'scary': 48290,
 'because': 5412,
 'man': 33834,
 'let': 32124,
 'me': 34837,
 'tall': 54670,
 'you': 62039,
 'wasn': 60296,
 'so': 51267,
 'dumb': 17141,
 'funny': 22107,
 'we': 60418,
 'all': 2175,
 'know': 30871,
 'that': 55423,
 'tropical': 57071,
 'islands': 28995,
 'are': 3404,
 'natural': 37492,
 'hunting': 26876,
 'grounds': 24173,
 'killer': 30584,
 'snowmen': 51247,
 'and': 2694,
 'those': 55646,
 'stupid': 53351,
 'baby': 4508,
 'snowballs': 51238,
 'fake': 19801,
 'snow': 51234,
 'lousy': 32984,
 'actors': 1329,
 'oh'

In [52]:
x_train_np = x_train_cv.toarray()

In [53]:
np.where(x_train_np[0]!=0)

(array([ 1329,  2175,  2694,  3404,  4508,  5296,  5352,  5412,  7278,
        15079, 16256, 16328, 17141, 19052, 19801, 20572, 21321, 21938,
        22107, 24173, 25235, 25314, 26496, 26876, 28995, 29032, 30584,
        30871, 32124, 32361, 32612, 32837, 32984, 33489, 33834, 34837,
        36577, 36788, 37492, 38352, 38368, 38857, 38932, 39052, 39512,
        45816, 46342, 48290, 51234, 51238, 51247, 51267, 53351, 53615,
        53940, 54670, 55423, 55433, 55580, 55603, 55646, 55718, 55898,
        56064, 57071, 60277, 60296, 60304, 60418, 60627, 60703, 60875,
        61542, 62039, 62056, 62352], dtype=int64),)

In [54]:
v.get_feature_names_out()[1327]

'actores'

**Exercise-1**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

**Note:**
- use CountVectorizer for pre-processing the text.

- use **Random Forest** as the classifier with estimators as 50 and criterion as entropy.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [59]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer', CountVectorizer()),                                                    #initializing the vectorizer
    ('random_forest', (RandomForestClassifier(n_estimators=50, criterion='entropy')))      #using the RandomForest classifier
])



#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.82      0.82      1880
           1       0.82      0.82      0.82      1920

    accuracy                           0.82      3800
   macro avg       0.82      0.82      0.82      3800
weighted avg       0.82      0.82      0.82      3800



**Exercise-2**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **KNN** as the classifier with n_neighbors of 10 and metric as 'euclidean'.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html



In [57]:
#1. create a pipeline object
clf = Pipeline([
                
     ('vectorizer', CountVectorizer()),   
      ('KNN', (KNeighborsClassifier(n_neighbors=10, metric = 'euclidean')))   #using the KNN classifier with 10 neighbors 
])


#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.64      0.66      0.65      1880
           1       0.66      0.64      0.65      1920

    accuracy                           0.65      3800
   macro avg       0.65      0.65      0.65      3800
weighted avg       0.65      0.65      0.65      3800



**Exercise-3**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html



In [58]:
#1. create a pipeline object
from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.82      0.87      0.85      1880
           1       0.86      0.82      0.84      1920

    accuracy                           0.84      3800
   macro avg       0.84      0.84      0.84      3800
weighted avg       0.85      0.84      0.84      3800

