### Bag of words: Exercises


- In this Exercise, you are going to classify whether a given movie review is **positive or negative**.
- you are going to use Bag of words for pre-processing the text and apply different classification algorithms.
- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.

In [1]:
#Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

### **About Data: IMDB Dataset**

Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download


- This data consists of two columns.
        - review
        - sentiment
- Reviews are the statements given by users after watching the movie.
- sentiment feature tells whether the given review is positive or negative.

In [19]:
#1. read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable

df=pd.read_csv("IMDB Dataset.csv")

#2. print the shape of the data
df.shape



(50000, 2)

In [20]:
#3. print top 5 datapoints
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


review
Loved today's show!!! It was a variety and not solely cooking (which would have been great too). Very stimulating and captivating, always keeping the viewer peeking around the corner to see what was coming up next. She is as down to earth and as personable as you get, like one of us which made the show all the more enjoyable. Special guests, who are friends as well made for a nice surprise too. Loved the 'first' theme and that the audience was invited to play along too. I must admit I was shocked to see her come in under her time limits on a few things, but she did it and by golly I'll be writing those recipes down. Saving time in the kitchen means more time with family. Those who haven't tuned in yet, find out what channel and the time, I assure you that you won't be disappointed.                                                                                                                                                                                                         

In [23]:
#creating a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
df['Category']=df['sentiment'].apply(lambda x: 1 if x=="positive" else 0)
df

Unnamed: 0,review,sentiment,Category
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1
...,...,...,...
49995,I thought this movie did a down right good job...,positive,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,0
49997,I am a Catholic taught in parochial elementary...,negative,0
49998,I'm going to have to disagree with the previou...,negative,0


In [24]:
#check the distribution of 'Category' and see whether the Target labels are balanced or not.
df['Category'].value_counts()


Category
1    25000
0    25000
Name: count, dtype: int64

In [28]:
#Do the 'train-test' splitting with test size of 20%
X_train,X_test,y_train,y_test=train_test_split(df['review'],df['Category'],test_size=0.2)


In [29]:
X_test

46260    This tearful movie about a sister and her batt...
27869    Watching this odd little adventure movie, it's...
249      'Airport 4' is basically a slopped together me...
46503    If you hate redneck accents, you'll hate this ...
7069     Funny how a studio thinks it can make a sequel...
                               ...                        
8915     New Year 2006, and I'm watching Glimmer Man ag...
46038    I saw this movie as a child and fell in love w...
7623     I have to be honest and say I bought this movi...
40063    Anyone who complains about Peter Jackson makin...
14489    John Landis truly outdid himself when he direc...
Name: review, Length: 10000, dtype: object

**Exercise-1**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

**Note:**
- use CountVectorizer for pre-processing the text.

- use **Random Forest** as the classifier with estimators as 50 and criterion as entropy.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [31]:
#1. create a pipeline object
pip=Pipeline([(('CountVectorizer',CountVectorizer()),
             ('RandomForestClassifier',RandomForestClassifier()))])



#2. fit with X_train and y_train

pip.fit(X_train,y_train)

#3. get the predictions for X_test and store it in y_pred

y_pred=pip.predict(X_test)

#4. print the classfication report
print(classification_report(y_pred,y_test))

TypeError: keywords must be strings

**Exercise-2**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **KNN** as the classifier with n_neighbors of 10 and metric as 'euclidean'.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html



In [5]:

#1. create a pipeline object


#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred


#4. print the classfication report


**Exercise-3**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html



In [6]:

#1. create a pipeline object



#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred



#4. print the classfication report


### Can you write some observations of why model like KNN fails to produce good results unlike RandomForest and MultinomialNB?



## [**Solution**](./bag_of_words_exercise_solutions.ipynb)