### IMDB_review_classificatiom


- In this Exercise, we are going to classify whether a given movie review is **positive or negative**.
- we are going to use Bag of words for pre-processing the text and apply different classification algorithms.
- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.

In [1]:
#Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

### **About Data: IMDB Dataset**

Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download


- This data consists of two columns.
        - review
        - sentiment
- Reviews are the statements given by users after watching the movie.
- sentiment feature tells whether the given review is positive or negative.

In [2]:
df=pd.read_csv('movies_sentiment_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive
1,I enjoyed the movie and the story immensely! I...,positive
2,I had a hard time sitting through this. Every ...,negative
3,It's hard to imagine that anyone could find th...,negative
4,This is one military drama I like a lot! Tom B...,positive


In [3]:
df.shape

(19000, 2)

In [4]:
df.review

0        I first saw Jake Gyllenhaal in Jarhead (2005) ...
1        I enjoyed the movie and the story immensely! I...
2        I had a hard time sitting through this. Every ...
3        It's hard to imagine that anyone could find th...
4        This is one military drama I like a lot! Tom B...
                               ...                        
18995    - Bad Stuff: This movie is real crap. Bad stun...
18996    If you've seen the trailer for this movie, you...
18997    This has to be the all time best computer anim...
18998    I've seen 'NSNA' just after I've seen all Roge...
18999    Norris plays a Chicago cop who stumbles upon a...
Name: review, Length: 19000, dtype: object

In [5]:
#creating a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
df['category'] = df.sentiment.apply(lambda x:1 if x=='positive' else 0)

In [6]:
#check the distribution of 'Category' and see whether the Target labels are balanced or not.
df.category.value_counts()

1    9500
0    9500
Name: category, dtype: int64

In [33]:
x=df.review
y=df.category

In [35]:
from sklearn.feature_extraction.text import CountVectorizer

cv=CountVectorizer()

x_cv=cv.fit_transform(x.values)

In [36]:
#Do the 'train-test' splitting with test size of 20%
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_cv,y, test_size=0.2)

In [37]:
x_train.shape

(15200, 68545)

In [38]:
x_test.shape

(3800, 68545)

In [40]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()

model.fit(x_train, y_train)

In [42]:
from sklearn.metrics import classification_report

y_pred = model.predict(x_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.88      0.85      1916
           1       0.87      0.81      0.84      1884

    accuracy                           0.85      3800
   macro avg       0.85      0.85      0.85      3800
weighted avg       0.85      0.85      0.85      3800



In [64]:
x=df.review
y=df.category

In [65]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)

In [72]:
from sklearn.feature_extraction.text import CountVectorizer

cv=CountVectorizer()

x_train_cv=cv.fit_transform(x_train.values)

x_test_cv=cv.transform(x_test.values)

x_train_cv.shape,x_test_cv.shape

## RandomForest Pipeline

In [74]:
clf=Pipeline([
    ('cv',CountVectorizer()),
    ('rf',RandomForestClassifier())
])

### fit with X_train and y_train

In [76]:
clf.fit(x_train,y_train)

### print the classfication report

In [77]:
y_pred = clf.predict(x_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.83      0.84      1936
           1       0.82      0.84      0.83      1864

    accuracy                           0.83      3800
   macro avg       0.83      0.83      0.83      3800
weighted avg       0.83      0.83      0.83      3800



## KNN Pipeline

In [78]:
clf=Pipeline([
    ('cv',CountVectorizer()),
    ('knn',KNeighborsClassifier(n_neighbors=10,metric='euclidean'))
])

## fit with X_train and y_train

clf.fit(x_train,y_train)

## print the classfication report

y_pred = clf.predict(x_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.66      0.60      0.63      1936
           1       0.62      0.68      0.65      1864

    accuracy                           0.64      3800
   macro avg       0.64      0.64      0.64      3800
weighted avg       0.64      0.64      0.64      3800



## MultinomialNB Pipeline

In [79]:
clf=Pipeline([
    ('cv',CountVectorizer()),
    ('knn',MultinomialNB())
])

## fit with X_train and y_train

clf.fit(x_train,y_train)

## print the classfication report

y_pred = clf.predict(x_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.87      0.84      1936
           1       0.85      0.80      0.83      1864

    accuracy                           0.84      3800
   macro avg       0.84      0.83      0.84      3800
weighted avg       0.84      0.84      0.84      3800

