<a href="https://colab.research.google.com/github/pushyag1/NLPClass/blob/master/Movie_Review_MultinomialNB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Use the movie_reviews.txt file.

• Load the data

• Change the txt into a csv file

• Remove the ‘Unnamed’ column

• Print the first 5 rows

• Set X as movie_data[‘Summary’] and y as movie_data[‘Sentiment’]

• Split the dataset into ‘docs_train’, ‘docs_test’, y_train, y_test. Test size is 20%

• Initialize CountVectorizer:
o movieVzer = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize)
Use all 25K words. Higher accuracy

• Locate the word ‘screen’ and ‘Steven Seagal’ in the corpus

• Determine the shape of ‘docs_train_counts

• Convert raw frequency counts into Tfidf values

• Using the fitted vectorizer and transformer, transform the test data

• Use the Multinomial NB as a classifier

• Predict the test set results and determine the accuracy

• Provide the confusion matrix

• Provide the classification report

• Use GridSearchCV with a logistic regression to identify accuracy, the best
estimator, score and parameter. Use ‘scoring=roc_auc’ and ‘cv=5’.

• Provide the confusion matrix

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
nltk.download('stopwords')
nltk.download('punkt')
import pickle
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
read_file = pd.read_csv (r'Movie_Reviews.txt')
read_file.columns = ['Summary','Sentiment']
read_file.to_csv (r'Movie_Reviews.csv', header=["Summary", "Sentiment"])

In [3]:
movie_data = pd.read_csv('Movie_Reviews.csv')
movie_data = movie_data.loc[:, ~movie_data.columns.str.contains('^Unnamed')]
movie_data.head(5)

Unnamed: 0,Summary,Sentiment
0,rock destined st century new conan going make ...,1
1,gorgeously elaborate continuation lord ring tr...,1
2,effective tepid biopic,1
3,sometimes like go movie fun wasabi good place ...,1
4,emerges something rare issue movie honest keen...,1


In [4]:
X, y = movie_data['Summary'], movie_data['Sentiment']

In [5]:
# Split data into training and test sets
from sklearn.model_selection import train_test_split
docs_train, docs_test, y_train, y_test = train_test_split(X, y,
test_size = 0.20, random_state = 12)

In [6]:
# initialize CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
movieVzer= CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, max_features=3000) # use top 3000 words only. 78.25% acc.
# movieVzer = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize) # use all 25K words. Higher accuracy
# fit and tranform using training text
docs_train_counts = movieVzer.fit_transform(docs_train)

In [7]:
# 'screen' is found in the corpus, mapped to index 2290
movieVzer.vocabulary_.get('screen')

2268

In [8]:
# Likewise, Mr. Steven Seagal is present...
movieVzer.vocabulary_.get('seagal')

2274

In [9]:
# huge dimensions! 1,600 documents, 3K unique terms.
docs_train_counts.shape

(8529, 3000)

In [10]:
# Convert raw frequency counts into TF-IDF values
from sklearn.feature_extraction.text import TfidfTransformer
movieTfmer = TfidfTransformer()
docs_train_tfidf = movieTfmer.fit_transform(docs_train_counts)

In [11]:
# Using the fitted vectorizer and transformer, transform the test data
docs_test_counts = movieVzer.transform(docs_test)
docs_test_tfidf = movieTfmer.transform(docs_test_counts)

In [12]:
# Now ready to build a classifier.
# We will use Multinominal Naive Bayes as our model
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
# Train a Multimoda Naive Bayes classifier. Again, we call it "fitting"
clf = MultinomialNB()
clf.fit(docs_train_tfidf, y_train)
# Predict the test set results, find accuracy
y_pred = clf.predict(docs_test_tfidf)
print("Accuracy:\n", round(accuracy_score(y_test, y_pred),2))

Accuracy:
 0.75


In [13]:
# Provide the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm=cm[::-1, ::1]
print("Confusion Matrix:\n",cm)

Confusion Matrix:
 [[260 797]
 [804 272]]


In [14]:
# Provide the Classification Report
from sklearn.metrics import classification_report
print("Classification Report:\n", classification_report(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.76      0.75      0.75      1076
           1       0.75      0.75      0.75      1057

    accuracy                           0.75      2133
   macro avg       0.75      0.75      0.75      2133
weighted avg       0.75      0.75      0.75      2133



In [16]:
#Using GridSearchCV
from sklearn.pipeline import make_pipeline
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(class_weight="balanced", random_state=0)
param_grid = {'C': [0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(logreg, param_grid, scoring="roc_auc", cv=5)
grid_train = grid.fit(docs_test_tfidf , y_test)
pred_grid = grid_train.predict(docs_test_tfidf)
confusion = confusion_matrix(y_test, pred_grid)
cm=confusion
cm=cm[::-1,::1]
print("Confusion Matrix:\n", cm)

Confusion Matrix:
 [[ 95 962]
 [964 112]]


In [17]:
print("Accuracy:\n", round(accuracy_score(y_test, pred_grid),3))

Accuracy:
 0.903


In [18]:
print("Best Estimator:\n", grid.best_estimator_)

Best Estimator:
 LogisticRegression(C=1, class_weight='balanced', dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)


In [19]:
print("Best Score:\n", (grid.best_score_))

Best Score:
 0.7551331366677749


In [20]:
print("Best Parameter:\n", grid.best_params_)


Best Parameter:
 {'C': 1}
