**This project focuses on analyzing the sentiment of movie reviews using natural language processing (NLP) and machine learning techniques. The goal is to classify movie reviews, either positive or negative, providing insights into the overall reception of a movie.**

In [1]:
# importing packages
import pandas as pd
from textblob import TextBlob
from sklearn.metrics import accuracy_score
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score,accuracy_score, roc_curve
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.neighbors import KNeighborsClassifier

In [2]:
# load dataset into dataframe
movie_review_df = pd.read_csv('movie_review_dataset.tsv.zip', sep='\t', compression='zip', header=0, quotechar='"')

# display first five rows to verify if data loaded properly
movie_review_df.head(5)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [3]:
# group by sentiment to get number of positive and negative reviews
movie_review_df.groupby('sentiment')['id'].count()

sentiment
0    12500
1    12500
Name: id, dtype: int64

> **'0' represents a negative review, and '1' represents a positive review. There are 12500 negative reviews(0) and 12500 positive reviews(1).**

In [4]:
'''
Creating common methods to get the sentiment.
Arguments will be passed to this method are text and type of the mode.
Currently, it supports Textblob and Vader models.
'''


# method to get polarity from the text using different models
def get_polarity_score(text, model):
    if model == 'textblob':
        return TextBlob(text).sentiment.polarity
    elif(model=='vader'):
        return vadersid.polarity_scores(text)['compound']

# method to find sentiment from the text. 0 - negative review, 1- positive review
def find_sentiment_using_model(text, model):
    polarity_score = get_polarity_score(text,model)
    if polarity_score>=0:
        return 1
    else:
        return 0


In [5]:
# creating new column for sentiment using textblob
movie_review_df['sentiment_using_textblob'] = movie_review_df['review'].apply(lambda x:find_sentiment_using_model(x,'textblob'))

In [6]:
# find accuracy using accuracy_score from sklearn
accuracy_score(movie_review_df['sentiment'], movie_review_df['sentiment_using_textblob'])

0.68524

> **The accuracy of the model is 0.68524, which means textblob predicted review sentiment 68.52% of the time. Possible classes are positive and negative, and we can predict 1/number_of_class with random guessing, which represents 50% of the time, we can predict using random guessing. It implies that the textblob model is better than random guessing the provided dataset.**

In [7]:
# create object for the vader model analysis
vadersid = SentimentIntensityAnalyzer()

# creating new column for sentiment using vader
movie_review_df['sentiment_using_vader'] = movie_review_df['review'].apply(lambda x:find_sentiment_using_model(x,'vader'))

# display first 5 rows to view the updated dataframe
movie_review_df.head(5)

Unnamed: 0,id,sentiment,review,sentiment_using_textblob,sentiment_using_vader
0,5814_8,1,With all this stuff going down at the moment w...,1,0
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",1,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,0,0
3,3630_4,0,It must be assumed that those who praised this...,1,0
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,0,1


In [8]:
# find accuracy using accuracy_score from sklearn
accuracy_score(movie_review_df['sentiment'], movie_review_df['sentiment_using_vader'])

0.69356

> **The accuracy of the model is 0.69356, which means vader model predicted review sentiment 69.35% of the time. Possible classes are positive and negative, and we can predict 1/number_of_class with random guessing, which represents 50% of the time, we can predict using random guessing. It implies that the vader model is better than random guessing the provided dataset. It also shows that the Vader model is slightly better than the textblob model.**

In [9]:
'''
Creating a method to clean the text for a custom model. 
This method does the following cleanups 1. Convert text to lowercase, 
2. Remove all punctuation and special characters,
3.Remove English stop words and 4.stemming text
'''

# create object for stemming
porter = PorterStemmer()

# get the english stop word list
stop_words = stopwords.words('english')

# method to clean up the text
def model_preprocessor_cleaning(text):
# convert text to lowercase    
    lower_case_text = text.lower()
# remove all punctuation and special characters    
    alnum_only_text = re.sub('[^\w\s]','',lower_case_text)    
# tokenzing the text for stemming and stop word removal    
    word_tokens = word_tokenize(alnum_only_text)
# removing the stop words    
    tokens_without_stopwords = [word for word in word_tokens if not word in stop_words]
# stemming text and joining back    
    cleansed_text = (" ").join([porter.stem(word) for word in tokens_without_stopwords])
    return cleansed_text 

In [10]:
# Create a new column to store cleaned text
movie_review_df['cleansed_review_text'] = movie_review_df['review'].apply(model_preprocessor_cleaning)

In [11]:
# display first few rows
movie_review_df['cleansed_review_text'].head(5)

0    stuff go moment mj ive start listen music watc...
1    classic war world timothi hine entertain film ...
2    film start manag nichola bell give welcom inve...
3    must assum prais film greatest film opera ever...
4    superbl trashi wondrous unpretenti 80 exploit ...
Name: cleansed_review_text, dtype: object

In [12]:
# create new object for bag of words matrix transform
count_vect = CountVectorizer()

# get bag of words matrix
bag_of_words_matrix = count_vect.fit_transform(movie_review_df['cleansed_review_text'])

# Display the dimensions
bag_of_words_matrix.shape

(25000, 92528)

In [13]:
# Display the dimensions
movie_review_df.shape

(25000, 6)

> **Above result shows both original dataframe and bag_of_words_matrix has same number of rows(25000)**

In [None]:
# create new object for term frequency-inverse document frequenct matrix
tf_idf = TfidfVectorizer()

# get tf-idf matrix
tf_idf_matrix = tf_idf.fit_transform(movie_review_df['cleansed_review_text'])

# Display the dimensions
tf_idf_matrix.shape

> **Above result shows both bag_of_words_matrix and tf_idf_matrix  has same dimensions(25000, 92528)**

In [None]:
'''
Using train_test_split from sklearn, split
the dataset into train and test sets. 
sentiment is the target that needs to be predicted, and
the remaining columns are independent variables.
The parameter test_size = 0.2 allocates 20% of data for training,
and random_state=0  gives the same train and
test sets across different execution
'''
# get independent variables
X = movie_review_df.iloc[:,-1:]

# get dependent variable
y = movie_review_df.iloc[:,(1)]


# split the dataset
feature_train, feature_test, target_train, target_test =\
train_test_split(X ,y,test_size = 0.2, random_state = 0 )

In [None]:
# create new object for term frequency-inverse document frequenct matrix
tf_idf = TfidfVectorizer(max_features=20000, ngram_range = (1,2))

# get tf-idf matrix
tf_idf_matrix_train = tf_idf.fit_transform(feature_train['cleansed_review_text'])

# Display the dimensions
tf_idf_matrix_train.shape

In [None]:
# get tf-idf matrix
tf_idf_matrix_test = tf_idf.transform(feature_test['cleansed_review_text'])

# Display the dimensions
tf_idf_matrix_test.shape

> **Fitting data helps models to learn mean and variance from the data, and transform uses the fitted data and applies the actual transformation to the features. The tf-idf vectorization model learns from the training data when we fit the data to the model. We want to keep the test data hidden away from the model. So the model will not learn from the test data. We apply the transform to the test dataset so that the test dataset will be in the usable format for the model.**

In [None]:
# create instance of logisticregression
logistic_regression = LogisticRegression()

# train the model using training data
logistic_regression.fit(tf_idf_matrix_train, target_train)

In [None]:
# predict for test dataset
logit_predictions = logistic_regression.predict(tf_idf_matrix_test)
logit_predictions

In [None]:
'''
Creating a method to evaluate the model performance. 
Performance metrics such as accuracy_score, confusion_matrix,
precision_score, recall_score, and f1_score are calculated in this method. 
accuracy_score - The accuracy tells us how often we can expect 
the model to correctly predict an outcome out of the total
number of times it made predictions.
confusion_matrix - A confusion matrix often helps measure the classification models' performance.
It aims to predict a categorical label for each input instance.
precision_score - The precision score measures how well a classifier predicts the positive class
recall_score - The recall score measures the model performance by correctly
calculating the count of true positives out of all the actual positive values.
f1_score -The F1 score helps to measure the model's accuracy
'''

def model_performance_metrices(target_test, predictions):
    
# find accuracy using accuracy_score from sklearn    
    accuracy_score_val = accuracy_score(target_test, predictions)
# print  accuracy score   
    print('Accuracy score: %.3f\n' % accuracy_score_val)
    
# Create confusion matrix    
    cf_matrix = confusion_matrix(target_test, predictions)
# Visualizing the confusion matrix 
    sns.heatmap(cf_matrix,
            annot=True,
            fmt='g',
            xticklabels=['Positive_Review','Negative_Review'],
            yticklabels=['Positive_Review','Negative_Review']
           )
# Set labels and titles    
    plt.ylabel('Prediction',fontsize=13)
    plt.xlabel('Actual',fontsize=13)
    plt.title('Confusion Matrix',fontsize=17)
# display the plot    
    plt.show()
    
# find precision score using  sklearn  metrics  precision_score
    precision_score_val = precision_score(target_test, predictions)
# print precision score
    print('\nPrecision Score: %.3f' % precision_score_val)

# find recall score using  sklearn  metrics recall_score
    recall_score_val = recall_score(target_test, predictions)
# print recall score
    print('\nRecall Score: %.3f' % recall_score_val)    

# find f1 score using  sklearn  metrics f1_score
    f1_score_val = f1_score(target_test, predictions)
# print recall score
    print('\nF1 Score: %.3f' % f1_score_val)        

    

In [None]:
# Calling method to get performance metrices
model_performance_metrices(target_test, logit_predictions)

**Output comment**:

> Accuracy Score - The above result shows that the logistic regression will predict the results correctly about 89.7% of the time for the given datasets.

> Confusion Matrix - 
1. 224 true negative -  instances where the model correctly predicted the negative class.
2. 291  false positive -instances where the model incorrectly predicted the positive class
3. 2228 false negative - instances where the model incorrectly predicted the negative class
4. 2257 true positve - instances where the model correctly predicted the positive class.

> Precision Score - The precision score of the above model is 0.884. Higher precision means that there are fewer false positives when making predictions with the model, meaning more accurate results overall

> Recall Score - The above model shows a 91% recall score, which is a good model. The recall score measures the model performance by correctly calculating the count of true positives out of all the actual positive values.

> F1 Score - The F1 score for the above model is 0.896, which is an good model. The higher the F1 score better.  

In [None]:
# predict probabilities
y_pred_proba = logistic_regression.predict_proba(tf_idf_matrix_test)[::,1]

# calculate roc curve
fpr, tpr, thresholds = roc_curve(target_test,  y_pred_proba)

#create ROC curve
plt.plot(fpr,tpr)
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

**Output comment:**

> The more the ROC curve hugs the top left corner of the plot, the better the model does at classifying the data. The above ROC curve shows that the AUC is closer to 1, which implies that the model is an good model.

In [None]:
# creating instance for KNN

knn = KNeighborsClassifier()

# Fitting k-nearest neighbors model
knn.fit(tf_idf_matrix_train,target_train)

In [None]:
# Making Knn prediction
knn_predictions = knn.predict(tf_idf_matrix_test)
knn_predictions

In [None]:
# Calling method to get performance metrices
model_performance_metrices(target_test, knn_predictions)

**Output comment**:

> Accuracy Score - The above result shows that the KNN will predict the results correctly about 74.2% of the time for the given datasets.

> Confusion Matrix - 
1. 408 true negative -  instances where the model correctly predicted the negative class.
2. 709  false positive -instances where the model incorrectly predicted the positive class
3. 2044 false negative - instances where the model incorrectly predicted the negative class
4. 1839 true positve - instances where the model correctly predicted the positive class.

> Precision Score - The precision score of the above model is 0.742. Higher precision means that there are fewer false positives when making predictions with the model, meaning more accurate results overall

> Recall Score - The above model shows a 83% recall score, which is a good model. The recall score measures the model performance by correctly calculating the count of true positives out of all the actual positive values.

> F1 Score - The F1 score for the above model is 0.785, which is an good model. The higher the F1 score better.  

In [None]:
# predict probabilities
y_pred_proba = knn.predict_proba(tf_idf_matrix_test)[::,1]

# calculate roc curve
fpr, tpr, thresholds = roc_curve(target_test,  y_pred_proba)

#create ROC curve
plt.plot(fpr,tpr)
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

**Output comment:**

> The more the ROC curve hugs the top left corner of the plot, the better the model does at classifying the data. The above ROC curve shows that the AUC is close to 0.75, which implies that the model is an good model. 

**Conclusion:**

> Overall, the logistic regression model looks better than the KNN model for the given dataset. 