Prathvi Shetty                                                                                    

In [1]:
# Python 3.11.4

In [1]:
import pandas as pd
import numpy as np
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
import re
from bs4 import BeautifulSoup
import contractions 
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Perceptron
from sklearn.metrics import precision_score, recall_score, f1_score
from datetime import datetime
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
import warnings
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\prath\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\prath\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
warnings.simplefilter(action='ignore', category=Warning)

## Read Data

In [3]:
data = pd.read_csv('data.tsv',on_bad_lines='skip', sep="\t",usecols=['star_rating','review_body'])

The above code uses pandas to read the csv file. The dataframe contains data from columns start_rating & review_body only

## Create 2 classes

Below is the code to split convert the start_rating values to either 1 or 2. Any review that has a rating of 1,2 or 3 is set to 1 else it is set to 2
The lambda function is used to update values in the dataframe

In [4]:
data['class'] = data['star_rating'].apply(lambda x: 1 if x in [1,2,3] else 2)

 ## We form two classes and select 50000 reviews randomly from each class.



To get 50,000 reviews of both classes I filter out the data in a dataframe for both class 1 & 2. following this step I randomly select 50,000 reviews from each class using the sample method

In [5]:
balanced_data = pd.DataFrame(columns=data.columns)
for rating in data['class'].unique():
    class_set = data[data['class'] == rating]
    random_sample = class_set.sample(n=50000)
    balanced_data = pd.concat([balanced_data,random_sample]) 

Below is the code to find the average length of reviews in terms of character length in the dataset before cleaning

In [6]:
review_col = balanced_data['review_body']

average_before = sum(len(str(review)) for review in review_col)/len(review_col)

# Data Cleaning



# Pre-processing

Below is the code to perfom cleaning by performing the following steps:
1) The reviews are converted to lower case
2) Any HTML tags are removed using regular expressions
3) Any URLs presnt are removed using regular expression
4) All the non alphabetical characters are removed followed by removal of any extra spaces
5) The library contractions is used to perform contractions

In [7]:
# Converting into lower case
balanced_data['review_body'] = balanced_data['review_body'].str.lower()
# Remove HTML tags
balanced_data['review_body'] = balanced_data['review_body'].apply(lambda d: re.sub(r'<.*?>','',str(d)))
# Remove URLs
balanced_data['review_body'] = balanced_data['review_body'].apply(lambda d: re.sub(r'https?://\S+www.\S+', '', d))
# Remove non alphabetical character
balanced_data['review_body'] = balanced_data['review_body'].apply(lambda d: re.sub(r'[^a-zA-Z\s]','',d))
# Remove extra spaces
balanced_data['review_body'] = balanced_data['review_body'].apply(lambda d: ' '.join(d.split()))
# Perform contractions
balanced_data['review_body'] = balanced_data['review_body'].apply(lambda d: contractions.fix(d))

Below is the code to find the average length of reviews in terms of character length in the dataset after data cleaning

In [8]:
review_col = balanced_data['review_body']

average_after = sum(len(str(review)) for review in review_col)/len(review_col)
print(f"{average_before}, {average_after}")

319.57937, 303.36312


## remove the stop words 

The stopwords are removed in the below cell. The list of stopwords is retreived using nltk

In [9]:
stopwords = set(stopwords.words('english'))
balanced_data['review_body'] = balanced_data['review_body'].apply(lambda d: ' '.join([word for word in d.split() if word not in stopwords])) 

## perform lemmatization  

The below cell performs lemmetization using NLTK's WordNetLemmatizer

In [10]:
lemmatizer = WordNetLemmatizer()
balanced_data['review_body'] = balanced_data['review_body'].apply(lambda d: ' '.join([lemmatizer.lemmatize(word) for word in d.split()]))

Below is the code to find the average length of reviews in terms of character length in the dataset after preprocessing. The average value before and after preprocessing is printed

In [11]:
review_col = balanced_data['review_body']

average_after_preprocessing = sum(len(str(review)) for review in review_col)/len(review_col)
print(f"{average_after}, {average_after_preprocessing}")

303.36312, 188.80034


# TF-IDF and BoW Feature Extraction

1) Sklearn's TfidfVectorizer is used to obtain the document-term matrix of TF-IDF features that is used to train the models. TF-IDF is a numerical statistics used to evaluate the importance of a word within a documentrelative to a collection of documents
2) Sklearn's CountVectorizer is used to obtain matrix of token counts from the collection of texts. This is used to train the models.CountVectorizer converts input data and transforms into bag of words. BOW is a technique that represents text as vector of word counts disregarding the order & context of words in the text

In [12]:
    X = balanced_data['review_body']
    Y = balanced_data['class']

    tfidf_vectorizer = TfidfVectorizer()
    tfidf_X = tfidf_vectorizer.fit_transform(X)
    X_traintf, X_testtf, Y_traintf, Y_testtf = train_test_split(tfidf_X, Y, test_size=0.2)

    count_vectorizer = CountVectorizer()
    bow_X = count_vectorizer.fit_transform(X)
    X_trainbow, X_testbow, Y_trainbow, Y_testbow = train_test_split(bow_X, Y, test_size=0.2)

# Perceptron Using Both Features

# Perceptron using TFIDF

In the below cell the TFIDF features extracted in the previous step is used to train the model. The perceptron model is initialised, trained and used to make predictions. The prediction are measured using the precision, recall & F1 score

In [13]:
perceptron_tfidf = Perceptron()
perceptron_tfidf.fit(X_traintf, Y_traintf.astype('int'))

y_tf = perceptron_tfidf.predict(X_testtf)

precision_tf = precision_score(Y_testtf.astype('int'), y_tf.astype('int'), average='binary')
recall_tf = recall_score(Y_testtf.astype('int'), y_tf.astype('int'), average='binary')
f1_tf = f1_score(Y_testtf.astype('int'), y_tf.astype('int'), average='binary')

print(f"{precision_tf :.4f}; {recall_tf:.4f}; {f1_tf:.4f}")

0.7972; 0.7299; 0.7621


In the below cell the Bag Of Words features extracted in the previous step is used to train the model. The perceptron model is initialised, trained and used to make predictions. The prediction are measured using the precision, recall & F1 score

In [14]:
perceptron_bow = Perceptron()
perceptron_bow.fit(X_trainbow, Y_trainbow.astype('int'))

y_bow = perceptron_bow.predict(X_testbow)

precision_bow = precision_score(Y_testbow.astype('int'),y_bow.astype('int'), average='binary')
recall_bow = recall_score(Y_testbow.astype('int'),y_bow.astype('int'), average='binary')
f1_bow = f1_score(Y_testbow.astype('int'),y_bow.astype('int'), average='binary')

print(f"Prescision: {precision_bow :.4f}; Recall: {recall_bow:.4f}; F1 score: {f1_bow:.4f}")

Prescision: 0.7804; Recall: 0.7551; F1 score: 0.7676


# SVM Using Both Features

# Linear SVM using TFIDF

In the below cell the TFIDF features extracted in the previous step is used to train the model. The SVM model is initialised, trained and used to make predictions. The prediction are measured using the precision, recall & F1 score. The Linear Support Vector Classification is used to fit the data

In [15]:
svm_model_tf = LinearSVC()
svm_model_tf.fit(X_traintf, Y_traintf.astype('int'))


y_tf_svm = svm_model_tf.predict(X_testtf)

# Precision
precision_tf_svm = precision_score(Y_testtf.astype('int'), y_tf_svm.astype('int'), average='binary')

# Recall
recall_tf_svm = recall_score(Y_testtf.astype('int'), y_tf_svm.astype('int'), average='binary')

# F1 score
f1_tf_svm = f1_score(Y_testtf.astype('int'), y_tf_svm.astype('int'), average='binary')

print(f"Prescision: {precision_tf_svm :.4f}; Recall: {recall_tf_svm:.4f}; F1 score: {f1_tf_svm:.4f}")

Prescision: 0.8253; Recall: 0.8443; F1 score: 0.8347


# Linear SVM using BOW

In the below cell the Bag Of Words features extracted in the previous step are used to train the model. The SVM model is initialised, trained and used to make predictions. The prediction are measured using the precision, recall & F1 score. The Linear Support Vector Classification is used to fit the data

In [16]:
svm_model_bow = LinearSVC()
svm_model_bow.fit(X_trainbow, Y_trainbow.astype('int'))

y_bow_svm = svm_model_bow.predict(X_testbow)

# Precision
precision_bow_svm = precision_score(Y_testbow.astype('int'),y_bow_svm.astype('int'), average='binary')

# Recall
recall_bow_svm = recall_score(Y_testbow.astype('int'),y_bow_svm.astype('int'), average='binary')

# F1 score
f1_bow_svm = f1_score(Y_testbow.astype('int'),y_bow_svm.astype('int'), average='binary')

print(f"Prescision: {precision_bow_svm :.4f}; Recall: {recall_bow_svm:.4f}; F1 score: {f1_bow_svm:.4f}")

Prescision: 0.8273; Recall: 0.7896; F1 score: 0.8080


# Logistic Regression Using Both Features

## Logistic Regression using TFIDF

In the below cell the TFIDF features extracted in the previous step are used to train the model. The Logistic Regression model is initialised, trained and used to make predictions.

In [17]:
logisctic_regression_tfidf = LogisticRegression()
logisctic_regression_tfidf.fit(X_traintf, Y_traintf.astype('int'))

y_tf_reg = logisctic_regression_tfidf.predict(X_testtf)

# Precision
precision_tf_reg = precision_score(Y_testtf.astype('int'), y_tf_reg.astype('int'), average='binary')

# Recall
recall_tf_reg = recall_score(Y_testtf.astype('int'), y_tf_reg.astype('int'), average='binary')

# F1 score
f1_tf_reg = f1_score(Y_testtf.astype('int'), y_tf_reg.astype('int'), average='binary')

print(f"Prescision: {precision_tf_reg :.4f}; Recall: {recall_tf_reg:.4f}; F1 score: {f1_tf_reg:.4f}")

Prescision: 0.8270; Recall: 0.8534; F1 score: 0.8400


## Logistic Regression using BOW

In the below cell the Bag Of Words features extracted in the previous step are used to train the model. The Logistic Regression model is initialised, trained and used to make predictions. The prediction are measured using the precision, recall & F1 score. 

In [18]:
logisctic_regression_bow = LogisticRegression(max_iter = 400)
logisctic_regression_bow.fit(X_trainbow, Y_trainbow.astype('int'))

y_bow_reg = logisctic_regression_bow.predict(X_testbow)

# Precision
precision_bow_reg = precision_score(Y_testbow.astype('int'), y_bow_reg.astype('int'), average='binary')

# Recall
recall_bow_reg = recall_score(Y_testbow.astype('int'), y_bow_reg.astype('int'), average='binary')

# F1 score
f1_bow_reg = f1_score(Y_testbow.astype('int'), y_bow_reg.astype('int'), average='binary')

print(f"Prescision: {precision_bow_reg :.4f}; Recall: {recall_bow_reg:.4f}; F1 score: {f1_bow_reg:.4f}")

Prescision: 0.8447; Recall: 0.8075; F1 score: 0.8257


# Naive Bayes Using Both Features

# Naive Bayes on TF-IDF

In the below cell the TFIDF features extracted in the previous step are used to train the model. The Naive Bayes model is initialised, trained and used to make predictions.The Naive Bayes classifier for multinomial models is used to fit the data

In [19]:
nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_traintf, Y_traintf.astype('int'))

y_tfidf_nb = nb_tfidf.predict(X_testtf)

# Precision
precision_tfidf_nb = precision_score(Y_testtf.astype('int'), y_tfidf_nb.astype('int'), average='binary')

# Recall
recall_tfidf_nb = recall_score(Y_testtf.astype('int'), y_tfidf_nb.astype('int'), average='binary')

# F!1 score
f1_tfidf_nb = f1_score(Y_testtf.astype('int'), y_tfidf_nb.astype('int'), average='binary')

print(f"Prescision: {precision_tfidf_nb :.4f}; Recall: {recall_tfidf_nb:.4f}; F1 score: {f1_tfidf_nb:.4f}")

Prescision: 0.7871; Recall: 0.8459; F1 score: 0.8154


# Naive Bayes on BOW

In the below cell the TFIDF features extracted in the previous step are used to train the model. The Naive Bayes model is initialised, trained and used to make predictions.The Naive Bayes classifier for multinomial models is used to fit the data

In [20]:
nb_bow = MultinomialNB()
nb_bow.fit(X_trainbow, Y_trainbow.astype('int'))

y_bow_nb = nb_tfidf.predict(X_testbow)

# Precision
precision_bow_nb = precision_score(Y_testbow.astype('int'), y_bow_nb.astype('int'), average='binary')

# Recall
recall_bow_nb = recall_score(Y_testbow.astype('int'), y_bow_nb.astype('int'), average='binary')

# F1 score
f1_bow_nb = f1_score(Y_testbow.astype('int'), y_bow_nb.astype('int'), average='binary')

print(f"Prescision: {precision_bow_nb :.4f}; Recall: {recall_bow_nb:.4f}; F1 score: {f1_bow_nb:.4f}")

Prescision: 0.8132; Recall: 0.8597; F1 score: 0.8358


For the given dataset Perceptron model is giving me the lowest F1 scores when compared to other models.