## Package Requirements

To run the code, make sure the following packages are installed:

- `pandas=1.5.3=py311heda8569_0`
- `numpy=1.24.3=py311hdab7c0b_1`
- `nltk=3.8.1=py311haa95532_0`
- `scikit-learn=1.3.0=py311hf62ec03_0`
- `scikit-learn-intelex=2023.1.1=py311haa95532_0`

I used the Anaconda Python distribution, which typically comes with these packages pre-installed.


In [2]:
#importing/installing necessary libraries
import pandas as pd
import numpy as np
import nltk
nltk.download('wordnet')
import re
from bs4 import BeautifulSoup


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Negar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
! pip install bs4 




## Read Data
First we download data and save it to the same folder as the Jupyter notebook as data.tsv. Then we create a Pandas dataframe from our data. 

In [4]:
#reading the downloaded data as a table and creating dataframe "df"
df = pd.read_table('data.tsv', delimiter='\t', header=0, on_bad_lines='skip')
df.head()

  df = pd.read_table('data.tsv', delimiter='\t', header=0, on_bad_lines='skip')


Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,43081963,R18RVCKGH1SSI9,B001BM2MAC,307809868,"Scotch Cushion Wrap 7961, 12 Inches x 100 Feet",Office Products,5,0.0,0.0,N,Y,Five Stars,Great product.,2015-08-31
1,US,10951564,R3L4L6LW1PUOFY,B00DZYEXPQ,75004341,"Dust-Off Compressed Gas Duster, Pack of 4",Office Products,5,0.0,1.0,N,Y,"Phffffffft, Phfffffft. Lots of air, and it's C...",What's to say about this commodity item except...,2015-08-31
2,US,21143145,R2J8AWXWTDX2TF,B00RTMUHDW,529689027,Amram Tagger Standard Tag Attaching Tagging Gu...,Office Products,5,0.0,0.0,N,Y,but I am sure I will like it.,"Haven't used yet, but I am sure I will like it.",2015-08-31
3,US,52782374,R1PR37BR7G3M6A,B00D7H8XB6,868449945,AmazonBasics 12-Sheet High-Security Micro-Cut ...,Office Products,1,2.0,3.0,N,Y,and the shredder was dirty and the bin was par...,Although this was labeled as &#34;new&#34; the...,2015-08-31
4,US,24045652,R3BDDDZMZBZDPU,B001XCWP34,33521401,"Derwent Colored Pencils, Inktense Ink Pencils,...",Office Products,4,0.0,0.0,N,Y,Four Stars,Gorgeous colors and easy to use,2015-08-31


## Keep Reviews and Ratings
To do sentiment analysis, we need to only keep reviews and ratings, sincce we are going to predict the sentiment class of a review. This is done by training our model with a bunch of reviews and their sentiment class.

In [5]:
#Keeping only the review_body and star_rating columns

df = df[['star_rating', 'review_body']]

print(df.head())



  star_rating                                        review_body
0           5                                     Great product.
1           5  What's to say about this commodity item except...
2           5    Haven't used yet, but I am sure I will like it.
3           1  Although this was labeled as &#34;new&#34; the...
4           4                    Gorgeous colors and easy to use


 ## We form two classes and select 50000 reviews randomly from each class.
This is a binary classification problem, so we are going to form only two classes of sentiments (class 1 and 2). If the star rating of a review is 1, 2 or 3 we put it in class 1 (df_1) and rating of 4 and 5 as class 2(df_2). From each class, we use .sample method to randomly select 50000 reviews.The random_state parameter is used to initialize the random number generator, and we need to pass in an integer value to our random_state so that our results are erproducable each time we run the code. 
.We create a new column to represent the class in each dataframe and then remove the star_rating column (since we deal with class from now on). The average length of reviews in each class before data cleaning has been reported in this step.

In [65]:
#splitting our data frame into class 1 and 2 based on star rating
df_1 = df[(df['star_rating'] == 1) | (df['star_rating'] == 2) | (df['star_rating'] == 3)]
df_2 = df[(df['star_rating'] == 4) | (df['star_rating'] == 5)]

#randomly selecting 50000 reviews from each class
df_1 = df_1.sample(n = 50000,random_state = 1)
df_2 = df_2.sample(n = 50000, random_state = 1)

# creating a new column for class in each data frame and dropping star_rating column
df_1['Class'] = 1
df_1.drop(columns=['star_rating'], inplace=True)

df_2['Class'] = 2
df_2.drop(columns=['star_rating'], inplace=True)

print(df_1.head(20))
print(df_2.head(20))

#measuring length of review before data cleaning
class_1_len_before_cleaning = df_1['review_body'].str.len().mean()
class_2_len_before_cleaning = df_2['review_body'].str.len().mean()
print("Average length of class 1 reviews before data cleaning is",class_1_len_before_cleaning)
print("Average length of class 2 reviews before data cleaning is",class_2_len_before_cleaning)


                                               review_body  Class
999431   Crappy color, fuzzy, it puts ink down in layer...      1
1313016  These envelopes are of good quality however I ...      1
2421335  I did not purchase this thru Amazon,     IT IS...      1
20415    Returned it. Very cheap. Zipper wouldn't work....      1
1225242  If you want to print lots of pics all at once,...      1
2377357  Product works as described. It's certainly not...      1
736148   Absolute piece of garbage. Better sharpening a...      1
64340    There is hardly ANY tape on this roll! You nee...      1
2205549  The light is too dim and not enough options to...      1
573481   The cartridges were over filled and ink spilli...      1
1739005  This may have been a random occurrence since t...      1
501817   Broke within 2 weeks. Every time I plugged in ...      1
1206175  Many times unable to read some of the letters/...      1
1965712  This is my second - and last - HP printer.  DO...      1
549869    

# Data Cleaning

The data cleaning is necessary to reduce noise and remove unnecessary information before starting analysis. In this step, we have used str.lower() method to turn all strings in review body into lower case. Then using str.replace() method, all url and html tags (that start with specific characters) have been replaced with an empty string. Then the str.strip() command has been used to remove any extra space. To remove contractions (e.g. it's -> it is) we have downloaded and used the contractions library(https://github.com/kootenpv/contractions) and a lambda function. Before that, we have to make sure that iff x is NaN (missing value) or not a string (e.g., if it's of another data type), it is replaced x with an empty string ''.This step ensures that all elements in the 'review_body' column are either valid strings or empty strings. In the final step, replace('[^a-zA-Z ]', '', regex=True) is used to replace characters in each element of the 'review_body' column based on a regular expression pattern.The regular expression [^a-zA-Z ] matches any character that is not a lowercase letter (a-z), an uppercase letter (A-Z), or a space. In other words, it matches any character that is not an alphabet letter or a space. By specifying regex=True, it treats the pattern as a regular expression. Average length of reviews in each class has been reported which is shorter than what if was before data cleaning.

In [70]:
# convert all reviews into lower case
df_1['review_body'] = df_1['review_body'].str.lower()
df_2['review_body'] = df_2['review_body'].str.lower()

#remove urls from the reviews
df_1['review_body'] = df_1['review_body'].str.replace(r'http\S+|www\S+|https\S+', '', case=False)
df_2['review_body'] = df_2['review_body'].str.replace(r'http\S+|www\S+|https\S+', '', case=False)

#remove html tags from the reviews
df_1['review_body'] = df_1['review_body'].str.replace(r'<.*?>', '')
df_2['review_body'] = df_2['review_body'].str.replace(r'<.*?>', '')

#remove exra spaces 
df_1['review_body'] = df_1['review_body'].str.strip()
df_2['review_body'] = df_2['review_body'].str.strip()

print(df_1.head(20))

  df_1['review_body'] = df_1['review_body'].str.replace(r'http\S+|www\S+|https\S+', '', case=False)
  df_2['review_body'] = df_2['review_body'].str.replace(r'http\S+|www\S+|https\S+', '', case=False)


                                               review_body  Class
999431   crappy color fuzzy it puts ink down in layers ...      1
1313016  these envelopes are of good quality however i ...      1
2421335  i did not purchase this thru amazon     it is ...      1
20415    returned it very cheap zipper wouldnt work buy...      1
1225242  if you want to print lots of pics all at once ...      1
2377357  product works as described its certainly not f...      1
736148   absolute piece of garbage better sharpening al...      1
64340    there is hardly any tape on this roll you need...      1
2205549  the light is too dim and not enough options to...      1
573481   the cartridges were over filled and ink spilli...      1
1739005  this may have been a random occurrence since t...      1
501817   broke within  weeks every time i plugged in my...      1
1206175  many times unable to read some of the lettersn...      1
1965712  this is my second  and last  hp printer  do no...      1
549869    

  df_1['review_body'] = df_1['review_body'].str.replace(r'<.*?>', '')
  df_2['review_body'] = df_2['review_body'].str.replace(r'<.*?>', '')


In [71]:
#perform contractions on the reviews: https://github.com/kootenpv/contractions
!pip install contractions
import contractions
#first we turn NaN (unidentified data) and non-string values into empty strings and then perform contractions on them
df_1['review_body'] = df_1['review_body'].apply(lambda x: '' if pd.isna(x) or not isinstance(x, str) else x)
df_1['review_body'] = df_1['review_body'].apply(lambda x: contractions.fix(x))

df_2['review_body'] = df_2['review_body'].apply(lambda x: '' if pd.isna(x) or not isinstance(x, str) else x)
df_2['review_body'] = df_2['review_body'].apply(lambda x: contractions.fix(x))

print(df_1.head(20))

                                               review_body  Class
999431   crappy color fuzzy it puts ink down in layers ...      1
1313016  these envelopes are of good quality however i ...      1
2421335  i did not purchase this thru amazon     it is ...      1
20415    returned it very cheap zipper would not work b...      1
1225242  if you want to print lots of pics all at once ...      1
2377357  product works as described its certainly not f...      1
736148   absolute piece of garbage better sharpening al...      1
64340    there is hardly any tape on this roll you need...      1
2205549  the light is too dim and not enough options to...      1
573481   the cartridges were over filled and ink spilli...      1
1739005  this may have been a random occurrence since t...      1
501817   broke within  weeks every time i plugged in my...      1
1206175  many times unable to read some of the lettersn...      1
1965712  this is my second  and last  hp printer  do no...      1
549869    

In [67]:
# remove non-alphabetical characters
df_1['review_body'] = df_1['review_body'].replace('[^a-zA-Z ]', '', regex=True)
df_2['review_body'] = df_2['review_body'].replace('[^a-zA-Z ]', '', regex=True)

print(df_1.head(20))

                                               review_body  Class
999431   crappy color fuzzy it puts ink down in layers ...      1
1313016  these envelopes are of good quality however i ...      1
2421335  i did not purchase this thru amazon     it is ...      1
20415    returned it very cheap zipper wouldnt work buy...      1
1225242  if you want to print lots of pics all at once ...      1
2377357  product works as described its certainly not f...      1
736148   absolute piece of garbage better sharpening al...      1
64340    there is hardly any tape on this roll you need...      1
2205549  the light is too dim and not enough options to...      1
573481   the cartridges were over filled and ink spilli...      1
1739005  this may have been a random occurrence since t...      1
501817   broke within  weeks every time i plugged in my...      1
1206175  many times unable to read some of the lettersn...      1
1965712  this is my second  and last  hp printer  do no...      1
549869    

In [72]:
class_1_len_after_cleaning = df_1['review_body'].str.len().mean()
class_2_len_after_cleaning = df_2['review_body'].str.len().mean()
print("Average length of class 1 reviews after data cleaning is",class_1_len_after_cleaning)
print("Average length of class 2 reviews after data cleaning is",class_2_len_after_cleaning)


Average length of class 1 reviews after data cleaning is 365.60226
Average length of class 2 reviews after data cleaning is 248.64432


# Pre-processing

The first step of the preprocessing is to remove stop words (like "a", "the", "is" and "are").
But before that, it is better to tokenize the reviews, meaning that we break it into its units. The word_tokenize function from NLTK is used for this purpose. We have also downloaded the list of English stopwords from NLTk and removed stop words using a lambda function. The lambda function filters out stopwords from each list of tokens in the 'review_body' column. It keeps only the words that are not in the stop_words list. In order to improve text analysis and understanding, we also need to perform lemmatization. lemmatization reduces words to their base or dictionary form. lemmatizer.lemmatize(word) is used to lemmatize each word in the list of tokens. We apply a lambda function to each element in the 'review_body' column of both data frames.

In [73]:
from nltk.corpus import stopwords
nltk.download('stopwords')

#tokenizing the reviews
from nltk.tokenize import word_tokenize
nltk.download('punkt')

df_1['review_body'] = df_1['review_body'].astype(str)
df_2['review_body'] = df_2['review_body'].astype(str)

df_1['review_body'] = df_1['review_body'].apply(word_tokenize)
df_2['review_body'] = df_2['review_body'].apply(word_tokenize)

#measuring length of class 1 and 2 reviews before preprocessing
class_1_len_before_preprocessing = df_1['review_body'].str.len().mean()
class_2_len_before_preprocessing = df_2['review_body'].str.len().mean()
print("Average length of class 1 reviews before pre-processing is",class_1_len_before_preprocessing)
print("Average length of class 2 reviews before pre-processing is",class_2_len_before_preprocessing)

#we create a list of stop words and then remove all stop words from our review body
stop_words = stopwords.words('english')
df_1['review_body'] = df_1['review_body'].apply(lambda x: [word for word in x if word not in stop_words])
df_2['review_body'] = df_2['review_body'].apply(lambda x: [word for word in x if word not in stop_words])
print(df_1.head(20))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Negar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Negar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Average length of class 1 reviews before pre-processing is 69.52882
Average length of class 2 reviews before pre-processing is 47.24498
                                               review_body  Class
999431   [crappy, color, fuzzy, puts, ink, layers, goes...      1
1313016  [envelopes, good, quality, however, ordered, i...      1
2421335  [purchase, thru, amazon, cheaper, walmart, sho...      1
20415    [returned, cheap, zipper, would, work, buy, so...      1
1225242  [want, print, lots, pics, works, great, howeve...      1
2377357  [product, works, described, certainly, fancy, ...      1
736148   [absolute, piece, garbage, better, sharpening,...      1
64340    [hardly, tape, roll, need, equal, one, roll, r...      1
2205549  [light, dim, enough, options, change, brightne...      1
573481   [cartridges, filled, ink, spilling, good, qual...      1
1739005  [may, random, occurrence, since, product, many...      1
501817   [broke, within, weeks, every, time, plugged, i...      1
120617

## perform lemmatization  

In [74]:
# Performing lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = nltk.stem.WordNetLemmatizer()

df_1['review_body'] = df_1['review_body'].apply(lambda list:[lemmatizer.lemmatize(word) for word in list])
df_2['review_body'] = df_2['review_body'].apply(lambda list:[lemmatizer.lemmatize(word) for word in list])
print(df_1.head(20))


                                               review_body  Class
999431   [crappy, color, fuzzy, put, ink, layer, go, pr...      1
1313016  [envelope, good, quality, however, ordered, in...      1
2421335  [purchase, thru, amazon, cheaper, walmart, sho...      1
20415    [returned, cheap, zipper, would, work, buy, so...      1
1225242  [want, print, lot, pic, work, great, however, ...      1
2377357  [product, work, described, certainly, fancy, g...      1
736148   [absolute, piece, garbage, better, sharpening,...      1
64340    [hardly, tape, roll, need, equal, one, roll, r...      1
2205549  [light, dim, enough, option, change, brightnes...      1
573481   [cartridge, filled, ink, spilling, good, quali...      1
1739005  [may, random, occurrence, since, product, many...      1
501817   [broke, within, week, every, time, plugged, ip...      1
1206175  [many, time, unable, read, lettersnumbers, rui...      1
1965712  [second, last, hp, printer, buy, printer, revi...      1
549869    

In [75]:
#measuring length of class 1 and 2 reviews after preprocessing
class_1_len_after_preprocessing = df_1['review_body'].str.len().mean()
class_2_len_after_preprocessing = df_2['review_body'].str.len().mean()
print("Average length of class 1 reviews after data processing is",class_1_len_after_preprocessing)
print("Average length of class 2 reviews after data processing is",class_2_len_after_preprocessing)


Average length of class 1 reviews after data processing is 33.55024
Average length of class 2 reviews after data processing is 23.4124


# TF-IDF and BoW Feature Extraction
Before moving forward we concatenate and shuffle data from class 1 and class 2 to create a balanced dataset. The .sample method with a fraction of 1 does that for us. Then we split our data into 80% training data and 20% testing data using train_test_split from sklearn. In this step, x contains my features and y contains my lables(classes). Then we will do feature extraction on training and testing data separately. I had to set my max-features to 1000 since I ran out of memory when I didn't.
Count Vectorizer does feature extraction based on Bag of Words method. The fit_transform() method is used during the training phase (bow_x_train = bow_vectorizer.fit_transform(x_train)) to both fit the bow_vectorizer to the training data (x_train) and transform it into a sparse matrix. The sparse matrix gives the frequency of words in our corpus (features) in each document (review).The transform() method is used during the testing phase (bow_x_test = bow_vectorizer.transform(x_test)) to apply the same transformation learned from the training data to the test data. This ensures that the same vocabulary (word features) used for training is applied to the test data, resulting in consistent feature representations.
We repeat feature extraction using the TF_IDF method as well so that we can compare results from both features. This is done in a similar way for train and test data. The output of the Tfidf vectorization is the TF-IDF (Term Frequency-Inverse Document Frequency) matrix, which represents the importance of words our corpus, accounting for the fact that some words appear more frequently in general.

In [76]:
# concatenating data from class 1 and 2 and then shuffling them
balanced_data = pd.concat([df_1, df_2])
balanced_data = balanced_data.sample(frac=1, random_state=1)

print(balanced_data.head())

                                               review_body  Class
2598989  [psc, little, year, cutterhead, supposedly, ye...      1
2323412  [pretty, solidly, built, cabinet, flimsy, thou...      2
1764168  [placed, combo, order, prismacolor, premier, s...      1
1210270  [perfect, replacement, fast, shipping, prior, ...      2
1093602                                       [great, buy]      2


In [77]:
#Splitting data into 80% training data and 20% testing data

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer=lambda x: x, max_features = 1000)

x = balanced_data['review_body']
y = balanced_data['Class']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

#Bag of Words Feature Extraction
from sklearn.feature_extraction.text import CountVectorizer

bow_vectorizer = CountVectorizer(analyzer=lambda x: x, max_features = 1000)

bow_x_train = bow_vectorizer.fit_transform (x_train)
bow_x_test = bow_vectorizer.transform(x_test)

print(bow_x_test)
print(bow_x_test)


  (0, 40)	1
  (0, 64)	1
  (0, 88)	1
  (0, 335)	1
  (0, 590)	1
  (0, 872)	1
  (0, 908)	1
  (1, 308)	1
  (1, 352)	1
  (1, 356)	1
  (1, 774)	1
  (1, 983)	1
  (2, 16)	1
  (2, 35)	1
  (2, 78)	1
  (2, 122)	1
  (2, 352)	1
  (2, 412)	3
  (2, 418)	1
  (2, 488)	1
  (2, 532)	1
  (2, 661)	1
  (2, 674)	1
  (2, 675)	1
  (2, 703)	1
  :	:
  (19999, 87)	1
  (19999, 101)	1
  (19999, 135)	1
  (19999, 162)	1
  (19999, 274)	1
  (19999, 352)	1
  (19999, 395)	1
  (19999, 403)	1
  (19999, 412)	1
  (19999, 450)	1
  (19999, 452)	1
  (19999, 466)	1
  (19999, 488)	1
  (19999, 552)	1
  (19999, 566)	1
  (19999, 573)	1
  (19999, 575)	2
  (19999, 657)	2
  (19999, 658)	1
  (19999, 818)	1
  (19999, 895)	1
  (19999, 903)	1
  (19999, 960)	1
  (19999, 989)	1
  (19999, 996)	1
  (0, 40)	1
  (0, 64)	1
  (0, 88)	1
  (0, 335)	1
  (0, 590)	1
  (0, 872)	1
  (0, 908)	1
  (1, 308)	1
  (1, 352)	1
  (1, 356)	1
  (1, 774)	1
  (1, 983)	1
  (2, 16)	1
  (2, 35)	1
  (2, 78)	1
  (2, 122)	1
  (2, 352)	1
  (2, 412)	3
  (2, 418)	1
  (2, 488)

In [44]:
##TF_IDF feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(analyzer=lambda x: x, max_features = 1000)

tfidf_x_train = tfidf_vectorizer.fit_transform (x_train)
tfidf_x_test = tfidf_vectorizer.transform(x_test)

print(tfidf_x_train)
print(tfidf_x_test)


  (0, 602)	0.6134238710039859
  (0, 356)	0.4776859525926707
  (0, 499)	0.6289096001637421
  (1, 964)	0.20493697448296408
  (1, 509)	0.2703759573757694
  (1, 798)	0.20543586282169735
  (1, 920)	0.1825865198205182
  (1, 592)	0.2685840745635208
  (1, 674)	0.19703140100231215
  (1, 289)	0.2017628236134194
  (1, 635)	0.24789170576434108
  (1, 966)	0.1494722654513359
  (1, 281)	0.2595991219704854
  (1, 386)	0.2838099994002555
  (1, 532)	0.18558720011288682
  (1, 21)	0.22996610304699885
  (1, 787)	0.5536764153141333
  (1, 938)	0.17908731401778002
  (2, 333)	0.17734902785217496
  (2, 664)	0.08723951504516353
  (2, 94)	0.14982025121408182
  (2, 676)	0.16875522035531876
  (2, 984)	0.11604163872358368
  (2, 598)	0.15884865040465349
  (2, 359)	0.15411417277389455
  :	:
  (79999, 550)	0.13793586177413716
  (79999, 870)	0.1173751622943249
  (79999, 38)	0.1722340541953556
  (79999, 692)	0.15904852696031457
  (79999, 162)	0.15115283437870666
  (79999, 278)	0.13067844416521895
  (79999, 431)	0.11794658

# Perceptron Using Both Features

In this step, we train a Perceptron classifier using Bag of Words (BoW) features (bow_x_train) and corresponding labels (y_train). The same is done for Tf-Idf features(tfidf_x_train). Then we evaluate the model's performance on the test data (bow_x_test) and (tfidf_x_test) using metrics accuracy, precision, recall, and F1-score. These metrics provide insights into how well the classifier is performing in terms of correctly classifying instances.

In [45]:
#let's train a model using sklearn implementation of perceptron

from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

#creating the perceptron instance(classifier)

ppn_clf = Perceptron()

#bow_features --------------------------
#training the classifier

ppn_clf.fit(bow_x_train, y_train)

#makng perdictions
bow_ppn_y_predict = ppn_clf.predict(bow_x_test)

#measuring performance using accuracy score
print("accuracy:", accuracy_score(y_test, bow_ppn_y_predict))

#performace using precision, recall and F1 score
bow_ppn_precision = precision_score(y_test, bow_ppn_y_predict)
bow_ppn_recall = recall_score(y_test, bow_ppn_y_predict)
bow_ppn_f1 = f1_score(y_test, bow_ppn_y_predict)
print("bow_ppn_precision:", bow_ppn_precision, ", bow_ppn_recall:", bow_ppn_recall, ", bow_ppn_f1:", bow_ppn_f1)

accuracy: 0.7712
bow_ppn_precision: 0.7625146656237779 , bow_ppn_recall: 0.7841343253569274 , bow_ppn_f1: 0.7731733914940022


In [46]:
#tfidf_features --------------------------

#training the classifier
ppn_clf.fit(tfidf_x_train, y_train)

#makng perdictions
tfidf_ppn_y_predict = ppn_clf.predict(tfidf_x_test)

#measuring performance using accuracy score
print("accuracy:", accuracy_score(y_test, tfidf_ppn_y_predict))

#performace using precision, recall and F1 score
#tfidf_ppn_precision, tfidf_ppn_recall, tfidf_ppn_f1, tfidf_ppn_support = precision_recall_fscore_support(y_test, y_predict)
tfidf_ppn_precision = precision_score(y_test, tfidf_ppn_y_predict)
tfidf_ppn_recall = recall_score(y_test, tfidf_ppn_y_predict)
tfidf_ppn_f1 = f1_score(y_test, tfidf_ppn_y_predict)
print("tfidf_ppn_precision:", tfidf_ppn_precision, ", tfidf_ppn_recall:", tfidf_ppn_recall, ", tfidf_ppn_f1_score:", tfidf_ppn_f1)

accuracy: 0.74315
tfidf_ppn_precision: 0.802186753801684 , tfidf_ppn_recall: 0.641765533882968 , tfidf_ppn_f1_score: 0.7130648494665699


# SVM Using Both Features
We trains an SVM classifier using both BoW and TF-IDF feature representations and evaluate the classifier's performance in terms of accuracy, precision, recall, and F1-score for each feature representation. This allows us to compare how well the SVM classifier performs with different text feature representations. The max_iter parameter is set to a large value (100,000) to ensure good performance.

In [47]:
from sklearn.svm import SVC

#creating the SVM instance (classifier)
SVM_clf = SVC(max_iter = 100000)

#bow_features --------------------------
#training the classifier
SVM_clf.fit(bow_x_train, y_train)


#makng perdictions
bow_svm_y_predict = SVM_clf.predict(bow_x_test)

#measuring performance using accuracy score
print("accuracy:", accuracy_score(y_test, bow_svm_y_predict))


#performace using precision, recall and F1 score
bow_svm_precision = precision_score(y_test, bow_svm_y_predict)
bow_svm_recall = recall_score(y_test, bow_svm_y_predict)
bow_svm_f1 = f1_score(y_test, bow_svm_y_predict)
print("bow_svm_precision:", bow_svm_precision, ", bow_svm_recall:", bow_svm_recall, ", bow_svm_f1_score:", bow_svm_f1)


accuracy: 0.82665
bow_svm_precision: 0.8109820485744457 , bow_svm_recall: 0.8493866881158254 , bow_svm_f1_score: 0.8297402150960075


In [48]:
#tfidf_features --------------------------
#training the classifier
SVM_clf.fit(tfidf_x_train, y_train)


#makng perdictions
tfidf_svm_y_predict = SVM_clf.predict(tfidf_x_test)

#measuring performance using accuracy score
print("accuracy:", accuracy_score(y_test, tfidf_svm_y_predict))


#performace using precision, recall and F1 score
#tfidf_svm_precision, tfidf_svm_recall, tfidf_svm_f1, tfidf_svm_support = precision_recall_fscore_support(y_test, tfidf_svm_y_predict, average= 'micro')
tfidf_svm_precision = precision_score(y_test, tfidf_svm_y_predict)
tfidf_svm_recall = recall_score(y_test, tfidf_svm_y_predict)
tfidf_svm_f1 = f1_score(y_test, tfidf_svm_y_predict)
print("tfidf_svm_precision:", tfidf_svm_precision, ", tfidf_svm_recall:", tfidf_svm_recall, ", tfidf_svm_f1_score:", tfidf_svm_f1)


accuracy: 0.8433
tfidf_svm_precision: 0.8381652104845115 , tfidf_svm_recall: 0.8487834305248341 , tfidf_svm_f1_score: 0.8434409031871316


# Logistic Regression Using Both Features
We train a Logistic Regression classifier with two different text features, Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). After training, we evaluate the classifier's performance using metrics  accuracy, precision, recall, and F1-score for each feature. This approach enables us to compare and assess how well the Logistic Regression classifier performs when using different text features.

In [49]:
from sklearn.linear_model import LogisticRegression

LR_clf = LogisticRegression(max_iter = 1000)

#bow_features --------------------------
#training the classifier
LR_clf.fit(bow_x_train, y_train)

#makng perdictions
bow_lr_y_predict = LR_clf.predict(bow_x_test)

#measuring performance using accuracy score
print("accuracy:", accuracy_score(y_test, bow_lr_y_predict))

#performace using precision, recall and F1 score
bow_lr_precision = precision_score(y_test, bow_lr_y_predict)
bow_lr_recall = recall_score(y_test, bow_lr_y_predict)
bow_lr_f1 = f1_score(y_test, bow_lr_y_predict)
print("bow_lr_precision:", bow_lr_precision, ", bow_lr_recall:", bow_lr_recall, ", bow_lr_f1_score:", bow_lr_f1)

accuracy: 0.82635
bow_lr_precision: 0.8339696625735218 , bow_lr_recall: 0.8125879750653529 , bow_lr_f1_score: 0.8231399908336303


In [79]:
#tfidf_features --------------------------
#training the classifier
LR_clf.fit(tfidf_x_train, y_train)

#makng perdictions
tfidf_lr_y_predict = LR_clf.predict(tfidf_x_test)

#measuring performance using accuracy score
print("accuracy:", accuracy_score(y_test, tfidf_lr_y_predict))

#performace using precision, recall and F1 score
tfidf_lr_precision = precision_score(y_test, tfidf_lr_y_predict)
tfidf_lr_recall = recall_score(y_test, tfidf_lr_y_predict)
tfidf_lr_f1 = f1_score(y_test, tfidf_lr_y_predict)
print("tfidf_lr_precision:", tfidf_lr_precision, ", tfidf_lr_recall:", tfidf_lr_recall, ", tfidf_lr_f1_score:", tfidf_lr_f1)

accuracy: 0.8297
tfidf_lr_precision: 0.8226124704025256 , tfidf_lr_recall: 0.8383269656143173 , tfidf_lr_f1_score: 0.8303953789463201


# Naive Bayes Using Both Features
In the final step, we train a Naive Bayes classifier using Bag of Words (BoW) features (bow_x_train) and corresponding labels (y_train). The same is done for Tf-Idf features(tfidf_x_train). We have used the Multinomial Naive Bayes for better scores. Then we evaluate the model's performance on the test data (bow_x_test) and (tfidf_x_test) using metrics accuracy, precision, recall, and F1-score. These metrics provide insights into how well the classifier is performing in terms of correctly classifying instances.

In [53]:
from sklearn.naive_bayes import MultinomialNB

NB_clf = MultinomialNB()

#bow_features --------------------------
#training the classifier
NB_clf.fit(bow_x_train.toarray(), y_train)

#makng perdictions
bow_nb_y_predict = NB_clf.predict(bow_x_test)


#measuring performance using accuracy score
print("accuracy:", accuracy_score(y_test, bow_nb_y_predict))


#performace using precision, recall and F1 score
bow_nb_precision = precision_score(y_test, bow_nb_y_predict)
bow_nb_recall = recall_score(y_test, bow_nb_y_predict)
bow_nb_f1 = f1_score(y_test, bow_nb_y_predict)
print("bow_nb_precision:", bow_nb_precision, ", bow_nb_recall:", bow_nb_recall, ", bow_nb_f1_score:", bow_nb_f1)

accuracy: 0.78715
bow_nb_precision: 0.8239380480583077 , bow_nb_recall: 0.7274281118037402 , bow_nb_f1_score: 0.7726811555508091


In [78]:
#tfidf_features --------------------------
#training the classifier
NB_clf.fit(tfidf_x_train.toarray(), y_train)

#makng perdictions
tfidf_nb_y_predict = NB_clf.predict(tfidf_x_test)

#measuring performance using accuracy score
print("accuracy:", accuracy_score(y_test, tfidf_nb_y_predict))

#performace using precision, recall and F1 score
tfidf_nb_precision = precision_score(y_test, tfidf_nb_y_predict)
tfidf_nb_recall = recall_score(y_test, tfidf_nb_y_predict)
tfidf_nb_f1 = f1_score(y_test, tfidf_nb_y_predict)
print("tfidf_nb_precision:", tfidf_nb_precision, ", tfidf_nb_recall:", tfidf_nb_recall, ", tfidf_nb_f1_score:", tfidf_nb_f1)

accuracy: 0.80255
tfidf_nb_precision: 0.8006015037593985 , tfidf_nb_recall: 0.8029358536094913 , tfidf_nb_f1_score: 0.8017669795692988
