# Introduction

## **Project: Sentiment Analysis on Amazon Product Reviews**

With the rise of e-commerce, online product reviews have become crucial for consumers.

Analyzing vast volumes of reviews manually is impractical. Supervised learning models can streamline sentiment analysis on large-scale datasets.

Our **study focuses on categorizing feedback as positive or negative and building an efficient sentiment analysis model.**



In [1]:
FIRST_NAME = "Muthu"
LAST_NAME = "Selvam"
STUDENT_ID = "801276057"

In [2]:
pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.7/110.7 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.1.0 textsearch-0.0.24


In [3]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
import re
from contractions import contractions_dict
from string import punctuation
import warnings
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

warnings.filterwarnings("ignore")
pd.set_option('display.max_colwidth', None)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Create Amazon Customer Reviews DataFrame from JSON objects

In [4]:
import pandas as pd
import gzip
import json

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield json.loads(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('amazon_product_reviews.json.gz')

df.head()

EOFError: Compressed file ended before the end-of-stream marker was reached

In [None]:
df = df.dropna(subset = ['reviewText','summary'])
df.isna().sum()

In [None]:
print(df['overall'].value_counts())
df['overall'].value_counts(normalize=True) * 100

## Separate positive and negative reviews for analysis

In [None]:
df_negative_reviews = df[df['overall']<3].iloc[:50000]
df_positive_reviews = df[df['overall']>3].iloc[:50000]

In [None]:
df_new = df.iloc[:100]
df_new.head()

In [None]:
# @title overall

from matplotlib import pyplot as plt
df_new['overall'].plot(kind='line', figsize=(8, 4), title='overall')
plt.gca().spines[['top', 'right']].set_visible(False)

In [None]:
# @title overall vs unixReviewTime

from matplotlib import pyplot as plt
df_new.plot(kind='scatter', x='overall', y='unixReviewTime', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

## Review Text PreProcssing Function Definitions

In [None]:
def expand_contractions(text, contractions_dict):
    contractions_pattern = re.compile('({})'.format('|'.join(contractions_dict.keys())),
                                      flags=re.IGNORECASE | re.DOTALL)

    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contractions_dict.get(match) \
            if contractions_dict.get(match) \
            else contractions_dict.get(match.lower())
        expanded_contraction = expanded_contraction
        return expanded_contraction

    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text



def strip_punctuation(s):
    return ''.join(c for c in s if c not in punctuation)

In [None]:
from nltk.tokenize import RegexpTokenizer
def cleanme(txt):
    sent = txt.lower()
    sent_expanded_contractions = expand_contractions(sent,contractions_dict)
    sent_expanded_contractions = re.sub(r'(?<=[.,])(?=[^\s])', r' ', sent_expanded_contractions)
    sent_without_punct = strip_punctuation(sent_expanded_contractions)
    sent_without_digits=re.sub('[0-9]+', '', sent_without_punct)

    TOKENIZER = RegexpTokenizer('(?u)\W+|\$[\d\.]+|\S+')
    wrds = word_tokenize(sent_without_digits)
    to_remove = ['no', 'not']
    new_stopwords = set(stopwords.words('english')).difference(to_remove)
    clwrds = [w for w in wrds if not w in new_stopwords]
    ln = len(clwrds)
    if ln>0:
        pos = pd.DataFrame(pos_tag(wrds))
        pos = (" ".join(list(pos[pos[1].str.contains("JJ")].iloc[:,0]))).split(" ")
        l2 = ["i","you","me"]
        pos = [x for x in pos if x not in l2]
    else:
        pos = [""]
    rt = [ln, " ".join(clwrds), " ".join(pos)]
    return(rt)

## Create Negative Reviews WordCloud

**This will take take time to load all the dataset. Please wait.**

In [None]:
tmp = list()
num_rows = min(50000, len(df_negative_reviews))
for i in range(num_rows):
    tmp.append(cleanme(df_negative_reviews.iloc[i,:]['reviewText']))

tmp = pd.DataFrame(tmp)
tmp.columns = ['reviewlen', 'cleanrev', 'adjreview']

df_negative_reviews_new = df_negative_reviews.reset_index()
df_negative_reviews_new = pd.concat([df_negative_reviews_new,tmp], axis=1)
df_negative_reviews_new = df_negative_reviews_new[['overall','reviewText','summary','reviewlen', 'cleanrev', 'adjreview']]
df_negative_reviews_new.head()


In [None]:
# @title reviewlen

from matplotlib import pyplot as plt
df_negative_reviews_new['reviewlen'].plot(kind='line', figsize=(8, 4), title='reviewlen')
plt.gca().spines[['top', 'right']].set_visible(False)

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
nltk.download('wordnet')

wordnet_lemmatizer = WordNetLemmatizer()
snowball_stemmer = SnowballStemmer('english')

txt = df_negative_reviews_new.cleanrev.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
lemmatized_word = [wordnet_lemmatizer.lemmatize(word) for word in words]
bgs = nltk.trigrams(lemmatized_word)

#compute frequency distribution for all the bigrams in the text
fdist = nltk.FreqDist(bgs)
fdist.most_common(40)

In [None]:
d = {}
for key, value in fdist.items() :
    d["_".join(key)] = value

In [None]:
!pip install wordcloud

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
WC_height = 200
WC_width = 400
WC_max_words = 50
wordcloud = WordCloud(max_words=WC_max_words, height=WC_height, width=WC_width, background_color="white")
wordcloud.generate_from_frequencies(frequencies=d)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
wordcloud.to_file("WordCloud_Bigrams_frequent_words.png")

## Create Positive Reviews WordCloud

**This will take take time to load all the dataset. Please wait.**

In [None]:
tmp = list()
for i in range(50000):
    tmp.append(cleanme(df_positive_reviews.iloc[i,:]['reviewText']))
tmp = pd.DataFrame(tmp)
tmp.columns = ['reviewlen', 'cleanrev', 'adjreview']

(tmp.head())


df_positive_reviews_new = df_positive_reviews.reset_index()
df_positive_reviews_new = pd.concat([df_positive_reviews_new,tmp], axis=1)
df_positive_reviews_new = df_positive_reviews_new[['overall','reviewText','summary','reviewlen', 'cleanrev', 'adjreview']]
df_positive_reviews_new.head()

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer

wordnet_lemmatizer = WordNetLemmatizer()
snowball_stemmer = SnowballStemmer('english')

txt = df_positive_reviews_new.cleanrev.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
lemmatized_word = [wordnet_lemmatizer.lemmatize(word) for word in words]

bgs = nltk.trigrams(lemmatized_word)

#compute frequency distribution for all the bigrams in the text
fdist = nltk.FreqDist(bgs)
fdist.most_common(30)

In [None]:
d = {}
for key, value in fdist.items() :
    d["_".join(key)] = value

In [None]:
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
WC_height = 200
WC_width = 400
WC_max_words = 50
wordcloud = WordCloud(max_words=WC_max_words, height=WC_height, width=WC_width, background_color="white")
wordcloud.generate_from_frequencies(frequencies=d)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
wordcloud.to_file("WordCloud_Positive_Reviews.png")

## PreProcess 50,000 reviews to be used to build classification models

**This will take take time to load all the dataset. Please wait.**

In [None]:
df_new = df.iloc[:50000]


tmp = list()
for i in range(50000):
    tmp.append(cleanme(df_new.iloc[i,:]['reviewText']))
tmp = pd.DataFrame(tmp)
tmp.columns = ['reviewlen', 'cleanrev', 'adjreview']

(tmp.head())


df_new = df_new.reset_index()
df_new = pd.concat([df_new,tmp], axis=1)
df_new = df_new[['overall','reviewText','summary','reviewlen', 'cleanrev', 'adjreview']]
df_new.head()

In [None]:
df_new.columns = ['overall_rating','reviewText','summary','cleanReviewLength', 'cleanReview', 'adjectives']
df_new.head()

## Calculate Polarity of Reviews

In [None]:
!pip install textblob

from textblob import TextBlob, Word
def detect_polarity(text):
    return TextBlob(text).sentiment.polarity

df_new['polarity'] = df_new.reviewText.apply(detect_polarity)
df_new[1:10]

## Naive Bayes Multi-Class Classifier

In [None]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import numpy as np
from scipy import sparse

tfidf = TfidfVectorizer(sublinear_tf=False, max_features = 10000, min_df=5,max_df=0.60,ngram_range= (1,2))

review_df = pd.concat([
    df_new[df_new['overall_rating']==1.0].sample(n=10000, replace=True),
    df_new[df_new['overall_rating']==2.0].sample(n=10000, replace=True),
    df_new[df_new['overall_rating']==3.0].sample(n=10000, replace=True),
    df_new[df_new['overall_rating']==4.0].sample(n=10000, replace=True),
    df_new[df_new['overall_rating']==5.0].sample(n=10000, replace=True)
])

review_df = review_df[review_df['cleanReviewLength']<50]
review_df = review_df[['cleanReview','overall_rating']]
train, test = train_test_split(review_df, test_size=0.2)

train['overall_rating'].hist();
test['overall_rating'].hist();

train = pd.get_dummies(train, columns = ['overall_rating'])
train.head()

test = pd.get_dummies(test, columns = ['overall_rating'])
test.head()

train.shape, test.shape


In [None]:
class NBFeatures(BaseEstimator):
    '''Class implementation of Jeremy Howards NB Linear model'''
    def __init__(self, alpha):
        # Smoothing Parameter: always going to be one for my use
        self.alpha = alpha

    def preprocess_x(self, x, r):
        return x.multiply(r)

    # calculate probabilities
    def pr(self, x, y_i, y):
        p = x[y == y_i].sum(0)
        return (p + self.alpha)/((y==y_i).sum()+self.alpha)

    # calculate the log ratio and represent as sparse matrix
    # ie fit the nb model
    def fit(self, x, y = None):
        self._r = sparse.csr_matrix(np.log(self.pr(x, 1, y) /self.pr(x, 0, y)))
        return self

    # apply the nb fit to original features x
    def transform(self, x):
        x_nb = self.preprocess_x(x, self._r)
        return x_nb

In [None]:
# Create pipeline using sklearn pipeline:
    # I basically create my tfidf features which are fed to my NB model
    # for probability calculations. Then those are fed as input to my
    # logistic regression model.
lr = LogisticRegression()
nb = NBFeatures(1)
p = Pipeline([
    ('tfidf', tfidf),
    ('nb', nb),
    ('lr', lr)
])

In [None]:
class_names = ['overall_rating_1.0', 'overall_rating_2.0','overall_rating_3.0','overall_rating_4.0','overall_rating_5.0']
scores = []
preds = np.zeros((len(test), len(class_names)))
for i, class_name in enumerate(class_names):
    train_target = train[class_name]
    cv_score = np.mean(cross_val_score(estimator = p, X = train['cleanReview'].values,
                                      y = train_target, cv = 3, scoring = 'accuracy'))
    scores.append(cv_score)
    print('CV score for class {} is {}'.format(class_name, cv_score))
    p.fit(train['cleanReview'].values, train_target)
    preds[:,i] = p.predict_proba(test['cleanReview'].values)[:,1]

In [None]:
t = metrics.classification_report(np.argmax(test[class_names].values, axis = 1),np.argmax(preds, axis = 1))
print(t)

## Some Analysis

In [None]:
import seaborn as sns
print(df_new['overall_rating'].value_counts())
sns.set(style='whitegrid', palette="deep", font_scale=1.1, rc={"figure.figsize": [8, 5]})
sns.distplot(
    (df_new['overall_rating']), norm_hist=False, kde=False, bins=20, hist_kws={"alpha": 1}
).set(xlabel='Overall Rating', ylabel='Count');

In [None]:
df_res = df_new[df_new['cleanReviewLength']>0]
print(df_res.groupby('overall_rating', as_index=False)['cleanReviewLength'].mean())
print(df_res.groupby('overall_rating', as_index=False)['polarity'].mean())

sns.set(style='whitegrid', palette="deep", font_scale=1.1, rc={"figure.figsize": [8, 5]})
sns.distplot(
    (df_res['cleanReviewLength']), norm_hist=False, kde=False, bins=20, hist_kws={"alpha": 1}
).set(xlabel='Review Length', ylabel='Count');

## Some Other Multi-Class Classifier Models which also takes into account Review Length and Polarity

In [None]:
df_res = pd.concat([
    df_new[df_new['overall_rating']==1.0].sample(n=10000, replace=True),
    df_new[df_new['overall_rating']==2.0].sample(n=10000, replace=True),
    df_new[df_new['overall_rating']==3.0].sample(n=10000, replace=True),
    df_new[df_new['overall_rating']==4.0].sample(n=10000, replace=True),
    df_new[df_new['overall_rating']==5.0].sample(n=10000, replace=True)
])
df_res = df_res[df_res['cleanReviewLength'] < 50]
df_res['overall_rating'].value_counts()


In [None]:
df_100 = df_res.copy()

v = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.60)
x = v.fit_transform(df_100['cleanReview'])

df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names_out())
df_100 = df_100.drop('cleanReview', axis=1)

df_100.reset_index(drop=True, inplace=True)
df1.reset_index(drop=True, inplace=True)

res = pd.concat([df_100, df1], axis=1)


In [None]:
res1 = res[res.columns.difference(['reviewText','summary', 'adjectives', 'overall_rating'])]
normalized_res1 = res1
normalized_res1['cleanReviewLength']= (normalized_res1['cleanReviewLength']-normalized_res1['cleanReviewLength'].min())/(normalized_res1['cleanReviewLength'].max()-normalized_res1['cleanReviewLength'].min())
normalized_res1['polarity']= (normalized_res1['polarity']-normalized_res1['polarity'].min())/(normalized_res1['polarity'].max()-normalized_res1['polarity'].min())
y = res['overall_rating'].values.reshape(-1,1)

res1.shape, y.shape

## Logistic Regression Classifier and ROC Curve

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn import preprocessing

def multiclass_roc_auc_score(y_test, y_pred, average="macro"):
    lb = preprocessing.LabelBinarizer()
    lb.fit(y_test)
    y_test = lb.transform(y_test)
    y_pred = lb.transform(y_pred)
    return roc_auc_score(y_test, y_pred, average=average)


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
from sklearn import metrics
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

X_train, X_test, y_train, y_test = train_test_split(normalized_res1, y, test_size=0.2, random_state= 51)
lr = linear_model.LogisticRegression(multi_class='multinomial', solver='newton-cg').fit(X_train,y_train)

print ("Multinomial Logistic regression Train Accuracy :: ", metrics.accuracy_score(y_train, lr.predict(X_train)))
print ("Multinomial Logistic regression Test Accuracy :: ", metrics.accuracy_score(y_test, lr.predict(X_test)))
print ("Area under ROC curve:: ",multiclass_roc_auc_score(y_test,lr.predict(X_test)))

cnf_matrix = metrics.confusion_matrix(y_test, lr.predict(X_test))


class_names=[1,2,3,4,5] # name  of classes
fig, ax = plt.subplots()
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
ax.xaxis.set_ticklabels(class_names)
ax.yaxis.set_ticklabels(class_names)
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

In [None]:
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import label_binarize
import matplotlib.pyplot as plt


y = label_binarize(y, classes=[1,2,3,4,5])
n_classes = 5

# shuffle and split training and test sets
X_train, X_test, y_train, y_test =train_test_split(normalized_res1, y, test_size=0.2, random_state=51)

# classifier
clf = OneVsRestClassifier(linear_model.LogisticRegression(multi_class='multinomial', solver='newton-cg'))
y_score = clf.fit(X_train, y_train).decision_function(X_test)


# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])
colors = ['blue', 'red', 'green', 'yellow', 'violet']
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i+1, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for multi-class data')
plt.legend(loc="lower right")
plt.show()

## SVM Linear Classifier and ROC Curve

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn import metrics
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Assuming y_train and y_test are one-hot encoded
# Convert them to 1D arrays by selecting the index of the maximum value
y_train_1d = np.argmax(y_train, axis=1)
y_test_1d = np.argmax(y_test, axis=1)

# Check the shape of normalized_res1 and y_train_1d
print("Shape of normalized_res1:", normalized_res1.shape)
print("Shape of y_train_1d:", y_train_1d.shape)

# Check if the number of samples in y_train_1d is less than the number of samples in normalized_res1
if y_train_1d.shape[0] < normalized_res1.shape[0]:
    # Trim normalized_res1 to match the number of samples in y_train_1d
    normalized_res1_trimmed = normalized_res1[:y_train_1d.shape[0]]
    print("Shape of trimmed normalized_res1:", normalized_res1_trimmed.shape)

    # Use the trimmed normalized_res1 for training
    X_train, X_test, y_train_1d, y_test_1d = train_test_split(normalized_res1_trimmed, y_train_1d, test_size=0.2, random_state=51)
else:
    # Use the original normalized_res1 and y_train_1d for training
    X_train, X_test, y_train_1d, y_test_1d = train_test_split(normalized_res1, y_train_1d, test_size=0.2, random_state=51)

# Train LinearSVC model
svm = LinearSVC()
svm.fit(X_train, y_train_1d)

# Evaluate model performance
train_accuracy = metrics.accuracy_score(y_train_1d, svm.predict(X_train))
test_accuracy = metrics.accuracy_score(y_test_1d, svm.predict(X_test))
print("Multinomial SVM Train Accuracy:", train_accuracy)
print("Multinomial SVM Test Accuracy:", test_accuracy)

# Calculate and display confusion matrix
cnf_matrix = metrics.confusion_matrix(y_test_1d, svm.predict(X_test))
class_names = np.unique(y_test_1d)  # Use unique classes from the test set

plt.figure(figsize=(8, 6))
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
plt.title('Confusion Matrix')
plt.xlabel('Predicted label')
plt.ylabel('Actual label')
plt.xticks(ticks=np.arange(len(class_names)) + 0.5, labels=class_names)
plt.yticks(ticks=np.arange(len(class_names)) + 0.5, labels=class_names)
plt.show()


In [None]:
from sklearn.metrics import roc_curve, auc
from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.preprocessing import label_binarize
# from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt


y = label_binarize(y, classes=[1,2,3,4,5])
n_classes = 5

# shuffle and split training and test sets
X_train, X_test, y_train, y_test =train_test_split(normalized_res1, y, test_size=0.2, random_state=51)

# classifier
clf = OneVsRestClassifier(LinearSVC(random_state=0))
y_score = clf.fit(X_train, y_train).decision_function(X_test)


#Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])
colors = ['blue', 'red', 'green', 'yellow', 'violet']
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i+1, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for multi-class data')
plt.legend(loc="lower right")
plt.show()

## Multinomial Naive Bayes Multi-Class Classifier

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

X_train, X_test, y_train, y_test = train_test_split(df_res['cleanReview'], df_res['overall_rating'], random_state = 51)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)
print ("Multinomial Naive Bayes Train Accuracy :: ", metrics.accuracy_score(y_train, clf.predict(X_train_tfidf)))
print ("Multinomial Naive Bayes Accuracy :: ", metrics.accuracy_score(y_test, clf.predict(count_vect.transform(X_test))))
print ("Area under ROC curve:: ",multiclass_roc_auc_score(y_test,clf.predict(count_vect.transform(X_test))))

cnf_matrix = metrics.confusion_matrix(y_test, clf.predict(count_vect.transform(X_test)))
cnf_matrix

class_names=[1,2,3,4,5] # name  of classes
fig, ax = plt.subplots()
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
ax.xaxis.set_ticklabels(class_names)
ax.yaxis.set_ticklabels(class_names)
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

In [None]:
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import label_binarize
# from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt


y = label_binarize(y, classes=[1,2,3,4,5])
n_classes = 5

# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(df_res['cleanReview'], y, random_state = 51)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# classifier
clf = OneVsRestClassifier(MultinomialNB())
y_score = clf.fit(X_train_tfidf, y_train).predict_proba(count_vect.transform(X_test))


#Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:,i], y_score[:,i])
    roc_auc[i] = auc(fpr[i], tpr[i])
colors = ['blue', 'red', 'green', 'yellow', 'violet']
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i+1, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for multi-class data')
plt.legend(loc="lower right")
plt.show()

# Accuracy of Result

**Logistic Regression Classifier:** Achieved a training accuracy of around 73.87% and a test accuracy of around 63.39%. The area under the ROC curve is approximately 0.77.

**SVM Linear Classifier:** Achieved a training accuracy of around 50.19% and a test accuracy of around 19.95%. This model seems to perform poorly compared to Logistic Regression.

**Naive Bayes Multi-Class Classifier:** Achieved a training accuracy of around 73.15% and a test accuracy of around 60.34%. The area under the ROC curve is approximately 0.75.

**Automated Data Labeling using PCA:** Utilized PCA to reduce the dimensionality of TF-IDF vectorized review data, followed by KMeans clustering to assign cluster labels as automated data labels for sentiment analysis.

Overall, Logistic Regression and Naive Bayes classifiers performed relatively better compared to SVM. It's interesting to see the use of PCA for automated data labeling, which can be a useful technique for exploratory analysis and understanding the data distribution.

# Automated data labelling using PCA and generalizing the approach

Automating data labeling using PCA (Principal Component Analysis) for sentiment analysis involves reducing the dimensionality of the data and then using clustering algorithms to group similar data points together.

In this code:


1.   We perform PCA to reduce the dimensionality of the TF-IDF vectorized review data.

2.   We then scale the data and apply PCA to obtain 2 principal components for visualization.

3.   We visualize the PCA components to understand the data distribution.

4.   We use the Elbow Method to determine the optimal number of clusters for KMeans clustering.

5.   We apply KMeans clustering with the optimal number of clusters.

6.   We assign cluster labels to the data.

7.   We visualize the clustered data to see how the data points are grouped.

8.   We label the clusters based on the interpretation of reviews within each cluster.

8.   Finally, we can use these cluster labels as automated data labels for sentiment analysis.

In [None]:
!pip show scikit-learn

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster._kmeans import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Assuming df_negative_reviews_new contains the preprocessed negative reviews data
# Perform PCA to reduce dimensionality
X = df_negative_reviews_new['cleanrev']
vectorizer = TfidfVectorizer(max_features=1000)  # Adjust max_features as needed
X = vectorizer.fit_transform(X)

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X.toarray())

# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Visualize PCA components
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Components')
plt.show()

# Determine optimal number of clusters using Elbow Method
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(X_pca)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

# Based on the elbow method, select optimal number of clusters
optimal_clusters = 3  # Adjust as needed

# Apply KMeans clustering
kmeans = KMeans(n_clusters=optimal_clusters, init='k-means++', max_iter=300, n_init=10, random_state=0)
cluster_labels = kmeans.fit_predict(X_pca)

# Assign cluster labels to data
df_negative_reviews_new['cluster_label'] = cluster_labels

# Visualize clustered data
plt.figure(figsize=(8, 6))
for i in range(optimal_clusters):
    plt.scatter(X_pca[cluster_labels == i, 0], X_pca[cluster_labels == i, 1], label=f'Cluster {i}', alpha=0.5)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', label='Centroids')
plt.title('Clustered Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()


# Conclusion


Based on the analysis conducted for sentiment analysis using various classifiers and automated data labeling techniques:

**Classifier Performance:**
*   Logistic Regression and Naive Bayes classifiers outperformed the SVM classifier in terms of accuracy and area under the ROC curve.

*   Logistic Regression achieved the highest accuracy among the tested classifiers, followed closely by Naive Bayes.


**Model Evaluation:**
*   The evaluation metrics, including accuracy, precision, recall, and F1-score, provide insights into the performance of each classifier across different sentiment classes.

*   Logistic Regression and Naive Bayes classifiers demonstrated relatively balanced performance across all sentiment classes.

**Automated Data Labeling:**
*   Utilizing PCA and KMeans clustering for automated data labeling proved to be a valuable technique for exploratory analysis and understanding the data distribution.

*   Automated data labeling can provide insights into the underlying patterns in the data and assist in feature engineering for improving classifier performance.

**Future Directions:**
*   Further experimentation with advanced feature engineering techniques, such as word embeddings or deep learning models, could potentially improve sentiment classification accuracy.

*   Exploring ensemble learning methods or model stacking techniques may enhance overall model performance by combining the strengths of multiple classifiers.

*   Continuous monitoring and updating of the sentiment analysis model with new data can ensure its relevance and effectiveness over time.

**Limitations:**
*   The analysis focused primarily on traditional machine learning classifiers, and incorporating more advanced techniques could lead to further improvements.

*   The evaluation metrics used may not fully capture the nuances of sentiment analysis, and exploring additional metrics or domain-specific evaluation approaches could provide deeper insights.


In conclusion, the sentiment analysis project highlights the effectiveness of logistic regression and naive Bayes classifiers for sentiment classification tasks. Additionally, the use of automated data labeling techniques such as PCA and KMeans clustering can aid in exploratory analysis and improve understanding of the underlying data distribution. Moving forward, further experimentation with advanced techniques and continuous model refinement are essential for enhancing sentiment analysis accuracy and applicability in real-world scenarios.