<h2>Text Preprocessing</h2>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import re 
import spacy

In [2]:
preprocessor = spacy.load("en_core_web_sm", disable=["ner", "parser"])

In [3]:
html = re.compile(r"<.*?>")
url = re.compile(r"http\S+www\S+")
punctuation = re.compile(r"[^\w\s]")
numbers = re.compile(r"\d+")

def standardization(text):
    return numbers.sub("", punctuation.sub("", url.sub("", html.sub("",text.lower()))))

def stop_removal_and_lemmatize(text):
    text = standardization(text)
    doc = preprocessor(text)
    return " ".join([token.lemma_ for token in doc if not token.is_stop and token.is_alpha])

In [60]:
data = pd.read_csv("Reviews.csv")
data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [5]:
feature = data[['Text']]
feature

Unnamed: 0,Text
0,I have bought several of the Vitality canned d...
1,Product arrived labeled as Jumbo Salted Peanut...
2,This is a confection that has been around a fe...
3,If you are looking for the secret ingredient i...
4,Great taffy at a great price. There was a wid...
...,...
568449,Great for sesame chicken..this is a good if no...
568450,I'm disappointed with the flavor. The chocolat...
568451,"These stars are small, so you can give 10-15 o..."
568452,These are the BEST treats for training and rew...


In [6]:
feature['Text'] = feature['Text'].apply(stop_removal_and_lemmatize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  feature['Text'] = feature['Text'].apply(stop_removal_and_lemmatize)


In [57]:
feature_copy = feature.copy()

In [58]:
feature_copy

Unnamed: 0,Text
0,buy vitality can dog food product find good qu...
1,product arrive label jumbo salt peanutsthe pea...
2,confection century light pillowy citrus gelati...
3,look secret ingredient robitussin believe find...
4,great taffy great price wide assortment yummy ...
...,...
568449,great sesame chickenthis good well resturant e...
568450,m disappointed flavor chocolate note especiall...
568451,star small training session try train dog ceas...
568452,good treat training reward dog good groom low ...


In [61]:
def sentiment_conversion(score):
    if score >3:
        return 1
    elif score > 1:
        return 0
    else:
        return -1

feature_copy['Score'] = data['Score'].apply(sentiment_conversion)
feature_copy

Unnamed: 0,Text,Score
0,buy vitality can dog food product find good qu...,1
1,product arrive label jumbo salt peanutsthe pea...,-1
2,confection century light pillowy citrus gelati...,1
3,look secret ingredient robitussin believe find...,0
4,great taffy great price wide assortment yummy ...,1
...,...,...
568449,great sesame chickenthis good well resturant e...,1
568450,m disappointed flavor chocolate note especiall...,0
568451,star small training session try train dog ceas...,1
568452,good treat training reward dog good groom low ...,1


<h2>Feature Extraction</h2>

In [62]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

x = feature_copy['Text']
y = feature_copy['Score']

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size= 0.2)

vectorizer = TfidfVectorizer()

x_train_processed = vectorizer.fit_transform(x_train) 
x_test_processed = vectorizer.transform(x_test)

print("Size of training set: ", x_train_processed.shape[0])

Size of training set:  454763


<h2>Model Selection</h2>

<h6>Lexicon-based approach</h6>

In [63]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 

In [64]:
#Since Vader does not require vectorizer hence, we shall analyse on the preprocessed text instead of the numerical representations.
vader_analyser = SentimentIntensityAnalyzer()

#Mapping vader score to the scale of 1-5
def vader_to_label(score):
    if score <= -0.1:
        return -1  
    elif score >= 0.1:
        return 1   
    else:
        return 0   

#Predicted score
feature_copy['VaderScore'] = feature_copy['Text'].apply(lambda x: vader_analyser.polarity_scores(x)['compound'])
feature_copy['PredictedScore'] = feature_copy['VaderScore'].apply(vader_to_label)


In [65]:
feature_copy

Unnamed: 0,Text,Score,VaderScore,PredictedScore
0,buy vitality can dog food product find good qu...,1,0.9118,1
1,product arrive label jumbo salt peanutsthe pea...,-1,-0.1027,-1
2,confection century light pillowy citrus gelati...,1,0.8532,1
3,look secret ingredient robitussin believe find...,0,0.4404,1
4,great taffy great price wide assortment yummy ...,1,0.9468,1
...,...,...,...,...
568449,great sesame chickenthis good well resturant e...,1,0.9231,1
568450,m disappointed flavor chocolate note especiall...,0,-0.8221,-1
568451,star small training session try train dog ceas...,1,0.8860,1
568452,good treat training reward dog good groom low ...,1,0.9719,1


<h6>Machine Learning based approach</h6>

In [66]:
from sklearn.svm import LinearSVC
svm = LinearSVC(class_weight='balanced')

In [67]:
svm.fit(x_train_processed, y_train)

In [68]:
prediction = svm.predict(x_test_processed)
feature_copy['PredictedScore2'] = np.concatenate([y_train.values, prediction])

<h2>Model Evaluation</h2>

In [69]:
from sklearn.metrics import accuracy_score, classification_report

In [70]:
print("Lexicon-based accuracy: ", accuracy_score(feature_copy['Score'], feature_copy['PredictedScore']))
print("Lexicon-based classification report: ", classification_report(feature_copy['Score'], feature_copy['PredictedScore']))

Lexicon-based accuracy:  0.7757056859482033
Lexicon-based classification report:                precision    recall  f1-score   support

          -1       0.41      0.33      0.37     52268
           0       0.23      0.06      0.10     72409
           1       0.83      0.94      0.88    443777

    accuracy                           0.78    568454
   macro avg       0.49      0.45      0.45    568454
weighted avg       0.71      0.78      0.73    568454



In [71]:
print("Machine learning based accuracy: ", accuracy_score(feature_copy['Score'], feature_copy['PredictedScore2']))
print("Machine learning based classification report: ", classification_report(y_test, prediction))

Machine learning based accuracy:  0.6329166476091294
Machine learning based classification report:                precision    recall  f1-score   support

          -1       0.66      0.73      0.69     10570
           0       0.58      0.58      0.58     14355
           1       0.94      0.93      0.93     88766

    accuracy                           0.87    113691
   macro avg       0.73      0.75      0.74    113691
weighted avg       0.87      0.87      0.87    113691



In [72]:
feature_copy.rename(columns={'PredictedScore':'LaxiconScore','PredictedScore2':'MLScore'}, inplace=True)

In [73]:
def reverse_sentiment_conversion(score):
    if score == 1:
        return 'Positive'
    elif score == 0:
        return 'Neutral'
    else:
        return 'Negative'
        
score_column = ['LaxiconScore', 'MLScore', 'Score']

for i in score_column:
    feature_copy[i] = feature_copy[i].apply(reverse_sentiment_conversion)
    data[i] = feature_copy[i]
data

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,LaxiconScore,MLScore
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,Positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,Positive,Positive
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,Negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,Negative,Positive
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,Positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,Positive,Positive
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,Neutral,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,Positive,Positive
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,Positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...,Positive,Positive
...,...,...,...,...,...,...,...,...,...,...,...,...
568449,568450,B001EO7N10,A28KG5XORO54AY,Lettie D. Carter,0,0,Positive,1299628800,Will not do without,Great for sesame chicken..this is a good if no...,Positive,Neutral
568450,568451,B003S1WTCU,A3I8AFVPEE8KI5,R. Sawyer,0,0,Neutral,1331251200,disappointed,I'm disappointed with the flavor. The chocolat...,Negative,Positive
568451,568452,B004I613EE,A121AA1GQV751Z,"pksd ""pk_007""",2,2,Positive,1329782400,Perfect for our maltipoo,"These stars are small, so you can give 10-15 o...",Positive,Positive
568452,568453,B004I613EE,A3IBEVCTXKNOH,"Kathy A. Welch ""katwel""",1,1,Positive,1331596800,Favorite Training and reward treat,These are the BEST treats for training and rew...,Positive,Positive


In [74]:
data.to_csv("Processed_Review.csv")

<h2>Discussion</h2>
<h4>Group Members</h4>
<h5>Thong Hao Hong(SW01083725)</h5>
<h5>Vishnu Ram(SW01083727)</h5>
<h5>Jeevesh(SW01083692)</h5>

<h3>Lexicon based Sentiment Classification</h3>
<pre>
Strength:
Efficient
Higher accuracy
Does not require additional feature extraction such as TF-IDF.
No training required as prebuilt lexicons can be adopted to analyse the text straightforwardly.

Weaknesses:
Limited language support
Over-simplified the textual data into a single polarity score.
</pre>



<h3>Machine Learning based Sentiment Classification</h3>
<pre>
Strength:
Less susceptible to noises and unique words.

Weaknesses:
Lower accuracy
Requires training
Consume large amount of time to process the data
Prone to overfitting
</pre>

