In [1]:
import numpy as np 
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/amazon-product-reviews/Reviews.csv


In [2]:
df = pd.read_csv('/kaggle/input/amazon-product-reviews/Reviews.csv')
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      568454 non-null  int64 
 1   ProductId               568454 non-null  object
 2   UserId                  568454 non-null  object
 3   ProfileName             568428 non-null  object
 4   HelpfulnessNumerator    568454 non-null  int64 
 5   HelpfulnessDenominator  568454 non-null  int64 
 6   Score                   568454 non-null  int64 
 7   Time                    568454 non-null  int64 
 8   Summary                 568427 non-null  object
 9   Text                    568454 non-null  object
dtypes: int64(5), object(5)
memory usage: 445.8 MB


In [3]:
df.Score.value_counts(normalize=True)

Score
5    0.638789
4    0.141885
1    0.091948
3    0.075010
2    0.052368
Name: proportion, dtype: float64

Score 4 & 5 is positive, score 1 & 2 is negative and 3 neutral. Let's stick to binary case but also resample to balance our data distrubution as ~77% of our current is data positive sentiment.

In [4]:
ds = df.groupby('Score').apply(lambda x: x.sample(min(50_000, len(x)), random_state=42)).reset_index(drop=True)
ds.Score.value_counts(normalize=True)

Score
1    0.224811
4    0.224811
5    0.224811
3    0.191719
2    0.133848
Name: proportion, dtype: float64

In [5]:
ds = ds[['Id', 'ProductId', 'Summary', 'Text', 'Score']]
ds.rename(columns={
    "Id": "id",
    "ProductId": "product_id",
    "Summary": "summary",
    "Text": "text",
    "Score": "score"
}, inplace=True)

ds['sentiment'] = np.where(ds['score'] > 3, 1, np.where(ds['score'] < 3, 0 , np.nan))
ds = ds[~(ds['score'] == 3)]
ds.sentiment.value_counts()

sentiment
1.0    100000
0.0     79769
Name: count, dtype: int64

The techniques for performing sentiment analysis can be broken down into simple rule-based techniques and supervised machine learning approaches. Rule-based techniques are easier to apply since they do not require annotated training data. Supervised learning approaches provide better results but include the additional effort of labeling the data. 

- Sentiment analysis using lexicon-based approaches
- Sentiment analysis by building additional features from text data and applying a supervised machine learning algorithm
- Sentiment analysis using transfer learning technique and pretrained language models like BERT

In [6]:
! pip install textacy

Collecting textacy
  Downloading textacy-0.13.0-py3-none-any.whl (210 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m210.7/210.7 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Collecting floret~=0.10.0 (from textacy)
  Obtaining dependency information for floret~=0.10.0 from https://files.pythonhosted.org/packages/16/ee/388a5c76c9292f4bef85d7ef895005bb39a0899f8004e9daceb57b2bb0c9/floret-0.10.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading floret-0.10.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.1 kB)
Collecting jellyfish>=0.8.0 (from textacy)
  Obtaining dependency information for jellyfish>=0.8.0 from https://files.pythonhosted.org/packages/26/87/8d31224804af9dfa7b34657e083b67b24b322c41dd9464b52218c1a33890/jellyfish-1.0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading jellyfish-1.0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.5 kB)
Collectin

In [7]:
from textacy import preprocessing as tprep
from spacy.lang.en.stop_words import STOP_WORDS
import re
import spacy
from tqdm.autonotebook import tqdm

tqdm.pandas()

process = tprep.make_pipeline(
    tprep.replace.emails,
    tprep.replace.emojis,
    tprep.replace.urls,
    tprep.replace.phone_numbers,
    tprep.replace.hashtags,
    tprep.replace.currency_symbols,
    lambda text: re.sub(r"\n", " ", text),
    tprep.remove.html_tags,
    tprep.remove.brackets,
    tprep.remove.punctuation,
    tprep.normalize.hyphenated_words,
    tprep.normalize.quotation_marks,
    tprep.normalize.unicode,
    tprep.normalize.bullet_points,
    tprep.normalize.whitespace,
)



In [8]:
ds['clean_text'] = ds['text'].progress_apply(process)

  0%|          | 0/179769 [00:00<?, ?it/s]

## Lexicon Based Approaches

What is a lexicon? A lexicon is like a dictionary that contains a collection of words and has been compiled using expert knowledge. The key differentiating factor for a lexicon is that it incorporates specific knowledge and has been collected for a specific purpose. We will use sentiment lexicons that contain commonly used words and capture the sentiment associated with them. A simple example of this is the word happy, with a sentiment score of 1, and another is the word frustrated, which would have a score of -1. Several standardized lexicons are available for the English language, and the popular ones are AFINN Lexicon, SentiWordNet, Bing Liu’s lexicon, and VADER lexicon, among others. They differ from each other in the size of their vocabulary and their representation. For example, the AFINN Lexicon comes in the form of a single dictionary with 3,300 words, with each word assigned a signed sentiment score ranging from -3 to +3. Negative/positive indicate the polarity, and the magnitude indicates the strength. On the other hand, if we look at Bing Liu lexicon, it comes in the form of two lists: one for positive words and another for negative, with a combined
vocabulary of 6,800 words.

In [9]:
from nltk.corpus import opinion_lexicon
from nltk.tokenize import word_tokenize
import random


print('Total number of words in opinion lexicon', len(opinion_lexicon.words()))
print('Examples of positive words in opinion lexicon', random.sample(sorted(opinion_lexicon.positive()), 5))
print('Examples of negative words in opinion lexicon', random.sample(sorted(opinion_lexicon.negative()), 5))

Total number of words in opinion lexicon 6789
Examples of positive words in opinion lexicon ['affectation', 'proud', 'majestic', 'fervent', 'easy-to-use']
Examples of negative words in opinion lexicon ['annihilation', 'inconstant', 'rebuke', 'delay', 'adversarial']


In [10]:
import nltk


nltk.download('punkt')


pos_score, neg_score = 1, -1
word_dict = {}

# Adding the positive words to the dictionary
for word in opinion_lexicon.positive():
        word_dict[word] = pos_score
        
# Adding the negative words to the dictionary
for word in opinion_lexicon.negative():
        word_dict[word] = neg_score
        

def bing_liu_score(text):
    sentiment_score = 0
    bag_of_words = word_tokenize(text.lower())
    
    for word in bag_of_words:
        if word in word_dict:
            sentiment_score += word_dict[word]
    
    return sentiment_score / len(bag_of_words)

ds['Bing_Liu_Score'] = ds['text'].progress_apply(bing_liu_score)
ds.head()

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


  0%|          | 0/179769 [00:00<?, ?it/s]

Unnamed: 0,id,product_id,summary,text,score,sentiment,clean_text,Bing_Liu_Score
0,348179,B000O160KE,Sweet & Low without the cancer.,"If you like the (bitter) taste of Sweet & Low,...",1,0.0,If you like the taste of Sweet Low get this If...,0.020619
1,306508,B004NB79VU,wedding mom,item was much smaller than appeared on line. ...,1,0.0,item was much smaller than appeared on line Yo...,0.0
2,228313,B003VXHGPK,Don't waste your money or your Keurig on this!,This coffee tastes very flavorful and is not t...,1,0.0,This coffee tastes very flavorful and is not t...,0.005917
3,448369,B0030FGMFY,MADE IN CHINA!!!,I bought these for my Dalmatian for the first ...,1,0.0,I bought these for my Dalmatian for the first ...,0.0
4,515441,B004S04X4W,Tastes like cheap meat and salt,"I guess I am in the minority, but this hash pr...",1,0.0,I guess I am in the minority but this hash pro...,-0.006135


Now that we have calculated the sentiment score, we would like to check whether the calculated score matches the expectation based on the rating provided by the customer. Instead of checking this for each review, we could compare the sentiment score across reviews that have different ratings. We would expect that a review that has a five-star rating would have a higher sentiment score than a review with a one star rating

In [11]:
from sklearn.preprocessing import scale


ds['Bing_Liu_Score'] = scale(ds['Bing_Liu_Score'])
ds.groupby('score').agg({'Bing_Liu_Score':'mean'})

Unnamed: 0_level_0,Bing_Liu_Score
score,Unnamed: 1_level_1
1,-0.658122
2,-0.320433
4,0.311744
5,0.537158


### Disadvantages of a Lexicon-Based Approach

While the lexicon-based approach is simple, it has some obvious disadvantages.

- First, we are bound by the size of the lexicon; if a word does not exist in the chosen lexicon, then we are unable to use this information while determining the sentiment score for this review. In the ideal scenario, we would like to use a lexicon that captures all the words in the language, but this is not feasible.
- Second, we assume that the chosen lexicon is a gold standard and trust the sentiment score/polarity provided by the author(s). This is a problem because a particular lexicon may not be the right fit for a given use case. The Bing Liu lexicon is relevant here because it captures the online usage of language and includes common typos and slang in its lexicon. But if we were working on a dataset of tweets, then the VADER lexicon would be better suited since it includes support for popular acronyms (e.g., LOL) and emojis.
- Finally, one of the biggest disadvantages of lexicons is that they overlook negation. Since the lexicon only matches words and not phrases, this would result in a negative score for a sentence that contains not bad when it actually is more neutral.

## Applying Simple Machine Learning Algorithm 

In [12]:
# https://github.com/nltk/nltk/issues/3028
! unzip /usr/share/nltk_data/corpora/wordnet.zip -d /usr/share/nltk_data/corpora/

Archive:  /usr/share/nltk_data/corpora/wordnet.zip
   creating: /usr/share/nltk_data/corpora/wordnet/
  inflating: /usr/share/nltk_data/corpora/wordnet/lexnames  
  inflating: /usr/share/nltk_data/corpora/wordnet/data.verb  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.adv  
  inflating: /usr/share/nltk_data/corpora/wordnet/adv.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.verb  
  inflating: /usr/share/nltk_data/corpora/wordnet/cntlist.rev  
  inflating: /usr/share/nltk_data/corpora/wordnet/data.adj  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.adj  
  inflating: /usr/share/nltk_data/corpora/wordnet/LICENSE  
  inflating: /usr/share/nltk_data/corpora/wordnet/citation.bib  
  inflating: /usr/share/nltk_data/corpora/wordnet/noun.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/verb.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/README  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.sense  
  inflating: /usr/share/nltk_data

In [13]:
from nltk.corpus import wordnet
# from nltk.tokenize import word_tokenize
from spacy.lang.en import STOP_WORDS
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import WhitespaceTokenizer


def get_wordnet_pos(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
    

lemmas = tprep.make_pipeline(
    lambda t: t.lower(),
    tprep.remove.punctuation,
    tprep.replace.numbers,
    WhitespaceTokenizer().tokenize,   # This returns a list so be careful with functions afterwards
    lambda t: [x for x in t if x not in STOP_WORDS],
    lambda t: [x for x in t if len(x) > 0],
    pos_tag,
    lambda t: [WordNetLemmatizer().lemmatize(x[0], get_wordnet_pos(x[1])) for x in t],
    lambda t: [x for x in t if len(x) > 1],
    lambda t: " ".join(t)
)

ds['lemmas'] = ds['clean_text'].progress_apply(lemmas)    

  0%|          | 0/179769 [00:00<?, ?it/s]

In [14]:
## Remove observations that are empty after the cleaning step
ds = ds[ds['lemmas'].str.len() != 0]

In [15]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(ds['lemmas'], ds['sentiment'], test_size=0.2, random_state=42)

print ('Size of Training Data ', X_train.shape[0])
print ('Size of Test Data ', X_test.shape[0])

Size of Training Data  143494
Size of Test Data  35874


In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer


tfidf = TfidfVectorizer(min_df=10, ngram_range=(1,1))
X_train_tf = tfidf.fit_transform(X_train)
X_test_tf = tfidf.transform(X_test)

In [17]:
from sklearn.svm import LinearSVC


model1 = LinearSVC(random_state=42, tol=1e-5)
model1.fit(X_train_tf, y_train)

### Baseline vs ML Model

In [18]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

def baseline_scorer(text):
    return 1 if bing_liu_score(text) > 0 else 0
    
y_pred_baseline = X_test.progress_apply(baseline_scorer)
print("Baseline Accuracy Score: - ", accuracy_score(y_test, y_pred_baseline))
print("Baseline ROC-AUC Score: - ", roc_auc_score(y_test, y_pred_baseline))

y_pred = model1.predict(X_test_tf)
print ('Accuracy Score - ', accuracy_score(y_test, y_pred))
print ('ROC-AUC Score - ', roc_auc_score(y_test, y_pred))

  0%|          | 0/35874 [00:00<?, ?it/s]

Baseline Accuracy Score: -  0.7161175224396499
Baseline ROC-AUC Score: -  0.6999802330009797
Accuracy Score -  0.8774599988849864
ROC-AUC Score -  0.8754400205839005


We brought quite some improvements over the baseline here.