*** 

# <a id='toc5_'></a>[Sentiment Analysis](#toc0_)

We perform sentiment analysis on the reviewText column. We conduct one types of sentiment analysis: 

1. Sentiment Analysis using Lexicon-based Methods


We apply this sentiment analysis for each set of data. 


In [None]:
# reset the working directory
%reset -f

# import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import re
import nltk

In [None]:
# read in data from csv file
# amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set1_data_cleaned.csv')

# # load data - Set 1
amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set1_data_cleaned.csv')

# # load data - Set 2
# amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set2_data_cleaned.csv')

# # load data - Set 3
# amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set3_data_cleaned.csv')

# load data - Set 4
# amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set4_data_cleaned.csv')

In [None]:
# initial data view
display(amz_data.head(3))
print("Shape of the data: ", amz_data.shape)

In [None]:
# data summary
print("Shape of data =>", amz_data.shape)
print("Number of unique products =>", amz_data['asin'].nunique())
print("Number of unique users =>", amz_data['reviewerID'].nunique())

In [None]:
import ast

# Change to list of words for: filtered_tokens_revText
amz_data['filtered_tokens_revText'] = amz_data['filtered_tokens_revText'].apply(lambda x: ast.literal_eval(x))

# Change to list of words for: stemmed_words_revText
amz_data['stemmed_words_revText'] = amz_data['stemmed_words_revText'].apply(lambda x: ast.literal_eval(x))

# Change to list of words for: lemmatized_words_revText
amz_data['lemmatized_words_revText'] = amz_data['lemmatized_words_revText'].apply(lambda x: ast.literal_eval(x))

# view data
amz_data.head(4)



## <a id='toc5_1_'></a>[Sentiment Analysis using Lexicon-based Methods](#toc0_)

First we use lexicon-based methods to perform sentiment analysis. Lexicon-based methods use a lexicon, or a collection of words and phrases associated with emotions, to assign sentiment scores to a body of text.


### <a id='toc5_1_1_'></a>[VADER](#toc0_)

We use the VADER (Valence Aware Dictionary and Sentiment Reasoner) lexicon to perform sentiment analysis on the reviewText column. VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It is available in the NLTK package and can be applied directly to unlabeled text data.

VADER utilizes a sentiment lexicon containing words with sentiment scores. However, VADER goes beyond simply assigning positive or negative labels to words. It considers the intensity of sentiment and incorporates linguistic rules to handle negations, intensifiers, and other linguistic features. This makes VADER particularly suitable for analyzing sentiment in social media text, where linguistic nuances and context play a significant role

In [None]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer

# instance of the VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Iterate through tokenized reviews and analyze the sentiment for each one
sentiments_vader_revText = []
for review in amz_data['reviewText']:
    if isinstance(review, str):  # Check if the review is a string
        sentiment = sia.polarity_scores(review)
    else:
        sentiment = sia.polarity_scores('')  # Replace NaN with an empty string
    sentiments_vader_revText.append(sentiment)

# store the sentiment scores in the dataframe
amz_data['sentiments_vader_revText'] = sentiments_vader_revText

In [None]:
# see some results
amz_data.sentiments_vader_revText.head(4).values

NumPy array containing sentiment analysis results. Each element in the array represents the sentiment scores for a single review. These sentiment scores are generated by VADER (Valence Aware Dictionary and sEntiment Reasoner), which is a popular rule-based model used for sentiment analysis. The scores are typically between -1 and 1, where values closer to 1 indicate more positive sentiment, values closer to -1 indicate more negative sentiment, and values around 0 indicate neutral sentiment. The compound score represents an overall sentiment intensity, combining the individual sentiment scores. The compound score in VADER sentiment analysis typically varies between -1 and 1. A compound score of 1 indicates extremely positive sentiment, while a score of -1 indicates extremely negative sentiment. Scores close to 0 represent more neutral or balanced sentiment.





In [None]:
# see dataframe
amz_data[['reviewerID', 'asin', 'reviewText', 'overall', 'sentiments_vader_revText']].head(4)

### <a id='toc5_1_2_'></a>[TextBlob](#toc0_)

We also use the TextBlob library to perform sentiment analysis on the reviewText column. TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. TextBlob is built on top of NLTK and Pattern and provides an easy-to-use interface to the NLTK library. We use the sentiment analysis functionality of TextBlob to calculate the polarity and subjectivity scores for each review in the reviewText column. 

TextBlob's sentiment analysis algorithm is based on a pre-trained model that has been trained on a large dataset. The model uses a combination of linguistic rules, pattern matching, and machine learning techniques like Naive Bayes classifiers

In [None]:
from textblob import TextBlob

sentiments_textblob_revText = []
subjectivities_textblob_revText = []

for review in amz_data['reviewText']:
    if isinstance(review, str):
        blob = TextBlob(review)
        sentiment = blob.sentiment.polarity
        subjectivity = blob.sentiment.subjectivity
    else:
        blob = TextBlob('')
        sentiment = blob.sentiment.polarity
        subjectivity = blob.sentiment.subjectivity

    sentiments_textblob_revText.append(sentiment)
    subjectivities_textblob_revText.append(subjectivity)

amz_data['sentiments_textblob_revText'] = sentiments_textblob_revText
amz_data['subjectivities_textblob_revText'] = subjectivities_textblob_revText


In [None]:
# see some results - sentiment
print(amz_data.sentiments_textblob_revText.head(4).values)

# see some results - subjectivity
print(amz_data.subjectivities_textblob_revText.head(4).values)

TextBlob's sentiment analysis is based on a machine learning algorithm trained on a large dataset of labeled data. The algorithm learns patterns and linguistic features from the data to classify text into different sentiment categories, such as positive, negative, or neutral.

**Polarity**: It indicates the sentiment of the text on a scale from -1 to 1. A polarity score close to -1 indicates negative sentiment, a score close to 1 indicates positive sentiment, and a score around 0 indicates neutral sentiment.

**Subjectivity**: It measures the subjectivity of the text on a scale from 0 to 1. A subjectivity score of 0 means the text is objective and factual, while a score of 1 means the text is highly subjective and opinionated.

**TextBlob uses a trained model to analyze the sentiment of the input text based on the learned patterns and features. It takes into account not only individual words but also the context and grammar of the text.**




### <a id='toc5_1_3_'></a>[Bing, AFINN, and NRC](#toc0_)

The BING lexicon, for example, classifies words as either positive or negative. bing assigns a numerical sentiment score to words, where a positive score indicates positive sentiment and a negative score indicates negative sentiment. NRC extends this approach by providing a more comprehensive list of words and associating them with multiple sentiment dimensions, such as anger, joy, fear, etc.



#### AFINN

In [None]:
# read lexicons in
afinn = pd.read_csv('/Users/pavansingh/Desktop/Afinn.csv')

# AFINN
print("Shape of AFINN:", afinn.shape)
print("Unique Sentiments:", afinn.value.unique())
display(afinn.head(3))
afinn_dict = dict(zip(afinn['word'], afinn['value']))

In [None]:
# Get the sentiment score for each review using AFINN
sentiment_scores_afinn = []

for review_tokens in amz_data['filtered_tokens_revText']:
    sentiment_score = sum(afinn_dict.get(word, 0) for word in review_tokens)
    sentiment_scores_afinn.append(sentiment_score)

# Add the sentiment scores to the dataframe
amz_data['sentiment_score_afinn_revText'] = sentiment_scores_afinn


In [None]:
# see data
amz_data[['reviewerID', 'reviewText', 'overall', 'sentiment_score_afinn_revText']].head(4)

#### BING

In [None]:
# read in 
bing = pd.read_csv('/Users/pavansingh/Desktop/Bing.csv')

# BING
print("Shape of Bing:", bing.shape)
print("Unique Sentiments:", bing.sentiment.unique())
display(bing.head(3))
bing_dict = dict(zip(bing['word'], bing['sentiment']))

In [None]:
# Get the sentiment score for each review using bing
sentiment_scores_bing_revText = []

for review_tokens in amz_data['filtered_tokens_revText']:
    sentiment_score = sum(-1 if bing_dict.get(word, '') == 'negative' else 1 if bing_dict.get(word, '') == 'positive' else 0 for word in review_tokens)
    sentiment_scores_bing_revText.append(sentiment_score)

# Add the sentiment scores to the dataframe
amz_data['sentiment_score_bing_revText'] = sentiment_scores_bing_revText


In [None]:
# see data
amz_data[['reviewerID', 'reviewText', 'overall','sentiment_score_bing_revText']].head(4)

#### NRC

In [None]:
# read in 
nrc = pd.read_csv('/Users/pavansingh/Desktop/NRC.csv')

# NRC
print("Shape of NRC:", nrc.shape)
print("Unique Sentiments:", nrc.sentiment.unique())
display(nrc.head(3))
nrc_dict = dict(zip(nrc['word'], nrc['sentiment']))

In [None]:
unique_sentiments = ['trust', 'fear', 'negative', 'sadness', 'anger', 'surprise', 'positive', 'disgust', 'joy', 'anticipation']

# Get the sentiment score for each review using NRC
sentiment_scores_nrc_revText = []

# Calculate sentiment score and overall sentiment for each review using NRC lexicon
for review_tokens in amz_data['filtered_tokens_revText']:
    review_sentiment_scores = {sentiment: 0 for sentiment in unique_sentiments}
    for word in review_tokens:
        word_sentiments = nrc_dict.get(word, [])
        for sentiment in unique_sentiments:
            if sentiment in word_sentiments:
                review_sentiment_scores[sentiment] += 1

    overall_sentiment = max(review_sentiment_scores, key=review_sentiment_scores.get)
    sentiment_scores_nrc_revText.append(overall_sentiment)

# Add the sentiment scores to the dataframe
amz_data['sentiment_score_nrc_revText'] = sentiment_scores_nrc_revText

In [None]:
amz_data[['reviewerID', 'reviewText', 'overall', 'sentiment_score_bing_revText']].head(4)

#### Final Data Frame with BING, AFINN and NRC Sentiment Scores

In [None]:
# see data 
amz_data[['reviewerName',  'reviewText','overall', 'sentiment_score_afinn_revText',
        'sentiment_score_bing_revText',  'sentiment_score_nrc_revText']].head(4)

In [None]:
# # save data as csv - data set 1
amz_data.to_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set1_data_sentiment.csv", index=False)

# # save data as csv - data set 2
# amz_data.to_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set2_data_sentiment.csv", index=False)

# # save data as csv - data set 3
# amz_data.to_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set3_data_sentiment.csv", index=False)

# save data as csv - data set 4
# amz_data.to_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set4_data_sentiment.csv", index=False)