# Sentiment Analysis

The data is loaded from the **reviewsData** table from **test_db** schema.

Index | Date | ProductName | Category | Price | ReviewContent 
--- | --- | --- | --- | --- | ---
001 | 2020-09-27 09:11:04 | name1 | Category1 | 209.71 | Text Description
011 | 2022-12-13 15:00:54 | name2 | Category2 | 12.38 | Text Description

## VaderSentiment Analysis
VADER Sentiment Analysis is a python package ([installing instructions](https://pypi.org/project/vaderSentiment/)) used to detect the sentiment of the reviews. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains. (\cite: Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.)

VADER uses a combination of a sentiment lexicon is a list of lexical features (e.g., words) which are generally labeled according to their semantic orientation as either positive or negative. VADER not only tells about the Positivity and Negativity score but also tells us about how positive or negative a sentiment is.

VADER relies on a dictionary that maps words and other numerous lexical features common to sentiment expression in microblogs.
These features include:
- A full list of Western-style emoticons ( for example - :D and :P )
- Sentiment-related acronyms ( for example - LOL, ROFL )
- Commonly used slang with sentiment value ( for example - Nah, meh )


It makes use of certain rules to incorporate the impact of each sub-text on the perceived intensity of sentiment in sentence-level text. These rules are called Heuristics. These heuristics go beyond what would normally be captured in a typical bag-of-words model. They incorporate word-order sensitive relationships between terms.
There are 5 of them:
- *Punctuation* - namely the exclamation point (!), increases the magnitude of the intensity without modifying the semantic orientation. e.g., "It is so hot!!!” is more intense than "It is hot."
- *Capitalization* - specifically using ALL-CAPS to emphasize a sentiment-relevant word in the presence of other non-capitalized words, increases the magnitude of the sentiment intensity without affecting the semantic orientation. e.g., "It is HOT." conveys more intensity than "It is hot."
- *Degree modifiers* (also called intensifiers, booster words, or degree adverbs) - impact sentiment intensity by either increasing or decreasing the intensity. e.g., "It is extremely hot." is more intense than "It is hot.", whereas "It is slightly hot." reduces the intensity.
- *Polarity shift due to Conjunctions* - The contrastive conjunction "but" signals a shift in sentiment polarity, with the sentiment of the text following the conjunction being dominant. e.g., "It is hot, but it is bearable." has mixed sentiment, with the latter half dictating the overall rating.
- *Catching Polarity Negation* - By examining the contiguous sequence of 3 items preceding a sentiment-laden lexical feature, it able to catch nearly 90% of cases where negation flips the polarity of the text. For example a negated sentence would be "It isn't really that hot."

**Compound VADER scores for analyzing sentiment**
The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence.
As explained in the paper, the normalization is

$$x = \frac{x}{\sqrt{x^2 + \alpha}}$$
where $$x = \text{sum of valence scores of constituent words}, \alpha = \text{normalization constant (default value is 15)}$$

As a result, for each row, there is a sentiment output. There is an additional column containing the relevant terms (excluding stopwords) in the content of the reviews. The output table has the following format.
Index | Date | ProductName | Category | Price | Rating | ReviewContent | Sentiment
--- | --- | --- | --- | --- | --- | --- | --- 
001 | 2020-09-27 09:11:04 | name1 | Category1 | 21.89 | 4.5 | Text Description | Positive
011 | 2022-12-13 15:00:54 | name2 | Category2 | 130.13 | 1.2 | Text Description | Negative

This data is saved in the Hive Metastore under the **test_db** schema.

## Import packages

In [0]:
# spark packages
from pyspark.sql.functions import *
from pyspark.sql.types import *

from pyspark.ml import Pipeline
from pyspark.ml.feature import StopWordsRemover
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

import re
import pandas as pd
import numpy as np

# Warnings
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer



## Load Data
Load the data from the test_db.reviewsData table

In [0]:
SQL = f'''select * from test_db.reviewsData'''
df = spark.sql(SQL)

In [0]:
df.show(5)

+-----+-------------------+-------------+---------+-----+--------------------+
|Index|               Date|  ProductName| Category|Price|             Content|
+-----+-------------------+-------------+---------+-----+--------------------+
|    6|2022-12-23 03:52:04|ProductName-1|Category2|76.68|Buyer beware. Thi...|
|   18|2023-03-14 16:40:26|ProductName-4|Category2|15.52|i liked this albu...|
|   21|2021-02-14 15:35:50|ProductName-2|Category3|91.25|Problem with char...|
|   25|2021-10-29 14:49:19|ProductName-1|Category1|10.86|Batteries died wi...|
|   28|2020-08-28 19:38:12|ProductName-4|Category3|61.41|Excellent choice ...|
+-----+-------------------+-------------+---------+-----+--------------------+
only showing top 5 rows



## Sentiment Analysis
Using VaderSentimentAnalysis package for this purpose. This function can be directly applied to the reviews (without any additional clean-up), as this model was trained on using reviews data, so this works the best for this purpose.

As for scoring, the scores determine whether the sentiment is positive/negative/neutral and the cut-off is as follows:
- score >= 0.05: positive
- score <= -0.05: negative
- -0.05 < score < 0.05: neutral

An additional column is created to capture the sentiments from the reviews.

In [0]:
# sentiment analysis function

analyser = SentimentIntensityAnalyzer()

def choices(score):
    if score >= 0.05:
        return "Positive"
    elif (score > -0.05) and (score < 0.05):
        return "Neutral"
    elif score <= -0.05:
        return "Negative"

@udf(returnType=StringType())
def get_sentiment(sentence):
    score = np.round(analyser.polarity_scores(sentence)["compound"], 2)
    return choices(score)

In [0]:
df = df.withColumn("Sentiment", get_sentiment("Content"))

In [0]:
df.show(5)

+-----+-------------------+-------------+---------+-----+--------------------+---------+
|Index|               Date|  ProductName| Category|Price|             Content|Sentiment|
+-----+-------------------+-------------+---------+-----+--------------------+---------+
|    6|2022-12-23 03:52:04|ProductName-1|Category2|76.68|Buyer beware. Thi...| Negative|
|   18|2023-03-14 16:40:26|ProductName-4|Category2|15.52|i liked this albu...| Positive|
|   21|2021-02-14 15:35:50|ProductName-2|Category3|91.25|Problem with char...| Negative|
|   25|2021-10-29 14:49:19|ProductName-1|Category1|10.86|Batteries died wi...| Positive|
|   28|2020-08-28 19:38:12|ProductName-4|Category3|61.41|Excellent choice ...| Positive|
+-----+-------------------+-------------+---------+-----+--------------------+---------+
only showing top 5 rows



## Processing Text Columns

In [0]:
@udf("string")
def clean_sentence(sentence):
  
    '''function to clean up the sentence
    - remove punctuations, special characters,
    numbers, additional spaces in netween words,
    and remove any words of length <= 3'''

    sentence = re.sub(r"[^a-z A-Z]", " ", sentence)
    sentence = re.sub(r"/s+", "", sentence)
    sentence = " ".join([ele for ele in sentence.split() if len(ele) >= 3])
    return sentence

In [0]:
# clean up strings which have more than 2 same letters consecutively in a word

df = df.withColumn("Text", clean_sentence(regexp_replace(col("Content"), r"(\w)\1{2}", " ")))

Define the following spark pipelines to tokenize, lemmatize and remove stopwords from the sentences.

In [0]:
documentAssembler = DocumentAssembler() \
    .setInputCol("Text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols("document") \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

POSTag = PerceptronModel.pretrained() \
    .setInputCols("document", "token") \
    .setOutputCol("pos")

chunker = Chunker() \
    .setInputCols("sentence", "pos") \
    .setOutputCol("chunk") \
    .setRegexParsers(["<NN>", "<NNS>", "<NNP>", "<JJ>", "<ADJ>"])

lemmatizer = LemmatizerModel.pretrained() \
     .setInputCols(["token"]) \
     .setOutputCol("lemmatized")

stopwordsCleaner = StopWordsCleaner() \
    .setStopWords(StopWordsRemover \
    .loadDefaultStopWords("english")) \
    .setInputCols(["lemmatized"]) \
    .setOutputCol("unigram")

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[ | ][OK!]
lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ | ][OK!]


In [0]:
pipeline1 = Pipeline() \
            .setStages([
                documentAssembler,
                sentence,
                tokenizer,
                POSTag,
                chunker
            ])

In [0]:
pipelinedf = pipeline1.fit(df).transform(df)

In [0]:
pipelinedf = pipelinedf.select("Index", "Date", "ProductName", "Category", "Price", "Content", "Sentiment", "Text", \
                               col("chunk.result").alias("chunked")) \
                .withColumn("chunked", clean_sentence(
                            regexp_replace(clean_sentence(concat_ws(", ", array_distinct(split( \
                                 regexp_replace(concat_ws(", ", col("chunked")), "[^A-za-z0-9]", " "), " ")))), r"\s*[A-Z]\w*\s*", " ")))

In [0]:
pipelinedf = pipelinedf.filter(col("chunked").isNotNull()) \
                        .select("Index", "Date", "ProductName", "Category", "Price", "Content", "Sentiment", "Text", "chunked")

In [0]:
pipeline2 = Pipeline() \
            .setStages([
                documentAssembler,
                sentence,
                tokenizer,
                lemmatizer,
                stopwordsCleaner
            ])

In [0]:
pipelinedf = pipeline2.fit(pipelinedf).transform(pipelinedf)

In [0]:
pipelinedf = pipelinedf.select("Index", "Date", "ProductName", "Category", "Price", "Content", "Sentiment", \
                               "chunked", col("unigram.result").alias("unigrams")) \
                .withColumn("unigrams", clean_sentence(regexp_replace( \
                                          regexp_replace(clean_sentence(concat_ws(", ", \
                                                  array_distinct(split(regexp_replace(concat_ws(", ", \
                                                        col("unigrams")), "[^A-za-z0-9]", " "), " ")))), \
                                             r"\s*[A-Z]\w*\s*", " "), r"(\w)\1{2}", " "))) \
                .withColumn("words", split(col("unigrams"), " ")) 

In [0]:
pipelinedf = pipelinedf.filter((col("words").isNotNull()) ) 

In [0]:
data = pipelinedf.select("Index", "Date", "ProductName", "Category", "Price", "Content", "Sentiment", "words")

In [0]:
data = data.withColumn("Rating", when(col("Sentiment") == "Positive", round(rand()+3.9, 1)) \
                       .when(col("Sentiment") == "Neutral", round(rand()+2.7, 1)) \
                       .when(col("Sentiment") == "Negative", round(rand()+0.5, 1)) \
                       .when(col("Sentiment") == "Negative", round(rand()+1.6, 1)) ) 

In [0]:
data.show(5)

+-----+-------------------+-------------+---------+-----+--------------------+---------+--------------------+------+
|Index|               Date|  ProductName| Category|Price|             Content|Sentiment|               words|Rating|
+-----+-------------------+-------------+---------+-----+--------------------+---------+--------------------+------+
|    1|2020-12-17 05:17:03|ProductName-5|Category3|27.99|Inspiring. I hope...| Positive|[hope, lot, peopl...|   4.1|
|   10|2022-10-06 09:16:45|ProductName-1|Category1| 25.2|Awful beyond beli...| Negative|[beyond, belief, ...|   0.6|
|   34|2022-03-27 09:41:51|ProductName-5|Category3|98.63|This is a very go...| Positive|[good, book, stud...|   4.1|
|   48|2021-12-25 18:34:59|ProductName-5|Category1|34.38|supermarionation ...| Positive|[supermarionation...|   4.7|
|   65|2021-06-16 06:01:31|ProductName-2|Category1|52.51|Didn't live up to...| Positive|[live, expectatio...|   4.7|
+-----+-------------------+-------------+---------+-----+-------

In [0]:
# expand the array of terms in the words column for better visualization

datadf = data.withColumn("Terms", explode("words")) \
                .select("Index", "Date", "ProductName", "Category", "Price", "Rating", "Content", "Sentiment", "Terms")

In [0]:
datadf.show(5)

+-----+-------------------+-------------+---------+-----+------+--------------------+---------+------+
|Index|               Date|  ProductName| Category|Price|Rating|             Content|Sentiment| Terms|
+-----+-------------------+-------------+---------+-----+------+--------------------+---------+------+
|    1|2020-12-17 05:17:03|ProductName-5|Category3|27.99|   4.1|Inspiring. I hope...| Positive|  hope|
|    1|2020-12-17 05:17:03|ProductName-5|Category3|27.99|   4.1|Inspiring. I hope...| Positive|   lot|
|    1|2020-12-17 05:17:03|ProductName-5|Category3|27.99|   4.1|Inspiring. I hope...| Positive|people|
|    1|2020-12-17 05:17:03|ProductName-5|Category3|27.99|   4.1|Inspiring. I hope...| Positive|  hear|
|    1|2020-12-17 05:17:03|ProductName-5|Category3|27.99|   4.1|Inspiring. I hope...| Positive|  need|
+-----+-------------------+-------------+---------+-----+------+--------------------+---------+------+
only showing top 5 rows



## Write data to Databricks Hive Metastore

In [0]:
# saving the data dataframe (drop words column) for use in the topic modeling
# words column is redundant, so the column is dropped

data.drop("words").write.mode("overwrite").option("overwriteSchema", "true").saveAsTable("test_db.resultSentiment")

In [0]:
# saving the datadf dataframe to be used for visualization

datadf.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable("test_db.resultSentimentTerms")

## Validation of data/tables

In [0]:
%sql
select * from test_db.resultSentiment

Index,Date,ProductName,Category,Price,Content,Sentiment,Rating
2,2022-06-26T18:29:13.000+0000,ProductName-5,Category1,78.8,"I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny.",Positive,4.0
7,2020-08-13T13:20:02.000+0000,ProductName-4,Category3,81.03,"I was a dissapointed to see errors on the back cover, but since I paid for the book I read it anyway. I have to say I love it. I couldn't put it down. I read the whole book in two hours. I say buy it. I say read it. It is sad, but it gives an interesting point of view on church today. We spend too much time looking at the faults of others. I also enjoyed beloved.Sincerly,Jaylynn R",Positive,4.7
11,2021-04-04T20:09:20.000+0000,ProductName-4,Category1,45.74,"""When you hear folks say that they don't make 'em like that anymore, they might be talking about """"BY THE SEA"""". This is a very cool story about a young Cuban girl searching for idenity who stumbles into a coastal resort kitchen gig with a zen motorcycle maintenance man",Positive,4.4
36,2020-05-23T15:51:47.000+0000,ProductName-5,Category1,85.97,"It seems somebody was complaining for the printing quality. This is not a calculus book. If you take most theoretical books, and certainly most Springer's book, they don't have nice full color Barney images. This is technical (mostly theoretical) stuff. There is, in my opinion, no problem with the printing at all, clear, quality monochromatic printing.With respect to the contents of the book, it has almost everything you may want to know about Vector (and even Scalar) quantization and Signal compression. It was a great help while I was writing my doctoral thesis. Gray is probably one of the most respected authorities in the field.",Positive,4.8
54,2022-10-30T14:06:38.000+0000,ProductName-5,Category1,30.5,Recieved this item in very fast time. was perfectely happy with shipper and sylvania lcd tv is excellent buy and works perfect. would do business with shipper anytime.,Positive,4.8
61,2020-08-22T15:39:54.000+0000,ProductName-4,Category2,43.86,"""I have no quarrel with the book itself. Wanted a copy for a long time. However, I purchased a used book with the understanding that it was a """"Signed Copy."""" The copy I received WAS NOT autographed. This is a disappointment. If you advertise it as an autographed copy",Negative,1.0
73,2020-07-26T09:30:11.000+0000,ProductName-1,Category4,30.03,"I ordered this DVD and received a substitute I never received the DVD I ordered from Importcds (the Vendor). I contacted them and did not recieve any feedback. I can't rate a DVD I have never seen. I didn't bother to send it back because it would have cost me more that I orginally paid for it. In the future I will watch for the name of the person and/or persons I am buying from. I thought they were a good company. I understand a simple mistake but, to not get a response at all is not good businees sense. I spend hundreds of dollars a month on Amazon.com building my DVD collection. I guess I will be more careful in the future.",Neutral,3.1
82,2021-03-09T20:18:50.000+0000,ProductName-1,Category4,85.96,"I ordered the cake topper June 27, 2010. I was given an estimated shipping date of July 1-7, 2010. Those dates came & went with no cake topper. I contacted the seller twice with no response. I filed a claim with Amazon. I did end up receiving a cake topper on July 16, 2010; however, it's not even the one I ordered! The seller did refund my money, but has never contacted me or apologized for the mishap. I will never buy anything from the seller again. I looked at the seller's feedback & I'm not the only person they have done wrong, so buyer beware!!",Negative,0.7
100,2022-10-27T05:39:45.000+0000,ProductName-3,Category1,75.85,"""Don't usually purchase shoes without trying them on first and I should have stuck with my rule. They are advertised as a Wide, but actually they didn't feel like it. The quality of the shoe is Wonderful and the shoe is really cute. I think it would be great for the average foot. I'm wondering since they put elastic in the shoe, that's what they considered """"wide"""". Since I wore it once",Positive,4.5
140,2021-10-25T22:25:52.000+0000,ProductName-3,Category4,3.64,"These blocks are really colorful and cute, however, when you take into consideration the price, I am not sure they are worth it.",Positive,4.0
