# Are reviews more subjective for some categories of products than for others, based on sentiment analysis

- Use pre-existing sentiment analysis approaches (e.g. “TextBlob” package [2]).
- Compare the subjectiveness scores of different product categories.

- perform an sentiment analysis with textblob on subjectivity [1-subjectivity=objectivity]; This should be done for summary and reviewText.
- additionally we could compare it to the reviewTime; if this has an effect on it or not
- maybe we could investigate if men or women are more prone to writing in a specific matter
- laslty it would be interesting if there is a correlation between subjectivity and votes; to see if people react on those

## Imports

In [None]:
! pip install textblob

In [1]:
import pyspark.pandas as ps
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pyspark.sql.functions import udf, count, col, coalesce, lit, avg
from pyspark.sql.types import StringType, DoubleType
from pyspark.sql import SparkSession
from textblob import TextBlob



## Load data

In [None]:
data = ps.read_parquet('/data/data.parquet', index_col=['reviewerID', 'asin'])
# data = ps.read_parquet('/data/data_sample.parquet', index_col=['reviewerID', 'asin'])
print(data.shape)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/17 14:35:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/01/17 14:35:39 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/01/17 14:35:55 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors

(157656869, 25)


                                                                                

In [3]:
categories = data['category'].drop_duplicates().reset_index(drop=True).to_pandas()

                                                                                

In [6]:
max_number_of_samples_per_category = 10000

##
## Calulates the distribution of Star Ratings for each Category and picks equally sized samples for each Category, each following the aforementioned distribution
##
rows_for_category = ps.read_parquet(f'/data/{categories[1]}.parquet', index_col=['reviewerID', 'asin'])
rows_for_category = rows_for_category[:max_number_of_samples_per_category]
rows_for_category.to_parquet("/data/question2.parquet", mode="w", index_col=['reviewerID', 'asin'])


[Stage 8:====>           (60 + 4) / 206][Stage 9:>                  (0 + 0) / 1]

In [5]:
rows_for_category.head()

24/01/17 14:37:15 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
ERROR:root:KeyboardInterrupt while sending command.              (53 + 4) / 206]
Traceback (most recent call last):
  File "/home/maxkleinegger/.cache/pypoetry/virtualenvs/tu-dopp-ws23-jFqIN7wz-py3.11/lib/python3.11/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/maxkleinegger/.cache/pypoetry/virtualenvs/tu-dopp-ws23-jFqIN7wz-py3.11/lib/python3.11/site-packages/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
                          ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/maxkleinegger/miniconda3/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
KeyboardIn

KeyboardInterrupt: 

In [None]:
# cols [id, reviewerId, summary [polarity, subjectivity], reviewtext [polarity, subjectivity], vote, reviewerName, ]
def apply_sentiment_analysis(df):
    print(f"Number of rows with empty reviewText: {len(df[df['reviewText'].isna()])}")
    print(f"Number of rows with empty summary: {len(df[df['summary'].isna()])}")

    df = df.to_spark()

    calculate_neg_score = udf(lambda x: SentimentIntensityAnalyzer().polarity_scores(x)['neg'], DoubleType())
    df = df.withColumn("vader_neg", calculate_neg_score(df['text']))
    calculate_neu_score = udf(lambda x: SentimentIntensityAnalyzer().polarity_scores(x)['neu'], DoubleType())
    df = df.withColumn("vader_neu", calculate_neu_score(df['text']))
    calculate_pos_score = udf(lambda x: SentimentIntensityAnalyzer().polarity_scores(x)['pos'], DoubleType())
    df = df.withColumn("vader_pos", calculate_pos_score(df['text']))
    calculate_compound_score = udf(lambda x: SentimentIntensityAnalyzer().polarity_scores(x)['compound'], DoubleType())
    df = df.withColumn("vader_compound", calculate_compound_score(df['text']))

    df = df[['category', 'overall', 'verified', 'vader_neg', 'vader_neu', 'vader_pos', 'vader_compound']]
    calculate_sentiment = udf(lambda compound: 'very_positive' if compound >= 0.75 
                                else ('positive' if compound >= 0.25 
                                else ('very_negative' if compound <= -0.75 
                                else ('negative' if compound <= -0.25
                                else 'neutral'))), StringType())
    df = df.withColumn("sentiment", calculate_sentiment(df['vader_compound']))
    return df

