<a href="https://colab.research.google.com/github/kk-deng/Big-Data-Challenge/blob/main/Big_Data_Level_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Config for Spark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz
!tar xf spark-3.0.2-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.2-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()
from pyspark.sql import SparkSession 

spark = SparkSession.builder.appName("Basics").config("spark.driver.extraClassPath","/content/postgresql-42.2.9.jar").getOrCreate()

# For connection to Postgres 
!wget https://jdbc.postgresql.org/download/postgresql-42.2.9.jar

--2021-03-21 03:25:11--  https://jdbc.postgresql.org/download/postgresql-42.2.9.jar
Resolving jdbc.postgresql.org (jdbc.postgresql.org)... 72.32.157.228, 2001:4800:3e1:1::228
Connecting to jdbc.postgresql.org (jdbc.postgresql.org)|72.32.157.228|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 914037 (893K) [application/java-archive]
Saving to: ‘postgresql-42.2.9.jar’


2021-03-21 03:25:12 (5.52 MB/s) - ‘postgresql-42.2.9.jar’ saved [914037/914037]



In [2]:
url = "https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Luggage_v1_00.tsv.gz"

from pyspark import SparkFiles
spark.sparkContext.addFile(url)
spark_df = spark.read.csv(SparkFiles.get("amazon_reviews_us_Luggage_v1_00.tsv.gz"), sep="\t", header=True)
spark_df.show()

+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|marketplace|customer_id|     review_id|product_id|product_parent|       product_title|product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|
+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|         US|   40884699| R9CO86UUJCAW5|B00VGTN02Y|     786681372|Teenage Mutant Ni...|         Luggage|          3|            0|          0|   N|                Y|my review of this...|my review of this...| 2015-08-31|
|         US|   23208852|R3PR8X6QGVJ8B1|B005KIWL0E|     618251799|Kenneth Cole Reac...|         Luggage|          5|    

In [4]:
df_select = spark_df.select(["star_rating", "helpful_votes", "total_votes", "vine", "verified_purchase"])
df_select.show()

+-----------+-------------+-----------+----+-----------------+
|star_rating|helpful_votes|total_votes|vine|verified_purchase|
+-----------+-------------+-----------+----+-----------------+
|          3|            0|          0|   N|                Y|
|          5|            0|          0|   N|                Y|
|          4|            0|          0|   N|                Y|
|          4|            0|          0|   N|                Y|
|          5|            0|          0|   N|                Y|
|          3|            0|          0|   N|                Y|
|          4|            1|          1|   N|                Y|
|          5|            0|          0|   N|                Y|
|          1|            2|          2|   N|                Y|
|          5|            0|          0|   N|                Y|
|          1|            1|          1|   N|                Y|
|          5|            4|          4|   N|                Y|
|          5|            0|          1|   N|           

In [11]:
# Drop NA and duplicates
df_select = df_select.dropna(how='any')
df_select.dropDuplicates()
df_select.show()

+-----------+-------------+-----------+----+-----------------+
|star_rating|helpful_votes|total_votes|vine|verified_purchase|
+-----------+-------------+-----------+----+-----------------+
|          3|            0|          0|   N|                Y|
|          5|            0|          0|   N|                Y|
|          4|            0|          0|   N|                Y|
|          4|            0|          0|   N|                Y|
|          5|            0|          0|   N|                Y|
|          3|            0|          0|   N|                Y|
|          4|            1|          1|   N|                Y|
|          5|            0|          0|   N|                Y|
|          1|            2|          2|   N|                Y|
|          5|            0|          0|   N|                Y|
|          1|            1|          1|   N|                Y|
|          5|            4|          4|   N|                Y|
|          5|            0|          1|   N|           

In [12]:
df_vine = df_select.filter("total_votes>=10").filter(df_select["helpful_votes"]/df_select["total_votes"] >= 0.5)
df_vine.show()

+-----------+-------------+-----------+----+-----------------+
|star_rating|helpful_votes|total_votes|vine|verified_purchase|
+-----------+-------------+-----------+----+-----------------+
|          1|           29|         31|   N|                Y|
|          5|            9|         10|   N|                Y|
|          5|           10|         11|   N|                N|
|          5|           11|         15|   N|                Y|
|          5|           20|         22|   N|                N|
|          5|           34|         38|   N|                Y|
|          5|           20|         23|   N|                Y|
|          5|           11|         12|   N|                Y|
|          5|           23|         23|   N|                N|
|          5|           30|         30|   N|                N|
|          5|           28|         28|   N|                Y|
|          5|           18|         20|   N|                Y|
|          5|           13|         15|   N|           

# Analysis

In [13]:
from pyspark.sql.functions import col, avg
df_paid = df_vine.filter("vine='Y'")
df_unpaid = df_vine.filter("vine='N'")

In [14]:
df_paid.describe().show()

+-------+------------------+------------------+-----------------+----+-----------------+
|summary|       star_rating|     helpful_votes|      total_votes|vine|verified_purchase|
+-------+------------------+------------------+-----------------+----+-----------------+
|  count|                55|                55|               55|  55|               55|
|   mean| 4.381818181818182|23.381818181818183|             25.4|null|             null|
| stddev|0.7815205760403311|22.643362946388624|23.52508511418459|null|             null|
|    min|                 1|                10|               10|   Y|                N|
|    max|                 5|                 9|               76|   Y|                N|
+-------+------------------+------------------+-----------------+----+-----------------+



In [15]:
df_unpaid.describe().show()

+-------+------------------+------------------+------------------+-----+-----------------+
|summary|       star_rating|     helpful_votes|       total_votes| vine|verified_purchase|
+-------+------------------+------------------+------------------+-----+-----------------+
|  count|             15141|             15141|             15141|15141|            15141|
|   mean| 3.773462783171521|30.616736014794267|33.052110164454135| null|             null|
| stddev|1.5165963714566046|53.772057265912686|55.758208360231144| null|             null|
|    min|                 1|                10|                10|    N|                N|
|    max|                 5|                99|                99|    N|                Y|
+-------+------------------+------------------+------------------+-----+-----------------+



## Paid reviews of five-star

In [18]:
five_star = df_paid[df_paid['star_rating'] == 5].count()
five_star

28

In [20]:
all_paid = df_paid.count()
all_paid

55

In [21]:
# Pecentage of five-star reviews in Vine
print(five_star/all_paid)

0.509090909090909


## Unpaid reviews of five-star

In [22]:
unpaid_five_star = df_unpaid[df_paid['star_rating'] == 5].count()
unpaid_five_star

7643

In [23]:
all_unpaid = df_unpaid.count()
all_unpaid

15141

In [24]:
# Pecentage of five-star reviews in non-Vine
print(unpaid_five_star/all_unpaid)

0.5047883230962288


# Conclusion

* We can see that the percentage of 5-star reviews in Vine is very close to non-Vine reviews (51% to 50.5%).

* Although the number of Vine reviews is pretty low, so far it can still represent the product. However, the average rating from Vine customers is 4.38 with std deviation of 0.78, and this is much higher than the 3.77 from non-Vine customers. 

* I believe the Vine customers tend to give higher ratings and pretty focusing on the higher ratings too. So reviews from Vine customers are not that trustworthy for me.