# **Analysis**

Cloud ETL was carried on out analysis on Amazon's Apparel review dataset. Created a database instance in RDS and connected to postgres. Google Colab pyspark was used to load the dataset from amazon s3 bucket. A table called apparel_vine was successfully loaded to the rds database. The data was cleaned and analysis was carried to see if any bias exists in the Amazon vine program. 

From analysis, the Apparel review data shows 2,336 records of reviews from the vine program participants. This approximately 0.0004 percent of the total reviews in this data set. This in number elminates any for form of bias that may be associated with the Apparel data.

Further analysis to determine how many five reviews are from customers in the vine program. This showed 74 five star ratings are from customers in the vine program.

This notebook shows a indepth analysis of the Apparel review data set.

In [1]:
import os
# Find the latest version of spark 3.0  from http://www.apache.org/dist/spark/ and enter as the spark version
# For example:
# spark_version = 'spark-3.0.3'
spark_version = 'spark-3.2.0'
os.environ['SPARK_VERSION']=spark_version

# Install Spark and Java
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www.apache.org/dist/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop2.7.tgz
!tar xf $SPARK_VERSION-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{spark_version}-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com (91.180% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com (91.180% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com] [Connecting to                                                                               Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
0% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (91.189.88.142)                                                                               Hit:3 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Get:4 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Ign:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:6 https://developer.download.nvidia.com/compute/cuda/

In [2]:
!wget https://jdbc.postgresql.org/download/postgresql-42.2.9.jar

--2021-11-22 06:15:24--  https://jdbc.postgresql.org/download/postgresql-42.2.9.jar
Resolving jdbc.postgresql.org (jdbc.postgresql.org)... 72.32.157.228, 2001:4800:3e1:1::228
Connecting to jdbc.postgresql.org (jdbc.postgresql.org)|72.32.157.228|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 914037 (893K) [application/java-archive]
Saving to: ‘postgresql-42.2.9.jar.1’


2021-11-22 06:15:24 (5.14 MB/s) - ‘postgresql-42.2.9.jar.1’ saved [914037/914037]



In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("CloudETL").config("spark.driver.extraClassPath","/content/postgresql-42.2.9.jar").getOrCreate()

In [4]:
from pyspark import SparkFiles
# Load in Apparel data from s3.amazonaws.com into a DataFrame
url = "https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Apparel_v1_00.tsv.gz"
spark.sparkContext.addFile(url)

apparel_df = spark.read.option('header', 'true').csv(SparkFiles.get("amazon_reviews_us_Apparel_v1_00.tsv.gz"), inferSchema=True, sep="\t", timestampFormat="mm/dd/yy")
apparel_df.show(10)

+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|marketplace|customer_id|     review_id|product_id|product_parent|       product_title|product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|
+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|         US|   32158956|R1KKOXHNI8MSXU|B01KL6O72Y|      24485154|Easy Tool Stainle...|         Apparel|          4|            0|          0|   N|                Y|★ THESE REALLY DO...|These Really Do W...| 2013-01-14|
|         US|    2714559|R26SP2OPDK4HT7|B01ID3ZS5W|     363128556|V28 Women Cowl Ne...|         Apparel|          5|    

In [5]:
#count rows
print(apparel_df.count())


5906333


In [6]:
# Columns
len(apparel_df.columns)

15

In [7]:
#drop duplicates
apparel_df = apparel_df.dropDuplicates()
#show total counts of Apparel reviews data after dropping duplicates and incomplete rows
print(apparel_df.count())

5906333


In [7]:
apparel_df.printSchema()

root
 |-- marketplace: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: integer (nullable = true)
 |-- product_title: string (nullable = true)
 |-- product_category: string (nullable = true)
 |-- star_rating: integer (nullable = true)
 |-- helpful_votes: integer (nullable = true)
 |-- total_votes: integer (nullable = true)
 |-- vine: string (nullable = true)
 |-- verified_purchase: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: string (nullable = true)



In [8]:
apparel = apparel_df.select(["review_id", "product_title", "star_rating", "helpful_votes", "total_votes", "vine", "verified_purchase"])
apparel.show()

+--------------+--------------------+-----------+-------------+-----------+----+-----------------+
|     review_id|       product_title|star_rating|helpful_votes|total_votes|vine|verified_purchase|
+--------------+--------------------+-----------+-------------+-----------+----+-----------------+
|R1KKOXHNI8MSXU|Easy Tool Stainle...|          4|            0|          0|   N|                Y|
|R26SP2OPDK4HT7|V28 Women Cowl Ne...|          5|            1|          2|   N|                Y|
| RWQEDYAX373I1|James Fiallo Men'...|          5|            0|          0|   N|                Y|
|R231YI7R4GPF6J|Belfry Gangster 1...|          5|            0|          0|   N|                Y|
|R3KO3W45DD0L1K|JAEDEN Women's Be...|          5|            0|          0|   N|                Y|
|R1C4QH63NFL5NJ|Levi's Boys' 514 ...|          5|            0|          0|   N|                Y|
|R2GP65O1U9N7BP|Minimalist Wallet...|          5|            0|          0|   N|                Y|
|R3O29CT5M

In [17]:
# Configuration for RDS instance
mode="append"
jdbc_url = "jdbc:postgresql://<rds endpoint>:5432/databasename"
config = {"user":"root",
          "password": "password",
          "driver":"org.postgresql.Driver"}

In [18]:
apparel.write.jdbc(url=jdbc_url, table='apparel_vine', mode=mode, properties=config)

In [51]:
apparel1 = apparel.filter(apparel["vine"] == "Y").count()
print(apparel1)

2336


In [52]:
apparel1 = apparel.filter(apparel["vine"] == "Y").show()

+--------------+--------------------+-----------+-------------+-----------+----+-----------------+
|     review_id|       product_title|star_rating|helpful_votes|total_votes|vine|verified_purchase|
+--------------+--------------------+-----------+-------------+-----------+----+-----------------+
|R2S3OT2TGTNLRT|Genuine Little Gi...|          4|            1|          1|   Y|                N|
| R1YUILMCVON7J|Genuine Little Gi...|          2|            1|          1|   Y|                N|
| R2039OSLK2OAW|Genuine Little Gi...|          3|            1|          1|   Y|                N|
| R8T7CLRNUA19L|Genuine Big Boys'...|          1|            1|          1|   Y|                N|
|R1WBILU1035YWZ|Genuine Big Boys'...|          5|            7|          7|   Y|                N|
|R1H4E9IG55K7DR|Genuine Big Boys'...|          4|            0|          0|   Y|                N|
|R3BXABC3ONYY3X|Genuine Big Boys'...|          5|            1|          1|   Y|                N|
|R387TVQPO

In [49]:
apparel2 = apparel.filter(apparel["vine"] == "N").count()
print(apparel2)

5903986


In [50]:
apparel2 = apparel.filter(apparel["vine"] == "N").show()

+--------------+--------------------+-----------+-------------+-----------+----+-----------------+
|     review_id|       product_title|star_rating|helpful_votes|total_votes|vine|verified_purchase|
+--------------+--------------------+-----------+-------------+-----------+----+-----------------+
|R1KKOXHNI8MSXU|Easy Tool Stainle...|          4|            0|          0|   N|                Y|
|R26SP2OPDK4HT7|V28 Women Cowl Ne...|          5|            1|          2|   N|                Y|
| RWQEDYAX373I1|James Fiallo Men'...|          5|            0|          0|   N|                Y|
|R231YI7R4GPF6J|Belfry Gangster 1...|          5|            0|          0|   N|                Y|
|R3KO3W45DD0L1K|JAEDEN Women's Be...|          5|            0|          0|   N|                Y|
|R1C4QH63NFL5NJ|Levi's Boys' 514 ...|          5|            0|          0|   N|                Y|
|R2GP65O1U9N7BP|Minimalist Wallet...|          5|            0|          0|   N|                Y|
|R3O29CT5M

In [25]:
from pyspark.sql.functions import desc
apparel_df1 = apparel.select(["review_id","product_title", "star_rating", "helpful_votes", "total_votes", "vine", "verified_purchase"]).groupby("product_title").agg({"star_rating":"count"})

apparel_df1.orderBy(desc("count(star_rating)")).show()

+--------------------+------------------+
|       product_title|count(star_rating)|
+--------------------+------------------+
|Levi's Men's 505 ...|              5001|
|SHARKK® Aluminum ...|              4699|
|Glamorise Women's...|              4195|
|Levi's Men's 501 ...|              3881|
|Playtex Women's 1...|              3520|
|Squeem 'Perfect W...|              3495|
|Ann Chery Women's...|              3383|
|Levi's Men's 511 ...|              3261|
|Columbia Women's ...|              3125|
|Ann Chery Women's...|              3109|
|Fruit of the Loom...|              3081|
|Lee Men's Regular...|              3026|
|Levi's Men's 550 ...|              2979|
|Dickies Men's Ori...|              2965|
|Spalding Women's ...|              2691|
|Haggar Men's Cool...|              2545|
|Azules Women'S Ra...|              2538|
|Hanes Men's Ultim...|              2319|
|Lamaze Maternity ...|              2303|
|Carhartt Men's Ac...|              2205|
+--------------------+------------

In [21]:
#filter the dataset for total_votes
vine = apparel.select(["review_id","product_title", "star_rating", "helpful_votes", "total_votes", "vine", "verified_purchase"])
votes = apparel.filter("total_votes >= 10")
votes.show()

+--------------+--------------------+-----------+-------------+-----------+----+-----------------+
|     review_id|       product_title|star_rating|helpful_votes|total_votes|vine|verified_purchase|
+--------------+--------------------+-----------+-------------+-----------+----+-----------------+
|R28TEK5081Q3SQ|V28 Women Girls J...|          5|           12|         12|   N|                Y|
|R35IM9R63H3OVQ|Alain Dupetit Men...|          5|            7|         12|   N|                N|
|R24PZ7L6UNRYLS|Silver Lilly Unis...|          5|           13|         13|   N|                N|
| RHDEDSO28V1UG|Slip-On Scarf - P...|          5|            8|         10|   N|                N|
|R3BCHELYQ0CHUF|Komene Women's El...|          5|           15|         18|   N|                N|
| RL1L1QKPFMFOZ|Komene Women's El...|          1|            1|         11|   N|                N|
|R35PT06NWP7LDP|Women Padded Spor...|          5|           30|         32|   N|                N|
|R2P76PJFU

In [39]:
votes_ratings_df = votes.select(["star_rating","helpful_votes", "total_votes","vine"])\
  .groupby("star_rating")\
  .agg({"star_rating": "count", "helpful_votes": "count", "total_votes":"count"})
votes_ratings_df.show(truncate=False)

+-----------+------------------+------------------+--------------------+
|star_rating|count(total_votes)|count(star_rating)|count(helpful_votes)|
+-----------+------------------+------------------+--------------------+
|1          |19006             |19006             |19006               |
|3          |11869             |11869             |11869               |
|5          |56778             |56778             |56778               |
|4          |19479             |19479             |19479               |
|2          |8854              |8854              |8854                |
+-----------+------------------+------------------+--------------------+



In [41]:
votes_ratings_df.orderBy(desc("star_rating")).show(truncate=False)

+-----------+------------------+------------------+--------------------+
|star_rating|count(total_votes)|count(star_rating)|count(helpful_votes)|
+-----------+------------------+------------------+--------------------+
|5          |56778             |56778             |56778               |
|4          |19479             |19479             |19479               |
|3          |11869             |11869             |11869               |
|2          |8854              |8854              |8854                |
|1          |19006             |19006             |19006               |
+-----------+------------------+------------------+--------------------+



In [26]:
# Helpful votes and total votes
helpful_total = votes.filter(votes["helpful_votes"]/votes["total_votes"]>=0.5)
helpful_total.show()

+--------------+--------------------+-----------+-------------+-----------+----+-----------------+
|     review_id|       product_title|star_rating|helpful_votes|total_votes|vine|verified_purchase|
+--------------+--------------------+-----------+-------------+-----------+----+-----------------+
|R28TEK5081Q3SQ|V28 Women Girls J...|          5|           12|         12|   N|                Y|
|R35IM9R63H3OVQ|Alain Dupetit Men...|          5|            7|         12|   N|                N|
|R24PZ7L6UNRYLS|Silver Lilly Unis...|          5|           13|         13|   N|                N|
| RHDEDSO28V1UG|Slip-On Scarf - P...|          5|            8|         10|   N|                N|
|R3BCHELYQ0CHUF|Komene Women's El...|          5|           15|         18|   N|                N|
|R35PT06NWP7LDP|Women Padded Spor...|          5|           30|         32|   N|                N|
| RQW4AFOG9MR4Z|UONBOX Celebrity ...|          5|           51|         52|   N|                N|
|R2SMUEBMG

In [27]:
helpful_total.count()

110524

In [28]:
# helpful_total filtered to show vine reviews
helpful_total.filter(helpful_total["vine"] == "Y").show()

+--------------+--------------------+-----------+-------------+-----------+----+-----------------+
|     review_id|       product_title|star_rating|helpful_votes|total_votes|vine|verified_purchase|
+--------------+--------------------+-----------+-------------+-----------+----+-----------------+
| R6U9701C3BGO6|Wrangler Authenti...|          3|          139|        147|   Y|                N|
|R1XK3ALB45D7N4|Wrangler Authenti...|          5|           33|         34|   Y|                N|
|R1PLLCVDGANA0J|Wrangler Authenti...|          5|           15|         15|   Y|                N|
|R1IZCSTLX81D6C|Wrangler Authenti...|          5|           31|         33|   Y|                N|
|R3S53FVP06C7AL|Wrangler Authenti...|          4|           17|         19|   Y|                N|
|R2C8NC8EQLH4JF|Wrangler Authenti...|          3|           45|         48|   Y|                N|
| RK727CQ82BPVD|Wrangler Authenti...|          5|           10|         10|   Y|                N|
|R1KTWAAJA

In [29]:
# DataFrame filtered to show reviews that are not in the vine program 
helpful_total.filter(helpful_total["vine"] == "N").show()

+--------------+--------------------+-----------+-------------+-----------+----+-----------------+
|     review_id|       product_title|star_rating|helpful_votes|total_votes|vine|verified_purchase|
+--------------+--------------------+-----------+-------------+-----------+----+-----------------+
|R28TEK5081Q3SQ|V28 Women Girls J...|          5|           12|         12|   N|                Y|
|R35IM9R63H3OVQ|Alain Dupetit Men...|          5|            7|         12|   N|                N|
|R24PZ7L6UNRYLS|Silver Lilly Unis...|          5|           13|         13|   N|                N|
| RHDEDSO28V1UG|Slip-On Scarf - P...|          5|            8|         10|   N|                N|
|R3BCHELYQ0CHUF|Komene Women's El...|          5|           15|         18|   N|                N|
|R35PT06NWP7LDP|Women Padded Spor...|          5|           30|         32|   N|                N|
| RQW4AFOG9MR4Z|UONBOX Celebrity ...|          5|           51|         52|   N|                N|
|R2SMUEBMG

In [30]:
# number of five star reviews
FiveStar = helpful_total.filter(helpful_total["star_rating"]== 5)
FiveStar.show()

+--------------+--------------------+-----------+-------------+-----------+----+-----------------+
|     review_id|       product_title|star_rating|helpful_votes|total_votes|vine|verified_purchase|
+--------------+--------------------+-----------+-------------+-----------+----+-----------------+
|R28TEK5081Q3SQ|V28 Women Girls J...|          5|           12|         12|   N|                Y|
|R35IM9R63H3OVQ|Alain Dupetit Men...|          5|            7|         12|   N|                N|
|R24PZ7L6UNRYLS|Silver Lilly Unis...|          5|           13|         13|   N|                N|
| RHDEDSO28V1UG|Slip-On Scarf - P...|          5|            8|         10|   N|                N|
|R3BCHELYQ0CHUF|Komene Women's El...|          5|           15|         18|   N|                N|
|R35PT06NWP7LDP|Women Padded Spor...|          5|           30|         32|   N|                N|
| RQW4AFOG9MR4Z|UONBOX Celebrity ...|          5|           51|         52|   N|                N|
|R2SMUEBMG

In [55]:
FiveStar1 = helpful_total.filter(helpful_total["vine"]== "Y")
FiveStar1.count()

74

In [31]:
FiveStar.count()

55872

In [32]:
#t
helpful_total.count()

110524

In [33]:
#percentage of five star reviews
FiveStar.count() / helpful_total.count()

0.5055191632586588

In [34]:
# Vine reviews
FiveStar.filter(FiveStar["verified_purchase"] == "Y").count()/helpful_total.filter(helpful_total["verified_purchase"]== "Y").count()

0.5086851412454672

In [36]:
# Non vine reviews
FiveStar.filter(FiveStar["verified_purchase"] == "N").count()/helpful_total.filter(helpful_total["verified_purchase"]== "N").count()

0.4921609369097091