<a href="https://colab.research.google.com/github/karenbennis/Xy/blob/Data_ETL/Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<br><br>**ETL**<br><br>

In [1]:
# Install Java, Spark, and Findspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

#Interact with SQL
#!wget https://jdbc.postgresql.org/download/postgresql-42.2.9.jar

# Start Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lkdflkdasfj").getOrCreate()


In [2]:
from pyspark import SparkFiles
from pyspark.sql.types import  IntegerType, DateType
from pyspark.sql.functions import col,avg

In [3]:
url = "https://raw.githubusercontent.com/karenbennis/Xy/Data_ETL/yelp.csv"
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("yelp.csv"), sep=",", header=True)
df.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|         business_id|                date|           review_id|               stars|                text|                type|             user_id|                cool|              useful|               funny|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|9yKzy9PApeiPPOUJE...|          2011-01-26|fWKvX83p0-ka4JS3d...|                   5|My wife took me h...|                null|                null|                null|                null|                null|
|Do yourself a fav...|                null|                null|                null|                null|                null|                null|    

In [None]:
#Fixing data types
video_games_df= video_games_df.withColumn("customer_id", video_games_df["customer_id"].cast(IntegerType()))
video_games_df= video_games_df.withColumn("product_parent", video_games_df["product_parent"].cast(IntegerType()))
video_games_df= video_games_df.withColumn("star_rating", video_games_df["star_rating"].cast(IntegerType()))
video_games_df= video_games_df.withColumn("helpful_votes", video_games_df["helpful_votes"].cast(IntegerType()))
video_games_df= video_games_df.withColumn("total_votes", video_games_df["total_votes"].cast(IntegerType()))
video_games_df= video_games_df.withColumn("review_date", video_games_df["review_date"].cast(DateType()))

In [None]:
# Creating tables
review_id_table = video_games_df.select(['review_id','customer_id', 'product_id', 'product_parent', 'review_date'])
products=video_games_df.select(['product_id','product_title'])
customers = video_games_df.groupBy('customer_id').count()
vine_table = video_games_df.select(['review_id','star_rating','helpful_votes','total_votes','vine'])

In [None]:
# Configure settings for RDS
mode = "append"
jdbc_url="jdbc:postgresql://challenge.cde4fgpazxbm.ca-central-1.rds.amazonaws.com:5432/"
config = {"user":'postgres', 
          "password": 'anmol0926', 
          "driver":"org.postgresql.Driver"}



In [None]:
# Write dataframe to tables in RDS
review_id_table.write.jdbc(url=jdbc_url, table='review_id_table', mode=mode, properties=config)
vine_table.write.jdbc(url=jdbc_url, table='vine_table', mode=mode, properties=config)
products.write.jdbc(url=jdbc_url, table='products', mode=mode, properties=config)
customers.write.jdbc(url=jdbc_url, table='customers', mode=mode, properties=config)




<br><br><br><br><br><br>

**Analysis**

From the tables below, one can see non-Vine reviews vastly outweigh Vine reviews, over 400 times more entries. Due to this it only makes sense to compare average values between the data sets. When comparing product ratings we see similar averages, just over 4.0 stars. When comparing the distribution of reviews 4 and 5 star reviews are the most common for both sets. Furthermore, in both cases 75% of reviews have 1 or fewer "helpful" votes. After this preliminary analysis, the data sets match well enough to hypothesize  Amazon's vine reviews are good representations of the population. Further analysis can be done by comparing vine to non-vine reviews on an item to item basis, rather than an entire category.
<br><br>

In [None]:
#Filter Reviews
vine_table.groupBy('vine').count().show()
vine_reviews=vine_table.filter(col('vine')== 'Y')
nonvine_reviews=vine_table.filter(col('vine')== 'N')

<br><br><br>**Non-Vine Reviews**

In [None]:
nonvine_reviews['star_rating','helpful_votes', 'total_votes'].summary().show()
nonvine_reviews.groupBy('star_rating').count().show()

<br><br><br>**Vine reviews** 

In [None]:
vine_reviews['star_rating','helpful_votes', 'total_votes'].summary().show()
vine_reviews.groupBy('star_rating').count().show()