# Final Project Update(final project task1: check, pre-process, filter, and transform the dataset)

The dataset we chose for the final project: http://deepyeti.ucsd.edu/jianmo/amazon/index.html (Amazon review data)

The original/raw data is in json format and looks like: {"overall": 2.0, "verified": false, "reviewTime": "12 5, 2015", "reviewerID": "A3KUPJ396OQF78", "asin":
"B017O9P72A", "reviewerName": "Larry Russlin", "reviewText": "Can only control one of two bulbs from one of
two echos", "summary": "Buggy", "unixReviewTime": 1449273600}

As we only care about five fields: Date UserID Product_id Rate Review,
we use spark to load and check the raw data, transform the data to the format we need, and then, write back to HDFS



In [1]:
#Task: Load the raw data from HDFS, filter out the fields we don't need. Then, write back to the HDFS
peopleDF = spark.read.json("hdfs://orion11:21001/All_Amazon_Review_5.json")

peopleDF.printSchema()

root
 |-- asin: string (nullable = true)
 |-- image: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- style: struct (nullable = true)
 |    |-- Bore Diameter:: string (nullable = true)
 |    |-- Capacity:: string (nullable = true)
 |    |-- Closed Length String:: string (nullable = true)
 |    |-- Color Name:: string (nullable = true)
 |    |-- Color:: string (nullable = true)
 |    |-- Colorj:: string (nullable = true)
 |    |-- Colour:: string (nullable = true)
 |    |-- Conference Name:: string (nullable = true)
 |    |-- Configuration:: string (nullable = true)
 |    |-- Connectivity:: string (nullable = true)
 |    |-- Connector Type:: string (nullable = true)
 |    |-- Content:: string (nullable = true)
 |    |-- Curvature:: string (nullable = tru

In [2]:
peopleDF.take(5)

[Row(asin='B017O9P72A', image=None, overall=2.0, reviewText='Can only control one of two bulbs from one of two echos', reviewTime='12 5, 2015', reviewerID='A3KUPJ396OQF78', reviewerName='Larry Russlin', style=None, summary='Buggy', unixReviewTime=1449273600, verified=False, vote=None),
 Row(asin='B017O9P72A', image=None, overall=5.0, reviewText='Great skill', reviewTime='01 15, 2018', reviewerID='A3TXR8GLKS19RE', reviewerName='Nello', style=None, summary='Great', unixReviewTime=1515974400, verified=False, vote=None),
 Row(asin='B017O9P72A', image=None, overall=1.0, reviewText='Not happy. Can not connect to Alexa regardless.', reviewTime='01 4, 2018', reviewerID='A1FOHYK23FJ6CN', reviewerName='L. Ray Humphreys', style=None, summary='Can not connect to ECHO', unixReviewTime=1515024000, verified=False, vote='2'),
 Row(asin='B017O9P72A', image=None, overall=1.0, reviewText='Can not connect a hue lights to Alexa. Linked the LIFX in the Amazon Alexa app. Can not located the smart hue bulbs. 

In [10]:
'''
transform the raw data to 
Date\tUserID\tProduct_id\tRate\tReview
format
'''
transformed_data = peopleDF.rdd.map(lambda data: (data["reviewTime"], data["reviewerID"], data["asin"], data["overall"], data["reviewText"]))


In [11]:
transformed_data.take(5)

[('12 5, 2015',
  'A3KUPJ396OQF78',
  'B017O9P72A',
  2.0,
  'Can only control one of two bulbs from one of two echos'),
 ('01 15, 2018', 'A3TXR8GLKS19RE', 'B017O9P72A', 5.0, 'Great skill'),
 ('01 4, 2018',
  'A1FOHYK23FJ6CN',
  'B017O9P72A',
  1.0,
  'Not happy. Can not connect to Alexa regardless.'),
 ('12 30, 2017',
  'A1RRDX9AOST1AN',
  'B017O9P72A',
  1.0,
  'Can not connect a hue lights to Alexa. Linked the LIFX in the Amazon Alexa app. Can not located the smart hue bulbs. It should not be this hard to connect to Alexa. Even watched a you tube video and still'),
 ('12 29, 2017',
  'AA4DHYT5YSSIT',
  'B017O9P72A',
  1.0,
  'The service works with google home, but doesn\'t work with alexa. I\'m getting rid of the "I\'m  not sure" machine.')]

In [13]:
#Write rdd back to hdfs

def toCSVLine(data):
    return '\t'.join(str(d) for d in data)

lines = transformed_data.map(toCSVLine)

lines.saveAsTextFile('hdfs://orion11:21001/filtered_AMAZON.csv')

In [14]:
#Reload from hdfs
transformed_data = spark.read.load('hdfs://orion11:21001/filtered_AMAZON.csv/*', format='csv', sep='\t', inferSchema=True, header=False)

In [15]:
transformed_data.take(2)

[Row(_c0='08 27, 2014', _c1='A3IW6WI7AITDGG', _c2='B000038ABO', _c3=5.0, _c4='this game is great'),
 Row(_c0='11 27, 2013', _c1='A14Q3D86A82GVH', _c2='B000038ABO', _c3=5.0, _c4='Kick ass old school game. If you like games from Square Enix then this is a must have for your library.')]

In [17]:
transformed_data.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: string (nullable = true)

