## 1. Import and initialize Pyspark environment

Import Spark environment using findspark
Initialize Spark environment
Create SparkContext


In [74]:
!ls ./data/input/

metadata  reviews


In [75]:
from os import path

ROOT_DIR = "./"
DATA_DIR = path.join(ROOT_DIR, 'data')
INPUT_DATA_PATH = path.join(DATA_DIR, 'input', 'metadata')
!ls {INPUT_DATA_PATH}

meta_Movies_and_TV.json.gz


In [76]:
spark

## 2. Read data

Read all the fields in the txt file by applying filter and map 


In [77]:
"""
Read JSON-formatted data file to dataframe
"""

data = spark.read.json(INPUT_DATA_PATH)
data.show(10)

+--------------------+--------------------+----------+--------------+--------------------+----+--------------------+-------+-------+---+--------------------+-----------+------+--------------------+------------+-----+-----+--------------------+
|            also_buy|           also_view|      asin|         brand|            category|date|         description|details|feature|fit|               image|   main_cat| price|                rank|similar_item|tech1|tech2|               title|
+--------------------+--------------------+----------+--------------+--------------------+----+--------------------+-------+-------+---+--------------------+-----------+------+--------------------+------------+-----+-----+--------------------+
|                  []|                  []|0000695009|              |[Movies & TV, Mov...|    |                  []|   null|     []|   |                  []|Movies & TV|      |886,503 in Movies...|            |     |     |Understanding Sei...|
|                  []|  

## 3. Data Exploration

In [78]:
data.printSchema()

root
 |-- also_buy: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- also_view: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- asin: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- category: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- date: string (nullable = true)
 |-- description: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- details: struct (nullable = true)
 |    |-- 
    Item Weight: 
    : string (nullable = true)
 |    |-- 
    Package Dimensions: 
    : string (nullable = true)
 |    |-- 
    Product Dimensions: 
    : string (nullable = true)
 |    |-- ASIN:: string (nullable = true)
 |    |-- ASIN: : string (nullable = true)
 |    |-- Audio CD: string (nullable = true)
 |    |-- Audio Description:: string (nullable = true)
 |    |-- Blu-ray Audio: string (nullable = true)
 |    |-- DVD Audio: string (nullable = true)
 |    |-- Digital Co

In [79]:
"""
Get some data randomly to view
"""

data.sample(fraction = 0.001, withReplacement = False).show()

+--------------------+--------------------+----------+--------------------+--------------------+----+--------------------+--------------------+-------+---+-----+-----------+------+--------------------+------------+-----+-----+--------------------+
|            also_buy|           also_view|      asin|               brand|            category|date|         description|             details|feature|fit|image|   main_cat| price|                rank|similar_item|tech1|tech2|               title|
+--------------------+--------------------+----------+--------------------+--------------------+----+--------------------+--------------------+-------+---+-----+-----------+------+--------------------+------------+-----+-----+--------------------+
|[B001AEF6BS, B01M...|[B00RJXKUOM, B076...|078322687X|      Emilio Estevez|[Movies & TV, Stu...|    |[They were five t...|                null|     []|   |   []|Movies & TV| $8.89|22,438 in Movies ...|            |     |     |  The Breakfast Club|
|[B000MV

In [80]:
feature_used = ['asin',
                'brand',
                'category',
                'main_cat',
                'description',
                'details.Label:',
                'details.Publisher:',
                'title']

data_used = data.select(feature_used)
data_used.select('description').show(truncate = 0)
# data_used.sample(fraction = 0.001, withReplacement = False).show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [81]:
"""
Get summary of the data
To see how rich our selected data is
"""

summary = data_used.describe().show()
data_used.printSchema()

+-------+--------------------+-----------------+-----------+------------+--------------------+------------------+
|summary|                asin|            brand|   main_cat|      Label:|          Publisher:|             title|
+-------+--------------------+-----------------+-----------+------------+--------------------+------------------+
|  count|              203766|           203766|     203766|          24|                  16|            203766|
|   mean| 5.595807938177439E9|            152.5|       null|        null|                null|15173.021212121212|
| stddev|1.8679576229842582E9|461.5257522609112|       null|        null|                null| 76214.04439271425|
|    min|          0000143502|                 |           |     CD Baby|406 Productions (...|                  |
|    max|          B01HJF79XO|                ~|Video Games|Warner Bros.|             Unknown|                ~~|
+-------+--------------------+-----------------+-----------+------------+---------------

In [82]:
from pyspark.sql.functions import col

data_used = data_used.select('asin', 'title', 'brand', 'category', 'description')
data_used.sample(fraction = 0.001, withReplacement = False).show(truncate=True)

+----------+--------------------+-----------------+--------------------+--------------------+
|      asin|               title|            brand|            category|         description|
+----------+--------------------+-----------------+--------------------+--------------------+
|0929915453|Coyote Hunting Wi...|      Tom Bechdel|[Movies & TV, Gen...|[As a retired WVD...|
|144033742X|Acrylic Painting ...|                 |[Movies & TV, Mov...|[<div>, Acrylic T...|
|157252331X|North Shore Fish VHS|   Mercedes Ruehl|[Movies & TV, Ind...|                  []|
|6300189376|Heated Vengeance VHS|    Richard Hatch|[Movies & TV, Gen...|  [Action Adventure]|
|6301706811|          S.O.B. VHS|    Julie Andrews|[Movies & TV, Stu...|[It's been years ...|
|6303029329|      Cold Sweat VHS|        Ben Cross|[Movies & TV, Par...|[Previously viewe...|
|6303184464|     Riding High VHS|      Bing Crosby|[Movies & TV, Mus...|[RIDING HIGH tell...|
|6303283349|Upper Body/Variab...|                 |[Movies &

## 4. Visualization

## 5. Data transformation and save

We want to build a Recommendation system to suggest movies to users who might interested in based on their historical reviews. Rating score should be used. So we will make some information extracting and transforming job this part.

In [84]:
""" Store the ratings data as zipped csv """

target = os.path.join(DATA_DIR, 'movie-meta')

data_used.write.json(target, mode='overwrite', compression='gzip')