If you work locally, it might be a good idea to create symbolic link to your data directory:

```ln -s /your/data/directory data```

This will create a link named data in the current dir and you can use the data as it was in the directory `data`

In [24]:
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
import pandas as pd

In [25]:
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
sql_context = SQLContext(sc)

In [26]:
DATA_DIR = './data/'

`index.txt` describes format of the data. It is tab delimited, each row is a separate review entry.
We define the schema of the data to facilitate reading.

__TODO__ convert date to a suitable format

__TODO__ convert vine and verified purchase to bool Type

__TODO__ category can be categorical value?

In [71]:
def read_tsv_to_pyspark_DF(filename):
    schema = StructType([
        StructField('marketplace', StringType(), True), #2 letter country code
        StructField('customer_id', StringType(), True), #author identifier
        StructField('review_id', StringType(), True), #unique review ID
        StructField('product_id', StringType(), True), # unique product ID
        StructField('product_parent', StringType(), True), # product identifier to be used to aggregate reviews for a product
        StructField('product_title', StringType(), True),
        StructField('product_category', StringType(), True),
        StructField('star_rating', IntegerType(), True), # 1-5 star rating 
        StructField('helpful_votes', IntegerType(), True), # positive votes for the review
        StructField('total_votes', IntegerType(), True), # total votes for the review
        StructField('vine', StringType(), True), # review is part of Vine Program
        StructField('verfied_purchase', StringType(), True), # Review is on Verified Purchase
        StructField('review_headline', StringType(), True), # title of the review
        StructField('review_body', StringType(), True), # text
        StructField('review_date', StringType(), True)]) # date of review 

    return sql_context.read.option('sep', '\t').csv(filename, schema=schema, header=True)

In [72]:
sample_df = read_tsv_to_pyspark_DF('sample_us.tsv')
sample_df.printSchema()

root
 |-- marketplace: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: string (nullable = true)
 |-- product_title: string (nullable = true)
 |-- product_category: string (nullable = true)
 |-- star_rating: integer (nullable = true)
 |-- helpful_votes: integer (nullable = true)
 |-- total_votes: integer (nullable = true)
 |-- vine: string (nullable = true)
 |-- verfied_purchase: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: string (nullable = true)



In [52]:
df_books = read_tsv_to_pyspark_DF(DATA_DIR+'amazon_reviews_us_Books_v1_00.tsv.gz')

In [78]:
df_books.count()

10319091

In [83]:
df_books.filter("product_category == 'Books'").count()

10318984

In [80]:
sample_df.select('marketplace').distinct().show()

+-----------+
|marketplace|
+-----------+
|         US|
+-----------+

