## Internal Documentation

- First, install the EPFL VPN using the instructions from here: https://epnet.epfl.ch/AnyConnect-VPN-Clients

- How to run jobs in the cluster: https://drive.google.com/open?id=1n9tIfMkDPW6RDLFPvhhetOIcFCBNMnpfESzHc20MM4w

- To install Torch (the ML library) run the following commands on the cluster:

```curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py```

```python get-pip.py --user```

```pip install --user torch torchvision```

### Get The Code From the NoteBook To The Cluster:

We can write our python script in this notebook, but to run it we will need to scp to the server and run ```spark-submit```

```scp script.py GASPARID@iccluster028.iccluster.epfl.ch:/home/GASPARID/script.py```

```ssh GASPARID@iccluster028.iccluster.epfl.ch```

```spark-submit --master yarn --deploy-mode client --driver-memory 4G --num-executors 5 --executor-memory 4G --executor-cores 5 script.py```

In [None]:
########## Don't include when running on the cluster ##########
import findspark
findspark.init()
###############################################################

from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.functions import min
from pyspark.sql.functions import unix_timestamp
from pyspark.sql.types import *

from pyspark.ml.feature import StringIndexer

from datetime import datetime  
from datetime import timedelta

from pyspark.sql import SparkSession
from pyspark import SparkContext

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

########## Use this when running on cluster ##########
# This reads in all files that start with amazon_reviews_us (the english reviews)
dataFile = 'hdfs:///datasets/amazon_multiling/tsv/amazon_reviews_us*tsv.gz'
############ Use this when running locally ###########
#dataFile = 'sample_us.tsv'
######################################################

# Manually specify the schema
schema = StructType([
    StructField('marketplace', StringType()),
    StructField('customer_id', IntegerType()),
    StructField('review_id', StringType()),
    StructField('product_id', StringType()),
    StructField('product_parent', IntegerType()),
    StructField('product_title', StringType()),
    StructField('product_category', StringType()),
    StructField('star_rating', IntegerType()),
    StructField('helpful_votes', IntegerType()),
    StructField('total_votes', IntegerType()),
    StructField('vine', StringType()),
    StructField('verified_purchase', StringType()),
    StructField('review_headline', StringType()),
    StructField('review_body', StringType()),
    StructField('review_date', DateType()),
])

df = spark.read.csv(dataFile, sep="\t", header=True, schema=schema)

# We will want to take a subset of the data
# x-core means that all items have at least x reviews
x_core = 1

# This query returns the number of products with at least x reviews
query1 = '''
    SELECT product_id
    FROM df
    GROUP BY product_id
    HAVING COUNT(*) >= %s
''' % x_core

# This query returns the rows for reviews for products with at least x reviews
query2 = '''
SELECT *
FROM df
WHERE product_id IN
    (     
        SELECT product_id
        FROM df
        GROUP BY product_id
        HAVING COUNT(*) >= %s
    )
''' % x_core

df.registerTempTable("df")
df = spark.sql(query2)
# Results for all files that start with amazon_reviews_us (number of rows returned by query1):
#  Number of 1-core reviews: 21390118
#  Number of 2-core reviews: 10213901
#  Number of 3-core reviews:  6931152
#  Number of 4-core reviews:  5318037
#  Number of 5-core reviews:  4342875

# Index categorical stringType columns to numerical
# This is to help with ML later on
# Might want to make more columns numerical later on. I just did the obvious ones
categoricalColumns = ["marketplace", "product_category", "vine", "verified_purchase"]
for col in categoricalColumns:
    indexer = StringIndexer(inputCol=col, outputCol=col + "Index")
    df = indexer.fit(df).transform(df)




print("********** FINISHED **********")