## Spark cluster (standalone) - Prediction notebook

> Dockerized env : [JupyterLab server => Spark (master <-> 1 worker) ]  
`docker-compose.yml` was (slightly) adapted from this [article](https://towardsdatascience.com/first-steps-in-machine-learning-with-apache-spark-672fe31799a3)  

> Original notebook is heavily modified :  
-random forest regressor instead of the article's linreg  
-use of a Pipeline (pyspark.ml.pipeline) to streamline the whole prediction process  
-no more sql-type queries

#### Connect to Spark

Reminder (as defined in docker-compose.yml) :
- This notebook : http://localhost:8888
- Access Master http://localhost:8080
- Access Worker http://localhost:8081

In [26]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

# SparkSession
URL_SPARK = "spark://spark:7077"

spark = (
    SparkSession.builder
    .appName("spark-ml")
    .config("executor.memory", "4g")
    .master(URL_SPARK)
    .getOrCreate()
)

### Run example - pyspark.sql / pyspark.ml

On Avocado dataset (how original). If you cloned git repo, is in /data, else go Kaggle

#### Load data

*Quick desc / scope of dataset :*  
No EDA, this exercise have been made a million times
Years 2015 to 2018  
Two avocado types : organic or conventional  
Region = region of consumption  
Avocado sizes (PLU): 4046 (small-medium), 4225 (large), 4770 (x-large), expressed in total # of sold avocados

In [51]:
# Cache table/dataframe for re-usable table with .cache()
# caching operation takes place only when a Spark action (count, show, take or write) is also performed on the same dataframe
df_avocado = spark.read.csv(
  "data/avocado.csv", 
  header=True, 
  inferSchema=True
).cache() # cache transformation

df_avocado.printSchema()
df_avocado.show(4) # call show() from the cached df_avocado. df_avocado cached in memory right after we call the action (show)

root
 |-- _c0: integer (nullable = true)
 |-- Date: timestamp (nullable = true)
 |-- AveragePrice: double (nullable = true)
 |-- Total Volume: double (nullable = true)
 |-- 4046: double (nullable = true)
 |-- 4225: double (nullable = true)
 |-- 4770: double (nullable = true)
 |-- Total Bags: double (nullable = true)
 |-- Small Bags: double (nullable = true)
 |-- Large Bags: double (nullable = true)
 |-- XLarge Bags: double (nullable = true)
 |-- type: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- region: string (nullable = true)

+---+-------------------+------------+------------+-------+---------+-----+----------+----------+----------+-----------+------------+----+------+
|_c0|               Date|AveragePrice|Total Volume|   4046|     4225| 4770|Total Bags|Small Bags|Large Bags|XLarge Bags|        type|year|region|
+---+-------------------+------------+------------+-------+---------+-----+----------+----------+----------+-----------+------------+----+------+
|  0|

#### Steps overview

- No EDA, has been done a million times, we jump to the implementation directly
- feature creation from 'Date' : --> yy and mm
- one hot encoding categorical 'region'
- scale numerical features ()
- Drop columns : _c0 (index), Total Bags, Total Volume (strong corr with respective subcategories)
- Drop transformed columns (if not aready done):  Date, region