# Spark - Machine Learning Fundamentals
In this courselet, we are going to explore the basics in the use of Spark as a tool to train a Machine Learning model. By the end of this courselet, you should be able to:

- Recognize the fundamentals in training a ML model using Spark
- Identify two different ML algorithms
- Identify the main libraries and documentation to perform your tasks

This courselet presupposes a foundational understanding of fundamental machine learning processes and the Spark framework. It is designed primarily to introduce coding within these contexts, rather than to focus exclusively on the development of rigorously accurate models.

For this courselet, we will use taxi trips reported to the City of Chicago in 2020. This data is publicly available through the [Chicago Data Portal](https://data.cityofchicago.org/en/Transportation/Taxi-Trips-2020/r2u4-wwk3/about_data) If you previously covered the Exploratory Data Analysis with Spark courselet, you should be familiar with this dataset. 

In this courselet, we are going to explore the following cases:

- **Regression:** We are going to try to estimate the fare price of a trip, given a collection of features based on location, temporality, and trip duration.
- **Clustering:** We are going to segmentate our trips in 10 different clusters, using the coordinates and temporality components as our clustering features.

We are using the data in [Parquet format](https://parquet.apache.org/), given the several advantages of this format.

## Module 1: Regression

As a very first step, we will start by initiating our Spark session

import pyspark
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder \
    .appName("ML-Process-Regression") \
    .getOrCreate()

In [None]:
# We start by loading the data and printing the schema
df = spark.read.parquet("data/chicago-taxi-2020.parquet",header=True, inferSchema=True)
df.printSchema() 

In [None]:
# A display of the first few rows
df.limit(5).toPandas()

### Exploratory Analysis

We briefly explore our data. We might want to start by analyzing the features that will be part of our regression model and displaying the missingness rate per column in the dataframe.

In [None]:
# List of continuous variables for our regression analysis
continuous = ["Trip Seconds", "Trip Miles", "Fare", "Tips", "Tolls", "Extras", "Trip Total"]
df_cont = df.select(continuous)
df_cont_summary = df_cont.describe()

df_cont_summary.show()

In [None]:
# Missingness rate per column
from pyspark.sql.functions import col, count, when, lit
total_count = df.count()
missingness_rate = df.select([((count(when(col(c).isNull(), c)) / lit(total_count))).alias(c) for c in df.columns])

missingness_rate.toPandas().transpose() # Using toPandas method to make it look nicely

### Data pre-processing and feature engineering

**Regression Case**

For our regression modeling, our data pre-processing will go as follows:
1. We will start by creating a sub-df in which we'll exclusively keep those features that are part of our analysis
2. We will remove outliers from the *Fare* column (our target feature) by using the [1.5xIQR rule](https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/identifying-outliers-iqr-rule#:~:text=A%20commonly%20used%20rule%20says,3%20%2B%201.5%20%E2%8B%85%20IQR%20%E2%80%8D%20.)
3. We will extract the hour and the day of the week from *Trip Start Timestamp*
4. We will encode the hour, day of the week and community area of the pick up to treat them as categories for the model
5. We will place all of our explanatory features into a vector column using [VectorAssembler](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html)
6. We will reduce the number of features using Principal Component Analysis
7. We will create our final dataframe, keeping the relevant features with *Fare*, our target column.

**Vector Assembler** 

An important component of our data preparation pipeline will be the transformation of our dataframees into a column vector representations through the use of the VectorAssembler class. By transforming our dataframe into this representation, as PySpark algorithms need data to be represented like that in order to achieve an efficient parallel processing.

In [None]:
# Select regression features
regression_features = ["Trip Start Timestamp", "Trip Seconds", "Trip Miles", "Pickup Community Area", "Fare"]
df_reg = df.select(*regression_features).dropna(how='any', subset=regression_features) # We make sure to drop any missing values

In [None]:
# Removing outliers from our target column
from pyspark.sql.functions import col, lit

quantiles = df_reg.approxQuantile("Fare", [0.25, 0.75], 0.05) # https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.approxQuantile.html
Q1, Q3 = quantiles

IQR = Q3 - Q1 #Calculate IQR
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df_reg = df_reg.filter((col("Fare") >= lit(lower_bound)) & (col("Fare") <= lit(upper_bound))) # We remove the outliers using the bounds

In [None]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import to_timestamp, hour, dayofweek

# We start by converting the column's format to timestamp and extracting hour and day of the week
df_reg =  df_reg.withColumn("Trip Start Timestamp", 
                            to_timestamp("Trip Start Timestamp", 'MM/dd/yyyy hh:mm:ss a')) \
                .withColumn("pickup_hour", hour("Trip Start Timestamp")) \
                .withColumn("pickup_day_of_week", dayofweek("Trip Start Timestamp"))

# We have to convert the new columns into string format to index them
df_reg = df_reg.withColumn("pickup_hour", df_reg["pickup_hour"].cast("string")) \
               .withColumn("pickup_day_of_week", df_reg["pickup_day_of_week"].cast("string"))

# Now we index and encode the new columns, along with Pickup Community Area
hour_indexer = StringIndexer(inputCol="pickup_hour", 
                             outputCol="pickup_hour_indexed")
day_of_week_indexer = StringIndexer(inputCol="pickup_day_of_week", 
                                    outputCol="pickup_day_of_week_indexed")
community_area_indexer = StringIndexer(inputCol="Pickup Community Area", 
                                       outputCol="Pickup Community Area Index")

hour_encoder = OneHotEncoder(inputCols=["pickup_hour_indexed"], 
                             outputCols=["pickup_hour_vec"])
day_of_week_encoder = OneHotEncoder(inputCols=["pickup_day_of_week_indexed"], 
                                    outputCols=["pickup_day_of_week_vec"])
community_area_encoder = OneHotEncoder(inputCols=["Pickup Community Area Index"], 
                                       outputCols=["pickup_community_area_vec"])

# We create a transformations (pipeline https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Pipeline.html)
pipeline_reg = Pipeline(stages=[hour_indexer, day_of_week_indexer, 
                            community_area_indexer, hour_encoder, 
                            day_of_week_encoder, community_area_encoder])

# Now we execute the pipeline
df_reg = pipeline_reg.fit(df_reg).transform(df_reg)

In [None]:
# Now we transform our features of interest into a single column vector called "features" 
reg_input_cols = ["Trip Seconds", "Trip Miles", "pickup_hour_vec", "pickup_day_of_week_vec", "pickup_community_area_vec"]
reg_assembler = VectorAssembler(inputCols=reg_input_cols, outputCol="features")
df_reg = reg_assembler.transform(df_reg)

Now, to reduce the dimensionality and only keep the most important features, we will apply Principal Component Analysis to reduce the number of features and only keep 3 components.

In [None]:
from pyspark.ml.feature import PCA

# We apply PCA on the "features" column
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
pcaModel = pca.fit(df_reg)
df_reg_pca = pcaModel.transform(df_reg)

In [None]:
# Our final data only includes our components vector and the target column (Fare)
reg_data= df_reg_pca.select(col("pcaFeatures").alias("features"), col("Fare"))

### Training

We proceed to train our model. First, we will split the data into our training and testing datasets at a 80/20 distribution.

In [None]:
# We split the data into training and testing
train, test = reg_data.randomSplit([0.8, 0.2], seed=42)

In [None]:
# Now we train the model
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol="features", labelCol="Fare")
lr_model = lr.fit(train)

### Evaluation

To evaluate the performance of our regression model, we calculate the [Root Mean Squared Error](https://en.wikipedia.org/wiki/Root-mean-square_deviation).

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

# We first make create a new df with a new columns named "prediction", using our LR Model
predictions = lr_model.transform(test)

# Now we create a evaluator, setting the column, the
reg_evaluator = RegressionEvaluator(labelCol="Fare", predictionCol="prediction", metricName="rmse")

# Calculate RMSE
rmse = reg_evaluator.evaluate(predictions)
print(f"Root Mean Squared Error (RMSE) on test data = {rmse}")

### Saving the model for deployment

Once we are comfortable with the results of a model, and we want to make it availble for production deployment, we can store the model through the *.save()* method, to eventually load it in the production environment using the *.load()* method.

In [None]:
# Choosing the current path for storing. This can be any path
path = "the/path/to/lr_model"
lr_model.save(path)

In [None]:
# Now, in a production environment, we could load the model like this

model_path = "the/path/to/lr_model"
model = LinearRegressionModel.load(model_path)

### Explore More

The extensive collection of algorithms, framweworks and utilities that PySpark offers for Machine Learning tasks can be found in the following links:

- [MLlib (DataFrame-based)](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html)
- [MLlib (RDD-based)](https://spark.apache.org/docs/latest/api/python/reference/pyspark.mllib.html)