# An Introduction to Machine Learning with Spark

Spark allows to create a full ML pipeline from ingesting data through building model to evaluating model output.

`spark.mllib` vs `spark.ml`:
 * `spark.mllib` is the old library that works with RDD (expected to be removed in Spark 3.0)
 * `spark.ml` is the new API build around spark dataframe

**Resources:**
 * [ML Pipelines on Apache Spark Website](https://spark.apache.org/docs/latest/ml-pipeline.html)
 * [Spark Pipelines: Elegant Yet Powerful](https://blog.insightdatascience.com/spark-pipelines-elegant-yet-powerful-7be93afcdd42)
 * [Beginner’s Guide to Create End-to-End Machine Learning Pipeline in PySpark](https://towardsdatascience.com/beginners-guide-to-create-first-end-to-end-machine-learning-pipeline-in-pyspark-d3df25a08dfd)

## Import Libraries and Create Spark Session

In [1]:
from os.path import abspath
from pyspark.sql import SparkSession, HiveContext
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import LogisticRegression
import pyspark.ml.evaluation as evals
import pyspark.ml.tuning as tune
import numpy as np

In [2]:
warehouse_location = abspath('../data/spark-warehouse')
spark = SparkSession \
         .builder \
         .config("spark.sql.warehouse.dir", warehouse_location) \
         .config('spark.driver.extraJavaOptions','-Dderby.system.home=../data/tmp') \
         .enableHiveSupport() \
         .getOrCreate()

## Transformers and Estimators

There are 2 main basic classes `Transformer` and `Estimator` in the `pyspark.ml` module.
 * Method `transform()` from `Transformer` class takes Spark data frame and returns a new data frame with a new *transformed* column.
 * Method `fit()` from `Estimator` class takes Spark data frame but returns a model object.

## Data Types for Modeling

Modeling in Spark could be performed over numerical data only, therefore columns of data frames used for modeling must be **integers** or **doubles**.

When data is imported to Spark it usually correctly recognize type of data, but if not then we can use method `cast()` to convert types to numeric. The method `cast()` accepts one string argument which could be `"integer"` or `"double"`.

### Casting `string` to Numeric

Let's read in *Iris* data and **not** use `inferSchema =True` so it reads all columns as strings:

In [3]:
df_iris = spark.read.csv("../data/raw/iris.csv", header=True)

In [4]:
df_iris

DataFrame[sepal_length_cm: string, sepal_width_cm: string, petal_length_cm: string, petal_width_cm: string, class_iris: string]

In [5]:
df_iris.show(5)

+---------------+--------------+---------------+--------------+-----------+
|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm| class_iris|
+---------------+--------------+---------------+--------------+-----------+
|            5.1|           3.5|            1.4|           0.2|Iris-setosa|
|            4.9|           3.0|            1.4|           0.2|Iris-setosa|
|            4.7|           3.2|            1.3|           0.2|Iris-setosa|
|            4.6|           3.1|            1.5|           0.2|Iris-setosa|
|            5.0|           3.6|            1.4|           0.2|Iris-setosa|
+---------------+--------------+---------------+--------------+-----------+
only showing top 5 rows



Now, we have data frame `df_iris` with first 4 columns having numeric data stored as strings. Let's use `cast()` and `withColumn()` methods to re-create column `"sepal_length_cm"` with `double` type instead of `string`:

In [6]:
df_iris = df_iris.withColumn("sepal_length_cm", df_iris.sepal_length_cm.cast("double"))

Let's look again in the content of the data frame and note that type of first column `sepal_length_cm` is `double` now:

In [7]:
df_iris

DataFrame[sepal_length_cm: double, sepal_width_cm: string, petal_length_cm: string, petal_width_cm: string, class_iris: string]

In [8]:
df_iris.show(5)

+---------------+--------------+---------------+--------------+-----------+
|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm| class_iris|
+---------------+--------------+---------------+--------------+-----------+
|            5.1|           3.5|            1.4|           0.2|Iris-setosa|
|            4.9|           3.0|            1.4|           0.2|Iris-setosa|
|            4.7|           3.2|            1.3|           0.2|Iris-setosa|
|            4.6|           3.1|            1.5|           0.2|Iris-setosa|
|            5.0|           3.6|            1.4|           0.2|Iris-setosa|
+---------------+--------------+---------------+--------------+-----------+
only showing top 5 rows



### Casting Boolean to Numeric

If a column is boolean, it should be also converted to numeric type `integer`. For example, let's add a boolean column `sepal_length_big` to data frame:

In [9]:
df_iris = df_iris.withColumn("sepal_length_big", df_iris.sepal_length_cm > 6.0)

In [10]:
df_iris

DataFrame[sepal_length_cm: double, sepal_width_cm: string, petal_length_cm: string, petal_width_cm: string, class_iris: string, sepal_length_big: boolean]

In [11]:
df_iris.show(5)

+---------------+--------------+---------------+--------------+-----------+----------------+
|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm| class_iris|sepal_length_big|
+---------------+--------------+---------------+--------------+-----------+----------------+
|            5.1|           3.5|            1.4|           0.2|Iris-setosa|           false|
|            4.9|           3.0|            1.4|           0.2|Iris-setosa|           false|
|            4.7|           3.2|            1.3|           0.2|Iris-setosa|           false|
|            4.6|           3.1|            1.5|           0.2|Iris-setosa|           false|
|            5.0|           3.6|            1.4|           0.2|Iris-setosa|           false|
+---------------+--------------+---------------+--------------+-----------+----------------+
only showing top 5 rows



Convert boolean column to integer and rename it as `label` - **this is the default name for the response variable in Spark's machine learning**.

In [12]:
df_iris = df_iris.withColumn("label", df_iris.sepal_length_big.cast("integer"))

In [14]:
df_iris.show(5)

+---------------+--------------+---------------+--------------+-----------+----------------+-----+
|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm| class_iris|sepal_length_big|label|
+---------------+--------------+---------------+--------------+-----------+----------------+-----+
|            5.1|           3.5|            1.4|           0.2|Iris-setosa|           false|    0|
|            4.9|           3.0|            1.4|           0.2|Iris-setosa|           false|    0|
|            4.7|           3.2|            1.3|           0.2|Iris-setosa|           false|    0|
|            4.6|           3.1|            1.5|           0.2|Iris-setosa|           false|    0|
|            5.0|           3.6|            1.4|           0.2|Iris-setosa|           false|    0|
+---------------+--------------+---------------+--------------+-----------+----------------+-----+
only showing top 5 rows



### One-Hot Vectors

If a column can not be easily coverted to numeric values, then it is usually converted to set of *one-hot vectors* where one vector represents one level of the feature. For example, column `class_iris` has three classes or levels, so we can create 3 vectors (i.e. 3 new columns) each of which has `1` only for one class. This is done in 4 steps:
 * use `StringIndexer` to map uniqe column class to a number (for example, in the *Iris* data it labels `Iris-setosa` as `0`, `Iris-versicolor` as `1` and `Iris-virginica` as `2`).
 * use `OneHotEncoder` to convert column from previous step to 2 columns with `1` only for one initial column class in each of them. Note, that 2 columns is enough to lable 3 classes, since (1,0) represents 1st class, (0,1) respresents 2nd class and (0,0) represents 3rd class.
 * use `VectorAssembler` to combine all columns required for model in one data frame.
 * use `Pipeline` to run all `Estimators` and `Transformers` which are defined in input list.

Define object `iris_class_indexer` of `StringIndexer` class which creates column of indexes for different *Iris* types:

In [15]:
iris_class_indexer = StringIndexer(inputCol="class_iris", outputCol="class_iris_index")

Define object `iris_class_onehotencoder` of `OneHotEncoder` class which creates column of tuples that labels different types of *Iris*:

In [16]:
iris_class_onehotencoder = OneHotEncoder(inputCol="class_iris_index", outputCol="class_iris_onehot")

Note, `OneHotEncoder` is deprecated since Spark 2.3.0 and will be removed in 3.0.0. Use `OneHotEncoderEstimator` instead.

Define object `iris_vec_assembler` of `VectorAssembler` class which creates a column with a list of features for modleing in each row and rename it as `features` - **this is the default name for the input features in Spark's machine learning**.

In [17]:
iris_vec_assembler = VectorAssembler(inputCols=["sepal_length_cm","class_iris_onehot"], outputCol="features")

Define object `iris_pipe` of `Pipeline` class to combine all steps together. Input is list of objects in consecutive orders of transformation:

In [18]:
iris_pipe = Pipeline(stages=[iris_class_indexer, iris_class_onehotencoder, iris_vec_assembler])

This is where we finally pass data through the pipeline `iris_pipe`. Note, that we have to call 2 methods `fit()` and `transform()` with initail data frame `df_iris` as input:

In [19]:
iris_piped_data = iris_pipe.fit(df_iris).transform(df_iris)

Let's examine the data after transformation printing all records:

In [20]:
iris_piped_data.show(150)

+---------------+--------------+---------------+--------------+---------------+----------------+-----+----------------+-----------------+-------------+
|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm|     class_iris|sepal_length_big|label|class_iris_index|class_iris_onehot|     features|
+---------------+--------------+---------------+--------------+---------------+----------------+-----+----------------+-----------------+-------------+
|            5.1|           3.5|            1.4|           0.2|    Iris-setosa|           false|    0|             0.0|    (2,[0],[1.0])|[5.1,1.0,0.0]|
|            4.9|           3.0|            1.4|           0.2|    Iris-setosa|           false|    0|             0.0|    (2,[0],[1.0])|[4.9,1.0,0.0]|
|            4.7|           3.2|            1.3|           0.2|    Iris-setosa|           false|    0|             0.0|    (2,[0],[1.0])|[4.7,1.0,0.0]|
|            4.6|           3.1|            1.5|           0.2|    Iris-setosa|         

## Testing and Training Data

Only after we indexed all data and created a feature column we can split data in train and test sets using `randomSplit()` method. The method accepts list of 2 numbers - fraction of traning data and fraction of testing data, for example to divide 80% to 20% we use the following:

In [21]:
train, test = iris_piped_data.randomSplit([.8, .2], seed = 1)

Note, parameter `seed` allows to reproduce the same splitting. Change or omit this parameter for random splitting.

In [22]:
train.groupBy().count().show()

+-----+
|count|
+-----+
|  118|
+-----+



In [23]:
test.groupBy().count().show()

+-----+
|count|
+-----+
|   32|
+-----+



## Model Building

### Logistic Regression

Create a `LogisticRegression` estimator:

In [24]:
lr = LogisticRegression()

Create a `BinaryClassificationEvaluator`:

In [25]:
ev = evals.BinaryClassificationEvaluator(metricName="areaUnderROC")

Create the parameter grid:

In [26]:
grid = tune.ParamGridBuilder()

Add hyperparameters to the grid:

In [27]:
grid = grid.addGrid(lr.regParam, np.arange(0, .1, .01))
grid = grid.addGrid(lr.elasticNetParam, [0, 1])

Build the grid

In [28]:
grid = grid.build()

Create the `CrossValidator`:

In [29]:
cv = tune.CrossValidator(estimator=lr,
               estimatorParamMaps=grid,
               evaluator=ev
               )

Fit logistic regression models using cross validation:

In [30]:
%%time
models = cv.fit(train)

Extract the best model:

In [31]:
best_lr = models.bestModel

In [32]:
print(best_lr)

LogisticRegressionModel: uid = LogisticRegression_083748035d8e, numClasses = 2, numFeatures = 3


Use the model to predict the test set:

In [33]:
test_results = best_lr.transform(test)

Evaluate the predictions calculating the `AUC` (areaUnderROC), remember we set this up above `evals.BinaryClassificationEvaluator(metricName="areaUnderROC")`:

In [36]:
print(ev.evaluate(test_results))

1.0


Note, result is perfect (the closer the AUC is to 1, the better the model is) because in this example we used `sepal_length_cm` to create `label` column and then we used the same column in `features`.