### 1. Import Spark session

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

This dataset consists of a categorical label with two values (good or bad), a categorical variable
(color), and two numerical variables. While the data is synthetic, let’s imagine that this dataset
represents a company’s customer health. The “color” column represents some categorical health
rating made by a customer service representative. The “lab” column represents the true customer
health. The other two values are some numerical measures of activity within an application (e.g.,
minutes spent on site and purchases). Suppose that we want to train a classification model where
we hope to predict a binary variable—the label—from the other values

Let's read a synthetic dataset called simple-ml, which represents company's customer's health'. It contains two categorical variables and two numerical variables. Color shows health rating of patient and lab shows their actual health status.
Value 1 and value 2 are simply to parameters of person's activity within an application e.g. minutes spenton site and purchases

### 2. Import data

In [2]:
df = spark.read.json("./../data/simple-ml")
df.orderBy("value2").show()

+-----+----+------+------------------+
|color| lab|value1|            value2|
+-----+----+------+------------------+
|green|good|     1|14.386294994851129|
|green| bad|    16|14.386294994851129|
| blue| bad|     8|14.386294994851129|
| blue| bad|     8|14.386294994851129|
| blue| bad|    12|14.386294994851129|
|green| bad|    16|14.386294994851129|
|green|good|    12|14.386294994851129|
|  red|good|    35|14.386294994851129|
|  red|good|    35|14.386294994851129|
|  red| bad|     2|14.386294994851129|
|  red| bad|    16|14.386294994851129|
|  red| bad|    16|14.386294994851129|
| blue| bad|     8|14.386294994851129|
|green|good|     1|14.386294994851129|
|green|good|    12|14.386294994851129|
| blue| bad|     8|14.386294994851129|
|  red|good|    35|14.386294994851129|
| blue| bad|    12|14.386294994851129|
|  red| bad|    16|14.386294994851129|
|green|good|    12|14.386294994851129|
+-----+----+------+------------------+
only showing top 20 rows



Using R-formula (a subset of R functions) to select all variables and interactions between value1 & color and value2 & color

In [3]:
from pyspark.ml.feature import RFormula
supervised = RFormula(formula="lab ~ . + color:value1 + color:value2")

Next step is to fit RFormula transformer to the data to let it discover the possible values of each column. This returns a trained version of our transformed that we can use to actually transform our data.

By fitting RFormula to the data, it will discover the possible values of each column as well as whether a column is categorical or not

In [4]:
fittedRF = supervised.fit(df)

In [5]:
preparedDF = fittedRF.transform(df)
preparedDF.show(20)

+-----+----+------+------------------+--------------------+-----+
|color| lab|value1|            value2|            features|label|
+-----+----+------+------------------+--------------------+-----+
|green|good|     1|14.386294994851129|(10,[1,2,3,5,8],[...|  1.0|
| blue| bad|     8|14.386294994851129|(10,[2,3,6,9],[8....|  0.0|
| blue| bad|    12|14.386294994851129|(10,[2,3,6,9],[12...|  0.0|
|green|good|    15| 38.97187133755819|(10,[1,2,3,5,8],[...|  1.0|
|green|good|    12|14.386294994851129|(10,[1,2,3,5,8],[...|  1.0|
|green| bad|    16|14.386294994851129|(10,[1,2,3,5,8],[...|  0.0|
|  red|good|    35|14.386294994851129|(10,[0,2,3,4,7],[...|  1.0|
|  red| bad|     1| 38.97187133755819|(10,[0,2,3,4,7],[...|  0.0|
|  red| bad|     2|14.386294994851129|(10,[0,2,3,4,7],[...|  0.0|
|  red| bad|    16|14.386294994851129|(10,[0,2,3,4,7],[...|  0.0|
|  red|good|    45| 38.97187133755819|(10,[0,2,3,4,7],[...|  1.0|
|green|good|     1|14.386294994851129|(10,[1,2,3,5,8],[...|  1.0|
| blue| ba

The column features has our previously raw data

In [6]:
preparedDF.select("features").take(5)

[Row(features=SparseVector(10, {1: 1.0, 2: 1.0, 3: 14.3863, 5: 1.0, 8: 14.3863})),
 Row(features=SparseVector(10, {2: 8.0, 3: 14.3863, 6: 8.0, 9: 14.3863})),
 Row(features=SparseVector(10, {2: 12.0, 3: 14.3863, 6: 12.0, 9: 14.3863})),
 Row(features=SparseVector(10, {1: 1.0, 2: 15.0, 3: 38.9719, 5: 15.0, 8: 38.9719})),
 Row(features=SparseVector(10, {1: 1.0, 2: 12.0, 3: 14.3863, 5: 12.0, 8: 14.3863}))]

Train-test split

In [7]:
train, test = preparedDF.randomSplit([0.7, 0.3])

### 3. Develop a model

Next we actually fit a model. label & features are default labels for all estimators in Spark.

In [8]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="label",featuresCol="features")

Looking at model parameters show what options are available

In [9]:
print(lr.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features, current: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The

In [10]:
# Fit a model

fittedLR = lr.fit(train)

# Training of the model is performed immidiately.

Making predictions from model is transforming features into labels, logically

In [11]:
fittedLR.transform(train).select("label", "prediction").show()

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  1.0|       1.0|
|  1.0|       1.0|
|  1.0|       1.0|
|  1.0|       1.0|
|  1.0|       1.0|
|  1.0|       1.0|
+-----+----------+
only showing top 20 rows



The next step is to evaluate model performance and this is more conviniently done using Pipelines

In [12]:
#Here we create a randomsplit in input dataset
train, test = df.randomSplit([0.7, 0.3])

In [13]:
# These are two base stages in pipeline which represent transformer & estimator.

rForm = RFormula()
lr = LogisticRegression().setLabelCol("label").setFeaturesCol("features")
stages = [rForm, lr]

In [14]:
from pyspark.ml import Pipeline
pipeline = Pipeline().setStages(stages)

Training and evaluation

Now that you arranged the logical pipeline, the next step is training. In our case, we won’t train
just one model (like we did previously); we will train several variations of the model by
specifying different combinations of hyperparameters that we would like Spark to test. We will
then select the best model using an Evaluator that compares their predictions on our validation
data. We can test different hyperparameters in the entire pipeline, even in the RFormula that we
use to manipulate the raw data. This code shows how we go about doing that:


The following piece shows not only the model builder but with hyper parameter tuning

In [15]:
from pyspark.ml.tuning import ParamGridBuilder

params = ParamGridBuilder()\
  .addGrid(rForm.formula, [
    "lab ~ . + color:value1",
    "lab ~ . + color:value1 + color:value2"])\
  .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
  .addGrid(lr.regParam, [0.1, 2.0])\
  .build()

# In our current paramter grid, there are three hyperparameters that will diverge from the defaults:
# Two different versions of the RFormula
# Three different options for the ElasticNet parameter
# Two different options for the regularization parameter
# This gives us a total of 12 different combinations of these parameters, which means we will be
# training 12 different versions of logistic regression. 

In [16]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()\
  .setMetricName("areaUnderROC")\
  .setRawPredictionCol("prediction")\
  .setLabelCol("label")

#The evaluator allows us to automatically and objectively compare multiple models to the same evaluation metric.

In [17]:
from pyspark.ml.tuning import TrainValidationSplit

tvs = TrainValidationSplit()\
  .setTrainRatio(0.75)\
  .setEstimatorParamMaps(params)\
  .setEstimator(pipeline)\
  .setEvaluator(evaluator)

In [18]:
tvsFitted = tvs.fit(train)

And of course evaluate how it performs on the test set!

In [None]:
TBC ...