# Setting up Spark

The first section is preparing the notebook for running Spark. 

In [1]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/zulu8.78.0.19-ca-jdk8.0.412-linux_aarch64"
os.environ["SPARK_HOME"] = "/opt/spark"

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("spark://rpi0:7077") \
    .appName("MyApp") \
    .getOrCreate()
sc = spark.sparkContext

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/06/29 20:37:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


The next steps prepares a Spark session and the Spark context needed to use Spark. If you have not done so yet, you have to install the Python package "findspark". Do so by running pip on your own machine (not in this notebook): pip3 install findspark

# Exercise

You are now ready to run the actual exercise.  

The dataset describes characteristics of irises. We modified the dataset by reducing it to two types, so we can run a logicistic regression: Iris-versicolor and Iris-virginica. Additional information about this dataset can be found here: <br>
https://en.wikipedia.org/wiki/Iris_flower_data_set <br>
https://archive.ics.uci.edu/ml/datasets/Iris


The columns have the following meanings (label = 1 means the flower is an Iris-versicolor, label = 0 means it is an Iris-virginica):

| column | description |
| --- | --- |
| sl: | sepal length in cm |
| sw: | sepal width in cm |
| pl: | petal length in cm |
| pw: | petal width in cm |

In [2]:
data = spark.read.csv('hdfs://rpi0:8020/data/iris.csv', header=True, inferSchema=True)
data.show()

[Stage 2:>                                                          (0 + 1) / 1]

+---+---+---+---+-----+
| sl| sw| pl| pw|label|
+---+---+---+---+-----+
|7.0|3.2|4.7|1.4|    1|
|6.4|3.2|4.5|1.5|    1|
|6.9|3.1|4.9|1.5|    1|
|5.5|2.3|4.0|1.3|    1|
|6.5|2.8|4.6|1.5|    1|
|5.7|2.8|4.5|1.3|    1|
|6.3|3.3|4.7|1.6|    1|
|4.9|2.4|3.3|1.0|    1|
|6.6|2.9|4.6|1.3|    1|
|5.2|2.7|3.9|1.4|    1|
|5.0|2.0|3.5|1.0|    1|
|5.9|3.0|4.2|1.5|    1|
|6.0|2.2|4.0|1.0|    1|
|6.1|2.9|4.7|1.4|    1|
|5.6|2.9|3.6|1.3|    1|
|6.7|3.1|4.4|1.4|    1|
|5.6|3.0|4.5|1.5|    1|
|5.8|2.7|4.1|1.0|    1|
|6.2|2.2|4.5|1.5|    1|
|5.6|2.5|3.9|1.1|    1|
+---+---+---+---+-----+
only showing top 20 rows



                                                                                

We start by building a pipeline consisting of feature selection (via RFormula) and a logistic regression model.

In [3]:
pip install numpy

Looking in indexes: https://pypi.org/simple, https://www.piwheels.org/simple
Note: you may need to restart the kernel to use updated packages.


In [4]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import RFormula

rForm = RFormula()
lr = LogisticRegression(labelCol="label", featuresCol="features")

pipeline = Pipeline(stages=[rForm, lr])

As the tuning of a model boils down to trying out a lot of different parameters, we use a grid search to do an exhaustive search of all parameter combinations we specify. After running the whole notebook, come back to this section and modify the formulas to find a better model.

For a brief introduction to RFormula, go to  
https://www.datacamp.com/tutorial/r-formula-tutorial

In [5]:
from pyspark.ml.tuning import ParamGridBuilder

params = ParamGridBuilder().addGrid(rForm.formula, ["label ~ sl", "label ~ sw", "label ~ pl", "label ~ pw"]).addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]).addGrid(lr.regParam, [0.1,2.0]).build()

We need to evaluate the performance of the model. As we use logistic regression, we need to be able to evaluate a binary classification.

In [6]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(metricName="areaUnderPR", rawPredictionCol="prediction", labelCol="label")

Using the test dataset during tuning risks of overfitting a model to the test dataset. We only use the test dataset at the very end to evaluate the final model. For model tuning we use a validation dataset (with cross-validation). This is also useful for small datasets.

In [7]:
from pyspark.ml.tuning import TrainValidationSplit

tvs = TrainValidationSplit().setTrainRatio(0.75).setEstimatorParamMaps(params).setEstimator(pipeline).setEvaluator(evaluator)

Finally, we can do the training of the actual model and evaluate it.

In [8]:
train, test = data.randomSplit([0.7, 0.3])

tvsFitted = tvs.fit(train)

                                                                                

In [9]:
evaluator.evaluate(tvsFitted.transform(test))

                                                                                

0.9822580645161291

We can have a closer look at the best model found during the training and access its parameters to see which parameter set came out on top. Go back to the section with the grid search and try to find a parameter set which gives a better evaluation result.

In [10]:
from pyspark.ml import PipelineModel
from pyspark.ml.feature import RFormulaModel

trainedPipeline = tvsFitted.bestModel
trainedLRFeat = trainedPipeline.stages[0]
trainedLRModel = trainedPipeline.stages[1]

print(trainedLRModel.summary.objectiveHistory)
print(trainedLRModel.coefficients)
print(trainedLRModel.extractParamMap())
print("\n")

trainedLRFeat.getFormula()


[0.684616277801305, 0.4649861276066079, 0.45283915294315713, 0.4518109068746574, 0.4517975493415185, 0.4517975361625189, 0.4517975361617237]
[-1.7299045386027792]
{Param(parent='LogisticRegression_be0cb8c7c199', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2, Param(parent='LogisticRegression_be0cb8c7c199', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_be0cb8c7c199', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial'): 'auto', Param(parent='LogisticRegression_be0cb8c7c199', name='featuresCol', doc='features column name.'): 'features', Param(parent='LogisticRegression_be0cb8c7c199', name='fitIntercept', doc='whether to fit an intercept term.'): True, Param(parent='LogisticRegression_be0cb8c7c199', 

'label ~ pl'