<h1 style="text-align:center"> INFO 323: Cloud Computing and Big Data</h1>
<h2 style="text-align:center"> College of Computing and Informatics</h2>
<h2 style="text-align:center">Drexel University</h2>

<h3 style="text-align:center"> Spark's Advanced Analytics Tools</h3>
<h3 style="text-align:center"> Yuan An, PhD</h3>
<h3 style="text-align:center">Associate Professor</h3>

## Spark's Advanced Analytics Tools
Spark includes several core packages and many
external packages for performing advanced analytics. The primary package is MLlib, which provides
an interface for building machine learning pipelines.
![](https://i.imgur.com/LxNbxBY.png)

### What Is MLlib?
MLlib is a package, built on and included in Spark, that provides interfaces for gathering and cleaning
data, feature engineering and feature selection, training and tuning large-scale supervised and
unsupervised machine learning models, and using those models in production.

### When and why should you use MLlib (versus scikit-learn, TensorFlow, etc)
At a high level, MLlib might sound like a lot of other machine learning packages you’ve probably
heard of, such as scikit-learn for Python or the variety of R packages for performing similar tasks. 

So why should you bother with MLlib at all? There are numerous tools for performing machine learning
on a single machine, and while there are several great options to choose from, these single machine
tools do have their limits either in terms of the size of data you can train on or the processing time.

This means single-machine tools are usually complementary to MLlib. When you hit those scalability
issues, take advantage of Spark’s abilities.

There are two key use cases where you want to leverage Spark’s ability to scale. 
1. ** First, you want to leverage Spark for preprocessing and feature generation to reduce the amount of time it might take to produce training and test sets from a large amount of data. Then you might leverage single-machine
learning libraries to train on those given data sets. **

2. ** Second, when your input data or model size become too difficult or inconvenient to put on one machine, use Spark to do the heavy lifting. Spark makes distributed machine learning very simple. **

An important caveat to all of this is that while training and data preparation are made simple, there
are still some complexities you will need to keep in mind, especially when it comes to deploying a
trained model. For example, ** Spark does not provide a built-in way to serve low-latency predictions
from a model, so you may want to export the model to another serving system or a custom application
to do that. **

MLlib is generally designed to allow inspecting and exporting models to other tools where
possible.

## Data Types for Spark ML
There are also several lower-level data types
you may need to work with in MLlib (Vector being the most common). Whenever we pass a set of
features into a machine learning model, we must do it as a vector that consists of Doubles. 

This vector can be either sparse (where most of the elements are zero) or dense (where there are many
unique values). Vectors are created in different ways. To create a dense vector, we can specify an
array of all the values. To create a sparse vector, we can specify the total size and the indices and
values of the non-zero elements. Sparse is the best format, as you might have guessed, when the
majority of values are zero as this is a more compressed representation. Here is an example of how to
manually create a Vector:

In [0]:
# vector in Spark MLLib
from pyspark.ml.linalg import Vectors
denseVec = Vectors.dense(1.0, 2.0, 3.0)

In [0]:
denseVec

In [0]:
size = 3
idx = [1, 2] # locations of non-zero elements in vector
values = [2.0, 3.0]
sparseVec = Vectors.sparse(size, idx, values)

In [0]:
denseVec

Out[4]: DenseVector([1.0, 2.0, 3.0])

In [0]:
sparseVec

Out[5]: SparseVector(3, {1: 2.0, 2: 3.0})

## MLlib in Action
We’ll use a small synthetic dataset that will
help illustrate Spark's ML.

In [0]:
filepath = "dbfs:/FileStore/tables/simple-ml"
df = spark.read.json(filepath)

In [0]:
df.printSchema()

In [0]:
df.count()

In [0]:
df.show(5)

+-----+----+------+------------------+
|color| lab|value1|            value2|
+-----+----+------+------------------+
|green|good|     1|14.386294994851129|
| blue| bad|     8|14.386294994851129|
| blue| bad|    12|14.386294994851129|
|green|good|    15| 38.97187133755819|
|green|good|    12|14.386294994851129|
+-----+----+------+------------------+
only showing top 5 rows



In [0]:
df.orderBy("value2").show(5)

+-----+----+------+------------------+
|color| lab|value1|            value2|
+-----+----+------+------------------+
|  red|good|    35|14.386294994851129|
| blue| bad|    12|14.386294994851129|
|  red| bad|     2|14.386294994851129|
| blue| bad|     8|14.386294994851129|
|  red| bad|    16|14.386294994851129|
+-----+----+------+------------------+
only showing top 5 rows



This dataset consists of a categorical label with two values (good or bad), a categorical variable
(color), and two numerical variables. Suppose that we want to train a classification model where we
hope to predict a binary variable—the label—from the other values.

## Feature Engineering with Transformers
When we use MLlib, all inputs to machine learning algorithms in Spark must consist of type Double (for labels) and Vector[Double] (for features).

The current dataset does not meet that requirement and therefore we need to transform it to the proper
format.

To achieve this in our example, we are going to specify an RFormula. This is a declarative language
for specifying machine learning transformations and is simple to use once you understand the syntax.

RFormula supports a limited subset of the R operators that in practice work quite well for simple
models and manipulations. The
basic RFormula operators are:

~

Separate target and terms

+

Concat terms; “+ 0” means removing the intercept (this means that the y-intercept of the line that
we will fit will be 0)

-

Remove a term; “- 1” means removing the intercept (this means that the y-intercept of the line that
we will fit will be 0—yes, this does the same thing as “+ 0”

:

Interaction (multiplication for numeric values, or binarized categorical values)

.

All columns except the target/dependent variable

In order to specify transformations with this syntax, we need to import the relevant class. Then we go
through the process of defining our formula. In this case we want to use all available variables (the .)

Later, we can try to add in the interactions between value1 and color and value2 and color, treating those as
new features:

In [0]:
df.select('color').distinct().show()

+-----+
|color|
+-----+
|green|
|  red|
| blue|
+-----+



In [0]:
df.select('value1').distinct().show()

+------+
|value1|
+------+
|     1|
|    12|
|     8|
|    35|
|     2|
|    15|
|    16|
|    45|
+------+



In [0]:
from pyspark.ml.feature import RFormula

In [0]:
supervised = RFormula(formula="lab ~ color:value1")

In [0]:
fittedRF = supervised.fit(df)

In [0]:
preparedDF = fittedRF.transform(df)
preparedDF.show(10, False)

+-----+----+------+------------------+--------------+-----+
|color|lab |value1|value2            |features      |label|
+-----+----+------+------------------+--------------+-----+
|green|good|1     |14.386294994851129|[0.0,1.0,0.0] |1.0  |
|blue |bad |8     |14.386294994851129|[0.0,0.0,8.0] |0.0  |
|blue |bad |12    |14.386294994851129|[0.0,0.0,12.0]|0.0  |
|green|good|15    |38.97187133755819 |[0.0,15.0,0.0]|1.0  |
|green|good|12    |14.386294994851129|[0.0,12.0,0.0]|1.0  |
|green|bad |16    |14.386294994851129|[0.0,16.0,0.0]|0.0  |
|red  |good|35    |14.386294994851129|[35.0,0.0,0.0]|1.0  |
|red  |bad |1     |38.97187133755819 |[1.0,0.0,0.0] |0.0  |
|red  |bad |2     |14.386294994851129|[2.0,0.0,0.0] |0.0  |
|red  |bad |16    |14.386294994851129|[16.0,0.0,0.0]|0.0  |
+-----+----+------+------------------+--------------+-----+
only showing top 10 rows



In [0]:
supervised = RFormula(formula="lab~.+color:value1+color:value2")

At this point, we have declaratively specified how we would like to change our data into what we
will train our model on. The next step is to fit the RFormula transformer to the data to let it discover
the possible values of each column. Not all transformers have this requirement but because RFormula
will automatically handle categorical variables for us, it needs to determine which columns are
categorical and which are not, as well as what the distinct values of the categorical columns are. For
this reason, we have to call the fit method. Once we call fit, it returns a “trained” version of our
transformer we can then use to actually transform our data.

In [0]:
supervised = RFormula(formula="lab ~ color:value1+color:value2")

In [0]:
fittedRF = supervised.fit(df)

In [0]:
preparedDF = fittedRF.transform(df)

In [0]:
preparedDF.show(10, False)

+-----+----+------+------------------+-----------------------------------+-----+
|color|lab |value1|value2            |features                           |label|
+-----+----+------+------------------+-----------------------------------+-----+
|green|good|1     |14.386294994851129|(6,[1,4],[1.0,14.386294994851129]) |1.0  |
|blue |bad |8     |14.386294994851129|(6,[2,5],[8.0,14.386294994851129]) |0.0  |
|blue |bad |12    |14.386294994851129|(6,[2,5],[12.0,14.386294994851129])|0.0  |
|green|good|15    |38.97187133755819 |(6,[1,4],[15.0,38.97187133755819]) |1.0  |
|green|good|12    |14.386294994851129|(6,[1,4],[12.0,14.386294994851129])|1.0  |
|green|bad |16    |14.386294994851129|(6,[1,4],[16.0,14.386294994851129])|0.0  |
|red  |good|35    |14.386294994851129|(6,[0,3],[35.0,14.386294994851129])|1.0  |
|red  |bad |1     |38.97187133755819 |(6,[0,3],[1.0,38.97187133755819])  |0.0  |
|red  |bad |2     |14.386294994851129|(6,[0,3],[2.0,14.386294994851129]) |0.0  |
|red  |bad |16    |14.386294

It assigns a numerical value to each possible color category, creates additional
features for the interaction variables between colors and value1/value2, and puts them all into a
single vector. We then call transform on that object in order to transform our input data into the
expected output data.

Let’s create a simple test set based off a random split of the data now.

In [0]:
preparedDF.first()

Out[16]: Row(color='green', lab='good', value1=1, value2=14.386294994851129, features=SparseVector(10, {1: 1.0, 2: 1.0, 3: 14.3863, 5: 1.0, 8: 14.3863}), label=1.0)

In [0]:
preparedDF.take(3)

Out[17]: [Row(color='green', lab='good', value1=1, value2=14.386294994851129, features=SparseVector(10, {1: 1.0, 2: 1.0, 3: 14.3863, 5: 1.0, 8: 14.3863}), label=1.0),
 Row(color='blue', lab='bad', value1=8, value2=14.386294994851129, features=SparseVector(10, {2: 8.0, 3: 14.3863, 6: 8.0, 9: 14.3863}), label=0.0),
 Row(color='blue', lab='bad', value1=12, value2=14.386294994851129, features=SparseVector(10, {2: 12.0, 3: 14.3863, 6: 12.0, 9: 14.3863}), label=0.0)]

In [0]:
train, test = preparedDF.randomSplit([0.7, 0.3])

## Modeling
Now that we have transformed our data into the correct format and created some valuable features,
it’s time to actually fit our model. To create our classifier we instantiate an instance of LogisticRegression, using the
default configuration or hyperparameters. We then set the label columns and the feature columns; the
column names we are setting—label and features—are actually the default labels for all
estimators in Spark MLlib, and in later chapters we omit them:

In [0]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="label",featuresCol="features")

Before we actually go about training this model, let’s inspect the parameters. This is also a great way
to remind yourself of the options available for each particular model:

In [0]:
print(lr.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features, current: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The

It shows an explanation of all of the parameters for
Spark’s implementation of logistic regression. The explainParams method exists on all algorithms
available in MLlib.

In [0]:
fittedLR = lr.fit(train)

Once complete, you can use the model to make predictions. Logically this means tranforming features
into labels. We make predictions with the transform method. For example, we can transform our
training dataset to see what labels our model assigned to the training data and how those compare to
the true outputs. This, again, is just another DataFrame we can manipulate. Let’s perform that
prediction with the following code snippet:

In [0]:
fittedLR.transform(train).select("label", "prediction").show()

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
+-----+----------+
only showing top 20 rows



Our next step would be to manually evaluate this model and calculate performance metrics like the
true positive rate, false negative rate, and so on. We might then turn around and try a different set of
parameters to see if those perform better. However, while this is a useful process, it can also be quite
tedious. Spark helps you avoid manually trying different models and evaluation criteria by allowing
you to specify your workload as a declarative pipeline of work that includes all your transformations
as well as tuning your hyperparameters.

## Pipelining Our Workflow
As you probably noticed, if you are performing a lot of transformations, writing all the steps and
keeping track of DataFrames ends up being quite tedious. That’s why Spark includes the Pipeline
concept. A pipeline allows you to set up a dataflow of the relevant transformations that ends with an
estimator that is automatically tuned according to your specifications, resulting in a tuned model ready
for use.

In [0]:
# Create held out set
train, test = df.randomSplit([0.7, 0.3])

In [0]:
train.show(3)

+-----+---+------+------------------+
|color|lab|value1|            value2|
+-----+---+------+------------------+
| blue|bad|     8|14.386294994851129|
| blue|bad|     8|14.386294994851129|
| blue|bad|     8|14.386294994851129|
+-----+---+------+------------------+
only showing top 3 rows



In [0]:
test.show(3)

+-----+---+------+------------------+
|color|lab|value1|            value2|
+-----+---+------+------------------+
| blue|bad|     8|14.386294994851129|
| blue|bad|     8|14.386294994851129|
| blue|bad|    12|14.386294994851129|
+-----+---+------+------------------+
only showing top 3 rows



In [0]:
# Create two transformers
rForm = RFormula()
lr = LogisticRegression().setLabelCol("label").setFeaturesCol("features")

Now instead of manually using
our transformations and then tuning our model we just make them stages in the overall pipeline, as in
the following code snippet:

In [0]:
from pyspark.ml import Pipeline
stages = [rForm, lr]
pipeline = Pipeline().setStages(stages)

### Training and Evaluation
Now that you arranged the logical pipeline, the next step is training. In our case, we won’t train just
one model (like we did previously); we will train several variations of the model by specifying
different combinations of hyperparameters that we would like Spark to test. We will then select the
best model using an Evaluator that compares their predictions on our validation data. We can test
different hyperparameters in the entire pipeline, even in the RFormula that we use to manipulate the
raw data.

In [0]:
from pyspark.ml.tuning import ParamGridBuilder
params = ParamGridBuilder()\
.addGrid(rForm.formula, [
"lab ~ . + color:value1",
"lab ~ . + color:value1 + color:value2"])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.addGrid(lr.regParam, [0.1, 2.0])\
.build()

We will build 12 different logistic regression models. The evaluator allows us to
automatically and objectively compare multiple models to the same evaluation metric.  We will use the
BinaryClassificationEvaluator, which has a number of potential evaluation metrics. In this case we will use areaUnderROC, which is the total area under the
receiver operating characteristic, a common measure of classification performance:

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()\
.setMetricName("areaUnderROC")\
.setRawPredictionCol("prediction")\
.setLabelCol("label")

Spark provides two options for performing
hyperparameter tuning automatically. We can use TrainValidationSplit, which will simply
perform an arbitrary random split of our data into two different groups, or CrossValidator, which
performs K-fold cross-validation by splitting the dataset into k non-overlapping, randomly partitioned
folds:

In [0]:
from pyspark.ml.tuning import TrainValidationSplit
tvs = TrainValidationSplit()\
.setTrainRatio(0.75)\
.setEstimatorParamMaps(params)\
.setEstimator(pipeline)\
.setEvaluator(evaluator)

In [0]:
# Fit and evaluate the models. tvs stands for trainValidationSplit
tvsFitted = tvs.fit(train)

In [0]:
evaluator.evaluate(tvsFitted.transform(test)) // 0.9166666666666667

Out[46]: 1.0

#### Evaluate the single model trained early

In [0]:
train, test = preparedDF.randomSplit([0.7, 0.3])

In [0]:
evaluator.evaluate(fittedLR.transform(test))

Out[48]: 1.0