<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px"> 
# Spark Machine Learning Cluster Preparations
---

We first load the data and run a grid search. Then we save our model and the best configuration and scores. Since our aim is going to be to run the entire process on a cluster, the latter steps are useful for retrieving our results.

In [1]:
import pyspark as ps   
import warnings         
from pyspark.sql import SQLContext

from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.feature import StandardScaler, VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

## Set up the spark context

In [2]:
try:
    # we try to create a SparkContext to work locally on all cpus available
    sc = ps.SparkContext('local[4]')
    sqlContext = SQLContext(sc)
    print("Just created a SparkContext")
except ValueError:
    # give a warning if SparkContext already exists (for use inside pyspark)
    warnings.warn("SparkContext already exists in this scope")

Just created a SparkContext


## Load the data

In [3]:
spark = ps.sql.SparkSession(sc)

spark_df = spark.read.csv(
    path='data/boston_housing.csv', 
    header=True,
    mode="DROPMALFORMED",
    inferSchema=True,
    enforceSchema=False
    )

(data_train, data_test) = spark_df.randomSplit([0.7, 0.3], seed=1)

## Fit the model

In [4]:
features = [col for col in spark_df.columns if col != 'MEDV']

vectorAssembler = VectorAssembler(inputCols=features,
                                  outputCol="features")

scaler = StandardScaler(withMean=True,
                        inputCol="features",
                        outputCol="scaledfeatures")

model = LinearRegression(featuresCol=scaler.getOutputCol(),
                         labelCol='MEDV',
                         maxIter=3000,
                         regParam=0.0,
                         elasticNetParam=0.0)

pipeline = Pipeline(stages=[vectorAssembler, scaler, model])

evaluator = RegressionEvaluator(predictionCol='prediction',
                                labelCol='MEDV',
                                metricName='r2')

reg_strengths = sc.range(-4, 4).map(lambda x: 10**x).collect()

paramGrid = ParamGridBuilder() \
    .addGrid(model.regParam, reg_strengths) \
    .addGrid(model.fitIntercept, [True, False]) \
    .build()

# the actual gridsearch
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=5,
                          parallelism=2)

# Run cross-validation, and choose the best set of parameters.
model_fit = crossval.fit(data_train)

## Get the predictions

In [5]:
predictions_train = model_fit.transform(data_train)
predictions_test = model_fit.transform(data_test)
predictions_all = model_fit.transform(spark_df)

print(evaluator.evaluate(predictions_train))
print(evaluator.evaluate(predictions_test))
print(evaluator.evaluate(predictions_all))

0.7310936297006648
0.6934885107911399
0.7218117084843778


In [6]:
java_model = model_fit.bestModel.stages[2]._java_obj
best_parameters = {param.name: java_model.getOrDefault(java_model.getParam(param.name))
       for param in paramGrid[0]}

print(best_parameters)
print()
print(java_model.explainParams())

{'regParam': 1.0, 'fitIntercept': True}

aggregationDepth: suggested depth for treeAggregate (>= 2) (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0, current: 0.0)
epsilon: The shape parameter to control the amount of robustness. Must be > 1.0. (default: 1.35)
featuresCol: features column name (default: features, current: scaledfeatures)
fitIntercept: whether to fit an intercept term (default: true, current: true)
labelCol: label column name (default: label, current: MEDV)
loss: The loss function to be optimized. Supported options: squaredError, huber. (Default squaredError) (default: squaredError)
maxIter: maximum number of iterations (>= 0) (default: 100, current: 3000)
predictionCol: prediction column name (default: prediction)
regParam: regularization parameter (>= 0) (default: 0.0, current: 1.0)
solver: The solver algorithm for optimization. Supported option

## Save the best model

Spending a lot of time fitting the model is not much worth if we don't save the model, even more so if we work with an expensive cluster from which we have to retrieve our results.

The model will be saved in a directory for which we generate a name using the current time.
To verify that everything worked we load the saved model again. Additionally, we write the scores and model parameters to a file.

In [13]:
import time
time_now = time.strftime("%Y%m%d-%H%M%S")
path = 'model-{}'.format(time_now)
model_fit.bestModel.save(path)

In [14]:
best_model = PipelineModel.load(path)

l_predictions_train = best_model.transform(data_train)
l_predictions_test = best_model.transform(data_test)
l_predictions_all = best_model.transform(spark_df)

print(evaluator.evaluate(l_predictions_train))
print(evaluator.evaluate(l_predictions_test))
print(evaluator.evaluate(l_predictions_all))

0.7310936297006648
0.6934885107911399
0.7218117084843778


In [15]:
def add_line(text, file, n_lines=1):
    return file.write(str(text)+'\n'*n_lines)

with open('results-{}.txt'.format(time_now), 'w') as file:
    add_line(best_parameters, file, n_lines=2)
    add_line(java_model.explainParams(), file, n_lines=2)
    add_line(evaluator.evaluate(predictions_train), file)
    add_line(evaluator.evaluate(predictions_test), file)
    add_line(evaluator.evaluate(predictions_all), file)

In the file [ml_load_model](ml_load_model.ipynb) you will find the code for loading the saved model and evaluating it on the data.

On the cluster, we don't want to go through the lengthy process of setting up anaconda with notebook access. Rather we are going to simply submit a script based on the code above which will load the data and perform the computation. You will find it [here](scripts/spark_ml_cluster.py), and the same can be done for the model loading procedure with another [script](scripts/spark_model_loader.py).

Each of them can be run from the command line. Change into the scripts directory, activate your pyspark environment, and the run the following:

```bash
spark-submit spark_ml_cluster.py
```

Once it has finished, it will have stored a folder `model-...` and you can load it replacing the appropriate folder name using

```bash
spark-submit spark_model_loader.py model-...
```

The print outs are buried among a lot of verbosity. To save them redirect the output into a file:

```bash
spark-submit spark_model_loader.py model-... > result.txt
```

Note that once running the scripts on a cluster, you will have to use

```python
sc = ps.SparkContext('yarn')
```

when creating the spark context. Also you should adjust the parallelism level in the grid search.