# ML vs. MLLib

There're two machine learning packages in PySpark, `ml` and `mllib`. `pyspark.ml` is an older library based on RDD, whereas `mllib` is a newer library based on `DataFrame`. **We recommend using `ml` whenever it is possible, and only use `mllib` is the needed feature doesn't exist in `ml`.** More differences between `ml` and `mllib` are summarized in this post: http://yuqli.com/?p=2330

# Data Structures

## Local Vector
A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values. For example, a vector (1.0, 0.0, 3.0) can be represented in dense format as [1.0, 0.0, 3.0] or in sparse format as (3, [0, 2], [1.0, 3.0]), where 3 is the size of the vector.

MLlib recognizes the following types as dense vectors:

NumPy’s [array](http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html)

Python’s list, e.g., [1, 2, 3]

In [48]:
import numpy as np
from pyspark.mllib.linalg import Vectors

# create through numpy array
v1 = Vectors.dense(np.array([1.0, 2.0, 3.0]))
# create through Python's list
v2 = Vectors.dense([0, 1, 2])
# create through 
v3 = Vectors.dense(0, 1, 2)
v1, v2, v3

(DenseVector([1.0, 2.0, 3.0]),
 DenseVector([0.0, 1.0, 2.0]),
 DenseVector([0.0, 1.0, 2.0]))

and the following as sparse vectors:

MLlib’s [SparseVector](https://spark.apache.org/docs/2.2.0/api/python/pyspark.mllib.html#pyspark.mllib.linalg.SparseVector).

SciPy’s [csc_matrix (Compressed Sparse Column matrix)](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix) with a single column

In [49]:
import numpy as np
import scipy.sparse as sps
from pyspark.mllib.linalg import Vectors

# Create a SparseVector through factory methods.
sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0])
# Create a SparseVector through scipy's csc_matrix
sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape=(3, 1))
sv1, sv2

(SparseVector(3, {0: 1.0, 2: 3.0}),
 <3x1 sparse matrix of type '<type 'numpy.float64'>'
 	with 2 stored elements in Compressed Sparse Column format>)

**We recommend using NumPy arrays over lists for efficiency, and using the factory methods implemented in Vectors to create sparse vectors.**

## LabeledPoint
A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ....

In [50]:
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint

# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])

# Create a labeled point with a negative label and a sparse feature vector.
neg = LabeledPoint(0.0, Vectors.sparse(3, [0, 2], [1.0, 3.0]))
pos, neg

(LabeledPoint(1.0, [1.0,0.0,3.0]), LabeledPoint(0.0, (3,[0,2],[1.0,3.0])))

## Local matrix
A local matrix has integer-typed row and column indices and double-typed values, stored on a single machine. MLlib supports dense matrices, whose entry values are stored in a single double array in column-major order, and sparse matrices, whose non-zero entry values are stored in the Compressed Sparse Column (CSC) format in column-major order. For example, the following dense matrix

\begin{pmatrix}
1.0 & 2.0 \\
3.0 & 4.0 \\
5.0 & 6.0
\end{pmatrix}

is stored in a one-dimensional array [1.0, 3.0, 5.0, 2.0, 4.0, 6.0] with the matrix size (3, 2).

The base class of local matrices is Matrix, and we provide two implementations: DenseMatrix, and SparseMatrix. We recommend using the factory methods implemented in Matrices to create local matrices. Remember, local matrices in MLlib are stored in column-major order.

In [51]:
from pyspark.ml.linalg import Matrix, Matrices

# Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
dm2 = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])

# Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])
dm2, sm

(DenseMatrix(3, 2, [1.0, 2.0, 3.0, 4.0, 5.0, 6.0], False),
 SparseMatrix(3, 2, [0, 1, 3], [0, 2, 1], [9.0, 6.0, 8.0], False))

## Distributed Matrix
A distributed matrix has long-typed row and column indices and double-typed values, stored distributively in one or more RDDs. It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive. Four types of distributed matrices have been implemented so far.

The basic type is called [RowMatrix](https://spark.apache.org/docs/2.2.0/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix). A RowMatrix is a row-oriented distributed matrix without meaningful row indices, e.g., a collection of feature vectors. It is backed by an RDD of its rows, where each row is a local vector. We assume that the number of columns is not huge for a RowMatrix so that a single local vector can be reasonably communicated to the driver and can also be stored / operated on using a single node. 

An [IndexedRowMatrix](https://spark.apache.org/docs/2.2.0/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.IndexedRowMatrix) is similar to a RowMatrix but with row indices, which can be used for identifying rows and executing joins. 

A [CoordinateMatrix](https://spark.apache.org/docs/2.2.0/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.CoordinateMatrix) is a distributed matrix stored in coordinate list (COO) format, backed by an RDD of its entries. 

A [BlockMatrix](https://spark.apache.org/docs/2.2.0/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.BlockMatrix) is a distributed matrix backed by an RDD of MatrixBlock which is a tuple of (Int, Int, Matrix).

### RowMatrix

A RowMatrix is a row-oriented distributed matrix without meaningful row indices, backed by an RDD of its rows, where each row is a local vector. Since each row is represented by a local vector, the number of columns is limited by the integer range but it should be much smaller in practice.

A [RowMatrix](https://spark.apache.org/docs/2.2.0/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix) can be created from an RDD of vectors.

Refer to the [RowMatrix Python docs](https://spark.apache.org/docs/2.2.0/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix) for more details on the API.

In [53]:
from pyspark.mllib.linalg.distributed import RowMatrix

# Create an RDD of vectors.
rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])

# Create a RowMatrix from an RDD of vectors.
mat = RowMatrix(rows)

# Get its size.
m = mat.numRows()  # 4
n = mat.numCols()  # 3

# Get the rows as an RDD of vectors again.
rowsRDD = mat.rows
mat

<pyspark.mllib.linalg.distributed.RowMatrix at 0x7f2fcc4dc890>

## DataFrame

**This is different from `pandas.DataFrame`!**

[DataFrame](https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#datasets-and-dataframes) is a new data structure introduced to support `spark.ml` library.

In [54]:
# Note that we're using the ml version of Vectors
from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([
    (1.218, Vectors.dense(1.560, -0.605)),
    (2.949, Vectors.dense(0.346, 2.158)),
    (3.627, Vectors.dense(1.380, 0.231)),
    (0.273, Vectors.dense(0.520, 1.151)),
    (4.199, Vectors.dense(0.795, -0.226))], ["label", "features"])
df.show()

+-----+--------------+
|label|      features|
+-----+--------------+
|1.218| [1.56,-0.605]|
|2.949| [0.346,2.158]|
|3.627|  [1.38,0.231]|
|0.273|  [0.52,1.151]|
|4.199|[0.795,-0.226]|
+-----+--------------+



# A Regression Example - Least Square

In this tutorial, we will be using the same data set used by The Elements of Statistical Machine Learning, which comes from a study by Stamey et al. (1989) that examined the correlation between the level of prostate specific antigen (PSA) and a number of clinical measures, in 97 men who were about to receive a radical prostatectonmy.

Through this simple example, we hope you can learn the following topics:
  * Read text data into `DataFrame`;
  * View basic statistics such as correlation matrix;
  * Transform data using `StandardScaler`;
  * Train a regression model using `LinearRegression`;
  
#### Note that this tutorial is based on `pyspark.ml`, not `pyspark.mllib`.

## Read Data into DataFrame

As the first step, we will download the data file and save it locally. The file has the following format: 
```
        lcavol  lweight age     lbph    svi     lcp     gleason pgg45   lpsa    train
1       -0.579818495    2.769459        50      -1.38629436     0       -1.38629436     6         0     -0.4307829      T
2       -0.994252273    3.319626        58      -1.38629436     0       -1.38629436     6         0     -0.1625189      T
3       -0.510825624    2.691243        74      -1.38629436     0       -1.38629436     7        20     -0.1625189      T
4       -1.203972804    3.282789        58      -1.38629436     0       -1.38629436     6         0     -0.1625189      T
5        0.751416089    3.432373        62      -1.38629436     0       -1.38629436     6         0      0.3715636      T
6       -1.049822124    3.228826        50      -1.38629436     0       -1.38629436     6         0      0.7654678      T
```

There're 8 predictors (column 1--8) and the outcome is `lpsa` (column 9). This last column indicates which 67 observations were used as the 
"training set" and which 30 as the test set, as described on page 48
in the book.

In [56]:
from urllib.request import urlopen
from pyspark.mllib.regression import LabeledPoint
import google.datalab.storage as storage
bucket = storage.Bucket('mth9898-bucket')

# Read data from Google cloud storage
f = bucket.object('prostate.data')
psa_file = sc.parallelize(f.read_stream().splitlines())

Parse the first row to get the column names

In [59]:
# skip the header row
header = psa_file.first()
# split the first row to get the list of header names
feature_labels = header.split('\t')[1:-2]
print('Predictors: %s' % (feature_labels))

Predictors: ['lcavol', 'lweight', 'age', 'lbph', 'svi', 'lcp', 'gleason', 'pgg45']


In [61]:
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors

# read data as LabeledPoint RDDs
def parse_data_point(row):
    values = row.split('\t')
    is_train = values[-1] == 'T'
    label = float(values[-2])
    features = [float(v) for v in values[1:-2]] # skip the id column
    return Row(**{
        'train': is_train,
        'label': label,
        'features': Vectors.dense(features)
    })

df_data = sqlContext.createDataFrame(psa_file.filter(lambda x: x != header).map(parse_data_point))
print('Total rows: %s' % df_data.count())

Total rows: 97


In [62]:
df_data.show()

+--------------------+----------+-----+
|            features|     label|train|
+--------------------+----------+-----+
|[-0.579818495,2.7...|-0.4307829| true|
|[-0.994252273,3.3...|-0.1625189| true|
|[-0.510825624,2.6...|-0.1625189| true|
|[-1.203972804,3.2...|-0.1625189| true|
|[0.751416089,3.43...| 0.3715636| true|
|[-1.049822124,3.2...| 0.7654678| true|
|[0.737164066,3.47...| 0.7654678|false|
|[0.693147181,3.53...| 0.8544153| true|
|[-0.776528789,3.5...|  1.047319|false|
|[0.223143551,3.24...|  1.047319|false|
|[0.254642218,3.60...| 1.2669476| true|
|[-1.347073648,3.5...| 1.2669476| true|
|[1.613429934,3.02...| 1.2669476| true|
|[1.477048724,2.99...| 1.3480731| true|
|[1.205970807,3.44...| 1.3987169|false|
|[1.541159072,3.06...|  1.446919| true|
|[-0.415515444,3.5...| 1.4701758| true|
|[2.288486169,3.64...| 1.4929041| true|
|[-0.562118918,3.2...| 1.5581446| true|
|[0.182321557,3.82...| 1.5993876| true|
+--------------------+----------+-----+
only showing top 20 rows



## Split training/test data set

Normally we would use `data.randomSplit` to split the data into training and test sets. Since the data contains the split information, we'll just use it. 

In [65]:
df_training = df_data.filter('train')
df_testing = df_data.filter('!train')

df_training.count(), df_testing.count()

(67, 30)

## Feature Standardization

Standardizes predictors by scaling to unit variance and/or removing the mean using column summary statistics on the samples in the training set. **Note that we need to compute the mean and std on the training set, and use them to transform the testing set. **

In [67]:
from pyspark.ml.feature import StandardScaler

# scaler (an Estimator)
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=True, withMean=True)

# Fit the training data to produce a scaler Model (a Transformer)
scaler_model = scaler.fit(df_training)

# Normalize each feature to have unit standard deviation.
df_training_scaled = scaler_model.transform(df_training)
df_testing_scaled = scaler_model.transform(df_testing)

## Basic Statistics

In [69]:
from pyspark.ml.stat import Correlation

print(str(Correlation.corr(df_training, column='features').collect()[0][0]))

DenseMatrix([[ 1.        ,  0.30023199,  0.28632427,  0.06316772,  0.59294913,
               0.69204308,  0.42641407,  0.48316136],
             [ 0.30023199,  1.        ,  0.31672347,  0.43704154,  0.18105448,
               0.15682859,  0.02355821,  0.07416632],
             [ 0.28632427,  0.31672347,  1.        ,  0.28734645,  0.12890226,
               0.1729514 ,  0.36591512,  0.27580573],
             [ 0.06316772,  0.43704154,  0.28734645,  1.        , -0.1391468 ,
              -0.08853456,  0.03299215, -0.03040382],
             [ 0.59294913,  0.18105448,  0.12890226, -0.1391468 ,  1.        ,
               0.67124021,  0.30687537,  0.48135774],
             [ 0.69204308,  0.15682859,  0.1729514 , -0.08853456,  0.67124021,
               1.        ,  0.47643684,  0.66253335],
             [ 0.42641407,  0.02355821,  0.36591512,  0.03299215,  0.30687537,
               0.47643684,  1.        ,  0.7570565 ],
             [ 0.48316136,  0.07416632,  0.27580573, -0.03040382,  0.


## Linear Regression - a simple example

We will show how to use `LinearRegression` to train a LS model. 

In [70]:
from pyspark.ml.regression import LinearRegression

# no regularization
reg_param = 0.0
# standardization = False because we already did it
lr = LinearRegression(maxIter=10, featuresCol='scaledFeatures', 
                      regParam=reg_param, standardization=False)

In [71]:
# Fit the model
lrModel = lr.fit(df_training_scaled)

# Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(np.round(lrModel.coefficients, 3)))
print("Intercept: %s" % str(round(lrModel.intercept, 3)))

Coefficients: [ 0.716  0.293 -0.143  0.212  0.31  -0.289 -0.021  0.277]
Intercept: 2.452


In [72]:
# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary
print('Training Summary:')
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
print('MSE = %f' % trainingSummary.meanSquaredError)
print("RMSE = %f" % trainingSummary.rootMeanSquaredError)
print("R-squared = %f" % trainingSummary.r2)
print("MAE = %f" % trainingSummary.meanAbsoluteError)
print("Explained variance = %f" % trainingSummary.explainedVariance)

Training Summary:
numIterations: 1
objectiveHistory: [0.0]
MSE = 0.439200
RMSE = 0.662721
R-squared = 0.694371
MAE = 0.498614
Explained variance = 0.997837


In [86]:
# Prediction
lrPred = lrModel.transform(df_testing_scaled)
lrPred.show()

+--------------------+---------+-----+--------------------+------------------+
|            features|    label|train|      scaledFeatures|        prediction|
+--------------------+---------+-----+--------------------+------------------+
|[0.737164066,3.47...|0.7654678|false|[-0.4638113238022...|1.9690384442936995|
|[-0.776528789,3.5...| 1.047319|false|[-1.6819865855447...|1.1699557741534656|
|[0.223143551,3.24...| 1.047319|false|[-0.8774798387409...|1.2611792855419313|
|[1.205970807,3.44...|1.3987169|false|[-0.0865295175639...|1.8837591422420163|
|[2.059238834,3.50...|1.6582281|false|[0.60015536615859...|2.5443188606403004|
|[0.385262401,3.66...|1.7316555|false|[-0.7470113808369...|1.9327540170823725|
|[1.446918983,3.12...|1.7664417|false|[0.10737845154330...|2.0423357081761413|
|[-0.400477567,3.8...|1.8164521|false|[-1.3793516789484...|1.8309162520003734|
|[0.182321557,3.80...| 2.008214|false|[-0.9103321727276...|  1.99115928590372|
|[0.009950331,3.26...|2.0215476|false|[-1.0490514397

In [84]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction")
mse = evaluator.evaluate(lrPred, {evaluator.metricName: "mse"})
r2 = evaluator.evaluate(lrPred, {evaluator.metricName: "r2"})
sde = (lrPred.rdd.map(lambda x: (x.prediction - x.label)**2).variance() / (lrPred.count() - 1))**0.5
print('Test Summary:')
print("MSE = %g" % mse)
print("R squared = %g" % r2)
print('Std Error = %g' % sde)

Test Summary:
MSE = 0.521274
R squared = 0.50338
Std Error = 0.178724


# Model Selection with Cross Validation

Now let's introduce regularization (ridge regression) and use cross-validation to select the best model

In [78]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Ridge regression (by setting elasticNetParam = 0)
rr = LinearRegression(maxIter=10, elasticNetParam=0.0, standardization=False)

# Here we just make a few wild guesses of lambdas
# It's better to calculate the optimal lambdas by inferring them from effective degree of freeom
all_lambdas = [0.1, 1.0, 10.0]
paramGrid = ParamGridBuilder() \
    .addGrid(rr.regParam, all_lambdas) \
    .build()
    
crossval = CrossValidator(estimator=rr,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator(metricName='mse'),
                          numFolds=10)
# Fit the model
rrModel = crossval.fit(df_training_scaled)

In [80]:
# Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(np.round(rrModel.bestModel.coefficients, 3)))
print("Intercept: %s" % str(round(rrModel.bestModel.intercept, 3)))

Coefficients: [ 0.554  0.444 -0.015  0.154  0.407 -0.11  -0.047  0.009]
Intercept: 1.033


In [81]:
# Summarize the model over the training set and print out some metrics
trainingSummary = rrModel.bestModel.summary
print('Training Summary:')
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
print('MSE = %f' % trainingSummary.meanSquaredError)
print("RMSE = %f" % trainingSummary.rootMeanSquaredError)
print("R-squared = %f" % trainingSummary.r2)
print("MAE = %f" % trainingSummary.meanAbsoluteError)
print("Explained variance = %f" % trainingSummary.explainedVariance)

Training Summary:
numIterations: 1
objectiveHistory: [0.0]
MSE = 0.458445
RMSE = 0.677085
R-squared = 0.680979
MAE = 0.523436
Explained variance = 0.860463


In [82]:
# Prediction
rrPred = rrModel.transform(df_testing_scaled)
rrPred.show()

+--------------------+---------+-----+--------------------+------------------+
|            features|    label|train|      scaledFeatures|        prediction|
+--------------------+---------+-----+--------------------+------------------+
|[0.737164066,3.47...|0.7654678|false|[-0.4638113238022...|1.9851040528449706|
|[-0.776528789,3.5...| 1.047319|false|[-1.6819865855447...|1.1236746647713867|
|[0.223143551,3.24...| 1.047319|false|[-0.8774798387409...|1.3062080912389102|
|[1.205970807,3.44...|1.3987169|false|[-0.0865295175639...|1.9236425303120877|
|[2.059238834,3.50...|1.6582281|false|[0.60015536615859...| 2.762047335548208|
|[0.385262401,3.66...|1.7316555|false|[-0.7470113808369...|1.9519164872808261|
|[1.446918983,3.12...|1.7664417|false|[0.10737845154330...|2.1151631356846217|
|[-0.400477567,3.8...|1.8164521|false|[-1.3793516789484...| 1.809175290412458|
|[0.182321557,3.80...| 2.008214|false|[-0.9103321727276...| 1.976599848919253|
|[0.009950331,3.26...|2.0215476|false|[-1.0490514397

In [83]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction")
mse = evaluator.evaluate(rrPred, {evaluator.metricName: "mse"})
r2 = evaluator.evaluate(rrPred, {evaluator.metricName: "r2"})
sde = (rrPred.rdd.map(lambda x: (x.prediction - x.label)**2).variance() / (rrPred.count() - 1))**0.5
print('Test Summary:')
print("MSE = %g" % mse)
print("R squared = %g" % r2)
print('Std Error = %g' % sde)

Test Summary:
MSE = 0.526516
R squared = 0.498386
Std Error = 0.185974


# Pipelines

## Main concepts in Pipelines

MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is mostly inspired by the scikit-learn project.

[DataFrame](https://spark.apache.org/docs/2.2.0/ml-pipeline.html#dataframe): This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.

[Transformer](https://spark.apache.org/docs/2.2.0/ml-pipeline.html#transformers): A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.

[Estimator](https://spark.apache.org/docs/2.2.0/ml-pipeline.html#estimators): An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.

[Pipeline](https://spark.apache.org/docs/2.2.0/ml-pipeline.html#pipeline): A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.

[Parameter](https://spark.apache.org/docs/2.2.0/ml-pipeline.html#parameters): All Transformers and Estimators now share a common API for specifying parameters.

### Pipeline
<img src="https://spark.apache.org/docs/2.2.0/img/ml-Pipeline.png" alt="Drawing" style="width: 750px;"/>

### PipelineModel
<img src="https://spark.apache.org/docs/2.2.0/img/ml-PipelineModel.png" alt="Drawing" style="width: 750px;"/>

In [91]:
from pyspark.ml.feature import PCA
from pyspark.ml.feature import StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator

# standardize the data
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=True, withMean=True)

# extract principal components
pca = PCA(k=2, inputCol="scaledFeatures", outputCol="pcaFeatures")

# Ridge regression
rr = LinearRegression(featuresCol='features', standardization=False)

# Configure an ML pipeline, which consists of three stages: scaler, pca, and rr.
pipeline = Pipeline(stages=[scaler, pca, rr])

# build the params grid
paramGrid = ParamGridBuilder() \
    .addGrid(pca.k, [2, 3, 4]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator(metricName='mse'),
                          numFolds=2)  # use more folds in practice

In [93]:
# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(df_training)

In [101]:
# Prediction
cvPred = cvModel.transform(df_testing)

In [102]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction")
mse = evaluator.evaluate(cvPred, {evaluator.metricName: "mse"})
r2 = evaluator.evaluate(cvPred, {evaluator.metricName: "r2"})
sde = (cvPred.rdd.map(lambda x: (x.prediction - x.label)**2).variance() / (cvPred.count() - 1))**0.5
print('Test Summary:')
print("MSE = %g" % mse)
print("R squared = %g" % r2)
print('Std Error = %g' % sde)

Test Summary:
MSE = 0.521274
R squared = 0.50338
Std Error = 0.178724


# References
 * [Data Types - RDD-based API](https://spark.apache.org/docs/2.2.0/mllib-data-types.html)
 * [Linear Methods - RDD-based API](https://spark.apache.org/docs/2.2.0/mllib-linear-methods.html#mjx-eqn-eqregPrimal)
 * [Basic Statistics](https://spark.apache.org/docs/2.2.0/ml-statistics.html)
 * [Extracting, transforming and selecting features](https://spark.apache.org/docs/2.2.0/ml-features.html)
 * [ML Pipelines](https://spark.apache.org/docs/2.2.0/ml-pipeline.html)
 * [Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/)