
```
---
title: Machine Learning with Spark
type:  lesson + lab + demo
duration: "1:25"
creator:
    name: David Yerrington
    city: SF
---
```
<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 10px">

#  Intro to:  Machine Learning with Spark
Week 9 | 4.3


<img src="https://snag.gy/vD04Y2.jpg" width="600">

Common cases with data preprocessing and Spark MLib for Machine Learning.


## Loading Data

Working with RDD's in Spark, can be a bit more challenging with regards to loading data.  The most common method provided for working with text data is `sc.textFile()`.  The data returned will be a semi-structured RDD where each line is cast into rows of strings.

In [1]:
import findspark
findspark.init()
import pyspark

In [2]:
sc = pyspark.SparkContext()

## RDD with Schema
Let's look at the SQLContext module we have which we can use  read.csv files in Spark. 

In [30]:
from pyspark.sql import SQLContext
sqlctx = SQLContext(sc)
df = sqlctx.read.csv('../../../datasets/iowa_liquor/Iowa_Liquor_sales_sample_10pct.csv',
                     header=True)

df.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Store Number: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zip Code: string (nullable = true)
 |-- County Number: string (nullable = true)
 |-- County: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Category Name: string (nullable = true)
 |-- Vendor Number: string (nullable = true)
 |-- Item Number: string (nullable = true)
 |-- Item Description: string (nullable = true)
 |-- Bottle Volume (ml): string (nullable = true)
 |-- State Bottle Cost: string (nullable = true)
 |-- State Bottle Retail: string (nullable = true)
 |-- Bottles Sold: string (nullable = true)
 |-- Sale (Dollars): string (nullable = true)
 |-- Volume Sold (Liters): string (nullable = true)
 |-- Volume Sold (Gallons): string (nullable = true)



In [5]:
df.select("Store Number").describe().show()

+-------+
|summary|
+-------+
|  count|
|   mean|
| stddev|
|    min|
|    max|
+-------+



## Update Types / Cleanup

In [6]:
df.select("Date", "Store Number", "Category Name", "Bottles Sold", "Sale (Dollars)").show(5)

+----------+------------+--------------------+------------+--------------+
|      Date|Store Number|       Category Name|Bottles Sold|Sale (Dollars)|
+----------+------------+--------------------+------------+--------------+
|11/04/2015|        3717|    APRICOT BRANDIES|          12|        $81.00|
|03/02/2016|        2614|    BLENDED WHISKIES|           2|        $41.26|
|02/11/2016|        2106|STRAIGHT BOURBON ...|          24|       $453.36|
|02/03/2016|        2501|  AMERICAN COCKTAILS|           6|        $85.50|
|08/18/2015|        3654|      VODKA 80 PROOF|          12|       $129.60|
+----------+------------+--------------------+------------+--------------+
only showing top 5 rows



In [5]:
# crap = df.select('Sale (Dollars)').na
# crap.replace("$", "").collect()

In [32]:
from pyspark.sql.types import StringType, IntegerType, DoubleType
from pyspark.sql.functions import udf, regexp_replace

# stripDollarSigns = udf(lambda s: s.replace("$", ""), DoubleType())

df = df.withColumn("Store Number", df["Store Number"].cast("integer"))\
.withColumn("Sale (Dollars)",        regexp_replace("Sale (Dollars)", "\\$", "").cast("double")) \
.withColumn("Zip Code",              df["Zip Code"].cast("integer")) \
.withColumn("County Number",         df["County Number"].cast("integer")) \
.withColumn("Vendor Number",         df["Vendor Number"].cast("integer")) \
.withColumn("Item Number",           df["Item Number"].cast("integer")) \
.withColumn("Bottle Volume (ml)",    df["Bottle Volume (ml)"].cast("integer")) \
.withColumn("State Bottle Cost",     regexp_replace("State Bottle Cost", "\\$", "")) \
.withColumn("State Bottle Retail",   regexp_replace("State Bottle Retail", "\\$", "")) \
.withColumn("Bottles Sold",          df["Bottles Sold"].cast("integer")) \
.withColumn("Volume Sold (Liters)",  df["Volume Sold (Liters)"].cast("double")) \
.withColumn("Volume Sold (Gallons)", df["Volume Sold (Gallons)"].cast("double")) \

df.printSchema()
df.show(5)

root
 |-- Date: string (nullable = true)
 |-- Store Number: integer (nullable = true)
 |-- City: string (nullable = true)
 |-- Zip Code: integer (nullable = true)
 |-- County Number: integer (nullable = true)
 |-- County: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Category Name: string (nullable = true)
 |-- Vendor Number: integer (nullable = true)
 |-- Item Number: integer (nullable = true)
 |-- Item Description: string (nullable = true)
 |-- Bottle Volume (ml): integer (nullable = true)
 |-- State Bottle Cost: string (nullable = true)
 |-- State Bottle Retail: string (nullable = true)
 |-- Bottles Sold: integer (nullable = true)
 |-- Sale (Dollars): double (nullable = true)
 |-- Volume Sold (Liters): double (nullable = true)
 |-- Volume Sold (Gallons): double (nullable = true)

+----------+------------+-----------+--------+-------------+----------+---------+--------------------+-------------+-----------+--------------------+------------------+--------------

## Basic Summary Statistics
Once type is defined, describe / show will report useful statistics.

In [8]:
df.select(["Zip Code", "Bottle Volume (ml)", "Bottles Sold", "Sale (Dollars)", "Volume Sold (Liters)"]).describe().show()

+-------+-----------------+------------------+-----------------+------------------+--------------------+
|summary|         Zip Code|Bottle Volume (ml)|     Bottles Sold|    Sale (Dollars)|Volume Sold (Liters)|
+-------+-----------------+------------------+-----------------+------------------+--------------------+
|  count|           270738|            270955|           270955|            270955|              270955|
|   mean|51264.20559729332| 924.8303408315033|9.871284899706593| 128.9023747485706|   8.981351183775748|
| stddev|988.9071803701167|493.08848860663403|24.04091157393874|383.02736884240466|  28.913690130072464|
|    min|            50002|                50|                1|              1.34|                 0.1|
|    max|            56201|              6000|             2508|           36392.4|              2508.0|
+-------+-----------------+------------------+-----------------+------------------+--------------------+



> The **Spark** equivelent to **Pandas** df.describe() is: 
> ```python 
> df.select(df.columns).describe().show()
> ```

## Quick Notes About Matrix/Vector Types in Spark

There are lots of types to familliarize yourself with inside of Spark/MLib.  Generally, It is important to choose the right format with storing large and distributed matrices.

_"MLlib supports **local vectors** and matrices stored on a single machine, as well as **distributed matrices** backed by one or more RDDs. **Local vectors** and **local matrices** are simple data models that serve as public interfaces. The underlying linear algebra operations are provided by [Breeze](http://www.scalanlp.org/). A training example used in supervised learning is called a **“labeled point”** in MLlib."_

> ### Local Vectors

> A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values. For example, a vector (1.0, 0.0, 3.0) can be represented in dense format as [1.0, 0.0, 3.0] or in sparse format as (3, [0, 2], [1.0, 3.0]), where 3 is the size of the vector.
>

> ### Labeled Points
> A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ....

Further reading about Spark matrix / vector types:
http://spark.apache.org/docs/latest/mllib-data-types.html

## Preparing our training data

First, we need to take our dataframe, and encode the training features / predictors.  Notice our response variable is the first parameter.

> **LabeledPoint(response, [features])**

In [9]:
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel

features = ["Bottles Sold", "Sale (Dollars)", "Bottle Volume (ml)"]
response = "Volume Sold (Liters)"

X = df.rdd.map( 
    lambda row: LabeledPoint(row[response], [row[feature] for feature in features])
)

## Test / Train Split

It's also possible to do KFolds, but here is the equivelent to Scikit-learn's `train_test_split()`


In [10]:
# Split the data into training and test sets (30% held out for testing)
trainingData, testData = X.randomSplit([0.7, 0.3])

## Linear Regression in Spark

>Train a linear regression model using Stochastic Gradient
Descent (SGD). This solves the least squares regression
formulation

>    `f(weights) = 1/(2n) ||A weights - y||^2`

Read about the SGD parameters here:<br>
http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.regression.LinearRegressionWithSGD

The default step paramter is 1.0.  An optimal value for the step hyperparameter is dependant on the data size, and will vary depending on training, testing, and out of sample data.  Plotting the error rate to see how SGD converges is the best way to tune this paramter.  For the sake of this example, step/rate was chosen very roughly to give a cursory sense of the application of this model.

In [11]:
linearModel = LinearRegressionWithSGD.train(trainingData, iterations=100, step=0.000001)



### Examine Coefficients

In [12]:
zip(features, linearModel.weights.array)

[('Bottles Sold', 0.0028388114049113112),
 ('Sale (Dollars)', 0.038752006629178624),
 ('Bottle Volume (ml)', 0.0052608986484459662)]

## Familliar Regression Metrics

In [13]:
from pyspark.mllib.evaluation import RegressionMetrics

prediObserRDD = testData.map(lambda row: (float(linearModel.predict(row.features)), row.label)).cache()
metrics = RegressionMetrics(prediObserRDD)

print """
                R2:  %.6f
Explained Variance:  %.6f
               MSE:  %.6f
              RMSE:  %.6f
""" % (metrics.r2, metrics.explainedVariance, metrics.meanSquaredError, metrics.rootMeanSquaredError)


                R2:  0.615141
Explained Variance:  233.887470
               MSE:  316.502071
              RMSE:  17.790505



## No Problem?

By now you might see there are some similarities between Pandas DataFrames and Spark DataFrames.  The nuance and complexity lies with Sparks many types of matrices and datatypes that are specific to your application.  Your application largely depends on the scale of your problem, and the model(s) you choose to use.

# Logistic Regression

Similarly implemented, the logistic regression model we've become familliar with is implemented in Spark as well as many other models that are available in Scikit-learn.  We will attempt to demonstrate a more concise example using a prior dataset.

## Load / Clean / Apply Schema
This time, we will attept to apply a schema at loadtime of our model.

> Note:  If you don't see your data show up after applying a schema, it's likely that you forgot a field or that your supplied schema doesn't match your dataset 1:1.

In [14]:
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType

schema = StructType([
    StructField("PassengerId", IntegerType()),
    StructField("Survived",    IntegerType()),
    StructField("Pclass",      IntegerType()),
    StructField("Name",        StringType()),
    StructField("Sex",         StringType()),
    StructField("Age",         DoubleType()),
    StructField("SibSp",       IntegerType()),
    StructField("Parch",       IntegerType()),
    StructField("Fare",        DoubleType()),
    StructField("Embarked",    StringType()) 
])

df = sqlctx.read.csv(
    "../../../datasets/titanic/titanic_clean.csv", header=True, mode="DROPMALFORMED", schema=schema
)

# Print schema, and then show the first 5 records in printed format
df.printSchema()
df.show(5)

# rdd.printSchema()
# df.select(df.columns).describe().show()
# df.printSchema()


# Build the model
# model = LogisticRegressionWithLBFGS.train(parsedData)


root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Embarked: string (nullable = true)

+-----------+--------+------+--------------------+------+----+-----+-----+-------+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|   Fare|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+-------+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|   7.25|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|71.2833|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|  7.925|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.

##  Labeled Points

Time to create our training data and train test splits.

> **LabeledPoint(response, [features])**

In [15]:
features = ["Pclass", "Age", "SibSp", "Parch"]
response = "Survived"

X = df.rdd.map( 
    lambda row: LabeledPoint(row[response], [row[feature] for feature in features])
)

# Split the data into training and test sets (30% held out for testing)
trainingData, testData = X.randomSplit([0.7, 0.3])

## Setup Logistic Model

Just like Scikit-learn, Spark's Mlib follows a consistent pattern.  Learn the application of one model, the rest quickly become familliar.

In [18]:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.evaluation import BinaryClassificationMetrics
logisticModel = LogisticRegressionWithLBFGS.train(trainingData)

## Coefficients

In [17]:
zip(features, logisticModel.weights.array)

[('Pclass', -0.40109411126250538),
 ('Age', 0.0074837668128318402),
 ('SibSp', -0.052646294269188945),
 ('Parch', 0.31349198435142495)]

In [23]:
prediObserRDD = testData.map(lambda row: (float(logisticModel.predict(row.features)), row.label)).cache()

## Classification Metrics

In [26]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Overall accuracy
def testError(lap):
    return lap.filter(lambda (v, p): v != p).count() / float(testData.count())
    
accuracy = testError(prediObserRDD)

print "Test Accuracy = %s" % accuracy

# Instantiate metrics object
metrics = BinaryClassificationMetrics(prediObserRDD)

# Area under precision-recall curve
print "Area under PR = %s" % metrics.areaUnderPR

# Area under ROC curve
print "Area under ROC = %s" % metrics.areaUnderROC

Test Accuracy = 0.385650224215
Area under PR = 0.555840273329
Area under ROC = 0.541125541126


## Multi-Class Metrics
If we had a multinomial response, we could use the _Multiclassmetrics_ to get a more accurate sense of these. It appears that the Scala and Java flavors include these in the binomial version of the metrics but not the Python flavor.  This is here mainly as a reference point for future problems you may want to use.

> These metrics are highly suspect.  Only an example of the implementation with Pyspark.


In [27]:
from pyspark.mllib.evaluation import MulticlassMetrics

metrics = MulticlassMetrics(prediObserRDD)

precision = metrics.precision()
recall = metrics.recall()
f1Score = metrics.fMeasure()

print "Summary Stats" 
print "--------------------"
print "Accuracy  = %s" % metrics.accuracy
print "Precision = %s" % precision 
print "Recall    = %s" % recall 
print "F1 Score  = %s" % f1Score 

Summary Stats
--------------------
Accuracy  = 0.614349775785
Precision = 0.614349775785
Recall    = 0.614349775785
F1 Score  = 0.614349775785


## Random Forests in Pyspark

This is a very basic setup for RF's in Pyspark.  One thing that is missing at the time of the writing of this content is that "feature importances" aren't implemented inside of the Python library for spark yet, but if you implement these models in Scala or Java, that metric is available as part of the API.  It's very well possible that this feature is available inside the API for Python, but it isn't documented.  Python features usually are implemented last in the chain of the Spark ecosystem.

You may recall a rant on Scala vs Python within the Spark ecosystem.  This sould give you a sense about the value of learning Scala or Java for big data.  The best way to learn Spark, is to re-implement what we've done in class using Spark.  Some tasks are much easier, but overall it's a little slower to implement than using Pandas + Sklearn because you are unfamilliar with the Mlib / Spark stack. 

In [28]:
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils

model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=3, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())

print('Test Error = ' + str(testErr))
print('Learned classification forest model:')
print(model.toDebugString())

Test Error = 0.264573991031
Learned classification forest model:
TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 1 <= 17.0)
     If (feature 2 <= 2.0)
      If (feature 0 <= 2.0)
       Predict: 1.0
      Else (feature 0 > 2.0)
       If (feature 3 <= 1.0)
        Predict: 1.0
       Else (feature 3 > 1.0)
        Predict: 0.0
     Else (feature 2 > 2.0)
      If (feature 3 <= 1.0)
       Predict: 0.0
      Else (feature 3 > 1.0)
       Predict: 0.0
    Else (feature 1 > 17.0)
     If (feature 0 <= 2.0)
      If (feature 1 <= 23.0)
       If (feature 0 <= 1.0)
        Predict: 0.0
       Else (feature 0 > 1.0)
        Predict: 0.0
      Else (feature 1 > 23.0)
       If (feature 2 <= 0.0)
        Predict: 1.0
       Else (feature 2 > 0.0)
        Predict: 1.0
     Else (feature 0 > 2.0)
      If (feature 1 <= 32.0)
       If (feature 1 <= 30.0)
        Predict: 0.0
       Else (feature 1 > 30.0)
        Predict: 1.0
      Else (feature 1 > 32.0)
       If (feature 

## What Else!?

This is barely scratching the surface of what's possible with Spark + Python.  Some models are implemented in MLib that are not in Scikit-learn but Scikit-learn is considered to be a more robust toolset in terms of analsysis on a single machine, however, there are exceptions to that statement.  Mainly, Spark requires a little more attention to types, and preprocessing can be a bit more inolved, but the fact that you can quickly iterate on a machine learning and data processing pipeline, is still a great asset when building predictive models.

Some noteable features:

- Pipelines
- ParamGridSearch
- Model Loading / Saving

# Independent Practice

Load up the merged version of the wine dataset and attempt to build an entire analysis pipeline with schema, test / train split, model evaluation.

In [29]:
df = sqlctx.read.csv(
    "../../../datasets/wine_quality/winequality_merged.csv", header=True, mode="DROPMALFORMED"
)

### Attempt linear regression on one of the features as the response!