# Binary Classification with Spark ML Lab

### In this notebook, we will explore Binary Classification using Spark ML. We will exploit Spark ML's high-level APIs built on top of DataFrames to create and tune machine learning pipelines. Spark ML Pipelines enable combining multiple algorithms into a single pipeline or workflow. We will heavitly utilize Spark ML's feature transformers to convert, modify and scale the features that will be used to develop the machine learning model. Finally, we will evaluate and cross validate our model to demonstrate the process of determining a best fit model.

### The binary classification demo will utilize the famous Titanic dataset, which has been used for Kaggle competitions and can be downloaded here. There is no need to download the data manually as it is downloaded directly within the noteboook.
https://www.kaggle.com/c/titanic/data


### The Titanic data set was chosen for this binary classification demonstration because it contains both text based and numeric features that are both continuous and categorical. This will give us the opportunity to explore and utilize a number of feature transformers available in Spark ML.
     
          
               
               
    


![IBM Logo](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSzlUYaJ9xykGC-N5PijcV_eDBGCXy_pMn7sy6ymrVypmJ22q5ZmA)

## Verify Spark version and existence of Spark and Spark SQL contexts

In [None]:
print('The spark version is {}.'.format(spark.version))

## Import required Spark libraries

In [None]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import Bucketizer
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.feature import Normalizer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

## Download Data

In [None]:
!rm -f Titanic.csv
!wget https://ibm.box.com/shared/static/crceca9g1ym3nl0hwaxa5c0j0m3e19l8.csv -O Titanic.csv -q
!ls -l Titanic.csv

## Read data in as a DataFrame
### Source data is in CSV format and includes a header. We will ask Spark to infer the schema/data types.
### Drop unwanted columns and rows with null or invalid data.

In [None]:
loadTitanicData = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("Titanic.csv")
TitanicData = loadTitanicData.drop("PassengerId").drop("Name").drop("Ticket").drop("Cabin").dropna(how="any", subset=("Age", "Embarked"))

##  We will use the 'Survived' column as a label for training the machine learning model
#### Spark ML requires that that the labes are data type Double, so we will cast the  column as Double (it was inferred as Integer when read into Spark).

In [None]:
LabeledTitanicData = (TitanicData.withColumn("SurvivedTemp", TitanicData["Survived"]
    .cast("Double")).drop("Survived")
    .withColumnRenamed("SurvivedTemp", "Survived"))

## Show the labeled data
<div class="panel-group" id="accordion-1">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-1" href="#collapse-1">
        Hint</a>
      </h4>
    </div>
    <div id="collapse-1" class="panel-collapse collapse">
      <div class="panel-body">Type (or copy) the following in the cell below: <br>
          LabeledTitanicData.show()<br>
      </div>
    </div>
  </div>

In [None]:
# Print out a sample of the DataFrame 'LabeledTitanicData'


In [None]:
print('The total number of rows is {}.'.format(LabeledTitanicData.count()))
print('The number of rows labeled Not Survived is {}.'.format(LabeledTitanicData.filter(LabeledTitanicData['Survived'] == 0).count()))
print('The number of rows labeled Survived is {}.'.format(LabeledTitanicData.filter(LabeledTitanicData['Survived'] == 1).count()))

## Show the schema of the data
<div class="panel-group" id="accordion-2">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-2" href="#collapse-2">
        Hint</a>
      </h4>
    </div>
    <div id="collapse-2" class="panel-collapse collapse">
      <div class="panel-body">Type (or copy) the following in the cell below: <br>
          LabeledTitanicData.printSchema()<br>
      </div>
    </div>
  </div>

In [None]:
# Show the DataFrame schema


## StringIndexer

### StringIndexer is a transformer that encodes a string column to a column of indices. The indices are ordered by value frequencies, so the most frequent value gets index 0. If the input column is numeric, it is cast to string first. 

### For the Titanic data set, we will index the Sex/gender column as well as the Embarked column, which specifiies at which  port the passenger boarded the ship.

In [None]:
SexIndexer = StringIndexer(inputCol="Sex", outputCol="SexIndex")
EmbarkedIndexer = StringIndexer(inputCol="Embarked", outputCol="EmbarkedIndex")

## Bucketizer is a transformer that transforms a column of continuous features to a column of feature buckets, where the buckets are by a splits parameter. 

### For the Titanic data set, we will index the Age and Fare features.

In [None]:
AgeBucketSplits = [0.0, 6.0, 12.0, 18.0, 40.0, 65.0, 80.0, float("inf")]
AgeBucket = Bucketizer(splits=AgeBucketSplits, inputCol="Age", outputCol="AgeBucket")

FareBucketSplits = [0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 80.0, 100.0, float("inf")]
FareBucket = Bucketizer(splits=FareBucketSplits, inputCol="Fare", outputCol="FareBucket")

## VectorAssembler is a transformer that combines a given list of columns in the order specified into a single vector column in order to train a model.

In [None]:
assembler = VectorAssembler(inputCols= ["SexIndex", "EmbarkedIndex", "AgeBucket", "FareBucket", "SibSp", "Pclass", "Parch"], outputCol="features")

## Normalizer is a Transformer which transforms a dataset of Vector rows, normalizing each Vector to have unit norm
### This normalization can help standardize your input data and improve the behavior of learning algorithms.

In [None]:
normalizer = Normalizer(inputCol="features", outputCol="normFeatures", p=1.0)

## Logistic regression is a popular method to predict a binary response
### It is a special case of Generalized Linear models that predicts the probability of an outcome.

In [None]:
lr = LogisticRegression(featuresCol="normFeatures", labelCol="Survived", predictionCol="prediction", maxIter=10, regParam=0.01, elasticNetParam=0.8)

## A Pipeline is a sequence of stages where each stage is either a Transformer or an Estimator
### These stages are run in order and the input DataFrame is transformed as it passes through each stage. 

### In machine learning, it is common to run a sequence of algorithms to process and learn from data.

In [None]:
pipeline = Pipeline(stages=[SexIndexer, EmbarkedIndexer, AgeBucket,FareBucket, assembler, normalizer, lr])

## Randomly split the source data into training and test data sets
### 90% training, 10% test
### Cache the resulting DataFrames

In [None]:
train, test = LabeledTitanicData.randomSplit([90.0,10.0], seed=1)
train.cache()
test.cache()
print('The number of records in the traininig data set is {}.'.format(train.count()))
print('The number of rows labeled Not Survived in the training data set is {}.'.format(train.filter(train['Survived'] == 0).count()))
print('The number of rows labeled Survived in the training data set is {}.'.format(train.filter(train['Survived'] == 1).count()))
train.sample(False, 0.01, seed=0).show(5)
print('')

print('The number of records in the test data set is {}.'.format(test.count()))
print('The number of rows labeled Not Survived in the test data set is {}.'.format(test.filter(train['Survived'] == 0).count()))
print('The number of rows labeled Survived in the test data set is {}.'.format(test.filter(train['Survived'] == 1).count()))
test.sample(False, 0.1, seed=0).show(5)

## Fit the pipeline to the training data
<div class="panel-group" id="accordion-3">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-3" href="#collapse-3">
        Hint</a>
      </h4>
    </div>
    <div id="collapse-3" class="panel-collapse collapse">
      <div class="panel-body">Type (or copy) the following in the cell below: <br>
          model = pipeline.fit(train)<br>
      </div>
    </div>
  </div>

In [None]:
# Fit the pipeline to the training data assigning the result to a variable called 'model'.


## Make predictions on document in the Test data set
### Keep in mind that the model has not seen the data in the test data set
<div class="panel-group" id="accordion-4">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-4" href="#collapse-4">
        Hint</a>
      </h4>
    </div>
    <div id="collapse-4" class="panel-collapse collapse">
      <div class="panel-body">Type (or copy) the following in the cell below: <br>
          predictions = model.transform(test)<br>
      </div>
    </div>
  </div>

In [None]:
# Make predictions on the test data assigning the result to a variable called 'predictions'.


## Show results

In [None]:
predictions.sample(False, 0.1, seed=0).show(5)

In [None]:
print('The number of predictions labeled Not Survived is {}.'.format(predictions.filter(predictions['prediction'] == 0).count()))
print('The number of predictions labeled Survived is {}.'.format(predictions.filter(predictions['prediction'] == 1).count()))

In [None]:
(predictions.filter("Survived = 0.0")
     .select("Sex", "Age", "Fare", "Embarked", "Pclass", "Parch", "SibSp", "Survived", "prediction")
     .sample(False, 0.1, seed=0).show(5))

(predictions.filter("Survived = 1.0")
     .select("Sex", "Age", "Fare", "Embarked", "Pclass", "Parch", "SibSp", "Survived", "prediction")
     .sample(False, 0.5, seed=0).show(5))

## Create an evaluator for the binary classification using area under the ROC Curve as the evaluation metric

### Receiver operating characteristic (ROC) is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied

The curve is created by plotting the true positive rate against the false positive rate at various threshold settings. The ROC curve is thus the sensitivity as a function of fall-out. The area under the ROC curve is useful for comparing and selecting the best machine learning model for a given data set. A model with an area under the ROC curve score near 1 has very good performance. A model with a score near 0.5 is about as good as flipping a coin.

In [None]:
evaluator = BinaryClassificationEvaluator().setLabelCol("Survived").setMetricName("areaUnderROC")
print('Area under the ROC curve = {}.'.format(evaluator.evaluate(predictions)))

## Tune Hyperparameters
### Generate hyperparameter combinations by taking the cross product of some parameter values

Spark ML algorithms provide many hyperparameters for tuning models. These hyperparameters are distinct from the model parameters being optimized by Spark ML itself. Hyperparameter tuning is accomplished by choosing the best set of parameters based on model performance on test data that the model was not trained with. All combinations of hyperparameters specified will be tried in order to find the one that leads to the model with the best evaluation result.

## Build a Parameter Grid specifying what parameters and values will be evaluated in order to determine the best combination

In [None]:
paramGrid = (ParamGridBuilder().addGrid(lr.regParam, [0.0, 0.1, 0.3])
                 .addGrid(lr.elasticNetParam, [0.0, 1.0])
                 .addGrid(normalizer.p, [1.0, 2.0])
                 .build())

## Create a cross validator to tune the pipeline with the generated parameter grid
Spark ML provides for cross-validation for hyperparameter tuning. Cross-validation attempts to fit the underlying estimator with user-specified combinations of parameters, cross-evaluate the fitted models, and output the best one.

In [None]:
cv = CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(10)

## Cross-evaluate the ML Pipeline to find the best model
### using the area under the ROC evaluator and hyperparameters specified in the parameter grid

In [None]:
cvModel = cv.fit(LabeledTitanicData)
print('Area under the ROC curve for best fitted model = {}.'.format(evaluator.evaluate(cvModel.transform(LabeledTitanicData))))

## Let's see what improvement we achieve by tuning the hyperparameters using cross-evaluation 

In [None]:
print('Area under the ROC curve for non-tuned model = {}.'.format(evaluator.evaluate(predictions)))
print('Area under the ROC curve for best fitted model = {}.'.format(evaluator.evaluate(cvModel.transform(LabeledTitanicData))))
print('Improvement = {0:0.2f}%'.format((evaluator.evaluate(cvModel.transform(LabeledTitanicData)) - evaluator.evaluate(predictions)) *100 / evaluator.evaluate(predictions)))

## Make improved predictions using the Cross-validated model
### Using the Test data set and DataFrame API

In [None]:
cvModel.transform(test).select("Survived", "prediction").sample(False, 0.1, seed=0).show(10)

### Like above, but now using SQL
<div class="panel-group" id="accordion-5">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-4" href="#collapse-5">
        Hint</a>
      </h4>
    </div>
    <div id="collapse-5" class="panel-collapse collapse">
      <div class="panel-body">Type (or copy) the following in the cell below: <br>
            spark.sql("select Survived, prediction from cvModelPredictions").sample(False, 0.1, seed=0).show(10)<br>
      </div>
    </div>
  </div>

In [None]:
# create temporary table
cvModel.transform(test).createOrReplaceTempView("cvModelPredictions")
# use Spark SQL to query this temporary table to see the same columns as using the DataFrame APIs above


## Make a prediction on an imaginary passenger

## Define the imaginary passenger's features
Try other combinations of features to see if you imaginary passenger would have survived the Titanic!
Please make sure to not change any data types.

In [None]:
SexValue = 'female'
AgeValue = 40.0
FareValue = 15.0
EmbarkedValue = 'C'
PclassValue = 2
SibSpValue = 1
ParchValue = 1

PredictionFeatures = (spark.createDataFrame([(SexValue, AgeValue, FareValue, EmbarkedValue, PclassValue, SibSpValue, ParchValue)],
    ['Sex', 'Age', 'Fare', 'Embarked', 'Pclass', 'SibSp', 'Parch']))
PredictionFeatures.show()

## Predict whether the imaginary person would have survived
### using the best fit model

In [None]:
SurvivedOrNotPrediction = cvModel.transform(PredictionFeatures)
SurvivedOrNotPrediction.select('rawPrediction', 'probability', 'prediction').show(1, False)

## Display Prediction Result

In [None]:
SurvivedOrNot = SurvivedOrNotPrediction.select("prediction").first()[0]
if SurvivedOrNot == 0.0:
    print("Did NOT Survive")
elif(SurvivedOrNot == 1.0):
    print("Did Survive!!!")
else:
    print("Invalid Prediction")