# PySpark ML - Ensembles & Pipelines

Finally you'll learn how to make your models more efficient. You'll find out how to use pipelines to make your code clearer and easier to maintain. Then you'll use cross-validation to better test your models and select good model parameters. Finally you'll dabble in two types of ensemble model.

## Preparing the environment

### Importing libraries

In [1]:
import pandas as pd

from environment import SEED
from pprint import pprint
from pyspark.sql.types import (StructType, StructField,
                               DoubleType, IntegerType, StringType)
from pyspark.sql import SparkSession, functions as F
from pyspark.mllib.linalg import DenseVector
from pyspark.ml import Pipeline
from pyspark.ml.feature import (OneHotEncoder, StringIndexer, VectorAssembler,
                                Tokenizer, StopWordsRemover, HashingTF, IDF)
from pyspark.ml.regression import LinearRegression
from pyspark.ml.classification import (LogisticRegression, RandomForestClassifier,
                                       GBTClassifier, DecisionTreeClassifier)
from pyspark.ml.evaluation import RegressionEvaluator, BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

### Connect to Spark

In [2]:
spark = (SparkSession.builder
                     .master('local[*]') \
                     .appName('spark_application') \
                     .config("spark.sql.repl.eagerEval.enabled", True)  # eval DataFrame in notebooks
                     .getOrCreate())

sc = spark.sparkContext
print(f'Spark version: {spark.version}')

Spark version: 3.5.1


## Loading data

### Flights

In [3]:
# Reading the file
schema_flights = StructType([
    StructField("mon", IntegerType()),
    StructField("dom", IntegerType()),
    StructField("dow", IntegerType()),
    StructField("carrier", StringType()),
    StructField("flight", IntegerType()),
    StructField("org", StringType()),
    StructField("mile", IntegerType()),
    StructField("depart", DoubleType()),
    StructField("duration", IntegerType()),
    StructField("delay", IntegerType())
])
flights_data = spark.read.csv('data-sources/flights.csv', header=True, schema=schema_flights, nullValue='NA')

# Cleaning and mutating some columns
flights_data = flights_data.dropna()
flights_data = flights_data.withColumn('km', F.round(flights_data['mile'] * 1.60934, 0))

# Reviewing the result
flights_data.createOrReplaceTempView("flights")
print(f'Dataframe shape: ({flights_data.count()}, {len(flights_data.columns)})')
flights_data.printSchema()
flights_data.limit(2)

Dataframe shape: (47022, 11)
root
 |-- mon: integer (nullable = true)
 |-- dom: integer (nullable = true)
 |-- dow: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- org: string (nullable = true)
 |-- mile: integer (nullable = true)
 |-- depart: double (nullable = true)
 |-- duration: integer (nullable = true)
 |-- delay: integer (nullable = true)
 |-- km: double (nullable = true)



mon,dom,dow,carrier,flight,org,mile,depart,duration,delay,km
0,22,2,UA,1107,ORD,316,16.33,82,30,509.0
2,20,4,UA,226,SFO,337,6.17,82,-8,542.0


### SMS

In [4]:
# Reading the file
schema_sms = StructType([
    StructField("id", IntegerType()),
    StructField("text", StringType()),
    StructField("label", IntegerType())
])
sms_data = spark.read.csv("data-sources/sms.csv", sep=';', header=False, schema=schema_sms)

# Reviewing the result
sms_data.createOrReplaceTempView("sms")
print(f'Dataframe shape: ({sms_data.count()}, {len(sms_data.columns)})')
sms_data.printSchema()
sms_data.limit(2)

Dataframe shape: (5574, 3)
root
 |-- id: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- label: integer (nullable = true)



id,text,label
1,"Sorry, I'll call ...",0
2,Dont worry. I gue...,0


### Cars

In [5]:
# Reading the file
schema_cars = StructType([
    StructField("maker", StringType()),
    StructField("model", StringType()),
    StructField("origin", StringType()),
    StructField("type", StringType()),
    StructField("cyl", IntegerType()),
    StructField("size", DoubleType()),
    StructField("weight", IntegerType()),
    StructField("length", DoubleType()),
    StructField("rpm", IntegerType()),
    StructField("consumption", DoubleType())
])
cars_data = spark.read.csv('data-sources/cars.csv', header=True, schema=schema_cars, nullValue='NA')

# Cleaning and mutating some columns
cars_data = cars_data.dropna()
cars_data = cars_data.withColumn('mass', F.round(cars_data.weight / 2.205, 0))
cars_data = cars_data.withColumn('length', F.round(cars_data.length * 0.0254, 3))
cars_data = cars_data.withColumn('consumption', F.round(cars_data.consumption * 3.78541, 3))

# Reviewing the result
cars_data.createOrReplaceTempView("cars")
print(f'Dataframe shape: ({cars_data.count()}, {len(cars_data.columns)})')
cars_data.printSchema()
cars_data.limit(2)

Dataframe shape: (92, 11)
root
 |-- maker: string (nullable = true)
 |-- model: string (nullable = true)
 |-- origin: string (nullable = true)
 |-- type: string (nullable = true)
 |-- cyl: integer (nullable = true)
 |-- size: double (nullable = true)
 |-- weight: integer (nullable = true)
 |-- length: double (nullable = true)
 |-- rpm: integer (nullable = true)
 |-- consumption: double (nullable = true)
 |-- mass: double (nullable = true)



maker,model,origin,type,cyl,size,weight,length,rpm,consumption,mass
Geo,Metro,non-USA,Small,3,1.0,1695,3.835,5700,7.571,769.0
Honda,Civic,non-USA,Small,4,1.5,2350,4.394,5900,8.214,1066.0


### Books

In [6]:
# Reading the file
schema_books = StructType([
    StructField("id", IntegerType()),
    StructField("text", StringType())
])
books_data = spark.read.csv("data-sources/books.csv", sep=';', header=True, schema=schema_books)

# Reviewing the result
books_data.createOrReplaceTempView("books")
print(f'Dataframe shape: ({cars_data.count()}, {len(cars_data.columns)})')
books_data.printSchema()
books_data.limit(2)

Dataframe shape: (92, 11)
root
 |-- id: integer (nullable = true)
 |-- text: string (nullable = true)



id,text
0,"Forever, or a Lon..."
1,Winnie-the-Pooh


### BMI

In [7]:
# Reading the file
schema_bmi = StructType([
    StructField("height_mt", DoubleType()),
    StructField("mass_kg", DoubleType())
])
bmi_data = spark.read.csv("data-sources/bmi.csv", sep=',', header=True, schema=schema_bmi)

# Reviewing the result
bmi_data.createOrReplaceTempView("bmi")
print(f'Dataframe shape: ({cars_data.count()}, {len(cars_data.columns)})')
bmi_data.printSchema()
bmi_data.limit(2)

Dataframe shape: (92, 11)
root
 |-- height_mt: double (nullable = true)
 |-- mass_kg: double (nullable = true)



height_mt,mass_kg
1.7496714153011232,54.45891245893144
1.6861735698828817,63.85688444230989


### Tables catalogue

In [8]:
spark.catalog.listTables()

[Table(name='bmi', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='books', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='cars', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='flights', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='sms', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]

# Cars Dataset

## First Pipeline Model

### Applying steps

In [9]:
# Loading the data
df_cars = cars_data.select('*')

# setting the label and features columns
label_col = 'consumption'
feature_cols = ['mass', 'cyl', 'type_vec']

# Setting the steps
indexer_car = StringIndexer(inputCol='type', outputCol='type_idx')
onehot_car = OneHotEncoder(inputCols=['type_idx'], outputCols=['type_vec'])
assemble_car = VectorAssembler(inputCols=feature_cols, outputCol='features')
regression_car = LinearRegression(labelCol=label_col)

# Split into train and test set.
df_cars_train, df_cars_test = df_cars.randomSplit([0.8, 0.2], seed=SEED)
print(f"Training set: {df_cars_train.count()}, Testing set: {df_cars_test.count()}")

Training set: 75, Testing set: 17


### Building the pipeline

In [10]:
# Define the pipeline
pipeline_car = Pipeline(stages=[indexer_car, onehot_car, assemble_car, regression_car])

# Fitting the model on the trainning data
pipeline_car = pipeline_car.fit(df_cars_train)
df_cars_train = pipeline_car.transform(df_cars_train)

# Evaluating on the testing data
predictions_cars = pipeline_car.transform(df_cars_test)

eval_cars = RegressionEvaluator(labelCol=label_col)
print(f'''
RMSE: {eval_cars.evaluate(predictions_cars)}
 MAE: {eval_cars.evaluate(predictions_cars, {eval_cars.metricName: "mae"})}
  R²: {eval_cars.evaluate(predictions_cars, {eval_cars.metricName: "r2"})}
 MSE: {eval_cars.evaluate(predictions_cars, {eval_cars.metricName: "mse"})}
''')


RMSE: 0.8852987870756872
 MAE: 0.7074495911938766
  R²: 0.8105462852852746
 MSE: 0.783753942397683



### Stages

In [11]:
# Reviewing the steps in the pipeline
pipeline_car.stages

[StringIndexerModel: uid=StringIndexer_055199545c4c, handleInvalid=error,
 OneHotEncoderModel: uid=OneHotEncoder_d43fadc12aaf, dropLast=true, handleInvalid=error, numInputCols=1, numOutputCols=1,
 VectorAssembler_c3e21b76706b,
 LinearRegressionModel: uid=LinearRegression_16325fae09c0, numFeatures=7]

### Interpreting intercept & Coefficients

In [12]:
df_cars_train.select('type', 'type_idx', 'type_vec').distinct().sort('type_idx').show()

+-------+--------+-------------+
|   type|type_idx|     type_vec|
+-------+--------+-------------+
|Midsize|     0.0|(5,[0],[1.0])|
|  Small|     1.0|(5,[1],[1.0])|
|Compact|     2.0|(5,[2],[1.0])|
| Sporty|     3.0|(5,[3],[1.0])|
|    Van|     4.0|(5,[4],[1.0])|
|  Large|     5.0|    (5,[],[])|
+-------+--------+-------------+



In [13]:
# Getting the intercept, coefficients and features
intercept = pipeline_car.stages[3].intercept
coefficients = list(pipeline_car.stages[3].coefficients)

colIdx =  sorted((value, key) for (key, value) 
                 in df_cars_train.select('type', "type_idx").distinct().rdd.collectAsMap().items())
newCols_type = list(map(lambda x: x[1], colIdx))

print(f'''
Intercept: {intercept}
Coefficients:
{coefficients}

Feature cols: {feature_cols}
Encoded categos (type_vec): {newCols_type[:-1]}
''')

# With more detail
coefficient_len = len(coefficients)
pd.DataFrame({
    'Slope': ['Intercept'] + ['Coefficients']*coefficient_len,
    'Feature': [''] + feature_cols[:-1] + newCols_type[:-1],
    'value': [intercept] + coefficients
})


Intercept: 4.451124924882582
Coefficients:
[0.0044227611335224395, 0.37491422671277763, 0.7328547001153257, 0.4402829780657993, 0.9579155303172929, 1.1222235235549773, 3.207167823278289]

Feature cols: ['mass', 'cyl', 'type_vec']
Encoded categos (type_vec): ['Midsize', 'Small', 'Compact', 'Sporty', 'Van']



Unnamed: 0,Slope,Feature,value
0,Intercept,,4.451125
1,Coefficients,mass,0.004423
2,Coefficients,cyl,0.374914
3,Coefficients,Midsize,0.732855
4,Coefficients,Small,0.440283
5,Coefficients,Compact,0.957916
6,Coefficients,Sporty,1.122224
7,Coefficients,Van,3.207168


## Cross-Validation

### Building the pipeline

In [14]:
# Loading the data
df_cars = cars_data.select('*')

# setting the label and features columns
label_col = 'consumption'
feature_cols = ['mass', 'cyl']

# Setting the steps
assemble_car = VectorAssembler(inputCols=feature_cols, outputCol='features')
regression_car = LinearRegression(labelCol=label_col)

# Split into train and test set.
df_cars_train, df_cars_test = df_cars.randomSplit([0.8, 0.2], seed=SEED)
print(f"Training set: {df_cars_train.count()}, Testing set: {df_cars_test.count()}")

# Define the pipeline
pipeline_car = Pipeline(stages=[assemble_car, regression_car])

Training set: 75, Testing set: 17


### Grid and cross-validator

In [15]:
# A grid of parameter values (empty for the moment).
params = ParamGridBuilder().build()

# An object to evaluate model performance.
evaluator_car = RegressionEvaluator(labelCol='consumption')

# The cross-validation object.
cv_car = CrossValidator(estimator=pipeline_car, estimatorParamMaps=params,
                        evaluator=evaluator_car, numFolds=10, seed=SEED)

# Apply cross-validation to the training data.
cv_car = cv_car.fit(df_cars_train)

# What's the average RMSE across the folds?
print(f'''
What's the average RMSE across the folds? (cross validation score)
{cv_car.avgMetrics}
''')


What's the average RMSE across the folds? (cross validation score)
[1.2602868736515767]



### Evaluationg on test data

In [16]:
# Evaluating on the testing data
predictions_cars = cv_car.transform(df_cars_test)

eval_cars = RegressionEvaluator(labelCol=label_col)
print(f'''
RMSE: {eval_cars.evaluate(predictions_cars)}
 MAE: {eval_cars.evaluate(predictions_cars, {eval_cars.metricName: "mae"})}
  R²: {eval_cars.evaluate(predictions_cars, {eval_cars.metricName: "r2"})}
 MSE: {eval_cars.evaluate(predictions_cars, {eval_cars.metricName: "mse"})}
''')


RMSE: 1.2274630007114407
 MAE: 1.0808052770143333
  R²: 0.6357997773880841
 MSE: 1.5066654181155341



## GridSearch

### Manual process selection

In [17]:
# Loading the data
df_cars = cars_data.select('*')

# setting the label and features columns
label_col = 'consumption'
feature_cols = ['mass', 'cyl']

# Setting the steps
assemble_car = VectorAssembler(inputCols=feature_cols, outputCol='features')
df_cars = assemble_car.transform(df_cars)

# Evaluator 
evaluator_car = RegressionEvaluator(labelCol='consumption')

# Split into train and test set.
df_cars_train, df_cars_test = df_cars.randomSplit([0.8, 0.2], seed=SEED)
print(f"Training set: {df_cars_train.count()}, Testing set: {df_cars_test.count()}")

# Opcion fitIntercept=True
regression_car = LinearRegression(labelCol=label_col, fitIntercept=True).fit(df_cars_train)
predictions_cars = regression_car.transform(df_cars_test)
print(f'''LinearRegression(fitIntercept=True) RMSE: {evaluator_car.evaluate(predictions_cars)}''')

# Opcion fitIntercept=False
regression_car = LinearRegression(labelCol=label_col, fitIntercept=False).fit(df_cars_train)
predictions_cars = regression_car.transform(df_cars_test)
print(f'''LinearRegression(fitIntercept=False) RMSE: {evaluator_car.evaluate(predictions_cars)}''')

Training set: 75, Testing set: 17
LinearRegression(fitIntercept=True) RMSE: 1.2274630007114407
LinearRegression(fitIntercept=False) RMSE: 1.5719472988594767


### A simple Grid

In [18]:
# Loading the data
df_cars = cars_data.select('*')

# setting the label and features columns
label_col = 'consumption'
feature_cols = ['mass', 'cyl']

# Setting the steps
assemble_car = VectorAssembler(inputCols=feature_cols, outputCol='features')
df_cars = assemble_car.transform(df_cars)

# Split into train and test set.
df_cars_train, df_cars_test = df_cars.randomSplit([0.8, 0.2], seed=SEED)
print(f"Training set: {df_cars_train.count()}, Testing set: {df_cars_test.count()}")

# Evaluator 
evaluator_car = RegressionEvaluator(labelCol='consumption')

# Model
regression_car = LinearRegression(labelCol=label_col)

# ------------------------------------------------------------------------------
# Create a parameter grid builder
params_car = ParamGridBuilder().addGrid(regression_car.fitIntercept, [True, False]).build()
print('Number of models to be tested: ', len(params_car))

# ------------------------------------------------------------------------------
# Create a cross-validator and fit to the training data.
cv_car = CrossValidator(estimator=regression_car, estimatorParamMaps=params_car,
                    evaluator=evaluator_car)
cv_car = cv_car.setNumFolds(10).setSeed(SEED).fit(df_cars_train)
print(f'''\nWhat's the cross-validated RMSE for each model? {cv_car.avgMetrics}''')

# ------------------------------------------------------------------------------
# Access the best model
print(f'''\nBest Model: \n{cv_car.bestModel}''')

# Or just use the cross-validator object.
predictions_cars = cv_car.transform(df_cars_test)
print(f'''CV RMSE (best model): {evaluator_car.evaluate(predictions_cars)}''')

# Retrieve the best parameter.
print(f'''Best parameter: \n{cv_car.bestModel.explainParam("fitIntercept")}''')

Training set: 75, Testing set: 17
Number of models to be tested:  2

What's the cross-validated RMSE for each model? [1.2602868736515764, 1.415907832036895]

Best Model: 
LinearRegressionModel: uid=LinearRegression_9315520dd3d6, numFeatures=2
CV RMSE (best model): 1.2274630007114407
Best parameter: 
fitIntercept: whether to fit an intercept term. (default: True, current: True)


### A more complicated grid

In [19]:
# Loading the data
df_cars = cars_data.select('*')

# setting the label and features columns
label_col = 'consumption'
feature_cols = ['mass', 'cyl']

# Setting the steps
assemble_car = VectorAssembler(inputCols=feature_cols, outputCol='features')
df_cars = assemble_car.transform(df_cars)

# Split into train and test set.
df_cars_train, df_cars_test = df_cars.randomSplit([0.8, 0.2], seed=SEED)
print(f"Training set: {df_cars_train.count()}, Testing set: {df_cars_test.count()}")

# Evaluator 
evaluator_car = RegressionEvaluator(labelCol='consumption')

# Model
regression_car = LinearRegression(labelCol=label_col)

# ------------------------------------------------------------------------------
# Create a parameter grid builder
params_car = (ParamGridBuilder().addGrid(regression_car.fitIntercept, [True, False])
                                .addGrid(regression_car.regParam, [0.001, 0.01, 0.1, 1, 10])
                                .addGrid(regression_car.elasticNetParam, [0, 0.25, 0.5, 0.75, 1])
                                .build())
print('Number of models to be tested: ', len(params_car))

Training set: 75, Testing set: 17
Number of models to be tested:  50


## Ensemble

### RandomForestClassifier

In [20]:
# Loading the data
df_cars = cars_data.select('*')
df_cars = StringIndexer(inputCol="origin", outputCol="label").fit(df_cars).transform(df_cars)

# setting the label and features columns
feature_cols = ['cyl', 'size', 'mass', 'length', 'rpm', 'consumption']

# Setting the steps
assemble_car = VectorAssembler(inputCols=feature_cols, outputCol='features')
df_cars = assemble_car.transform(df_cars)

# Split into train and test set.
df_cars_train, df_cars_test = df_cars.randomSplit([0.8, 0.2], seed=SEED)
print(f"Training set: {df_cars_train.count()}, Testing set: {df_cars_test.count()}")

# ------------------------------------------------------------------------------
# Create a forest of trees
forest_car = RandomForestClassifier(numTrees=5, seed=SEED).fit(df_cars_train)

# Seeing the trees
print(f'''
Trees in RandomForestClassifier: {len(forest_car.trees)}
''')
pprint(forest_car.trees)

# Feature importances
feature_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': DenseVector(forest_car.featureImportances)
})
print(f'''
Feature importance:
{feature_importance.sort_values('Importance', ascending=False)}
''')

# ------------------------------------------------------------------------------
# Consensus predictions
predictions_car = forest_car.transform(df_cars_test)
print('\nReviewing predictions (5 first records):')
predictions_car.select('label', 'probability', 'prediction').show(5, truncate=False)

# ------------------------------------------------------------------------------
# Evaluation
cm_car = predictions_car.groupBy("label", "prediction").count().toPandas().sort_values(["prediction", "label"])
cm_car.index = ['True negative (TN)', 'False negative (FN)', 'False positive (FP)', 'True positive (TP)']
TN, FN, FP, TP = cm_car['count'].to_list()
accuracy_forest_car = (TN + TP) / (TN + TP + FN + FP)
print(f'''
Accuracy : {accuracy_forest_car}

Confussion Matrix:
{cm_car}
''')

Training set: 75, Testing set: 17

Trees in RandomForestClassifier: 5

[DecisionTreeClassificationModel: uid=dtc_409dc29e8ae6, depth=5, numNodes=17, numClasses=2, numFeatures=6,
 DecisionTreeClassificationModel: uid=dtc_343b290d5ca9, depth=5, numNodes=17, numClasses=2, numFeatures=6,
 DecisionTreeClassificationModel: uid=dtc_2531f6c7fe2c, depth=5, numNodes=19, numClasses=2, numFeatures=6,
 DecisionTreeClassificationModel: uid=dtc_f6b53ed91162, depth=5, numNodes=19, numClasses=2, numFeatures=6,
 DecisionTreeClassificationModel: uid=dtc_99991c65db06, depth=4, numNodes=21, numClasses=2, numFeatures=6]

Feature importance:
       Feature  Importance
4          rpm    0.301180
3       length    0.264662
1         size    0.152525
5  consumption    0.143760
2         mass    0.129507
0          cyl    0.008366


Reviewing predictions (5 first records):
+-----+----------------------------------------+----------+
|label|probability                             |prediction|
+-----+--------------

### Gradient-Boosted Trees

In [21]:
# Loading the data
df_cars = cars_data.select('*')
df_cars = StringIndexer(inputCol="origin", outputCol="label").fit(df_cars).transform(df_cars)

# setting the label and features columns
feature_cols = ['cyl', 'size', 'mass', 'length', 'rpm', 'consumption']

# Setting the steps
assemble_car = VectorAssembler(inputCols=feature_cols, outputCol='features')
df_cars = assemble_car.transform(df_cars)

# Split into train and test set.
df_cars_train, df_cars_test = df_cars.randomSplit([0.8, 0.2], seed=SEED)
print(f"Training set: {df_cars_train.count()}, Testing set: {df_cars_test.count()}")

# ------------------------------------------------------------------------------
# Create a Gradient-Boosted Tree classifier
gbt_car = GBTClassifier(maxIter=10, seed=SEED).fit(df_cars_train)

# Seeing the trees
print(f'''
Trees in Gradient-Boosted Tree: {len(gbt_car.trees)}
''')
pprint(gbt_car.trees)

# Feature importances
feature_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': DenseVector(gbt_car.featureImportances)
})
print(f'''
Feature importance:
{feature_importance.sort_values('Importance', ascending=False)}
''')

# ------------------------------------------------------------------------------
# Predictions
predictions_car = gbt_car.transform(df_cars_test)

# ------------------------------------------------------------------------------
# Evaluation
cm_car = predictions_car.groupBy("label", "prediction").count().toPandas().sort_values(["prediction", "label"])
cm_car.index = ['True negative (TN)', 'False negative (FN)', 'False positive (FP)', 'True positive (TP)']
TN, FN, FP, TP = cm_car['count'].to_list()
accuracy_gbt_car = (TN + TP) / (TN + TP + FN + FP)
print(f'''
Accuracy : {accuracy_gbt_car}

Confussion Matrix:
{cm_car}
''')

Training set: 75, Testing set: 17

Trees in Gradient-Boosted Tree: 10

[DecisionTreeRegressionModel: uid=dtr_380c018ebed1, depth=5, numNodes=21, numFeatures=6,
 DecisionTreeRegressionModel: uid=dtr_79814dddc780, depth=5, numNodes=33, numFeatures=6,
 DecisionTreeRegressionModel: uid=dtr_8230e0062dcb, depth=5, numNodes=37, numFeatures=6,
 DecisionTreeRegressionModel: uid=dtr_a65a69a496c8, depth=5, numNodes=37, numFeatures=6,
 DecisionTreeRegressionModel: uid=dtr_eb902d1b3291, depth=5, numNodes=41, numFeatures=6,
 DecisionTreeRegressionModel: uid=dtr_651efa8de78f, depth=5, numNodes=35, numFeatures=6,
 DecisionTreeRegressionModel: uid=dtr_dd23212651b6, depth=5, numNodes=35, numFeatures=6,
 DecisionTreeRegressionModel: uid=dtr_169b50283d65, depth=5, numNodes=29, numFeatures=6,
 DecisionTreeRegressionModel: uid=dtr_c093dbb701a2, depth=5, numNodes=29, numFeatures=6,
 DecisionTreeRegressionModel: uid=dtr_2ba15baf8525, depth=5, numNodes=35, numFeatures=6]

Feature importance:
       Feature  Im

### Decision Tree

In [22]:
# Loading the data
df_cars = cars_data.select('*')
df_cars = StringIndexer(inputCol="origin", outputCol="label").fit(df_cars).transform(df_cars)

# setting the label and features columns
feature_cols = ['cyl', 'size', 'mass', 'length', 'rpm', 'consumption']

# Setting the steps
assemble_car = VectorAssembler(inputCols=feature_cols, outputCol='features')
df_cars = assemble_car.transform(df_cars)

# Split into train and test set.
df_cars_train, df_cars_test = df_cars.randomSplit([0.8, 0.2], seed=SEED)
print(f"Training set: {df_cars_train.count()}, Testing set: {df_cars_test.count()}")

# ------------------------------------------------------------------------------
# Create a Decision Tree model
tree_car = DecisionTreeClassifier(seed=SEED).fit(df_cars_train)

# Feature importances
feature_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': DenseVector(tree_car.featureImportances)
})
print(f'''
Feature importance:
{feature_importance.sort_values('Importance', ascending=False)}
''')

# ------------------------------------------------------------------------------
# Predictions
predictions_car = tree_car.transform(df_cars_test)

# ------------------------------------------------------------------------------
# Evaluation
cm_car = predictions_car.groupBy("label", "prediction").count().toPandas().sort_values(["prediction", "label"])
cm_car.index = ['True negative (TN)', 'False negative (FN)', 'False positive (FP)', 'True positive (TP)']
TN, FN, FP, TP = cm_car['count'].to_list()
accuracy_tree_car = (TN + TP) / (TN + TP + FN + FP)
print(f'''
Accuracy : {accuracy_tree_car}

Confussion Matrix:
{cm_car}
''')

Training set: 75, Testing set: 17

Feature importance:
       Feature  Importance
4          rpm    0.470220
3       length    0.233508
2         mass    0.109938
5  consumption    0.105161
1         size    0.081173
0          cyl    0.000000


Accuracy : 0.8235294117647058

Confussion Matrix:
                     label  prediction  count
True negative (TN)     0.0         0.0      9
False negative (FN)    1.0         0.0      2
False positive (FP)    0.0         1.0      1
True positive (TP)     1.0         1.0      5



### Comparing trees

In [23]:
print(f'''
RandomForestClassifier: {accuracy_forest_car}
  GradientBoostedTrees: {accuracy_gbt_car}
          DecisionTree: {accuracy_tree_car}
''')


RandomForestClassifier: 0.8235294117647058
  GradientBoostedTrees: 0.8235294117647058
          DecisionTree: 0.8235294117647058



# Fligths Dataset

## Pipeline

### Ex. 1 - Flight duration model: Pipeline stages

You're going to create the stages for the flights duration model pipeline. You will use these in the next exercise to build a pipeline and to create a regression model.

**Instructions:**

1. Create an indexer to convert the `org` column into an indexed column called `org_idx`.
2. Create a one-hot encoder to convert the `org_idx` and `dow` columns into vec variable columns called `org_vec` and `dow_vec`.
3. Create an assembler which will combine the `km` column with the two vec variable columns. The output column should be called `features`.
4. Create a linear regression object to predict flight `duration`.

In [24]:
# Loading the data
df_flights = flights_data.select('*')

# setting the label and features columns
label_col = 'duration'
feature_cols = ['km', 'org_vec', 'dow_vec']

# Setting the steps
indexer_flight = StringIndexer(inputCol='org', outputCol='org_idx')
onehot_flight = OneHotEncoder(inputCols=['org_idx', 'dow'], outputCols=['org_vec', 'dow_vec'])
assemble_flight = VectorAssembler(inputCols=feature_cols, outputCol='features')
regression_flight = LinearRegression(labelCol=label_col)

# Split into train and test set.
df_flights_train, df_flights_test = df_flights.randomSplit([0.8, 0.2], seed=SEED)
print(f"Training set: {df_flights_train.count()}, Testing set: {df_flights_test.count()}")

Training set: 37601, Testing set: 9421


### Ex. 2 - Flight duration model: Pipeline model

You're now ready to put those stages together in a pipeline.

You'll construct the pipeline and then train the pipeline on the training data. This will apply each of the individual stages in the pipeline to the training data in turn. None of the stages will be exposed to the testing data at all: there will be no leakage!

Once the entire pipeline has been trained it will then be used to make predictions on the testing data.

**Instructions:**

1. Import the class for creating a pipeline.
2. Create a pipeline object and specify the indexer, onehot, assembler and regression stages, in this order.
3. Train the pipeline on the training data.
4. Make predictions on the testing data.

In [25]:
cols_to_predict = ['org_idx', 'org_vec', 'dow_vec', 'features', 'prediction']
df_flights_train = df_flights_train.drop(*cols_to_predict)
df_flights_test = df_flights_test.drop(*cols_to_predict)
label_col = 'duration'

# Define the pipeline
pipeline_flight = Pipeline(stages=[indexer_flight, onehot_flight, assemble_flight, regression_flight])

# Fitting the model on the trainning data
pipeline_flight = pipeline_flight.fit(df_flights_train)
df_flights_train = pipeline_flight.transform(df_flights_train)

# Evaluating on the testing data
predictions_flights = pipeline_flight.transform(df_flights_test)

eval_flights = RegressionEvaluator(labelCol=label_col)
print(f'''
RMSE: {eval_flights.evaluate(predictions_flights)}
 MAE: {eval_flights.evaluate(predictions_flights, {eval_flights.metricName: "mae"})}
  R²: {eval_flights.evaluate(predictions_flights, {eval_flights.metricName: "r2"})}
 MSE: {eval_flights.evaluate(predictions_flights, {eval_flights.metricName: "mse"})}
''')


RMSE: 11.018250570030013
 MAE: 8.531036191848083
  R²: 0.9841835312882599
 MSE: 121.40184562396668



## Cross Validation

### Ex. 4 - Cross validating simple flight duration model

You've already built a few models for predicting flight duration and evaluated them with a simple train/test split. However, cross-validation provides a much better way to evaluate model performance.

In this exercise you're going to train a simple model for flight duration using cross-validation. Travel time is usually strongly correlated with distance, so using the km column alone should give a decent model.

**Instructions:**
1. Create an empty parameter grid.
2. Create objects for building and evaluating a linear regression model. The model should predict the "duration" field.
3. Create a cross-validator object. Provide values for the estimator, estimatorParamMaps and evaluator arguments. Choose 5-fold cross validation.
4. Train and test the model across multiple folds of the training data.

In [26]:
cols_to_predict = ['org_idx', 'org_vec', 'dow_vec', 'features', 'prediction']
df_flights_train = df_flights_train.drop(*cols_to_predict)
df_flights_test = df_flights_test.drop(*cols_to_predict)
label_col = 'duration'

# Define the estimator
pipeline_flight = Pipeline(stages=[indexer_flight, onehot_flight, assemble_flight, regression_flight])

# A grid of parameter values (empty for the moment).
params = ParamGridBuilder().build()

# An object to evaluate model performance.
evaluator_flight = RegressionEvaluator(labelCol='duration')

# The cross-validation object.
cv_flight = CrossValidator(estimator=pipeline_flight, estimatorParamMaps=params,
                           evaluator=evaluator_flight, numFolds=10, seed=SEED)

# Apply cross-validation to the training data.
cv_flight = cv_flight.fit(df_flights_train)

# What's the average RMSE across the folds?
print(f'''
What's the average RMSE across the folds? (cross validation score)
{cv_flight.avgMetrics}
''')


What's the average RMSE across the folds? (cross validation score)
[11.043356149070807]



### Ex. 5 - Cross validating flight duration model pipeline

The cross-validated model that you just built was simple, using `km` alone to predict duration.

Another important predictor of flight duration is the origin airport. Flights generally take longer to get into the air from busy airports. Let's see if adding this predictor improves the model!

In this exercise you'll add the `org` field to the model. However, since `org` is categorical, there's more work to be done before it can be included: it must first be transformed to an index and then one-hot encoded before being assembled with `km` and used to build the regression model. We'll wrap these operations up in a pipeline.

**Instructions:**

1. Create a string indexer. Specify the input and output fields as org and org_idx.
2. Create a one-hot encoder. Name the output field org_dummy.
3. Assemble the km and org_dummy fields into a single field called features.
4. Create a pipeline using the following operations: string indexer, one-hot encoder, assembler and linear regression. Use this to create a cross-validator.

In [27]:
# Loading the data
df_flights = flights_data.select('*')

# setting the label and features columns
label_col = 'duration'
feature_cols = ['km', 'org_vec']

# Setting the steps
indexer_flight = StringIndexer(inputCol='org', outputCol='org_idx')
onehot_flight = OneHotEncoder(inputCols=[indexer_flight.getOutputCol()], outputCols=['org_vec'])
assemble_flight = VectorAssembler(inputCols=feature_cols, outputCol='features')
regression_flight = LinearRegression(labelCol=label_col)

# Split into train and test set.
df_flights_train, df_flights_test = df_flights.randomSplit([0.8, 0.2], seed=SEED)
print(f"Training set: {df_flights_train.count()}, Testing set: {df_flights_test.count()}")

# A grid of parameter values (empty for the moment).
params = ParamGridBuilder().build()

# An object to evaluate model performance.
evaluator_flight = RegressionEvaluator(labelCol=label_col)

# The cross-validation object.
cv_flight = CrossValidator(estimator=pipeline_flight, estimatorParamMaps=params,
                           evaluator=evaluator_flight, numFolds=10, seed=SEED)

# Apply cross-validation to the training data.
cv_flight = cv_flight.fit(df_flights_train)

# What's the average RMSE across the folds?
print(f'''
What's the average RMSE across the folds? (cross validation score)
{cv_flight.avgMetrics}
''')

# Evaluating on the testing data
predictions_flights = cv_flight.transform(df_flights_test)

print(f'''
RMSE: {evaluator_flight.evaluate(predictions_flights)}
 MAE: {evaluator_flight.evaluate(predictions_flights, {evaluator_flight.metricName: "mae"})}
  R²: {evaluator_flight.evaluate(predictions_flights, {evaluator_flight.metricName: "r2"})}
 MSE: {evaluator_flight.evaluate(predictions_flights, {evaluator_flight.metricName: "mse"})}
''')

Training set: 37601, Testing set: 9421

What's the average RMSE across the folds? (cross validation score)
[11.043356149070807]


RMSE: 11.018250570030013
 MAE: 8.531036191848083
  R²: 0.9841835312882599
 MSE: 121.40184562396668



## GridSearch

### Ex. 6 - Optimizing flights linear regression

Up until now you've been using the default hyper-parameters when building your models. In this exercise you'll use cross validation to choose an optimal (or close to optimal) set of model hyper-parameters.

**Instructions:**

1. Create a parameter grid builder.
2. Add grids for with regression.regParam (values `0.01`, `0.1`, `1.0`, and `10.0`) and regression.`elasticNetParam` (values `0.0`, `0.5`, and `1.0`).
3. Build the grid.
4. Create a cross validator, specifying five folds.

In [28]:
# Loading the data
df_flights = flights_data.select('*')

# setting the label and features columns
label_col = 'duration'
feature_cols = ['km', 'org_vec']

# Setting the steps
indexer_flight = StringIndexer(inputCol='org', outputCol='org_idx')
onehot_flight = OneHotEncoder(inputCols=[indexer_flight.getOutputCol()], outputCols=['org_vec'])
assemble_flight = VectorAssembler(inputCols=feature_cols, outputCol='features')
regression_flight = LinearRegression(labelCol=label_col)

# Define the estimator
pipeline_flight = Pipeline(stages=[indexer_flight, onehot_flight, assemble_flight, regression_flight])

# Split into train and test set.
df_flights_train, df_flights_test = df_flights.randomSplit([0.8, 0.2], seed=SEED)
print(f"Training set: {df_flights_train.count()}, Testing set: {df_flights_test.count()}")

# A grid of parameter values (empty for the moment).
params = (ParamGridBuilder().addGrid(regression_flight.regParam, [0.01, 0.1, 1.0, 10.0])
                            .addGrid(regression_flight.elasticNetParam, [0.0, 0.5, 1.0])
                            .build())
print('Number of models to be tested: ', len(params))

# An object to evaluate model performance.
evaluator_flight = RegressionEvaluator(labelCol=label_col)

# The cross-validation object.
cv_flight = CrossValidator(estimator=pipeline_flight, estimatorParamMaps=params,
                           evaluator=evaluator_flight, numFolds=10, seed=SEED)

Training set: 37601, Testing set: 9421
Number of models to be tested:  12


### Ex. 7 - Dissecting the best flight duration model

You just set up a `CrossValidator` to find good parameters for the linear regression model predicting flight `duration`.

The model pipeline has multiple stages (objects of type `StringIndexer`, `OneHotEncoder`, `VectorAssembler` and `LinearRegression`), which operate in sequence. The stages are available as the stages attribute on the pipeline object. They are represented by a list and the stages are executed in the sequence in which they appear in the list.

Now you're going to take a closer look at the pipeline, split out the stages and use it to make predictions on the testing data.

The following objects have already been created:
- `cv_flight` — a trained CrossValidatorModel object and
- `evaluator_flight` — a RegressionEvaluator object.

**Instructions:**

1. Retrieve the best model.
2. Look at the stages in the best model.
3. Isolate the linear regression stage and extract its parameters.
4. Use the best model to generate predictions on the testing data and calculate the RMSE.

In [29]:
# Apply cross-validation to the training data.
cv_flight = cv_flight.fit(df_flights_train)
print(f"What's the cross-validated RMSE for each model? {cv_flight.avgMetrics}")

# Access the best model
best_model = cv_flight.bestModel
print(f'Best Model: \n{best_model}')

# Get the parameters for the LinearRegression object in the best model
print(f'''
LinearRegression Parameters (best model): 
best_model.stages[3].extractParamMap()
''')

# Generate predictions on testing data using the best model then calculate RMSE
predictions_flights = best_model.transform(df_flights_test)
print("RMSE =", evaluator_flight.evaluate(predictions_flights))

What's the cross-validated RMSE for each model? [11.043796482476411, 11.044215367880962, 11.044972888958773, 11.04569240725167, 11.07688496687515, 11.151202862483116, 11.168330585629565, 11.515881741405666, 11.693665918948176, 14.536143664571739, 17.014857807938384, 19.16091783706862]
Best Model: 
PipelineModel_515019dbe135

LinearRegression Parameters (best model): 
best_model.stages[3].extractParamMap()

RMSE = 11.017107718193145


## Ensemble

### Ex. 10 - Delayed flights with Gradient-Boosted Trees

You've previously built a classifier for flights likely to be delayed using a Decision Tree. In this exercise you'll compare a Decision Tree model to a Gradient-Boosted Trees model.

**Instructions:**

1. Import the classes required to create Decision Tree and Gradient-Boosted Tree classifiers.
2. Create Decision Tree and Gradient-Boosted Tree classifiers. Train on the training data.
3. Create an evaluator and calculate AUC on testing data for both classifiers. Which model performs better?
4. For the Gradient-Boosted Tree classifier print the number of trees and the relative importance of features.

In [30]:
# Loading the data
df_flights = flights_data.select('*')
df_flights = df_flights.withColumn('label', (df_flights['delay']>=15).cast('integer'))

# setting the label and features columns
feature_cols = ['mon', 'depart', 'duration']

# Setting the steps
assemble_flight = VectorAssembler(inputCols=feature_cols, outputCol='features')
df_flights = assemble_flight.transform(df_flights)

# Split into train and test set.
df_flights_train, df_flights_test = df_flights.randomSplit([0.8, 0.2], seed=SEED)
print(f"Training set: {df_flights_train.count()}, Testing set: {df_flights_test.count()}")

# ------------------------------------------------------------------------------
# Create the classifier model
gbt_flight = GBTClassifier(seed=SEED).fit(df_flights_train)
tree_flight = DecisionTreeClassifier(seed=SEED).fit(df_flights_train)

# Feature importances
feature_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance GBT': DenseVector(gbt_flight.featureImportances),
    'Importance Tree': DenseVector(tree_flight.featureImportances)
})
print(f'''
Feature importance:
{feature_importance}
''')

# ------------------------------------------------------------------------------
# Predictions
predictions_flight_gbt = gbt_flight.transform(df_flights_test)
predictions_flight_tree = tree_flight.transform(df_flights_test)

# ------------------------------------------------------------------------------
# Evaluation
evaluator_flight = BinaryClassificationEvaluator()
print(f'''
Accuracy GBT: {evaluator_flight.evaluate(gbt_flight.transform(df_flights_test))}
Accuracy Tree: {evaluator_flight.evaluate(tree_flight.transform(df_flights_test))}
''')

Training set: 37601, Testing set: 9421

Feature importance:
    Feature  Importance GBT  Importance Tree
0       mon        0.306272         0.365375
1    depart        0.326609         0.424626
2  duration        0.367119         0.209999


Accuracy GBT: 0.6764592767631815
Accuracy Tree: 0.6177641355453237



### Ex. 11 - Delayed flights with a Random Forest

In this exercise you'll bring together cross validation and ensemble methods. You'll be training a Random Forest classifier to predict delayed flights, using cross validation to choose the best values for model parameters.

You'll find good values for the following parameters:
- `featureSubsetStrategy` — the number of features to consider for splitting at each node and
- `maxDepth` — the maximum number of splits along any branch.

Unfortunately building this model takes too long, so we won't be running the `.fit()` method on the pipeline.

The `RandomForestClassifier` class has already been imported into the session.

**Instructions:**

1. Create a random forest classifier object.
2. Create a parameter grid builder object. Add grid points for the featureSubsetStrategy and maxDepth parameters.
3. Create binary classification evaluator.
4. Create a cross-validator object, specifying the estimator, parameter grid and evaluator. Choose 5-fold cross validation.

In [31]:
# Create a random forest classifier
forest_flight = RandomForestClassifier()

# Create a parameter grid
params = (ParamGridBuilder().addGrid(forest_flight.featureSubsetStrategy, ['all', 'onethird', 'sqrt', 'log2'])
                            .addGrid(forest_flight.maxDepth, [2, 5, 10])
                            .build())
print('Number of models in GridSearch: ', len(params))

# Create a binary classification evaluator
evaluator_flight = BinaryClassificationEvaluator()

# Create a cross-validator
cv_flight = CrossValidator(estimator=forest_flight, 
                           estimatorParamMaps=params, 
                           evaluator=evaluator_flight, 
                           numFolds=5)

Number of models in GridSearch:  12


### Ex. 12 - Evaluating Random Forest

In this final exercise you'll be evaluating the results of cross-validation on a Random Forest model.

**Instructions:**

1. Print a list of average AUC metrics across all models in the parameter grid.
2. Display the average AUC for the best model. This will be the largest AUC in the list.
3. Print an explanation of the maxDepth and featureSubsetStrategy parameters for the best model.
4. Display the AUC for the best model predictions on the testing data.

In [32]:
# Loading the data
df_flights = flights_data.select('*')
df_flights = df_flights.withColumn('label', (df_flights['delay']>=15).cast('integer'))

# setting the label and features columns
feature_cols = ['mon', 'depart', 'duration']

# Setting the steps
assemble_flight = VectorAssembler(inputCols=feature_cols, outputCol='features')
df_flights = assemble_flight.transform(df_flights)

# Split into train and test set.
df_flights_train, df_flights_test = df_flights.randomSplit([0.8, 0.2], seed=SEED)
print(f"Training set: {df_flights_train.count()}, Testing set: {df_flights_test.count()}")

# ------------------------------------------------------------------------------
# Fitting the model
cv_flight = cv_flight.fit(df_flights_train)

# ------------------------------------------------------------------------------
# Inspecting results
print(f'''
Average AUC for each parameter combination in grid: 
{cv_flight.avgMetrics}

Average AUC for the best model: {max(cv_flight.avgMetrics)}

What's the optimal parameter value for maxDepth?
{cv_flight.bestModel.explainParam('maxDepth')}

What's the optimal parameter value for featureSubsetStrategy?
{cv_flight.bestModel.explainParam('featureSubsetStrategy')}

AUC for best model on testing data:
{evaluator_flight.evaluate(cv_flight.transform(df_flights_test))}
''')

Training set: 37601, Testing set: 9421

Average AUC for each parameter combination in grid: 
[0.6190417566601683, 0.6608697791808258, 0.6698831701918285, 0.646538142319198, 0.6636898339808173, 0.6736555609859828, 0.6420409970736392, 0.6644620268396466, 0.6709851288402727, 0.6420409970736392, 0.6644620268396466, 0.6709851288402727]

Average AUC for the best model: 0.6736555609859828

What's the optimal parameter value for maxDepth?
maxDepth: Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30]. (default: 5, current: 10)

What's the optimal parameter value for featureSubsetStrategy?
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the fe

# SMS Datasets

## Ex. 3 - SMS spam pipeline

You haven't looked at the SMS data for quite a while. Last time we did the following:
- split the text into tokens
- removed stop words
- applied the hashing trick
- converted the data from counts to IDF and
- trained a logistic regression model.
Each of these steps was done independently. This seems like a great application for a pipeline!

**Instructions:**

1. Create an object for splitting text into tokens.
2. Create an object to remove stop words. Rather than explicitly giving the input column name, use the `getOutputCol()` method on the previous object.
3. Create objects for applying the hashing trick and transforming the data into a TF-IDF. Use the `getOutputCol()` method again.
4. Create a pipeline which wraps all of the above steps as well as an object to create a Logistic Regression model.

In [33]:
# Loading the data
df_sms = sms_data.select('*')

# setting the label and features columns
label_col = 'label'
feature_cols = ["hash"]

# Setting the steps
tokenizer_sms = Tokenizer(inputCol='text', outputCol='words') # Break text into tokens at non-word characters
remover_sms = StopWordsRemover(inputCol=tokenizer_sms.getOutputCol(), outputCol='terms') # Remove stop words
hasher_sms = HashingTF(inputCol=remover_sms.getOutputCol(), outputCol="hash") # Hashing trick, transform to TF-IDF
idf_sms = IDF(inputCol=hasher_sms.getOutputCol(), outputCol="features")
logistic_sms = LogisticRegression() # Create a logistic regression object

# Split into train and test set.
df_sms_train, df_sms_test = df_sms.randomSplit([0.8, 0.2], seed=SEED)
print(f"Training set: {df_sms_train.count()}, Testing set: {df_sms_test.count()}")

Training set: 4503, Testing set: 1071


In [34]:
cols_to_drop = ['words', 'terms', 'hash', 'features', 
                'prediction', 'rawPrediction', 'probability']
df_sms_test = df_sms_test.drop(*cols_to_drop)
df_sms_train = df_sms_train.drop(*cols_to_drop)

# Define the pipeline
pipeline_sms = Pipeline(stages=[tokenizer_sms, remover_sms, hasher_sms, idf_sms, logistic_sms])

# Fitting the model on the trainning data
pipeline_sms = pipeline_sms.fit(df_sms_train)
df_sms_train = pipeline_sms.transform(df_sms_train)

# Evaluating on the testing data
predictions_sms = pipeline_sms.transform(df_sms_test)

# predictions_sms.groupBy('label', 'prediction').count().show() # Confusion matrix
cm_sms = predictions_sms.groupBy("label", "prediction").count().toPandas().sort_values(["prediction", "label"])
cm_sms.index = ['True negative (TN)', 'False negative (FN)',
                         'False positive (FP)', 'True positive (TP)']
TN = predictions_sms.filter('prediction = 0 AND label = 0').count()
TP = predictions_sms.filter('prediction = 1 AND label = 1').count()
FN = predictions_sms.filter('prediction = 0 AND label = 1').count()
FP = predictions_sms.filter('prediction = 1 AND label = 0').count()

# Accuracy measures the proportion of correct predictions
accuracy = (TN + TP) / (TN + TP + FN + FP)
precision = TP / (TP + FP)
recall = TP / (TP + FN)
print(f'''
Accuracy : {accuracy}
Precision: {precision}
Recall   : {recall}
''')
cm_sms


Accuracy : 0.976657329598506
Precision: 0.9534883720930233
Recall   : 0.8661971830985915



Unnamed: 0,label,prediction,count
True negative (TN),0,0.0,923
False negative (FN),1,0.0,19
False positive (FP),0,1.0,6
True positive (TP),1,1.0,123


## Ex. 8 - SMS spam optimised

The pipeline you built earlier for the SMS spam model used the default parameters for all of the elements in the pipeline. It's very unlikely that these parameters will give a particularly good model though. In this exercise you're going to run the pipeline for a selection of parameter values. We're going to do this in a systematic way: the values for each of the hyperparameters will be laid out on a grid and then pipeline will systematically run across each point in the grid.

In this exercise you'll set up a parameter grid which can be used with cross validation to choose a good set of parameters for the SMS spam classifier.

**Instructions:**

1. Create a parameter grid builder object.
2. Add grid points for `numFeatures` and `binary` parameters to the `HashingTF` object, giving values `1024`, `4096` and `16384`, and `True` and `False`, respectively.
3. Add grid points for `regParam` and `elasticNetParam` parameters to the `LogisticRegression` object, giving values of `0.01`, `0.1`, `1.0` and `10.0`, and `0.0`, `0.5`, and `1.0` respectively.
4. Build the parameter grid.

In [35]:
# Create parameter grid
params_sms = (ParamGridBuilder().addGrid(hasher_sms.numFeatures, [1024, 4096, 16384])    # Params for hashing
                                .addGrid(hasher_sms.binary, [True, False])
                                .addGrid(logistic_sms.regParam, [0.01, 0.1, 1.0, 10.0])  # Params for logReg
                                .addGrid(logistic_sms.elasticNetParam, [0.0, 0.5, 1.0])
                                .build())
print('Number of models created: ', len(params_sms))

Number of models created:  72


## Ex. 9 - How many models for grid search?

How many models will be built in the cross-validator?

In [36]:
# The cross-validation object.
numfolds = 5
cv_flight = CrossValidator(estimator=pipeline_flight, estimatorParamMaps=params,
                           evaluator=evaluator_flight, numFolds=numfolds, seed=SEED)
print('Number of models in the cross validator: ', len(params_sms)*numfolds)

Number of models in the cross validator:  360


# Close session

In [37]:
spark.stop()