# Model tuning and selection

In this last chapter, you'll apply what you've learned to create a model that predicts which flights will be delayed.

## Preparing the environment

### Importing libraries

In [1]:
import numpy as np

from pyspark.ml import Pipeline, evaluation as evals, tuning as tune
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.sql.types import (_parse_datatype_string, StructType, StructField,
                               DoubleType, IntegerType, StringType)
from pyspark.sql import SparkSession

### Connect to Spark

In [2]:
spark = SparkSession.builder.getOrCreate()

# eval DataFrame in notebooks
spark.conf.set('spark.sql.repl.eagerEval.enabled', True)

### Reading the data

In [3]:
schema_str = "year int, month int, day int, dep_time int, dep_delay int, arr_time int, " + \
             "arr_delay int, carrier string, tailnum string, flight int, origin string, " + \
             "dest string, air_time int, distance int, hour int, minute int"
customSchema = _parse_datatype_string(schema_str)
flights = spark.read.csv('data-sources/flights_small.csv', header=True, schema=schema_str)
flights.createOrReplaceTempView("flights")
flights.printSchema()
flights.limit(2)

root
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- dep_time: integer (nullable = true)
 |-- dep_delay: integer (nullable = true)
 |-- arr_time: integer (nullable = true)
 |-- arr_delay: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- tailnum: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: integer (nullable = true)
 |-- distance: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- minute: integer (nullable = true)



year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,tailnum,flight,origin,dest,air_time,distance,hour,minute
2014,12,8,658,-7,935,-5,VX,N846VA,1780,SEA,LAX,132,954,6,58
2014,1,22,1040,5,1505,5,AS,N559AS,851,SEA,HNL,360,2677,10,40


In [4]:
schema_str = "faa string, name string, lat double, lon double, alt int, tz int, dst string"
customSchema = _parse_datatype_string(schema_str)
airports = spark.read.schema(customSchema).csv('data-sources/airports.csv', header=True)
airports.createOrReplaceTempView("airports")
airports.printSchema()
airports.limit(2)

root
 |-- faa: string (nullable = true)
 |-- name: string (nullable = true)
 |-- lat: double (nullable = true)
 |-- lon: double (nullable = true)
 |-- alt: integer (nullable = true)
 |-- tz: integer (nullable = true)
 |-- dst: string (nullable = true)



faa,name,lat,lon,alt,tz,dst
04G,Lansdowne Airport,41.1304722,-80.6195833,1044,-5,A
06A,Moton Field Munic...,32.4605722,-85.6800278,264,-5,A


In [5]:
customSchema = StructType([
    StructField("tailnum", StringType()),
    StructField("year", IntegerType()),
    StructField("type", StringType()),
    StructField("manufacturer", StringType()),
    StructField("model", StringType()),
    StructField("engines", IntegerType()),
    StructField("seats", IntegerType()),
    StructField("speed", DoubleType()),
    StructField("engine", StringType())
])
planes = (spark.read.schema(customSchema)
                    .format("csv")
                    .option("header", "true")
                    .load('data-sources/planes.csv'))
planes.createOrReplaceTempView("planes")
planes.printSchema()
planes.limit(2)

root
 |-- tailnum: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- manufacturer: string (nullable = true)
 |-- model: string (nullable = true)
 |-- engines: integer (nullable = true)
 |-- seats: integer (nullable = true)
 |-- speed: double (nullable = true)
 |-- engine: string (nullable = true)



tailnum,year,type,manufacturer,model,engines,seats,speed,engine
N102UW,1998,Fixed wing multi ...,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
N103US,1999,Fixed wing multi ...,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan


In [6]:
spark.catalog.listTables()

[Table(name='airports', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='flights', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='planes', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]

## Preparing the data for the modelling process

### Getting the model data

In [7]:
planes = planes.withColumnRenamed('year', 'plane_year')

# Join the DataFrames
model_data = flights.join(planes, on='tailnum', how="leftouter")

# Create the column plane_age
model_data = model_data.withColumn("plane_age", model_data.year - model_data.plane_year)

# Create is_late
model_data = model_data.withColumn("is_late", model_data.arr_delay > 0)

# Convert to an integer
model_data = model_data.withColumn("label", model_data.is_late.cast('integer'))

# Remove missing values
model_data = model_data.filter("arr_delay is not NULL and "
                               "dep_delay is not NULL and "
                               "air_time is not NULL and "
                               "plane_year is not NULL")

model_data.printSchema()
print('Shape', (model_data.count(), len(model_data.columns)))
model_data.limit(2)

root
 |-- tailnum: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- dep_time: integer (nullable = true)
 |-- dep_delay: integer (nullable = true)
 |-- arr_time: integer (nullable = true)
 |-- arr_delay: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: integer (nullable = true)
 |-- distance: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- minute: integer (nullable = true)
 |-- plane_year: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- manufacturer: string (nullable = true)
 |-- model: string (nullable = true)
 |-- engines: integer (nullable = true)
 |-- seats: integer (nullable = true)
 |-- speed: double (nullable = true)
 |-- engine: string (nullable = true)
 |-- plane_age: integer (nullable = true)
 |-- is_late: 

tailnum,year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,flight,origin,dest,air_time,distance,hour,minute,plane_year,type,manufacturer,model,engines,seats,speed,engine,plane_age,is_late,label
N846VA,2014,12,8,658,-7,935,-5,VX,1780,SEA,LAX,132,954,6,58,2011,Fixed wing multi ...,AIRBUS,A320-214,2,182,,Turbo-fan,3,False,0
N559AS,2014,1,22,1040,5,1505,5,AS,851,SEA,HNL,360,2677,10,40,2006,Fixed wing multi ...,BOEING,737-890,2,149,,Turbo-fan,8,True,1


### Encoding the categorical variables

In [8]:
# Create a StringIndexer
carr_indexer = StringIndexer(inputCol="carrier", outputCol="carrier_index")
dest_indexer = StringIndexer(inputCol="dest", outputCol="dest_index")

# Create a OneHotEncoder
carr_encoder = OneHotEncoder(inputCol="carrier_index", outputCol="carrier_fact")
dest_encoder = OneHotEncoder(inputCol="dest_index", outputCol="dest_fact")

### Combine the selected features into a single column

In [9]:
# Make a VectorAssembler
vec_assembler = VectorAssembler(
    inputCols=["month", "air_time", "carrier_fact", "dest_fact", "plane_age"], 
    outputCol='features'
)

### Set the pipeline for the final result

In [10]:
# Make the pipeline
flights_pipe = Pipeline(
    stages=[dest_indexer, dest_encoder, carr_indexer, carr_encoder, vec_assembler]
)

# Fit and transform the data
piped_data = flights_pipe.fit(model_data).transform(model_data)
piped_data.printSchema()
print('Shape', (piped_data.count(), len(piped_data.columns)))
piped_data.limit(2)

root
 |-- tailnum: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- dep_time: integer (nullable = true)
 |-- dep_delay: integer (nullable = true)
 |-- arr_time: integer (nullable = true)
 |-- arr_delay: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: integer (nullable = true)
 |-- distance: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- minute: integer (nullable = true)
 |-- plane_year: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- manufacturer: string (nullable = true)
 |-- model: string (nullable = true)
 |-- engines: integer (nullable = true)
 |-- seats: integer (nullable = true)
 |-- speed: double (nullable = true)
 |-- engine: string (nullable = true)
 |-- plane_age: integer (nullable = true)
 |-- is_late: 

tailnum,year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,flight,origin,dest,air_time,distance,hour,minute,plane_year,type,manufacturer,model,engines,seats,speed,engine,plane_age,is_late,label,dest_index,dest_fact,carrier_index,carrier_fact,features
N846VA,2014,12,8,658,-7,935,-5,VX,1780,SEA,LAX,132,954,6,58,2011,Fixed wing multi ...,AIRBUS,A320-214,2,182,,Turbo-fan,3,False,0,1.0,"(68,[1],[1.0])",7.0,"(10,[7],[1.0])","(81,[0,1,9,13,80]..."
N559AS,2014,1,22,1040,5,1505,5,AS,851,SEA,HNL,360,2677,10,40,2006,Fixed wing multi ...,BOEING,737-890,2,149,,Turbo-fan,8,True,1,19.0,"(68,[19],[1.0])",0.0,"(10,[0],[1.0])","(81,[0,1,2,31,80]..."


### Splitting the data into training and test sets

In [11]:
training, test = piped_data.randomSplit([.6, .4])

print(f'Training set: {training.count()} rows.')
print(f'Testing set : {test.count()} rows.')

Training set: 5628 rows.
Testing set : 3675 rows.


## What is logistic regression?

The model you'll be fitting in this chapter is called a logistic regression. This model is very similar to a linear regression, but instead of predicting a numeric variable, it predicts the probability (between 0 and 1) of an event.

To use this as a classification algorithm, all you have to do is assign a cutoff point to these probabilities. If the predicted probability is above the cutoff point, you classify that observation as a 'yes' (in this case, the flight being late), if it's below, you classify it as a 'no'!

You'll tune this model by testing different values for several hyperparameters. A hyperparameter is just a value in the model that's not estimated from the data, but rather is supplied by the user to maximize performance. For this course it's not necessary to understand the mathematics behind all of these values - what's important is that you'll try out a few different choices and pick the bestne.



## Ex. 1 - Create the modeler

The Estimator you'll be using is a `LogisticRegression` from the `pyspark.ml.classification` submodule.

**Instructions:**

1. Import the `LogisticRegression` class from `pyspark.ml.classification`. (Already done!).
2. Create a LogisticRegression called lr by calling LogisticRegression() with no arguments.

In [12]:
# Create a LogisticRegression Estimator
lr = LogisticRegression()

## Cross validation

In the next few exercises you'll be tuning your logistic regression model using a procedure called k-fold cross validation. This is a method of estimating the model's performance on unseen data (like your test DataFrame).

It works by splitting the training data into a few different partitions. The exact number is up to you, but in this course you'll be using PySpark's default value of three. Once the data is split up, one of the partitions is set aside, and the model is fit to the others. Then the error is measured against the held out partition. This is repeated for each of the partitions, so that every block of data is held out and used as a test set exactly once. Then the error on each of the partitions is averaged. This is called the cross validation error of the model, and is a good estimate of the actual error on the held out data.

You'll be using cross validation to choose the hyperparameters by creating a grid of the possible pairs of values for the two hyperparamet`ers, elasticNet`Param` and reg`Param, and using the cross validation error to compare all the different models so you can choose the best one!

## Ex. 2 - Create the evaluator

The first thing you need when doing cross validation for model selection is a way to compare different models. Luckily, the `pyspark.ml.evaluation` submodule has classes for evaluating different kinds of models. Your model is a binary classification model, so you'll be using the `BinaryClassificationEvaluator` from the `pyspark.ml.evaluation` module.

This evaluator calculates the area under the ROC. This is a metric that combines the two kinds of errors a binary classifier can make (false positives and false negatives) into a simple number. You'll learn more about this towards the end of the chapter!

**Instructions:**

1. Import the submodule `pyspark.ml.evaluation` as `evals`. (Already done!)
2. Create `evaluator` by calling `evals.BinaryClassificationEvaluator()` with the argument `metricName="areaUnderROC"`.

In [13]:
# Create a BinaryClassificationEvaluator
evaluator = evals.BinaryClassificationEvaluator(metricName="areaUnderROC")

## Ex. 3 - Make a grid

Next, you need to create a grid of values to search over when looking for the optimal hyperparameters. The submodule `pyspark.ml.tuning` includes a class called `ParamGridBuilder` that does just that (maybe you're starting to notice a pattern here; PySpark has a submodule for just about everything!).

You'll need to use the `.addGrid()` and `.build()` methods to create a grid that you can use for cross validation. The `.addGrid()` method takes a model parameter (an attribute of the model Estimator, `lr`, that you created a few exercises ago) and a list of values that you want to try. The `.build()` method takes no arguments, it just returns the grid that you'll use later.

**Instructions:**

1. Import the submodule `pyspark.ml.tuning` under the alias tune. (Already done!)
2. Call the class constructor `ParamGridBuilder()` with no arguments. Save this as grid.
3. Call the `.addGrid()` method on grid with `lr.regParam` as the first argument and `np.arange(0, .1, .01)` as the second argument. This second call is a function from the `numpy` module (imported as `np`) that creates a list of numbers from 0 to .1, incrementing by .01. Overwrite grid with the result.
4. Update grid again by calling the `.addGrid()` method a second time create a grid for `lr.elasticNetParam` that includes only the values `[0, 1]`.
5. Call the `.build()` method on grid and overwrite it with the output.

In [14]:
# Create the parameter grid
grid = tune.ParamGridBuilder()

# Add the hyperparameter
grid = grid.addGrid(lr.regParam, np.arange(0, .1, .01))
grid = grid.addGrid(lr.elasticNetParam, [0, 1])

# Build the grid
grid = grid.build()

## Ex. 4 - Make the validator

The submodule `pyspark.ml.tuning` also has a class called `CrossValidator` for performing cross validation. This `Estimator` takes the modeler you want to fit, the grid of hyperparameters you created, and the evaluator you want to use to compare your models.

The submodule `pyspark.ml.tune` has already been imported as `tune`. You'll create the `CrossValidator` by passing it the logistic regression Estimator `lr`, the parameter `grid`, and the `evaluator` you created in the previous exercises.

**Instructions:**

1. Create a `CrossValidator` by calling `tune.CrossValidator()` with the arguments:
    - `estimator=lr`
    - `estimatorParamMaps=grid`
    - `evaluator=evaluator`
2. Name this object `cv`.

In [15]:
# Create the CrossValidator
cv = tune.CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cv

CrossValidator_babb027848fc

## Ex. 5 - Fit the model(s)

Cross validation is a very computationally intensive procedure. Fitting all the models would take too long.
To do this locally you would use the code:

```
# Fit cross validation models
models = cv.fit(training)

# Extract the best model
best_lr = models.bestModel
```

**Instructions:**
1. Fit the `cv` on `training` set.
2. Get the best model and save it as `best_lr`.
3. Print the best params of the model._lr)

In [16]:
# Fit cross validation models
models = cv.fit(training)

In [17]:
# Extract the best model
best_lr = models.bestModel
best_lr

LogisticRegressionModel: uid=LogisticRegression_1e23dad5a81a, numClasses=2, numFeatures=81

In [18]:
# Printing the best params
print('Best Param (regParam): ', best_lr._java_obj.getRegParam())
print('Best Param (elasticNetParam): ', best_lr._java_obj.getElasticNetParam())

Best Param (regParam):  0.0
Best Param (elasticNetParam):  0.0


In [19]:
models.getEstimatorParamMaps()[np.argmax(models.avgMetrics)]

{Param(parent='LogisticRegression_1e23dad5a81a', name='regParam', doc='regularization parameter (>= 0).'): 0.0,
 Param(parent='LogisticRegression_1e23dad5a81a', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0}

## Evaluating binary classifiers

For this course we'll be using a common metric for binary classification algorithms call the AUC, or area under the curve. In this case, the curve is the ROC, or receiver operating curve. The details of what these things actually measure isn't important for this course. All you need to know is that for our purposes, the closer the AUC is to one (1), the better the model is!

## Ex. 6 - Evaluate the model

Remember the `test` data that you set aside? It's finally time to test your model on it! You can use the same evaluator you made to fit the model.

**Instructions:**

1. Use your model to generate predictions by applying `best_lr.transform()` to the `test` data. Save this as `test_results`.
2. Call `evaluator.evaluate()` on `test_results` to compute the AUC. Print the output.

In [20]:
# Use the model to predict the test set
test_results = best_lr.transform(test)

# Evaluate the predictions
print(evaluator.evaluate(test_results))

0.7092976660204492


In [21]:
test_results.limit(2)

tailnum,year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,flight,origin,dest,air_time,distance,hour,minute,plane_year,type,manufacturer,model,engines,seats,speed,engine,plane_age,is_late,label,dest_index,dest_fact,carrier_index,carrier_fact,features,rawPrediction,probability,prediction
N102UW,2014,11,9,2220,-5,555,-11,US,1930,PDX,CLT,257,2282,22,20,1998,Fixed wing multi ...,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan,16,False,0,33.0,"(68,[33],[1.0])",5.0,"(10,[5],[1.0])","(81,[0,1,7,45,80]...",[1.09596277884176...,[0.74950288792253...,0.0
N107US,2014,5,15,1058,-2,1845,-23,US,2092,SEA,CLT,264,2279,10,58,1999,Fixed wing multi ...,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan,15,False,0,33.0,"(68,[33],[1.0])",5.0,"(10,[5],[1.0])","(81,[0,1,7,45,80]...",[0.66860176047499...,[0.66118999931458...,0.0


## Close

In [22]:
spark.stop()