# Tuning Machine Learning models in Spark

<a href = "http://yogen.io"><img src="http://yogen.io/assets/logo.svg" alt="yogen" style="width: 200px; float: right;"/></a>

## ML Pipelines in Spark

Spark contiene los típicos modelos de ML para poder entrenar, y permite usar las ventajas del clúster. 

Además, si, por ejemplo, queremos afinar parámetros, tenemos que lanzar varias veces el entrenamiento. Spark introduce el concepto de Pipeline que permite mejorar el rendimiento de estos procesos aprovechando la capacidad del clúster.

No es triviual usar Spark si queremos hacer deep learnin. Normalmente, en deep learning, se tira por un PC con una pepino de GPU, o un ordenador con MultiGPU (soportado por liberías como keras o tensorflow), o como último revusro un clúster de GPU. Si quieres saber más: Juan Arévalo. 

En Pyspark.ML tendremos varios pasos, en cada uno de ellos haremos una transformación. Hay dos tipos de transformaciones:

- `Transformer`: de entrada tiene un DF y como salida tiene otro DF, fila a fila. Por ejemplo, calcular una nueva columna a partir de dos existentes.  
- `Estimator`: hay casos en los que no quiero transformar fila a fila. Primero necesito leer todo el dataset. Por ejemplo, si quiero normalizar antes tengo que saber la media de los valores. Los de spark se dan cuenta que es el mismo caso que si estoy entrenando un modelo: por ejemplo, si quiero sacar una regresión lineal, tengo que leer el dataset entero para sacar las betas. Por tanto, mete todo en el saco de `Estimator` . Podemos también entenderlo como algo que es capaz de inspeccionar un dataset y a partir de ahí generar un transformer. 

>     `Transformer`---> Pipeline.transform()
>     `Estimator`  ---> Pipeline.fit()

**_Un modelo es un trasformer que ha sido generado por un estimator_**
 
ML model training and tuning often represents running the same steps once and again. Often, we run the same steps with small variations in order to evaluate combinations of parameters. 

In order to make this use case a lot easier, Spark provides the [Pipeline](https://spark.apache.org/docs/2.3.0/ml-pipeline.html) abstraction.

A Pipeline represents a series of steps in the processing of a dataset. Each step is a Transformer or an Estimator. The whole Pipeline is an Estimator, so we can .fit the whole pipeline in one step. When we do that, the steps'  .fit and .transform methods will be called in turn.

![pipelineestimator](https://spark.apache.org/docs/2.3.0/img/ml-Pipeline.png)

![PipelineModel](https://spark.apache.org/docs/2.3.0/img/ml-PipelineModel.png)

*NOTA* 

pyspark.mllib se basa en RDD. Está deprecado. 

pyspark.ml se basa en DF. Es lo que hay que usar.

Pseudocógido:

```
data

train, test

ntrain = train.normalize()
model.fit(train)

ntest = test.normalize()
model.predict(ntest)

Esto es un problema, porque estamos metiendo información (la media, la varainza de los datos de test) en el entrenamiento. 

Con las pipelines, por lo visto, te lo ahorras. A ver si lo entiendo
```

## Example: predicting flight delays

We'll be using the same [Transtats'](https://www.transtats.bts.gov/) OTP performance data] from way back when. Remember it?

It's a table that contains all domestic departures by US air air carriers that represent at least one percent of domestic scheduled passenger revenues, with data on each individual departure including [Tail Number](https://en.wikipedia.org/wiki/Tail_number), departure delay, origin, destination and carrier.


### Load the data

Opening .zip files in Spark is a bit of a pain. For now, let's just decompress the file we want to read. When we are ready to expand the processing to the cluster, we will need to do [this](https://stackoverflow.com/questions/28569788/how-to-open-stream-zip-files-through-spark).

In [118]:
df = spark.read.csv('data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2018_12.csv', 
                    header=True, inferSchema=True)

df = df[['Year', 'Month', 'DayOfMonth', 'FlightDate', 'DayOfWeek', 'Reporting_Airline', 'Tail_Number', 'Flight_Number_Reporting_Airline', 'Origin', 
                'OriginCityName', 'OriginStateName', 'Dest', 'DestCityName', 'DestStateName',
                'DepTime', 'DepDelay', 'AirTime', 'Distance']]
                    
df.show(10)

+----+-----+----------+-------------------+---------+-----------------+-----------+-------------------------------+------+--------------+---------------+----+----------------+-------------+-------+--------+-------+--------+
|Year|Month|DayOfMonth|         FlightDate|DayOfWeek|Reporting_Airline|Tail_Number|Flight_Number_Reporting_Airline|Origin|OriginCityName|OriginStateName|Dest|    DestCityName|DestStateName|DepTime|DepDelay|AirTime|Distance|
+----+-----+----------+-------------------+---------+-----------------+-----------+-------------------------------+------+--------------+---------------+----+----------------+-------------+-------+--------+-------+--------+
|2018|   12|        25|2018-12-25 00:00:00|        2|               WN|     N566WN|                           1823|   OAK|   Oakland, CA|     California| GEG|     Spokane, WA|   Washington|   1048|    18.0|  111.0|   723.0|
|2018|   12|        25|2018-12-25 00:00:00|        2|               WN|     N562WN|                     

### Drop nas

There are only a few departures for which any of the columns of interest contains null values. The most expedient way to handle them is to just drop them, since they won't make much of a difference.

In [119]:
df.dropna()

DataFrame[Year: int, Month: int, DayOfMonth: int, FlightDate: timestamp, DayOfWeek: int, Reporting_Airline: string, Tail_Number: string, Flight_Number_Reporting_Airline: int, Origin: string, OriginCityName: string, OriginStateName: string, Dest: string, DestCityName: string, DestStateName: string, DepTime: int, DepDelay: double, AirTime: double, Distance: double]

NA-related functions are grouped in a .na attribute of DataFrames.

In [120]:
df = df.na.drop()

## Feature extraction and generation of target variable

The departing hour is the most important factor in delays, so we need to calculate it from the departure time. Since the input file uses a funny format for times, Spark has interpreted them as floats:

In [121]:
df.select('DepTime').show(5)

+-------+
|DepTime|
+-------+
|   1048|
|    638|
|   1710|
|   1318|
|    953|
+-------+
only showing top 5 rows



In [122]:
from pyspark.sql import functions, types

minutes = (df['DepTime'] / 100).cast(types.IntegerType()) * 60 + df['DepTime'] % 100
with_minutes = df.withColumn('DepMinutes', minutes)

In [123]:
with_minutes.select('DepTime','DepMinutes').show(5)

+-------+----------+
|DepTime|DepMinutes|
+-------+----------+
|   1048|       648|
|    638|       398|
|   1710|      1030|
|   1318|       798|
|    953|       593|
+-------+----------+
only showing top 5 rows



#### Exercise

Calculated the 'DepHour' column that represents the hour as an int.

In [124]:
# ya hecho antes

We will also generate a binary target variable. The aviation industry considers a flight delayed when it departs more than 15 minutes after its scheduled departure time, so we will use that. We will create it as an integer, since that is what the learning algorithms expect.

In [125]:
with_delayed_col = with_minutes.withColumn('Delayed', (df['DepDelay'] > 15).cast(types.ByteType()) )

In [126]:
with_delayed_col.select('DepDelay','Delayed').show(5)

+--------+-------+
|DepDelay|Delayed|
+--------+-------+
|    18.0|      1|
|    -2.0|      0|
|     0.0|      0|
|    -2.0|      0|
|    -2.0|      0|
+--------+-------+
only showing top 5 rows



In order to make the training times manageable, let's pick only 10% of the data to train.

Para separar aleatorizamente. Tienes que poner los números en proporción. Podríamos poner también 2.0, 18.0 y sería equivalente. 

In [127]:
train, test = with_delayed_col.randomSplit([1.0,9.0]) # lo normal sería al revés. Train con 0.9, pero lo hacemos para no tardar

In [128]:
data = with_delayed_col

## Handle different fields in different ways

We have features of at least three kinds:

* **Numeric continuous fields***, which we can use as input to many algorithms as they are. In particular, decision trees can take continuous variables with any value as input, since they only look for the cutoff point that most increases the homogeneity of the resulting groups. In contrast, if we were using a logistic regression with regularization, for example, we would need to first scale the variables to have comparable magnitudes.

* There are fields which we will treat as **categorical variables**, but which are **already integers**. These need to be one-hot encoded.

* Finally, there are several **categorical variables** that are encoded as **strings**. These need to be **one-hot encoded**, but OneHotEncoder requires numeric input. Therefore, we will need to apply a StringIndexer to each of them before one-hot encoding.

Tenemos columnas de tipo strin y las tenemos que convertir en números. El OneHotEncoder convierte el string más frecuente en un 1, el segundo más frecuente en un 2. Eso es un algortirmo como otro cualquiera, pero en el que perdemos información. Si supiéramos lo que significa cada string, por ejemplo, ciudades, podríamos usar información de las ciudades (con un join)

In [129]:
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator, VectorAssembler

We have generated the list of names of columns that have dataType string with a list comprehension, rather than hard-coding it, but it is just like the other ones.

In [130]:
categorical_fields = ['Year', 'Month', 'DayOfMonth', 'DayOfWeek', 'Reporting_Airline', 
                      'Origin', 'OriginCityName', 'OriginStateName', 
                      'Dest', 'DestCityName', 'DestStateName']

string_fields = [field.name for field in data.schema.fields if field.dataType == types.StringType()]

continuous_fields = ['Distance', 'DepMinutes']

target_field = 'Delayed'

## Handling categorical fields

Let's do the processing of just one field first, as an example. Then we will process the rest.

### StringIndexer 

A [StringIndexer](https://spark.apache.org/docs/2.2.0/ml-features.html#stringindexer) is an estimator that takes a single string field, then produces a transformer that codifies said field as numeric labels that are fit for feeding to a one-hot encoding. 

We need to specify an input column, an output column, and a way to handle invalids. In this case, invalids are values that the indexer has not seen during fitting but that the transformer finds during processing. Its values are 'error' (the default), which is pretty self-explanatory, 'skip', which drops them, and 'keep', which is what we want. It will assign all unseen labels to a single category index.

In [131]:
indexerEstimator = StringIndexer(inputCol='Reporting_Airline', outputCol='Reporting_AirlineIndex')

indexerModel = indexerEstimator.fit(data)
indexerModel

StringIndexer_e201d1350cf8

**_El estimator nos devuelve un transformer, que es el model_**

Al hacer el fit tarda: es un estimator que necesita recorrer datos. 

In [132]:
indexed = indexerModel.transform(data)

Al hacer el transform no tarda, sólo se lo anota (lazy)

### OneHotEncoderEstimator

A [OneHotEncoder](https://spark.apache.org/docs/latest/ml-features#onehotencoderestimator) generates a n-1 length vector column for an n-category column of category indices. 

We need to specify an input (podemos poner varias imputs) and an output column.

Si tenemos tres tipos de aerolínes ya convertidas a número: 1,2,3. Para una fila que tenga aerolinea 1, tendremos como resultado el vector [0,0], para la aerolinea 2, sería [1,0], y para la aerolinea 3 sería [0,1]

**OJO: Aquí lo estamos haciendo con los datos completos, pero lo normal sería hacerlo con los de test. Duda mía: qué pasa si en los datos completos tengo aerolíneas que no me han caído en los de test. Eso está bien, porque refleja el hecho de que cuando lo ponga en producción me encontraré aerolíneas nuevas. PAra eso tenemos el `handleInvalid`, que lo que hará es añadir una nueva columna que se pondrá a uno para casos no conocidos, a modo de `otros`**

In [133]:
oneHotEstimator = OneHotEncoderEstimator(inputCols=['Reporting_AirlineIndex'], outputCols=['Reporting_AirlineOneHot'], handleInvalid='keep')

oneHotEstimator

OneHotEncoderEstimator_299e30dcd5fa

In [134]:
oneHotModel = oneHotEstimator.fit(indexed)
oneHot = oneHotModel.transform(indexed)

(oneHot
 .select('Reporting_Airline', 'Reporting_AirlineIndex', 'Reporting_AirlineOneHot')
 .sample(fraction=.001)
 .take(5))

[Row(Reporting_Airline='WN', Reporting_AirlineIndex=0.0, Reporting_AirlineOneHot=SparseVector(17, {0: 1.0})),
 Row(Reporting_Airline='WN', Reporting_AirlineIndex=0.0, Reporting_AirlineOneHot=SparseVector(17, {0: 1.0})),
 Row(Reporting_Airline='WN', Reporting_AirlineIndex=0.0, Reporting_AirlineOneHot=SparseVector(17, {0: 1.0})),
 Row(Reporting_Airline='WN', Reporting_AirlineIndex=0.0, Reporting_AirlineOneHot=SparseVector(17, {0: 1.0})),
 Row(Reporting_Airline='WN', Reporting_AirlineIndex=0.0, Reporting_AirlineOneHot=SparseVector(17, {0: 1.0}))]

### SparseVectors

The vectors produced by OneHotEncoder will each have only one non-zero value, but can potentially be very long. An efficient way to represent them is therefore a SparseVector, and that is what OneHotEncoder generates. 

A SparseVector is a data structure that only stores the length of the vector, a list of positions, and a list of values. All other values are assumed to be 0s.

This way, a vector like the following, with lenght 15 and non-zero values only on positions 3 and 9:

```python
[0.0, 0.0, 0.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0]
```

can be compactly expressed as

```python
(15, [3, 9], [6.0, 4.0]) # Tengo un vector de tamaño 15, con valores 6 y 4 en las posiciones 3 y 9
```

SpareseVector: para vectores que tienen muchos ceros, podemos representarlos de una forma más inteligente
Eso es lo que vemos en la celda anterior

## Let's build our first Pipeline!

Our pipeline consists of a number of StringIndexers, followed by one OneHotEncoder, followed by a VectorAssembler, with a RandomForestClassifier at the end.

A Spark Pipeline is a single Estimator. We build it secifying the stages it comprises, and then we are ready to .fit it in one go. This will save us a lot of trouble, since we don't need to fit and transform each stage individually.

In [135]:
from pyspark.ml import Pipeline

pipeline = Pipeline()

### StringIndexer stages

We only need to StringIndex some of the fields. We are going to build the input and output column names programatically.


In [136]:
string_fields

['Reporting_Airline',
 'Tail_Number',
 'Origin',
 'OriginCityName',
 'OriginStateName',
 'Dest',
 'DestCityName',
 'DestStateName']

In [137]:
indexers = [StringIndexer(inputCol=field, outputCol=field + 'Index') for field in string_fields]
indexers

[StringIndexer_ae5cc3a0894f,
 StringIndexer_3b6576569bb2,
 StringIndexer_3edbf3fe9748,
 StringIndexer_3a3e8075d085,
 StringIndexer_91e5bae6c9c9,
 StringIndexer_f904ca6a1446,
 StringIndexer_b80846695afd,
 StringIndexer_8905aa08b090]

### OneHotEncoderEstimator

One OneHotEncoderEstimator can handle all categorical columns. We are also going to build it programatically

In [138]:
inputCols = ( [field for field in categorical_fields if field not in string_fields]
             + [field + 'Index' for field in string_fields])

inputCols

['Year',
 'Month',
 'DayOfMonth',
 'DayOfWeek',
 'Reporting_AirlineIndex',
 'Tail_NumberIndex',
 'OriginIndex',
 'OriginCityNameIndex',
 'OriginStateNameIndex',
 'DestIndex',
 'DestCityNameIndex',
 'DestStateNameIndex']

In [139]:
outputCols = ( [field + 'OneHot 'for field in categorical_fields if field not in string_fields]
             + [field + 'OneHot' for field in string_fields])

outputCols

['YearOneHot ',
 'MonthOneHot ',
 'DayOfMonthOneHot ',
 'DayOfWeekOneHot ',
 'Reporting_AirlineOneHot',
 'Tail_NumberOneHot',
 'OriginOneHot',
 'OriginCityNameOneHot',
 'OriginStateNameOneHot',
 'DestOneHot',
 'DestCityNameOneHot',
 'DestStateNameOneHot']

Versión JL

In [140]:
outputCols = [inputCol + 'OneHot' for inputCol in inputCols]

outputCols

['YearOneHot',
 'MonthOneHot',
 'DayOfMonthOneHot',
 'DayOfWeekOneHot',
 'Reporting_AirlineIndexOneHot',
 'Tail_NumberIndexOneHot',
 'OriginIndexOneHot',
 'OriginCityNameIndexOneHot',
 'OriginStateNameIndexOneHot',
 'DestIndexOneHot',
 'DestCityNameIndexOneHot',
 'DestStateNameIndexOneHot']

In [141]:
oneHotEstimator =  OneHotEncoderEstimator(inputCols=inputCols, outputCols=outputCols)
oneHotEstimator

OneHotEncoderEstimator_26e78adcd8d2

### VectorAssembler

Once we have generated our features, we can assemble them into a single features column, together with the continuous_fields.

Tenemos algunas columnas que hemos convertidas en vectores. Otras no las he tocado , son números a palo. Pues bien, el assembler las junta todas en una nueva columna (`features` en nuestro caso) que contiene un vector. Entiendo que, por cada columna escalar, añade una nueva dimesión al vector. Para las one hot, que son vectores, añade todas esas características. 

Un assembler por tanto, no es más que un transformer

**En Python, sumar arrays es concaternarlos**

In [142]:
assembler = VectorAssembler(inputCols=outputCols + continuous_fields, outputCol='features')

### RandomForestClassifier

Aaaaand we are ready to do some Machine Learning! We'll use a RandomForestClassifier to try to predict delayed versus non delayed flights, a binary classification task.

In [143]:
from pyspark.ml.classification import RandomForestClassifier

classifier = RandomForestClassifier(featuresCol='features', labelCol='Delayed')

### Pipeline!

Now that we have all the stages, we are finally ready to put them together into a single Estimator, our Pipeline.

In [144]:
pipeline = Pipeline(stages = indexers + [oneHotEstimator] + [assembler] + [classifier])

Now that we have gone to the trouble of building our Pipeline, fitting it and using it to predict the probabilty of delay on unseen data is as easy as using a single Estimator:

In [145]:
pipeline.fit(train)

PipelineModel_05965413d661

**Importante: si se te olvida guardar lo anterior en una variable, tienees _ y __ para obtener lo de la celda anterior  y la celda previa a la anterior respectivamente**

In [146]:
model = _

In [165]:
result = model.transform(test)
result.select('Year','features', 'rawPrediction', 'probability', 'prediction').show(6)

+----+--------------------+--------------------+--------------------+----------+
|Year|            features|       rawPrediction|         probability|prediction|
+----+--------------------+--------------------+--------------------+----------+
|2018|(8776,[2035,2064,...|[16.6449634932409...|[0.83224817466204...|       0.0|
|2018|(8776,[2035,2064,...|[16.6449634932409...|[0.83224817466204...|       0.0|
|2018|(8776,[2035,2064,...|[16.6449634932409...|[0.83224817466204...|       0.0|
|2018|(8776,[2035,2064,...|[16.3880171372084...|[0.81940085686042...|       0.0|
|2018|(8776,[2035,2064,...|[16.6449634932409...|[0.83224817466204...|       0.0|
|2018|(8776,[2035,2064,...|[16.5002946725902...|[0.82501473362951...|       0.0|
+----+--------------------+--------------------+--------------------+----------+
only showing top 6 rows



**La `probability` es la `rawPrediction` pasada por una función. La `prediction` es simplemente 0 sy mayor que 0.5, uno si menor. Podría crear mi propia predicción creando una columna nueva a partitr de `probability`**

## Evaluating and tuning our Pipeline

**Quiero optimizar los hiperparámetros de mi modelo**

Probably the most interesting use of Spark Pipelines is quickly (in terms of coding time) evaluating many combinations of hyperparameters to feed our model and choosing the best ones. For that, we can use a TrainValidationSplit or a CrossValidator. The CrossValidator will generally perform better, but it will take several times as much. I'm using here the TrainValidationSplit because the API is the same.

In [152]:
from pyspark.ml.tuning import TrainValidationSplit, CrossValidator



TrainsValidtorSplit evalua mi pipeline dividiendo en dos grupos

CrossValidator lo hace mediante ventnas, mejor. 

In [None]:
TrainValidationSplit(estimator=pipeline)

### Params and Evaluators

In order to evaluate different sets of parameters, we need a) the set of parameters to iterate through and b) a metric to compare the results. 

The first element is represented by ParamMaps, which we build with a ParamGridBuilder, and the second by an Evaluator that needs to be specific to the relevant task.

In [153]:
classifier

RandomForestClassifier_8e3b9a1e411a

Nuestro classifier es un RandomForest, que tiene sus parámetros. Nosotros no le hemos puesto ningún parámetro a ese random forest, que tiene muchos (número de árboles, etc.) Ahora podemos probar con distintos:

In [154]:
from pyspark.ml.tuning import ParamGridBuilder

In [155]:
builder = ParamGridBuilder()

In [156]:
builder.addGrid(classifier.maxBins, [16,32,64])
builder.addGrid(classifier.numTrees, [10,20,50])
builder.addGrid(classifier.maxDepth, [3,5,10])

<pyspark.ml.tuning.ParamGridBuilder at 0x11ca29fd0>

In [159]:
param_map = builder.build()

In [161]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol='Delayed')

We now have all the elements in place to perform our fit:

In [162]:
split = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=param_map, evaluator=evaluator)

* Primer parámetro: modelo a evaluar.
* Segundo parámetro: combinaciones de cambios que vamos a hacer. 
* Tercer parémtro: cómo evaluaremos si el modelo es bueno o no: en este caso, que detecte bien 'Delayed''

And now we can predict on the rest of the flights and compare them with reality:

In [163]:
%%time 

results_of_exploration = split.fit(train)

Py4JJavaError: An error occurred while calling o3584.evaluate.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in stage 150.0 failed 1 times, most recent failure: Lost task 11.0 in stage 150.0 (TID 1311, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$9: (string) => double)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Unseen label: N708DN.  To handle unseen labels, set Param handleInvalid to keep.
	at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$9.apply(StringIndexer.scala:260)
	at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$9.apply(StringIndexer.scala:246)
	... 17 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:309)
	at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:171)
	at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:151)
	at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
	at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
	at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$4$lzycompute(BinaryClassificationMetrics.scala:155)
	at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$4(BinaryClassificationMetrics.scala:146)
	at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions$lzycompute(BinaryClassificationMetrics.scala:148)
	at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions(BinaryClassificationMetrics.scala:148)
	at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.createCurve(BinaryClassificationMetrics.scala:226)
	at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.roc(BinaryClassificationMetrics.scala:86)
	at org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:97)
	at org.apache.spark.ml.evaluation.BinaryClassificationEvaluator.evaluate(BinaryClassificationEvaluator.scala:87)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$9: (string) => double)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
Caused by: org.apache.spark.SparkException: Unseen label: N708DN.  To handle unseen labels, set Param handleInvalid to keep.
	at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$9.apply(StringIndexer.scala:260)
	at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$9.apply(StringIndexer.scala:246)
	... 17 more


### Let's have a look

We are now ready to compare our predictions with reality. Do these features have any predictive power at all?

Not bad, considering we have not performed any feature engineering at all!

### Further Reading

https://spark.apache.org/docs/latest/ml-tuning.html

https://stackoverflow.com/questions/28569788/how-to-open-stream-zip-files-through-spark