<img style="float: left" src="images/spark.png" />
<img style="float: right" src="images/surfsara.png" />
<hr style="clear: both" />

## Airplane delays with Spark.ML and DataFrames

In this notebook we show how to work with DataFrames in Apache Spark's machine learning component Spark.ML. We'll be using part of the "Airline on-time performance" data set from Data expo 09, which you can find [here](http://stat-computing.org/dataexpo/2009/the-data.html) in full, together with a description of the records.

We are interested in the delays of flights and we will be trying to predict them by using the machine-learning model [Random Forests](https://en.wikipedia.org/wiki/Random_forest).

In [None]:
# Create a SparkSession, the 'DataFrame version' of the SparkContext
from pyspark.sql import SparkSession

spark = (
    SparkSession
    .builder
    .appName("Airplane delay prediction")
    .getOrCreate()
)


In [None]:
# Import a number of libraries we will be using
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook

In [None]:
dfAirplane = spark.read.parquet("../data/airplane_2008.parquet")
dfAirplane.printSchema()

### Inspecting the DataFrame

Let's see what our DataFrame contains by using [show](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.show).  We will print out the first 10 records of the DataFrame. If we specify no arguments to show, it will print out 20 records.

In [None]:
dfAirplane.head()

Print the columns of the DataFrame. You may want to check the [DataFrame](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame) API documentation.

In [None]:
# TODO: Replace <FILL IN> with appropriate code

# Print the columns of this DataFrame
dfAirplane.<FILL IN>

### Counting the records

Let's see how many records we have. Again, you may want to consult the [API docs](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame). 


In [None]:
# TODO: Replace <FILL IN> with appropriate code

# Print the number of records in this DataFrame
dfAirplane.<FILL IN>

In [None]:
printSample = dfAirplane.head()
from IPython.display import display, HTML

th = ["<th>" + d + "</th>" for d in dfAirplane.columns]
td = ["<td>" + str(d) + "</td>" for d in printSample]

display(HTML("<table><thead><tr>" + "".join(th) + "</tr></thead><tbody><tr>" + "".join(td) + "</tr></tbody></table>"))

### Basic descriptive statistics

We are interested in the departure delay. We can compute the number of records that lists this field, together with the mean value, the standard deviation, the minimum and the maximum values, by invoking [describe](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.describe) on the column. 

In [None]:
# TODO: Replace <FILL IN> with appropriate code

dfAirplane.select(<FILL IN>).show()

### Data reduction - Average department delay by month

In the next cell we compute the average departure delay per month. We do this by selecting the columns 'DepDelay' and 'Month', then group by 'Month' and compute the mean of 'DepDelay'.

The result of the [groupBy](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy) method is that for each month the values of 'DepDelay' are grouped together in a list. For each of these lists the mean is then computed.

Finally, we convert the DataFrame to a Python [pandas](http://pandas.pydata.org/) DataFrame. This means that we collect all data from the data frame to the driver. The pandas DataFrame (called `pdf` here) is local, it will not be distributed over many machines when using a cluster. We will visualize this Pandas DataFrame later, hence the import of matplotlib which is a visualisation library. 

In [None]:
# TODO: Replace <FILL IN> with appropriate code

grouped = dfAirplane[['DepDelay', 'Month']].<FILL IN>

# The last line is equivalent to
# grouped = dfAirplane.select(dfAirplane['DepDelay'], dfAirplane['Month]).groupby('Month').mean('DepDelay')

grouped_sorted = grouped.sort(grouped['Month'])

pdf = grouped_sorted.toPandas()
pdf

In [None]:
# The import of matplotlib is for the graphics
import matplotlib.pyplot as plt
# make sure the graphics are shown within the notebook
%matplotlib inline

pdf.plot(kind='bar', x='Month')

## What are the main causes for delay

Let's see what the main causes for delay are. Beacuse some fields have 'NA' values (Not Aavailable), we first filter these out. We then use an aggregation to sum ech of the fieldvalues that indicate causes for delay.  

In [None]:
from pyspark.sql.types import IntegerType

df = dfAirplane.filter(dfAirplane['WeatherDelay']!= 'NA').filter(dfAirplane['SecurityDelay'] != 'NA')\
.filter(dfAirplane['NASDelay'] != 'NA')

In [None]:
df.agg({"WeatherDelay" : "sum", "NASDelay": "sum", "SecurityDelay": "sum",\
                "LateAircraftDelay":"sum", "CarrierDelay" : "sum"}).toPandas().head()

## Visualisation of delay during the week

The next cells show how to get some more insight into the distribution of delay during the week. The first things that we do is to add a boolaen column to the dataframe called Deplayed. The value of the column is true when the delay for that row is more than 15 minutes.

Next a function to extract the hour is defined and we use this as User Defined Function (UDF) in the DataFrame api.

Then for all the rows with more than 15 minutes delay the average delayed is displayed per day of the week and hour. This can also be visualised in matplotlib.

In [None]:
airline_df = df.withColumn('DepDelayed', df['DepDelay']>15)

In [None]:
from pyspark.sql.functions import udf

# define hour function to obtain hour of day
def hour_ex(x): 
    h = int(str(int(x)).zfill(4)[:2])
    return h

# register as a UDF 
f = udf(hour_ex, IntegerType())

#CRSDepTime: scheduled departure time (local, hhmm)
airline_df = airline_df.withColumn('hour', f(airline_df.CRSDepTime))

In [None]:
#Origin_Airport="SJC"

In [None]:
#df_ORG = airline_df.filter(airline_df['origin']==Origin_Airport)
df_ORG = airline_df

In [None]:
hour_grouped = df_ORG.filter(df_ORG['DepDelayed']).select('DayOfWeek','hour','DepDelay')\
.groupby('DayOfWeek','hour').mean('DepDelay')
hour_grouped.show(10)

In [None]:
from pylab import rcParams
from IPython.display import display
from ipywidgets import interact

rcParams['figure.figsize'] = (10,5)
dh = hour_grouped.toPandas()
c = dh.pivot('DayOfWeek','hour')
X = c.columns.levels[1].values
Y = c.index.values
Z = c.values
plt.xticks(range(0,24), X)
plt.yticks(range(0,7), Y)
plt.xlabel('Hour of Day')
plt.ylabel('Day of Week')
plt.title('Average delay per hours and day?')
plt.imshow(Z)

## Machine Learning - predicting delays with Random Forests

In the rest of this notebook we will build a simple model that can be used to predict departure delays for a given airport.

We will define a delay as a departure delay of more than 15 minutes. This new feature is categorical and binary: a flight is either delayed or not. Then we will (after some data munging) train a model using Random Forests. This model can then be used to predict delays, based on new observations.

Here we will not explain what Random Forests are. For more info you may want to refer to [this](https://www.youtube.com/watch?v=3kYujfDgmNk) video. If you are not familiair with decision trees you may want to see [this one first](https://www.youtube.com/watch?v=-dCtJjlEEgM).

In order to train and test the model we will divide the data into a training set and a test set. 

This exercise is to show you how to use Machine Learning in Apache Spark. Obviously, the model that we'll build is not very special. The aim is here to show principles.

Apache Spark has two machine learning libraries, one for RDDs (MLlib) and one for DataFrames (Spark.ML). The new developments will centre around Spark.ML. 

For those familair with Python's [scikit-learn](http://scikit-learn.org/): Spark.ML is very similar in design and also supports [Pipelines](http://spark.apache.org/docs/latest/ml-pipeline.html). However, we will not be covering those in this notebook. If you are interested in combining Spark and scikit-learn you may want to read [this blogpost](https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html) from the DataBricks blog.

### Predicting delays 

In the next step we will add an extra column to the dataframe which indicates whether a flight had a delay of more than 15 minutes or not. When so, we enter a value of 1.0 in the column 'label', otherwise we use '0.0'.

We will try to predict the value of this 'label' column by using Random Forests later. 

In the next cell we use [withColumn](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn) to add an extra column to a DataFrame. Note that we do not 'change' or 'edit' an existing DataFrame but that we create a new one. Remember that DataFrames are immutable.

The [when](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.when) condition, together with 'otherwise', functions like an if-then-else statement.

In [None]:
from pyspark.sql.functions import when

OrdDelayeddf = airline_df.withColumn('label', when(airline_df['DepDelay'] > 15, 1.0).otherwise(0.0))
Slimdf = OrdDelayeddf.select(['Month','DayofMonth', 'DayOfWeek', 'Distance', 'UniqueCarrier', 'Dest','label'])
Slimdf.show()

### Dealing with categorical variables - using StringIndexer

The feature (or column) 'UniqueCarrier' is a categorical feature. Our Random Forest Classifier requires that we map the values of a categorical feature to numbers. Spark.ML offers a function called [StringIndexer](http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer) that does exactly this.

StringIndexer can be used to convert Strings to indexes. It tries to come up with the lowest index for the most common labels. It also provides a method IndexToString to revert predictions back to the original labels. 

Let's see how it works in a toy example first.

In the next cell we create a DataFrame called `mldf`. It has two columns, that we name 'id' and 'category'. Next we transform the `mldf` DataFrame into a new DataFrame called 'index' by using a StringIndexer. The StringIndexer is given the input and an output column. The fit method is called on the StringIndexer and the data. The result is a StringIndexermodel that is then used to transform the 'mldf' DataFrame. Run the cell and see if you understand what happens.

In [None]:
from pyspark.ml.feature import StringIndexer

mldf = spark.createDataFrame(
    [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
    ["id", "category"])
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
fittedmodel = indexer.fit(mldf)
transformedDF = fittedmodel.transform(mldf)
transformedDF.show()

### Using StringIndexer to transform 'UniqueCarrier'

Next, use StringIndexer yourself to transform 'UniqueCarrier'. As name of the output column use 'Carrier'! We will be using this name later on. 

In [None]:
# TODO: Replace <FILL IN> with appropriate code
stringIndexer = StringIndexer(inputCol ="UniqueCarrier", outputCol="Carrier")

si_model = <FILL IN>
si_df = <FILL IN>
si_df.show()

### Transforming the 'Dest' column

We will do the same for the 'Dest' column.

In [None]:
stringIndexer = StringIndexer(inputCol='Dest', outputCol='Destination')
si_model = stringIndexer.fit(si_df)
dest_model_df = si_model.transform(si_df)

### Creating a feature vector: using VectorAssembler

All features that are used for building the classification model should be assembled in a feature vector. And again, Spark has a method for doing this: [VectorAssembler](http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler)

And again, let's see a toy example first.

First, we create a data frame with two rows and three columns, named 'a', 'b' and 'c'. Suppose now, that from this data frame we only want to use column 'a' and 'c' as features. Then we use only these columns as values for 'inputCols' in VectorAssembler. Note that the result of the transformation is a new DataFrame with an extra column called 'features'. This contains the features of the columns we selected, in the proper format.

In [None]:
from pyspark.ml.feature import VectorAssembler

mldf = spark.createDataFrame([(1, 0, 3), (2,3,2)], ["a", "b", "c"])
vecAssembler = VectorAssembler(inputCols=['a','c'], outputCol="features")
ass = vecAssembler.transform(mldf)
ass.show()

### Using a VectorAssembler
 
 In the cell below build the VectorAssembler that can be used to transform 'dest_model_df' in a data frame that cointains a column called 'features' (we reuse this name!). As input use the featureCols list.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
featureCols = ['Month', 'DayofMonth', 'DayOfWeek', 'Distance', 'Carrier', 'Destination']

#set the input and output column names
assembler = <FILL IN>

# return a dataframe with all of the  feature columns in  a vector column
features_df = <FILL IN>
features_df.show()

### Splitting the data set into a part for training and testing

Next, we split the data set into two parts. 80 percent of the data will be in the training set and the remaining 20 percent will be our test set. The assignment of records to these two sets is random. We use the [randomSplit](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.randomSplit) method that needs a list of weights and a seed for the random generation of the splits. 

In [None]:
train_data, test_data = features_df.randomSplit([0.8, 0.2], 12345)
train_data.show()

### Training the Random Forest Classifier

Now we are ready to train the Random Forest Classifier. As input we have to provide a few things: the number of trees we want to create, the depth of each tree, the label column (the class we want to predict). In addition, we provide a seed for random generation and a maxBins number for categorical variables.

Spark ML offers ways to search for the optimal set of configuration parameters. See here: http://spark.apache.org/docs/latest/ml-tuning.html . Running these, can be quite slow in local mode.

So let's run the classifier with a limited number of trees, and with shallow depth. If we want to see the tree we can use the `toDebugString` method. The training may take a while.


In [None]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(numTrees=15, maxBins= 350, maxDepth=8, labelCol="label", seed=42)
model = rf.fit(train_data)

### Predictions on the test data

Let's use the model to make predictions on the test_data. Note, that by predicting we transform one data frame into another.

In [None]:
predictions = model.transform(test_data)
predictions.show()

### Evaluating the results

So is machine learning that easy? Yes... but wait till we see how good the model performs.

We use Sparks [BinaryClassificationEvaluator](http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.BinaryClassificationEvaluator) to get some information on how the model performs. In this case, we measure accuracy. Check, the documentation for more evaluation metrics.

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol='label')
accuracy = evaluator.evaluate(predictions) 
accuracy

### And now...

This is not really great but it's better than chance. We can tweak the model by training again, or we may choose other features, or add some more information to our features. For example, what about holidays? Or, if we can get weather data that could improve our model significantly. Maybe you want to look at other models for classification? The journey does not end here, but really has just begun.

But for now, we leave it at this. 