# Machine Learning with Spark MLlib
Spark MLlib, sometimes known as Spark ML, is a library for building machine learning solutions on Spark.

## Data Preparation and Exploration
Machine learning begins with data preparation and exploration. We'll start by loading a dataframe of data about flights between airports in the US.

In [2]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

flightSchema = StructType([
  StructField("DayofMonth", IntegerType(), False),
  StructField("DayOfWeek", IntegerType(), False),
  StructField("Carrier", StringType(), False),
  StructField("OriginAirportID", StringType(), False),
  StructField("DestAirportID", StringType(), False),
  StructField("DepDelay", IntegerType(), False),
  StructField("ArrDelay", IntegerType(), False),
])

flights = spark.read.csv('wasb://spark@<YOUR_ACCOUNT>.blob.core.windows.net/data/raw-flight-data.csv', schema=flightSchema, header=True)
flights.show()

The data includes a record of each flight, including how late it departed and arrived. Let's see how many rows are in the data set:

In [4]:
flights.count()

### Data Cleansing
Generally, before you can use data to train a machine learning model, you need to do some pre-processing to clean the data so it's ready for use. For example, does our data include some duplicate rows?

In [6]:
flights.count() - flights.dropDuplicates().count()

Yes it does.

Does it have any missing values in the **ArrDelay** and **DepDelay** columns?

In [8]:
flights.count() - flights.dropDuplicates().dropna(how="any", subset=["ArrDelay", "DepDelay"]).count()

Yes.

So let's clean the data by removing the duplicates and replacing the missing values with 0.

In [10]:
flights=flights.dropDuplicates().fillna(value=0, subset=["ArrDelay", "DepDelay"])
flights.count()

### Exploring the Data
The data includes details of departure and arrival delays. However, we want to simply classify flights as *late* or *not late* based on a rule that defines a flight as *late* if it arrives more than 25 minutes after its scheduled arrival time. We'll select the columns we need, and create a new one that indicates whether a flight was late or not with a **1** or a **0**.

In [12]:
flights = flights.select("DayofMonth", "DayOfWeek", "Carrier", "OriginAirportID","DestAirportID",
                         "DepDelay", "ArrDelay", ((col("ArrDelay") > 25).cast("Int").alias("Late")))
flights.show()

OK, let's examine this data in more detail. The machine learning algorithms we are going to use are based on statistics; so let's look at some fundamental statistics for our flight data.

In [14]:
flights.describe().show()

The *DayofMonth* must be a value between 1 and 31, and the mean is around halfway between these values; which seems about right. The same is true for the *DayofWeek* which is a value between 1 and 7. *Carrier* is a string, so there are no numeric statistics; and we can ignore the statistics for the airport IDs - they're just unique identifiers for the airports, not actually numeric values. The departure and arrival delays range between 63 or 94 minutes ahead of schedule, and over 1,800 minutes behind schedule. The means are much closer to zero than this, and the standard deviation is quite large; so there's quite a bit of variance in the delays. The *Late* indicator is a 1 or a 0, but the mean is very close to 0; which implies that there significantly fewer late flights and non-late flights.

Let's verify that assumption by creating a table and using a SQL statement to count the number of late and non-late flights:

In [16]:
flights.createOrReplaceTempView("flightData")
spark.sql("SELECT Late, COUNT(*) AS Count FROM flightData GROUP BY Late").show()

Yes, it looks like there are more non-late flights than late ones - we can see this more clearly with a visualization. To use the notebooks's native visualization tools, we'll need to use an embedded SQL query to retreve a sample of the data:

In [18]:
%sql
SELECT * FROM flightData

The results of the query are shown in a table above, but you can also view the data returned as a **Bar** chart, showing the count of the ***&lt;id&gt;*** value by the ***Late*** key. This should confirm that there are significantly more on-time flights than late ones in the sample of 1000 records returned by the query.

While we're at it, we can also view histograms and box plots of the delays. Change the plot options to show a **Histogram** of **DepDelay** and confirm that most of the delays are within 100 minutes or so (either way) of 0, but there are a few extremely high delays. These are outliers. You can see these even more clearly if you change the plot type to a **Box Plot** in which the median value is shown as a line inside a box that represents the second and third quartiles of the delay values. The extreme outliers are shown as markers beyond the *whiskers* that indicate the first and fourth quartiles.

So we have two problems: our data is *unbalanced* with more negative classes than positive ones, and the outlier values make the distribution of the data extremely *skewed*. Both of these issues are likely to affect any machine learning model we create from it as the most common class and extreme delay values might dominate the training of the model. We'll address this by removing the outliers and *undersampling* the dominant class - in this case non-late flights.

In [20]:
# Remove flights with outlier delays
flights = flights.filter("DepDelay < 150 AND ArrDelay < 150")

# Undersample the most commonly occurring Late class
pos = flights.filter("Late = 1")
neg = flights.filter("Late = 0")
posCount = pos.count()
negCount = neg.count()
if posCount > negCount:
  pos = pos.sample(True, negCount/(negCount + posCount))
else:
  neg = neg.sample(True, posCount/(negCount + posCount))
flights = neg.union(pos).orderBy(rand()) # randomize order of unioned data so we can visualize a mixed sample in the notebook
flights.createOrReplaceTempView("flightData")
flights.describe().show()

Our statistics look a little better now, and we still have a lot of data. Let's take a look at that visually.

In [22]:
%sql
SELECT * FROM flightData

View histograms and box plots of the delays, and a bar chart of the *Late* classes as you did previously to see a more even distribution (though the delays are still skewed and far from *normal*).

You can also start to explore relationships in the data. For example, group the box plots of arrival delay by day or carrier to see if lateness varies by these factors. A box plot of **DepDelay** grouped by the **Late** indicator should show that on-time flights have a very low median departure delay and small variance compared to late flights.

Finally, to get a clearer picture of the relationship between **DepDelay** and **ArrDelay**, plot both of these fields as a scatter plot - you should see a linear relationship between these two - the later a flight departs, the later it tends to arrive!

We can use statistics to quantify this correlation:

In [24]:
flights.corr("DepDelay", "ArrDelay")

A correlation is a value between -1 and 1. A value close to 1 indicates a *positive* correlation - in other words, increases in one value tend to correlate with increases in the other.

## Training a Machine Learning Model
OK, now we're ready to build a machine learning model.
First, we'll split the data randomly into two sets for training and testing the model:

In [27]:
# Split the data for training and testing
splits = flights.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1].withColumnRenamed("Late", "trueLabel")
print("Training:", train.count(), ". Test:", test.count())


### Define the Pipeline and Train the Model
Now we'll define a pipeline of steps that prepares the *features* in our data, and then trains a model to predict our **Late** *label* from the features.

A pipeline encapsulates the transformations we need to make to the data to prepare features for modeling, and then fits the features to a machine learning algorithm to create a model. In this case, the pipeline:
- Creates indexes for all of the categorical columns in our data. These are columns that represent categories, not numeric values.
- Normalizes numeric columsn so they're on a similar scale - this prevents large numeric values from dominating the training. In this case, we only have one numeric value (**DepDelay**), so this step isn't strictly necessary - but it's included to show how its done.
- Assembles all of the categorical indexes and the vector of normalized numeric values into a single vector of features.
- Fits the features to a logistic regression algorithm to create a model.

Using a pipeline makes it easier to use the trained model with new data by encapsulating all of the feature preparation steps and ensuring numeric features used to generate predictions from the model are scaled using the same distribution statistics as the training data.

In [29]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import StringIndexer, MinMaxScaler, VectorAssembler
from pyspark.ml import Pipeline

# Create indexes for the categorical features
monthdayIndexer = StringIndexer(inputCol="DayofMonth", outputCol="DayofMonthIdx")
weekdayIndexer = StringIndexer(inputCol="DayOfWeek", outputCol="DayOfWeekIdx")
carrierIndexer = StringIndexer(inputCol="Carrier", outputCol="CarrierIdx")
originIndexer = StringIndexer(inputCol="OriginAirportID", outputCol="OriginAirportIdx")
destIndexer = StringIndexer(inputCol="DestAirportID", outputCol="DestAirportIdx")

# Normalize numeric features
numVect = VectorAssembler(inputCols = ["DepDelay"], outputCol="numFeatures")
minMax = MinMaxScaler(inputCol = numVect.getOutputCol(), outputCol="normFeatures")

# Assemble a vector of features (exclude ArrDelay as we won't have this when predicting new flights)
assembler = VectorAssembler(inputCols = ["DayofMonthIdx", "DayOfWeekIdx", "CarrierIdx",
                                         "OriginAirportIdx", "DestAirportIdx", "normFeatures"],
                            outputCol="features")

# Train a logistic regression classification model using the pipeline
lr = LogisticRegression(labelCol="Late",featuresCol="features",maxIter=10,regParam=0.3)

pipeline = Pipeline(stages=[monthdayIndexer, weekdayIndexer, carrierIndexer, originIndexer, destIndexer, numVect, minMax, assembler, lr])
model = pipeline.fit(train)
print(model)

### Test the Model
Now we're ready to apply the model to the test data.

In [31]:
prediction = model.transform(test)
predicted = prediction.select("features", "rawPrediction", "probability", col("prediction").cast("Int"), "trueLabel")
predicted.show(100, truncate=False)

### Compute Confusion Matrix Metrics
Classifiers are typically evaluated by creating a *confusion matrix*, which indicates the number of:
- True Positives
- True Negatives
- False Positives
- False Negatives

From these core measures, other evaluation metrics such as *accuracy*, *precision* and *recall* can be calculated.

In [33]:
tp = float(predicted.filter("prediction == 1 AND truelabel == 1").count())
fp = float(predicted.filter("prediction == 1 AND truelabel == 0").count())
tn = float(predicted.filter("prediction == 0 AND truelabel == 0").count())
fn = float(predicted.filter("prediction == 0 AND truelabel == 1").count())
metrics = spark.createDataFrame([
 ("TP", tp),
 ("FP", fp),
 ("TN", tn),
 ("FN", fn),
 ("Accuracy", (tp + tn)/(tp + fp + tn + fn)),
 ("Precision", tp / (tp + fp)),
 ("Recall", tp / (tp + fn))],["metric", "value"])
metrics.show()

### Review the Area Under ROC
Another way to assess the performance of a classification model is to measure the area under a *received operator characteristic (ROC) curve* for the model. The **spark.ml** library includes a **BinaryClassificationEvaluator** class that you can use to compute this. A ROC curve plots the True Positive and False Positive rates for varying *threshold* values (the probability value over which a class label is predicted). The area under this curve gives an overall indication of the models accuracy as a value between 0 and 1. A value under 0.5 means that a binary classification model (which predicts one of two possible labels) is no better at predicting the right class than a random 50/50 guess.

In [35]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol="trueLabel",
                                          rawPredictionCol="rawPrediction",
                                          metricName="areaUnderROC")
auc = evaluator.evaluate(prediction)
print ("AUC = ", auc)