<img style="float: right" src="images/surfsara.png">
<br/>
<hr style="clear: both" />

# Machine Learning - Random forests in Spark
In this notebook, we'll try to assess the credit risk on a German credit data set. Based on a number of _features_, listed below, we wil need to train a machine learning model that predicts whether a person can be safely offered credit or not.

Please note that we follow the example of a [blog post on the MapR website](https://mapr.com/blog/predicting-loan-credit-risk-using-apache-spark-machine-learning-random-forests/). That example uses Scala, below we use Python.

The structure of the data is shown in the table below. Notice that the first field is the _label_ (or _class_). This is the field we  will try to predict later. This value can be either 0 (false - high credit risk) or 1 (true - low credit risk). 
<br/>
<br/>
<img style="float: center" width="70%" src="https://mapr.com/blog/predicting-loan-credit-risk-using-apache-spark-machine-learning-random-forests/assets/blogimages/sparkmlgermancreditdata.png">
<br/>
<br/>
A more detailed description of the data, also listing the meaning of the attribute values, is available on the <a href="https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)">Machine Learning Repository</a> from the University of California, Irvine.


Because the label can only have two possible outcomes, this problem is known as a _binary classification problem_. To solve this problem, we must build a _classifier_ that takes all input features of the data set and successfully predicts the _label_ on new instances (persons we haven't seen before). 

Because the _label_ is known in advance, we will use a _supervised_ learning algorithm, in this case Random Forests. First, we use labeled data to train and build our model. This data contains features of each person and also the decision on creditability, the label that we have to predict for new data.

Please note that this dataset is very clean. In general you will spend a lot of time preprocessing your data before using it to train a model. In this example, we will only perform minimal preprocessing.

## Decision trees
A Random Forest consists of a number of _decision trees_ (hence the name). Before we dive into using Random Forests proper, we will study decision trees first to get a feel for how they work. We quote from the MapR blog post: 

"Decision trees create a model that predicts the class or label based on several input features. Decision trees work by evaluating an expression containing a feature at every node and selecting a branch to the next node based on the answer. A possible decision tree for predicting Credit Risk is shown below. The feature questions are the nodes, and the answers “yes” or “no” are the branches in the tree to the child nodes."

<img style="float: center" width="80%" src="https://mapr.com/blog/predicting-loan-credit-risk-using-apache-spark-machine-learning-random-forests/assets/blogimages/creditdecisiontree.png">

<br/>
Random Forests are a generalization of this method. Instead of creating one tree we create many trees, each for a random portion of the data. When building trees we also make sure that they use a random subset of features . Our final model is the average tree based on all these smaller trees.

If you want to know more you may enjoy this [video](https://www.youtube.com/watch?v=3kYujfDgmNk).

Spark creates all trees of a Random Forest in parallel. In this case the data set is small and the number of features and trees limited. But for many real time applications scaling this algorithm is very important.

In [None]:
# Create a SparkSession, the 'DataFrame version' of the SparkContext
from pyspark.sql import SparkSession

spark = (
    SparkSession
    .builder
    .getOrCreate()
)

We read in the csv data. We will let Spark figure out the schema:

In [None]:
credits_df = spark \
    .read \
    .format("csv") \
    .option("header", True) \
    .option("inferSchema", True) \
    .load("../data/germancredit.csv")

In [None]:
credits_df.printSchema()

## Data exploration

Let's inspect the resulting DataFrame by converting it to a Pandas dataframe:

In [None]:
credits_df.toPandas().head(10)

Spark can provide us with some information on the distribution of features. Usually you would perform a more extensive exploration of the data first. For example, we often calculate correlations between features to investigate the underlying structure of the data. Here we do all of this very briefly.

In the cell below, we calculate some descriptive statistics for the `amount` feature, which corresponds to the amount of credit:

In [None]:
credits_df.describe("amount", "age").toPandas()

Below we do some more descriptive analytics on the relation between creditability and the amount of credit. First, we look at the average amount of credit per creditability class:

In [None]:
credits_df.groupBy("creditability").agg({"amount" : "avg"}).toPandas()

We can visualise the amount of credit per class in a box plot using the [`seaborn`](https://seaborn.pydata.org/generated/seaborn.boxplot.html) plotting library:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline

mpl.rcParams['axes.labelsize'] = 24
mpl.rcParams['xtick.labelsize'] = 18
mpl.rcParams['ytick.labelsize'] = 18

plt.figure(figsize=(16, 8))
sns.boxplot(data=credits_df.toPandas(), x='creditability', y='amount')
plt.ylabel('Amount of credit')
plt.xlabel('Credit risk class - high risk vs. low risk');

We can also create a table that lists the number of high risk and low risk persons by the status of their checking account (the `balance` feature):

In [None]:
import pyspark.sql.functions as F
from pyspark.sql.types import StringType

# For information on the mapping between the checking account attribute's value and its meaning, see:
# https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

CHECKING_ACCOUNT_STATUS = {
    1: '< 0 DM',
    2: '0 <= ... < 200 DM',
    3: '... >= 200 DM',
    4: 'no checking account'
}

checking_account_status = F.udf(lambda index: CHECKING_ACCOUNT_STATUS[index], StringType())

credits_df \
    .groupBy('balance') \
    .pivot('creditability') \
    .count() \
    .withColumnRenamed('0', 'high_risk') \
    .withColumnRenamed('1', 'low_risk') \
    .orderBy('balance', ascending=False) \
    .withColumn('checking_account_status', checking_account_status('balance')) \
    .toPandas()

## Data preprocessing

We will need to make a decision on what features to include for our model training. We put these in a list. Notice, that we choose all features. Normally a selection is made based on informative value and relevance.

In [None]:
feature_column_names = ["balance", "duration", "history", "purpose", "amount",
               "savings", "employment", "instPercent", "sexMarried",
               "guarantors", "residenceDuration", "assets",  "age", 
               "concCredit", "apartment", "credits","occupation", 
               "dependents",  "hasPhone", "foreign" ]

The algorithm wants the data set but also needs the feature in the form of a feature vector. Spark has a helper class for this: `VectorAssembler`. We use it to put our features in the dataset under the column `features`.

If you execute the next cell and scroll to the right you can that the feature vector has been added.

In [None]:
from pyspark.ml.feature import VectorAssembler  

assembler = VectorAssembler(inputCols=feature_column_names, outputCol="features")
features_df = assembler.transform(credits_df)
features_df.toPandas().head()

## Model training

Then we take the data set and split it randomly in a training part (80% of the data) and a test part (20%).
We train our model on the training data and test our performance on the test data later.

In [None]:
train_data, test_data = features_df.randomSplit([0.8, 0.2], 12345)
train_data.count(), test_data.count()

Then we can run our Random Forest algorithm. The next cell creates 20 trees, each using 5 features. Continuous variables are split into 150 bins, and we use a random seed. (This seed allow us to replay the algorithm whilst preserving randomness.)

In [None]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(numTrees=20, maxDepth=5, labelCol="creditability", seed=42)
model = rf.fit(train_data)

Our model has been built. Its predictions are the average of those of the 20 trees that were built. We can now make predictions. In the next cell we predict the label of the test data set.

The predictions show two probabilities, one for 0 (high credit risk) and one for 1 (low credit risk).

In [None]:
predictions = model.transform(test_data)
predictions.toPandas()[['probability', 'prediction']].head(10)

To see how good the predictions are we can use an Evaluator. Below we determine the accuracy of the predictions on the test set. 

## Model evaluation

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol='creditability')
accuracy = evaluator.evaluate(predictions) 
accuracy

## Model inspection

We can plot the the importance of each feature for the final prediction using seaborn:

In [None]:
import pandas as pd
import matplotlib as mpl

importances_df = pd \
    .DataFrame({'importance': model.featureImportances.toArray(), 'feature': feature_column_names}) \
    .sort_values('importance', ascending=False)

plt.figure(figsize=(16, 8))    
sns.barplot(data=importances_df, x='feature', y='importance')
plt.xticks(rotation=90, fontsize=18);

We can also print out one of the decision trees used in the Random Forest model:

In [None]:
import string

tree_model_string = model.trees[0].toDebugString
for index, column_name in enumerate(feature_column_names):
    tree_model_string = tree_model_string.replace(
        'feature {} '.format(index),
        '{} '.format(column_name)
    )

print(tree_model_string)