# A working machine learning example
In this notebook, we'll examine a dataset and create a predictive model, or ensemble of models, to predict the output for unseen data points.

We will use the output of the FizzBuzz program and 'reverse engineer' the program using data analysis and machine learning. So, in the end, our model should be capable of predicting the output of the FizzBuzz program for a given input.

In [None]:
%pylab inline

In [None]:
# we use pandas for data analysis and plotting
import pandas as pd
# seaborn provides enhanced visualization functionality
import seaborn as sns
# Spark's mllib provides machine learning functionality
from pyspark.mllib.tree import LabeledPoint, RandomForest, RandomForestModel
from pyspark.mllib.regression import LinearRegressionWithSGD, RidgeRegressionWithSGD

### Seaborn
Note: this notebook uses the seaborn package for visualization, see [Gallery](http://web.stanford.edu/~mwaskom/software/seaborn/examples/index.html)

## Load and parse data

In [None]:
rdd = (
    sc
    .textFile('fizzbuzz.csv')                     # read textfile
    .map(lambda line: line.strip().split(','))    # parse CSV into two fields
    .map(lambda (n, fb): (int(n), fb))            # parse first element as int
)

## Data overview
It is generally a good idea to count the number of samples in your dataset, to ensure that it loaded properly and that there are no obvious errors at first sight. Also, you'll want to just eyeball the data to have a look at the values in there and get a better feeling for what different columns might mean.

In [None]:
rdd.count()

In [None]:
rdd.take(20)

## Numbers vs. the rest
In this example, the target is to predict the second column of the data (fizzbuzz) based on the first column (the number input). It appears that the output is a string which can either be a number or some label: Fizz, Buzz or FizzBuzz. Let us verify that this is the case. We'll split the data into two parts based on the output:
- the numbers
- the rest (textual output)

Using this split and subsequent analysis, we might gain insight into what causes the output to be either a number or something else.

In [None]:
def is_int(x):
    try:
        int(x)
        return True
    except:
        return False

In [None]:
not_numbers = rdd.filter(lambda (n, fb): not is_int(fb)) # Filter only values that are not an int
not_numbers.take(20)

## Fizz, Buzz, FizzBuzz
It once more appears that all the non-numerical output is either Fizz, Buzz or FizzBuzz and nothing else. Here, we verify this and count how often each of the labels occur. When working with a large dataset, we should always be careful when creating histograms like these and collecting them locally, as the result might be too large to collect locally in memory. Therefore, we first do a count.

In [None]:
not_numbers_hist = (
    not_numbers
    .map(lambda (n, fb): (fb, 1))       # Create tuples of (value, 1)
    .reduceByKey(lambda x,y: x + y)     # Group by value and sum the 1's
)
not_numbers_hist.count()                # Find out how many classes there are

In [None]:
not_numbers_hist.collect()              # Since there are only three classes, it's safe to collect

## 5 and 3 look important
In the sample above with all the fizzes and buzzes, the apparent situation is that all input numbers that results in a text label are divisible by 3 or 5. We will try to incorporate this idea into a predictive model, by performing feature engineering: we create derived features from the original input feature. In this case, we will add a boolean feature that is True when the input is divisible by 3 and False otherwise. We do the same for 5.

Note that at this point we do not verify whether our assumption about being divisible by 3 and 5 is complete and correct. If the assumed relation holds, the model will learn about it and the evaluation results will exhibit low error (later section). Otherwise, it's back to the drawing board.

In [None]:
# We create a DataFrame from a sample of the RDD of not_numbers
not_number_frame = pd.DataFrame(
    not_numbers
    .sample(False, 0.01, 0)                # 1% sample
    .map(lambda (n, fb): {                 # Turn into a collection of dict's
        'n': n,
        'fizzbuzz': fb,
        'by_three': n % 3 == 0,            # Include engineered feature for divisibility by three
        'by_five': n % 5 == 0              # Include engineered feature for divisibility by five
        }).collect()                       # Collect the sample locally into the DataFrame
)

In [None]:
# Let's have a look
not_number_frame.head()

In [None]:
# What is the relation between divisibility by three and the outcome (in case of not a number)
# Note: we use seaborn for visualization
# What does it tell us?
sns.barplot(not_number_frame.fizzbuzz, not_number_frame.by_three)

In [None]:
# Same for five
sns.barplot(not_number_frame.fizzbuzz, not_number_frame.by_five)
# What does it tell us?

In [None]:
# Same for divisibility by both
not_number_frame['by_both'] = not_number_frame.by_three & not_number_frame.by_five
sns.barplot(not_number_frame.fizzbuzz, not_number_frame.by_both)

## Decisions and numbers
The three plot above show the absence or presence of label outputs (Fizz, Buzz or FizzBuzz) given the divisibility by either 3, 5 or both. What we see is that for different values for divisibility by 3 and 5, there exists a decision boundary between different label values. This should be effectively learned by decision trees.

What remains is the part of the data where the result is not a label, but a numeric value. Let's further investigate.

In [None]:
numbers = (
    rdd
    .filter(lambda (n, fb): is_int(fb))            # Filter only numbers (int's)
    .map(lambda (n, fb): (n, int(fb)))             # Parse the string into an int if it is one
)

numbers_frame = pd.DataFrame(
    numbers.sample(False, 0.01, 0).collect(),      # Take a 1% sample
    columns=['n', 'fizzbuzz'])                     # Name the columns

numbers_frame.head(10)

In [None]:
# Let's look at the relation between n and the outcome
numbers_frame.plot(kind='scatter', x='n', y='fizzbuzz')

## Linearity
The relation between the input and output in the case of numbers appears perfectly linear (who would have thought?). This part of the data is better described by a linear regressor.

## Modeling
We conclude from the above analysis, that we can handle this prediction problem with a combination of two models using the following approach.

For training:
- Train a classification model (based on decision trees) on the part of the data with non-numeric output.
- Train a regression model on the part of the data with numeric output.

For prediction:
- Make a prediction for the type of output (a textual label or numeric) using the classification model.
- If the classification model predicts a label, predict the label.
- If the classification model predicts a numeric output, use the regression model to predict the value.

We will train both models and evaluate both models using a train/test split of the data.

### Classification
We start out with the classification model. Here we use a Random Forest.

In [None]:
def fizzbuzz_type(fb):
    # Spark MLLib requires to encode everything into floats, even classes
    return {
        'Fizz': 1.0,
        'Buzz': 2.0,
        'FizzBuzz': 3.0
    }.get(fb, 0.0)

# create dataset
classification_points = rdd.map(lambda (n, fb): LabeledPoint(fizzbuzz_type(fb), [n, n % 3 == 0, n % 5 == 0]))
# split dataset into train- and test- set
classification_points_train, classification_points_test = classification_points.randomSplit([0.6, 0.4])

In [None]:
classification_model = RandomForest.trainClassifier(
    classification_points_train,              # Use only the training part of the data
    numClasses=4,                             # We can predict one of four classes
    categoricalFeaturesInfo={1: 2, 2: 2},     # RandomForest needs to know which features are categorical
    numTrees=3,
    featureSubsetStrategy="auto",
    impurity='gini',
    maxDepth=4,
    maxBins=32)

classifier_model

### Regression model
For predicting the numerical values we use a linear regressor with SGD.

In [None]:
# create dataset
regression_points = numbers.map(lambda (n, fb): LabeledPoint(fb, [n]))
# split dataset into train- and test- set
regression_points_train, regression_points_test = regression_points.randomSplit([0.6, 0.4])

In [None]:
regression_model = LinearRegressionWithSGD.train(
    regression_points_train,   # Use only the training part of the data
    iterations=100,
    step=1.0,
    initialWeights=[1.0]       # Little cheating here, but otherwise it won't converge on perfectly linear data.
)

regression_model

### Model evaluation
We will evaluate the two trained models separately. The final model should normally also be evaluated for fitness, but this is left as an exercise to the reader.

We will use the [Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error) as evaluation metric. A perfect prediction would yield an error of 0.0.

In [None]:
# Calculate the Mean Squared Error between an RDD of predictions and an RDD of LabeledPoints with actuals.
def MSE(predictions, test_data):
    values_and_preds = test_data.map(lambda p: p.label).zip(predictions)
    return values_and_preds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / values_and_preds.count()

### Classification MSE
The MSE is actually not the standard way to evaluate a classifier. Do you know why?

In [None]:
MSE(
    classification_model.predict(classification_points_test.map(lambda p: p.features)),
    classification_points_test
)

### Regression MSE

In [None]:
MSE(
    regression_points_test.map(lambda p: p.features).map(regression_model.predict),
    regression_points_test
)

## Prediction function
Here we combine the two models as described.

In [None]:
def predict(n):
    classification_features = [n, n % 3 == 0, n % 5 == 0]   # Features required by the classifier
    regression_features = [x]                           # Features required by the regression
    
    # This is required to translate the floating point labels back to the original,
    # since Spark requires floating point values as class labels.
    classes = {
        1.0: 'Fizz',
        2.0: 'Buzz',
        3.0: 'FizzBuzz'
    }
    
    return classes.get(
        classification_model.predict(classification_features),  # If the classifier gave us a textual output, use that
        regression_model.predict(regression_features))  # Otherwise, use the regression model's prediction

## Final predictions
Congratulations! We've machine learned FizzBuzz!

In [None]:
# predicted output
[ predict(x) for x in range(1,21) ]

In [None]:
# original output
rdd.take(20)