![banner](img/cdips_2017_logo.png)

## Introduction

Text goes here.

Credit AFSIS, CDIPS, Kaggle.

## Tech Tools

- pandas
- scikit-learn
- numpy
- matplotlib + seaborn
- JuPyter notebooks
- binder

In [None]:
import numpy as np
import pandas as pd
import sklearn

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(font_scale=2)

%run notebooks/scripts/load_data.py
%run notebooks/scripts/comparison.py
%matplotlib inline

In [None]:
import sklearn.decomposition

import sklearn.linear_model
import sklearn.neural_network
import sklearn.svm
import sklearn.ensemble

## Data Visualization

In [None]:
X, y = load_training_spectra('./data/training.csv')

Add data visualizations here.
See more visualizations of the data in
[this notebook](link).

## Data Pre-Processing

Absorption was measured at over 3,000 wavelengths,
resuling in very rich data vectors.
This richness comes at a price.

Though we have 3,000 dimensional inputs,
we have only 1,000 data vectors.
For some models, like
[linear regression](link)
this will lead to
*over-fitting* -
a model that predicts
soil properties for previously-seen samples
extremely well,
but fails to predict soil properties accurately
for samples it has never seen before.

For other models, like
[random forest regression](link),
having many individually uninformative input dimensions
reduces performance.

And for all models, but especially
[support vector methods](link)
and
[neural networks](link),
more input dimensions
means more time spent training.

We could resolve this by smoothing and down-sampling,
but many of the spectra contain sharp,
precisely-located peaks
that we'd rather not smooth over.

Instead, we use
[Principal Components Analysis](link),
or PCA,
to reduce the size of our data vectors
while trying to retain as much information as possible.

In [None]:
PCA_transform = sklearn.decomposition.PCA(n_components=100)

## Models

#### Linear Regression

Text goes here.

In [None]:
linear_model = sklearn.linear_model.LinearRegression()

#### Neural Network

The resurgence of
[neural networks](link)
is one of the most exciting new developments
in machine learning in the past decade,
finding applications as diverse as
[beating human beings at Go](https://www.blog.google/topics/machine-learning/what-we-learned-in-seoul-with-alphago/),
[creating visual art](https://deepart.io/),
and
[operating autonomous vehicles](https://blogs.nvidia.com/blog/2016/05/06/self-driving-cars-3/).

Neural networks perform best in situations
with large datasets of very rich, structured data vectors
composed of many individually-uninformative dimensions.
However, we were still able to get decent performance
on this smaller dataset by
[carefully tuning hyperparameters](link).

In [None]:
neural_network = sklearn.neural_network.MLPRegressor(activation = 'logistic',
                                                     alpha = 0.0001,
                                                     batch_size = 16,
                                                     beta_1 = 0.95,
                                                     beta_2= 0.99,
                                                     early_stopping = False,
                                                     hidden_layer_sizes = 100,
                                                     learning_rate_init = 0.0001,
                                                     max_iter = 10000,
                                                     tol = 1e-16)

#### Random Forest

Text goes here.

In [None]:
random_forest = sklearn.ensemble.RandomForestRegressor()

#### Support Vector Machine

[Support Vector Machines](link)
are a core component of the machine learning toolkit,
but they are primarily used for
classification tasks,
where the output is a label,
rather than regression tasks like this one,
where the outputs are numbers.

Additionally,
unlike other models,
support vector regression models
can only predict a single output,
rather than multiple outputs.
This meant that we had to train one model
for calcium content,
one for pH,
and so on.

Despite this handicap,
[by tuning hyperparameters](link)
we were able to get good performance
from our ensemble of support vector machines.

In [None]:
num_outputs = y.shape[1]

C = 100; epsilon=1e-1; gamma=1e-2;
SVR_models = [sklearn.svm.SVR(kernel='rbf',
                            C=C, epsilon=epsilon, gamma=gamma) for _ in range(num_outputs)]

### Model Comparison

We compare our models by running random-split
[cross-validation](link).
We get out two metrics:
the *training score*,
which tells us how well the model
can predict the soil properties
for spectra it has seen before,
and
the *test score*,
which estimates how well the model
will predict the soil properties
for novel spectra.

In [None]:
models = [linear_model, neural_network, random_forest, SVR_models]
model_names = ['Linear Model', 'Neural Network', 'Random Forest', 'Support Vector Machine']

num_splits = 20

In [None]:
train_scores, test_scores = compare_models(
                            models, model_names, PCA_transform,
                                X, y, num_splits=num_splits)

In [None]:
comparison_plot(model_names, train_scores, test_scores)

Error bars are ± standard error of the mean.

## Conclusions

Text goes here.