# Overview
This notebook creates a model from a training dataset of 250 samples of 300 variables with binary labels and generates predictions for a dataset of nearly 20,000 samples.

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
import seaborn
%matplotlib inline

# Load data
Get the data from the csv's provide by Kaggle (clone using the API command `kaggle competitions download dont-overfit-ii`).

In [None]:
input_dir = Path('../input')

train = pd.read_csv(input_dir / 'train.csv', index_col=0)

# Split targets from inputs.
train_targets = train[['target']].copy()
train_inputs = train.drop('target', axis=1)

train.head()

In [None]:
test_inputs = pd.read_csv(input_dir / 'test.csv', index_col=0)
test_inputs.head()

In [None]:
# Check dimensions
train_inputs.shape, test_inputs.shape

# Exploration
Understand the properties of the data and gather information that can be leveraged for modelling.

## Distributions
Look at the distributions of variables and test for normality.

In [None]:
fig, ax = plt.subplots(figsize=(15, 10))

# Plot individual histograms.
for col in train_inputs:
    x = train_inputs[col].values 
    freq, bins = np.histogram(x, bins=10)
    ax.plot(bins[:-1], freq, color='gray', alpha=0.1)
    
# Plot normal distribution for shape comparison.
def gaussian(x, m=0, s=1, norm=1):
    return norm * np.exp(-((x - m) ** 2) / s)

# Range for plotting.
minimum = train_inputs.values.min()
maximum = train_inputs.values.max()
x = np.linspace(minimum, maximum, 10)

# Handpicked values for illustration.
mean = -0.25
stdev = 2
norm = 55
y = gaussian(x, m=mean, s=stdev, norm=norm)
ax.plot(x, y, color='r', label='Gaussian(-0.25, 2)')

ax.set_xlabel('Value')
ax.set_ylabel('Freq')
ax.set_title('Distribution of training variables')
ax.legend(loc='upper right')
ax.grid()


A fairly naive approach of putting a Guassian (with some hand picked parameters) over the top seems to get a fairly reasonable approximation for the average shape of the distributions; but, there are clearly quite a few variables where this gives a poor approximation.

In [None]:
from scipy.stats import shapiro

# Test for normality in each column using Shapiro-Wilks.
non_norm_cols = []
for col in train_inputs:
    x = train_inputs[col].values
    stat, pval = shapiro(x)
    
    if pval < 0.05:
        print('P-val {0:.2f} ==> evidence column {1} not normally distributed'.format(pval, col))
        non_norm_cols.append(col)
        

So there's reason to believe most of the columns approximately follow a normal distribution. Lets look at the ones with evidence to the contrary.

In [None]:
fig, ax = plt.subplots(figsize=(15, 8))
train_inputs[non_norm_cols].hist(ax=ax, bins=10);

Just from looking at the plots, it is pretty clear why the distributions returns a significant Shapiro-Wilks. However, since they are not too far away from a normal distribution (i.e. they are generally peaked about a mean and fall off to either side), we shall ignore these for now and consider all of the variables to be approximately Guassian.

## Collinearity
For completeness, check for linear correlations (Pearson) and plot a heatmap.

In [None]:
fig, ax = plt.subplots(figsize=(12, 10))
seaborn.heatmap(train_inputs.corr(), cmap='bwr', vmin=-1, vmax=1, ax=ax)

It is clear there are no strong linear correlations between any of the columns. Thus, it seems that the training inputs represent 250 samples from something like a multivariate Gaussian distribution, where each dimension is independent of the others. In other words, we have 250 samples of 300 iid Gaussian random variables.

## Dimensionality reduction
The biggest obstacle to overcome is the high dimensionality relative to the number of samples; it seems sensible to try and reduce the dimensionality if we are to train any type of model given the data we have available.

See how the total explained variance changes with number of principle components in PCA.

In [None]:
from sklearn.decomposition import PCA

# Try different numbers of components and plot falloff in variance.
pca = PCA(n_components=200)
x = train_inputs.values
x_trans = pca.fit_transform(x)
n_comps = list(range(0, 200))
var = []
    
for n in n_comps:
    var.append(pca.explained_variance_ratio_[:n].sum())
    
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(n_comps, var)
ax.set_xlabel('N components')
ax.set_ylabel('Sum of explained variance')
ax.set_title('Explained variance versus number of principle components')
ax.grid()