# Intro to machine learning

In [None]:
import numpy as np
import matplotlib.pyplot as mpl
% matplotlib inline

import sklearn as sk
sk.__version__

## Read the data

`numpy` has a convenient function, `loadtxt` that can load a CSV file. It needs a file... and ours is on the web. That's OK, we don't need to download it, we can just read it by sending its text content to a `StringIO` object, which acts exactly like a file handle.

In [None]:
import requests
import io

r = requests.get('https://raw.githubusercontent.com/seg/2016-ml-contest/master/training_data.csv')
f = io.StringIO(r.text)

We can't just load it, because we only want NumPy to have to handle an array of floats and there's metadata in this file (we cna't tell that, I just happen to know it... and it's normal for CSV files). 

Let's look at the first few rows:

In [None]:
r.text.split('\n')[:5]

For convenience later, we'll make a list of the features we're going to use.

In [None]:
features = r.text.split('\n')[0].split(',')
_ = [features.pop(i) for i in reversed([0,1,2])]
features

Now we'll load the data we want. First the feature vectors, `X`...

In [None]:
X = np.loadtxt(f, skiprows=1, delimiter=',', usecols=[3,4,5,6,7,8,9,10])

And the label vector, `y`:

In [None]:
_ = f.seek(0)  # Reset the file reader.
y = np.loadtxt(f, skiprows=1, delimiter=',', usecols=[0])

In [None]:
X.shape, y.shape

We have data! Almost ready to train, we just have to get our test / train subsets sorted.

## Getting ready to train

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
X_train.shape, y_train.shape

Now the fun can really begin. 

## Training and evaluating a model

In [None]:
from sklearn.ensemble import ExtraTreesClassifier 

In [None]:
clf = ExtraTreesClassifier()

In [None]:
clf.fit(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

Maybe we can do better by twiddling some of those parameters:

In [None]:
clf = ExtraTreesClassifier(n_estimators=2000, n_jobs=4, verbose=1)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

All models have the same API (but not the same hyperparameters), so it's very easy to try lots of models:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
KNeighborsClassifier().fit(X_train, y_train).score(X_test, y_test)

In [None]:
from sklearn.svm import SVC
SVC().fit(X_train, y_train).score(X_test, y_test)

In [None]:
from sklearn.naive_bayes import GaussianNB
GaussianNB().fit(X_train, y_train).score(X_test, y_test)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
GradientBoostingClassifier().fit(X_train, y_train).score(X_test, y_test)

## More in-depth evaluation: k-fold cross-validation

We need a vector that contains an integer (or something) representing each unique well.

In [None]:
wells = [row.split(',')[2] for row in r.text.split('\n')[1:] if row]

In [None]:
from sklearn.model_selection import LeaveOneGroupOut

logo = LeaveOneGroupOut()
clf = ExtraTreesClassifier(random_state=0)

for train, test in logo.split(X, y, groups=wells):
    # train and test are the indices of the data to use.
    well_name = wells[test[0]]
    clf.fit(X[train], y[train])
    score = clf.score(X[test], y[test])
    print("{:>20s}  {:.3f}".format(well_name, score))

<hr />

<div>
<img src="https://avatars1.githubusercontent.com/u/1692321?s=50"><p style="text-align:center">© Agile Geoscience 2016</p>
</div>