# Intro to machine learning - Scikit-Learn

We'll explore the Pandas and Scikit Learn packages for simple machine learning tasks using geoscience data examples. After this day, students will have a good overview of how to look at large datasets and solve problems with state-of-the-art machine learning tools.

- Machine learning concepts
- What is it that you’re trying to solve? How can machine learning help?
- What's the difference between supervised and unsupervised methods?
- What's the difference between classification and regression?


<img src="../data/ML_loop.png"></img>

## Machine learning concepts

### The machine learning iterative loop
- Data — Getting the data. How to load it and put it in an `array` and/or `DataFrame`
- Processing — data exploration, inspection, cleaning, and feature engineering.
- Model – What is a model? Training a Scikit-Learn model.
- Results – assessing quality and performance metrics (accuracy, recall, F1, confusion matrices)
- Repeat – What can we do to improve performance?

### Data management for machine learning
- DataFrames: A new way to look at well logs.
- DataFrames vs arrays.




## Basic Pandas

Introduces the concept of a `DataFrame` in Python. If you're familiar with R, it's pretty much the same idea! Useful cheat sheet [here](https://www.datacamp.com/community/blog/pandas-cheat-sheet-python#gs.59HV6BY)

The main purpose of Pandas is to allow easy manipulation of data in tabular form. Perhaps the most important idea that makes Pandas great for data science, is that it will always preserve **alignment** between data and labels.

In [None]:
import pandas as pd
import numpy as np

The most common data structure in Pandas is the `DataFrame`. A 2D structure that can hold various types of Python objects indexed by an `index` array (or multiple `index` arrays). Columns are usually labelled as well using strings.

An easy way to think about a `DataFrame` is if you imagine it as an Excel spreadsheet.

Let's define one using a numpy array:

In [None]:
arr =  [[1.23, 'sandstone'],
        [3.654, 'limestone'],
        [0.998, 'shale']]
arr

Make a `DataFrame` from `arr`

In [None]:
df = pd.DataFrame(arr, columns=['param1', 'lithology'])
df

In [None]:
df.loc[df['param1'] > 1,'param1']


Accessing the data is a bit more complex than in the numpy array cases but for good reasons

In [None]:
df.loc[1,'lithology'] # .loc[index, column]

Add more data (row wise)

In [None]:
df.loc[3] = [5.6, 'shale']
df

Add data (column wise) specifying the index locations

In [None]:
df.loc[0:2, 'one_more_column'] = [6,7,8]
df

Add a new column with a "complete" list, array or series

In [None]:
df['second_new_colum'] = ["x","y","z","a"]
df

Pandas also reads files from disk in tabular form ([here](http://pandas.pydata.org/pandas-docs/version/0.20/io.html)'s a list of all the formats that it can read and write). A very common one is CSV, so let's load one!

In [None]:
df = pd.read_csv("../data/2016_ML_contest_training_data.csv")
df.head()

# Inspecting the `DataFrame`

Using the `DataFrame` with well log information loaded before, we can make a summary using the `describe()` method of the `DataFrame` object

In [None]:
df.describe()

In [None]:
df = df.dropna()

## Adding more data to the `DataFrame`

In [None]:
def rhob(phi_rhob, Rho_matrix= 2650.0, Rho_fluid=1000.0):
    """
    Rho_matrix (sandstone) : 2.65 g/cc
    Rho_matrix (Limestome): 2.71 g/cc
    Rho_matrix (Dolomite): 2.876 g/cc
    Rho_matrix (Anyhydrite): 2.977 g/cc
    Rho_matrix (Salt): 2.032 g/cc

    Rho_fluid (fresh water): 1.0 g/cc (is this more mud-like?)
    Rho_fluid (salt water): 1.1 g/cc
    see wiki.aapg.org/Density-neutron_log_porosity
    returns density porosity log """
    
    return Rho_matrix*(1 - phi_rhob) + Rho_fluid*phi_rhob


In [None]:
phi_rhob = 2*(df.PHIND/100)/(1 - df.DeltaPHI/100) - df.DeltaPHI/100
calc_RHOB = rhob(phi_rhob)
df['RHOB'] = calc_RHOB

In [None]:
df.describe()

We can define a Python dictionary to relate facies with the integer label on the `DataFrame`

In [None]:
facies_dict = {1:'sandstone', 2:'c_siltstone', 3:'f_siltstone', 4:'marine_silt_shale',
               5:'mudstone', 6:'wackestone', 7:'dolomite', 8:'packstone', 9:'bafflestone'}

Let's add a new column with the name version of the facies

In [None]:
df["s_Facies"] = df.Facies.map(lambda x: facies_dict[x])

In [None]:
df.head()

## Visual exploration of the data

We can easily visualize the properties of each facies and how they compare using a `PairPlot`. The library `seaborn` integrates with matplotlib to make these kind of plots easily.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

g = sns.PairGrid(df, hue="s_Facies", vars=['GR','RHOB','PE','ILD_log10'], size=4)

g.map_upper(plt.scatter,**dict(alpha=0.4))  
g.map_lower(plt.scatter,**dict(alpha=0.4))
g.map_diag(plt.hist,**dict(bins=20))  
g.add_legend()

It is very clear that it's hard to separate these facies in feature space. Let's just select a couple of facies and using Pandas, select the rows in the `DataFrame` that contain information about those facies 

In [None]:
selected = ['f_siltstone', 'bafflestone', 'wackestone']

dfs = pd.concat(list(map(lambda x: df[df.s_Facies == x], selected)))

g = sns.PairGrid(dfs, hue="s_Facies", vars=['GR','RHOB','PE','ILD_log10'], size=4)  
g.map_upper(plt.scatter, alpha=0.4)
g.map_lower(plt.scatter, alpha=0.4)
g.map_diag(plt.hist,**dict(bins=20))  
g.add_legend()

---
# Feature engineering

Add PCA components? Average logs as function of Depth? ...

---
# Scikit-learn classifiers

Let's create a model that classifies between those three classes

### For a classifier comparison check the source code [here](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)

<img src="../data/ML_classifier_comparison_sklearn.png"></img>


*Choosing the right estimator:* Often the hardest part of solving a machine learning problem can be finding the right estimator for the job.
Different estimators are better suited for different types of data and different problems.

In [None]:
# Make X and y
X = dfs[['GR','RHOB','PE','ILD_log10']].as_matrix()
y = dfs['s_Facies'].values

Some methods expect the data to be normalized. It's sometimes a good idea of normalizing it no matter which method you try

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = scaler.fit_transform(X)

In [None]:
plt.scatter(X[:,0], X[:,1], c=dfs['Facies'].values)

Split the data into a training set and a test set. **This is a key step in the process**

In [None]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

A fairly common method for classifying data is to use the _k-nearest neighbors algorithm_. The label of the object in question is determined by the neighbouring data points in the feature space used. Its most important parameter, `k`, is the number of neighbors you include to make a membership decision.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

The next block is all you need to train a classifier model!

In [None]:
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)

Before we can move on to make predictions we need to create validation routines to make sure that the model we trained is _good_ and produces reasonble results. The most basic test is to look at how many good predictions we would make if we predict on our `Test` data.

In [None]:
score = clf.score(X_test, y_test)
print("The precision is {}%".format(np.round(score*100)))

This scoring is one of the _metrics_ we can use to check the quality of the predictions. There are a large number of different metrics and depending on your data and problem you may need to find the one that adjusts better to your needs. Typically, a more robust metric that is often used is called `F1`. It combines the `precision` score and a `recall` score (how many true positive predictions were made). Scikit-learn gives a nice summary of these three metrics using `classification_report`.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, clf.predict(X_test), digits=3))

Depending on you requirements, this results might be good enough to deploy this model and use it on a "Machine Learning Pipeline" product but it is often not the best model you can get. Each method has a set of parameters (also known as _hyperparameters_) that can be tweaked to tune the training.

For the `KNeighborsClassifier` there are a few of these parameters:

In [None]:
KNeighborsClassifier()

For this particular method, the most important parameter to adjust is `n_neighbors` (it's the `K` in the `KNeighborsClassifier`!). Unfortunately, there's no rule that tells you what's the optimal value of `k`. To overcome this we can train many models with different values of `k` and compare the results of classifications applied to the _Test_ data.

In [None]:
nns = np.arange(1,60,2) # Generated array of values of k to try

Loop over each value in `nns` and store the `F1 Score`

In [None]:
from sklearn.metrics import f1_score

acc = []
for n in nns:
    clf = KNeighborsClassifier(n)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    score = f1_score(y_pred, y_test, average='weighted')
    acc.append(score)

What value of `n` gives us the best result?

In [None]:
plt.plot(nns,acc)
_ = plt.xlabel('Number of neighbors')
_ = plt.ylabel('F1 Score')

<div class="alert alert-success">
<b>Exercise</b>:
<ul>
</ul>
</div>

## More methods to train models!

Let's pick 3 different classifiers to train different models and then compare how well they perform

In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

In [None]:
classifiers = [
    SVC(),
    RandomForestClassifier(),
    MLPClassifier()
    ]

names = ["Linear SVM", "RandomForest", "Neural Network"]

In [None]:
classifiers

Let's iterate over these classifiers and print common metrics to evaluate the performance of each model using the testing dataset we defined before

In [None]:
# iterate over classifiers
for name, clf in zip(names, classifiers):
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    print("{:12} {}".format(name,"-"*15))
    print(classification_report(y_test, clf.predict(X_test), digits=3))

<div class="alert alert-success">
<b>Exercise</b>:
<ul>
</ul>
</div>

# Parameter selection

Many of the models can be improved (or worsened) by changing the parameters that internally make the method work. It's always a good idea to check the documentation of each model (e.g. RandomForestClassifier [docs](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)). This process is usually called _hyperparameter tuning_.

Scikit-learn offers a simple way to test different parameters for each model through a function called `GridSearchCV`

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer as msc

# Select the parameters and values for each one to test
parameters = {'n_estimators':np.arange(1,100,5),
              'max_depth':np.arange(1,50,5)}

rfc = RandomForestClassifier()

clf = GridSearchCV(rfc, parameters, scoring = msc(f1_score,**{'average':'weighted'}), cv=3, n_jobs=8)

clf.fit(X_train, y_train)

How does the parameter space look like with respect to the score of the classifier?

In [None]:
scores = clf.cv_results_['mean_test_score']
max_depths = clf.cv_results_["param_max_depth"].data.astype(int)
n_estimators = clf.cv_results_["param_n_estimators"].data.astype(int)

In [None]:
X_size = len(np.unique(max_depths))
Y_size = len(np.unique(n_estimators))
X = max_depths.reshape((X_size, Y_size))
Y = n_estimators.reshape((X_size, Y_size))
Z = scores.reshape((X_size, Y_size))

In [None]:
import scipy.interpolate

# Set up a regular grid of interpolation points
xi, yi = np.linspace(X.min(), X.max(), 100), np.linspace(Y.min(), Y.max(), 100)
xi, yi = np.meshgrid(xi, yi)

# Interpolate
rbf = scipy.interpolate.Rbf(X, Y, Z, function='linear')
zi = rbf(xi, yi)

plt.imshow(zi, vmin=0.8, vmax=Z.max(), origin='lower',
           extent=[X.min(), X.max(), Y.min(), Y.max()], aspect=X.max()/Y.max())
# plt.scatter(X, Y, c=Z)
plt.colorbar()

_ = plt.ylabel('n_estimators')
_ = plt.xlabel('max_depth')

`clf` can now tell us the best parameters to use with our `RandomForestClassifier`

In [None]:
clf.best_params_

In [None]:
clf.best_estimator_

The nice thing about `scikit-learn`'s methods is that they're all consistent and behave in the same way. Notice how`GridSearchCV` was `.fit()`. That means that we can use it to `.predict()` and it will automatically use the best set of parameters!

In [None]:
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, digits=3))

It's also helpful to summarize the prediction tests using a [Confusion Matrix](https://en.wikipedia.org/wiki/Confusion_matrix). Scikit-learn has a function for that!

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)

But as you can see, it's not very clear... What does each row/column represent? We can help a bit:

In [None]:
# itertoools is a standard library for all kinds of handy iterator manipulation
import itertools

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

title = 'Confusion matrix'
cmap = plt.cm.Reds

# Plot non-normalized confusion matrix
plt.figure()
plt.imshow(cnf_matrix, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(selected))
plt.xticks(tick_marks, selected, rotation=45)
plt.yticks(tick_marks, selected)

# Print the support numbers inside the plot
thresh = cnf_matrix.max() / 2.
for i, j in itertools.product(range(cnf_matrix.shape[0]), range(cnf_matrix.shape[1])):
    plt.text(j, i, format(cnf_matrix[i, j], 'd'),
             horizontalalignment="center",
             color="white" if cnf_matrix[i, j] > thresh else "black")

plt.tight_layout()
_ = plt.ylabel('True label')
_ = plt.xlabel('Predicted label')

<div class="alert alert-success">
<b>Exercise</b>:
<ul>
</ul>
</div>

In [None]:
from sklearn.externals import joblib
joblib.dump(clf, 'facies_model.pkl')

How do you load a saved model?

In [None]:
clf = joblib.load('facies_model.pkl')

---
# Where to go next?

- More data!
- [XGBoost](https://xgboost.readthedocs.io/en/latest/)
- [LightGBM](https://github.com/Microsoft/LightGBM)
- If you want to get started on Neural Networks, [Keras](https://keras.io/) provides a scikit-learn type of experience

### Paper with classifier comparison ([link](https://arxiv.org/abs/1708.05070))

<img src="../data/model_performance.jpg"></img>

## The Data Science Hierarchy of Needs ([article](https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007))

<img src="../data/the_ai_hierarchy_of_needs.png"></img>

<hr />

<p style="color:gray">©2017 Agile Geoscience. Licensed CC-BY.</p>