# Intro to machine learning - k-means
---


Scikit-learn has a nice set of unsupervised learning routines which can be used to explore clustering in the parameter space.

In this notebook we will use k-means, included in Scikit-learn, to demonstrate how the different rocks occupy different regions in the available parameter space.

Let's load the data using pandas:

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("../data/2016_ML_contest_training_data.csv")
df.head()

In [None]:
df.describe()

In [None]:
df = df.dropna()

## Calculate RHOB from DeltaPHI and PHIND

In [None]:
def rhob(phi_rhob, Rho_matrix= 2650.0, Rho_fluid=1000.0):
    """
    Rho_matrix (sandstone) : 2.65 g/cc
    Rho_matrix (Limestome): 2.71 g/cc
    Rho_matrix (Dolomite): 2.876 g/cc
    Rho_matrix (Anyhydrite): 2.977 g/cc
    Rho_matrix (Salt): 2.032 g/cc

    Rho_fluid (fresh water): 1.0 g/cc (is this more mud-like?)
    Rho_fluid (salt water): 1.1 g/cc
    see wiki.aapg.org/Density-neutron_log_porosity
    returns density porosity log """
    
    return Rho_matrix*(1 - phi_rhob) + Rho_fluid*phi_rhob


In [None]:
phi_rhob = 2*(df.PHIND/100)/(1 - df.DeltaPHI/100) - df.DeltaPHI/100
calc_RHOB = rhob(phi_rhob)
df['RHOB'] = calc_RHOB

In [None]:
df.describe()

We can define a Python dictionary to relate facies with the integer label on the `DataFrame`

In [None]:
facies_dict = {1:'sandstone', 2:'c_siltstone', 3:'f_siltstone', 4:'marine_silt_shale',
               5:'mudstone', 6:'wackentstone', 7:'dolomite', 8:'packstone', 9:'bafflestone'}

In [None]:
df["s_Facies"] = df.Facies.map(lambda x: facies_dict[x])

In [None]:
df.head()

We can easily visualize the properties of each facies and how they compare using a `PairPlot`. The library `seaborn` integrates with matplotlib to make these kind of plots easily.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

g = sns.PairGrid(df, hue="s_Facies", vars=['GR','RHOB','PE','ILD_log10'], size=4)

g.map_upper(plt.scatter,**dict(alpha=0.4))  
g.map_lower(plt.scatter,**dict(alpha=0.4))
g.map_diag(plt.hist,**dict(bins=20))  
g.add_legend()
g.set(alpha=0.5)

It is very clear that it's hard to separate these facies in feature space. Let's just select a couple of facies and using Pandas, select the rows in the `DataFrame` that contain information about those facies 

In [None]:
selected = ['f_siltstone', 'bafflestone', 'wackentstone']

dfs = pd.concat(list(map(lambda x: df[df.s_Facies == x], selected)))

g = sns.PairGrid(dfs, hue="s_Facies", vars=['GR','RHOB','PE','ILD_log10'], size=4)  
g.map_upper(plt.scatter,**dict(alpha=0.4))  
g.map_lower(plt.scatter,**dict(alpha=0.4))
g.map_diag(plt.hist,**dict(bins=20))  
g.add_legend()
g.set(alpha=0.5)

In [None]:
# Make X and y
X = dfs[['GR','ILD_log10','PE']].as_matrix()
y = dfs['Facies'].values

Use scikit-learn StandardScaler to normalize the data. Needed for k-means.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = scaler.fit_transform(X)

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.3)

In [None]:
from sklearn.cluster import KMeans

clf = KMeans(n_clusters=4, random_state=1).fit(X)
y_pred = clf.predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred, alpha=0.3)

In [None]:
clf.inertia_

<hr />

<p style="color:gray">©2017 Agile Geoscience. Licensed CC-BY.</p>