<a id='top'></a>
# Log completion by ML regression

- Typical and useful Pandas
    - Data exploration using Matplotlib
    - Basic steps for data cleaning
    - **Exercise: Find problem in specific well log data.**
    - Feature engineering
- Setup scikit-learn workflow
    - Making X and y
- Choosing a model
    - Classification vs Regression
- Evaluating model performance
    - Parameter selection and tuning
    - GridSearch
- Add more data / remove data 

## More Pandas
---

Load Numpy, Pandas and Matplotlib

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

% matplotlib inline

Define the name of the file to be loaded and use Pandas to read it. Note that the name can be a PATH pointing at the file.

In [None]:
datafile = '../data/training_DataFrame.csv'

Pandas expects by default a column on the file to be an index for each row of values. For this example, column 1 (index = 0) is that column.

In [None]:
wells = pd.read_csv(datafile, index_col=0)

# Data Exploration and cleaning

Before feeding our machines with data to learn from, it's important to make sure that we feed them the best possible data. Pandas has a few methods to explore the contents of the data. The `head()` method shows the top rows of the DataFrame.

In [None]:
wells.head()

Another useful Pandas method is `describe()`, which compile useful statistics of each numeric column in the `DataFrame`. 

In [None]:
wells.describe()

Note how the `count` row is not the same for all columns? This means that there are some values that Pandas doesn't think they are numbers! (Could be missing values or `NaN`s). There are many strategies to deal with missing data but for this excercise we're just going to ignore the rows that contain these bad values.

In [None]:
wells = wells.dropna()

In [None]:
wells.describe()

Now every column in the `DataFrame` should contain the same number of elements and now we can focus on the statistics themselves. Look at each log property, do those `mean`, `min` and `max` look OK? `ILD` shouldn't have negative values. Let's take them out of our set:

In [None]:
wells = wells[wells.ILD > 0]

In [None]:
wells.describe()

Another typical first approach to explore the data is to study the distribution of values in the dataset...

In [None]:
ax = wells.hist(column="RHOB", figsize=(8,6), bins=20)

<div class="alert alert-success">
    <b>Exercise</b>:
     <ul>
      <li>
      That distribution doesn't seem right. Can you exclude the `DataFrame` values for which `RHOB` is higher than `1800`?
      </li>
      <p>
    </ul>
</div>

In [None]:
# Put your code here


<div class="alert alert-success">
    <b>Exercise</b>:
     <ul>
      <li>
      Explore the rest of the `DataFrame`. Do all distributions look OK?
      </li>
      <p>
    </ul>
</div>

Seaborn has a few tricks to display histograms better

In [None]:
import seaborn as sns

In [None]:
wells.ILD.values

In [None]:
sns.distplot(wells['ILD'])

<div class="alert alert-success">
    <b>Exercise</b>:
     <ul>
      <li>
      Calculate the `log` of ILD and store it in the `DataFrame`
      </li>
      <p>
    </ul>
</div>

In [None]:
# Put your code here


In [None]:
wells = wells[wells.DPHI > 0]

In [None]:
sns.distplot(wells.DPHI)

# Load testing data

In [None]:
w_train = wells.copy()
w_test = pd.read_csv('../data/testing_DataFrame.csv', index_col=0)
w_test_complete = pd.read_csv('../data/testing_DataFrame_complete.csv', index_col=0)

In [None]:
w_test.head()

In [None]:
w_test.describe()

In [None]:
w_test = w_test[w_test.DPHI > 0]

In [None]:
w_test_complete = w_test_complete[w_test_complete.DPHI > 0]

In [None]:
w_test.describe()

Let's start testing our training pipeline with a subset of wells. We can come back to this and change the number of wells we include, to see how it affects the result.

In [None]:
w_train = w_train[w_train.well_ID < 25]

In [None]:
# Make X and y
X = w_train[['Depth','GR','ILD','NPHI']].as_matrix()
y = w_train['RHOB'].values

In [None]:
X.shape

Set up the testing matrix of features we want to use to predict the missing `RHOB`

In [None]:
X_test = w_test[['Depth','GR','ILD','NPHI']].as_matrix()

We will display the predicted vs. true results for a test well

In [None]:
well_id = 81

# Available scikit-learn models to choose from:

http://scikit-learn.org/stable/supervised_learning.html

# Linear Regression


A first simple approach is to apply a linear model

In [None]:
from sklearn import linear_model                

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X,y)

# Make predictions using the testing set
y_test_LR = regr.predict(X_test)

# add a new column to data frame that already exists
w_test_complete['RHOB_pred_LinReg'] = y_test_LR

my_well = w_test_complete[w_test_complete.well_ID==well_id]

plt.figure(figsize=(3,10))
plt.plot(my_well.RHOB, my_well.Depth, 'k')
plt.plot(my_well.RHOB_pred_LinReg, my_well.Depth,'r')

<div class="alert alert-success">
    <b>Exercise</b>:
     <ul>
      <li>
      Complete the following code to test the different classifiers similar to the Linear Regression case
      </li>
      <p>
    </ul>
</div>


# Decision Tree Regressor

In [None]:
# add a new column to data frame that already exists and plot the results


# Nearest Neighbours

In [None]:
from sklearn.neighbors import KNeighborsRegressor

nbrs = KNeighborsRegressor()


In [None]:
# add a new column to data frame that already exists and plot the results


# Gradient Boosting Ensemble Regressor

In [None]:
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor



# Evaluation Metrics

Although it's good to see how the plots look, a more generalized way to determine how good a model is at predicting data


http://scikit-learn.org/stable/model_selection.html#model-selection

"Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. Note that the word “experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally."

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(est, X_test, w_test_complete.RHOB, cv=5, scoring='neg_mean_squared_error')
scores  

## Regression metrics

[TOP](#top)

http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

In [None]:
from sklearn.metrics import explained_variance_score
print(explained_variance_score(my_well.RHOB, my_well.RHOB_pred_LinReg))  
print(explained_variance_score(my_well.RHOB, my_well.RHOB_pred_DTR))
print(explained_variance_score(my_well.RHOB, my_well.RHOB_pred_KNN))


In [None]:
from sklearn.metrics import mean_squared_error
print(mean_squared_error(my_well.RHOB, my_well.RHOB_pred_LinReg))  
print(mean_squared_error(my_well.RHOB, my_well.RHOB_pred_DTR))
print(mean_squared_error(my_well.RHOB, my_well.RHOB_pred_KNN))

# Feature Engineering

What can we do to help our classifier?

<div class="alert alert-success">
    <b>Exercise</b>:
     <ul>
      <li>
      Create a function using `np.convolve` to smooth a log curve and return the smoothed version to add to the `DataFrame`
      </li>
      <p>
    </ul>
</div>

In [None]:
# s_NPHI will be the smoothed array!
X = w_train[['Depth','GR','ILD','NPHI','s_NPHI']].as_matrix()



In [None]:
print(mean_squared_error(my_well.RHOB, my_well.RHOB_pred_GBT)) 

In [None]:
from sklearn.metrics import mean_squared_error
print(mean_squared_error(my_well.RHOB, my_well.RHOB_pred_GBT))  

<hr />

<p style="color:gray">©2017 Agile Geoscience. Licensed CC-BY.</p>