# First models

Here, we build a modelling skeleton that we can hang further developments off of. 

In [98]:
import scipy
import pandas as pd
import numpy as np
import os
from os.path import join
from pprint import pprint

import xeek
import xeek.features as features

from sklearn.impute import KNNImputer

%matplotlib inline
from importlib import reload
reload(xeek)
reload(features)

<module 'xeek.features' from '/home/goyder/Projects/xeek/xeek/features.py'>

## Data import

In [3]:
df_train = pd.read_csv(xeek.raw_train_filepath, sep=";")

In [4]:
df_train.head()

Unnamed: 0,WELL,DEPTH_MD,X_LOC,Y_LOC,Z_LOC,GROUP,FORMATION,CALI,RSHA,RMED,...,ROP,DTS,DCAL,DRHO,MUDWEIGHT,RMIC,ROPA,RXO,FORCE_2020_LITHOFACIES_LITHOLOGY,FORCE_2020_LITHOFACIES_CONFIDENCE
0,15/9-13,494.528,437641.96875,6470972.5,-469.501831,NORDLAND GP.,,19.480835,,1.61141,...,34.63641,,,-0.574928,,,,,65000,1.0
1,15/9-13,494.68,437641.96875,6470972.5,-469.653809,NORDLAND GP.,,19.4688,,1.61807,...,34.63641,,,-0.570188,,,,,65000,1.0
2,15/9-13,494.832,437641.96875,6470972.5,-469.805786,NORDLAND GP.,,19.4688,,1.626459,...,34.779556,,,-0.574245,,,,,65000,1.0
3,15/9-13,494.984,437641.96875,6470972.5,-469.957794,NORDLAND GP.,,19.459282,,1.621594,...,39.965164,,,-0.586315,,,,,65000,1.0
4,15/9-13,495.136,437641.96875,6470972.5,-470.109772,NORDLAND GP.,,19.4531,,1.602679,...,57.483765,,,-0.597914,,,,,65000,1.0


In [5]:
df_test = pd.read_csv(xeek.raw_test_filepath, sep=";")

In [6]:
df_test.head()

Unnamed: 0,WELL,DEPTH_MD,X_LOC,Y_LOC,Z_LOC,GROUP,FORMATION,CALI,RSHA,RMED,...,SP,BS,ROP,DTS,DCAL,DRHO,MUDWEIGHT,RMIC,ROPA,RXO
0,15/9-14,480.628001,423244.5,6461862.5,-455.62442,NORDLAND GP.,,19.2031,,1.613886,...,35.525719,,96.46199,,,-0.538873,0.130611,,,
1,15/9-14,480.780001,423244.5,6461862.5,-455.776428,NORDLAND GP.,,19.2031,,1.574376,...,36.15852,,96.454399,,,-0.539232,0.130611,,,
2,15/9-14,480.932001,423244.5,6461862.5,-455.928436,NORDLAND GP.,,19.2031,,1.436627,...,36.873703,,96.446686,,,-0.54083,0.130611,,,
3,15/9-14,481.084001,423244.5,6461862.5,-456.080444,NORDLAND GP.,,19.2031,,1.276094,...,37.304054,,161.170166,,,-0.543943,0.130611,,,
4,15/9-14,481.236001,423244.53125,6461862.5,-456.232422,NORDLAND GP.,,19.2031,,1.204704,...,37.864922,,172.48912,,,-0.542104,0.130611,,,


## Modelling approach

For this first approach, we generate an "end-to-end" model with a deliberately simple approach. Our goal here is not to produce an impressive model, but moreso to generate a skeleton to hang further work off of.

Our first modelling approach will:

* Use only features that appear in most of the wells
* Cross-validate on a well basis
* In-fill missing values with a mean-per-well value

### Feature selection

We retrieve only the features that are present in all the majority of the well logs:

In [9]:
universal_features = features.features_mostly_present(df_train, presence_threshold=.8)
pprint(universal_features)

['WELL',
 'DEPTH_MD',
 'X_LOC',
 'Y_LOC',
 'Z_LOC',
 'GROUP',
 'FORMATION',
 'CALI',
 'RMED',
 'RDEP',
 'RHOB',
 'GR',
 'DTC',
 'DRHO',
 'FORCE_2020_LITHOFACIES_LITHOLOGY',
 'FORCE_2020_LITHOFACIES_CONFIDENCE']


How does this compare to our test dataset?

In [10]:
universal_features_test = features.features_mostly_present(df_test, presence_threshold=.8)
pprint(universal_features_test)

['WELL',
 'DEPTH_MD',
 'X_LOC',
 'Y_LOC',
 'Z_LOC',
 'GROUP',
 'FORMATION',
 'CALI',
 'RMED',
 'RDEP',
 'RHOB',
 'GR',
 'NPHI',
 'PEF',
 'DTC',
 'DRHO']


Our test dataset is a superset, so we are good to work with this limited dataset.

We will further limit our features to continuous features.

In [34]:
feature_columns = [feature for feature in universal_features if
                   (feature in features.well_log_features)]
columns = feature_columns + features.target + ["WELL"]

In [35]:
columns

['CALI',
 'RMED',
 'RDEP',
 'RHOB',
 'GR',
 'DTC',
 'FORCE_2020_LITHOFACIES_LITHOLOGY',
 'WELL']

In [13]:
df_train_limited = df_train.loc[:, columns]

In [14]:
df_train_limited

Unnamed: 0,CALI,RMED,RDEP,RHOB,GR,DTC,FORCE_2020_LITHOFACIES_LITHOLOGY,WELL
0,19.480835,1.611410,1.798681,1.884186,80.200851,161.131180,65000,15/9-13
1,19.468800,1.618070,1.795641,1.889794,79.262886,160.603470,65000,15/9-13
2,19.468800,1.626459,1.800733,1.896523,74.821999,160.173615,65000,15/9-13
3,19.459282,1.621594,1.801517,1.891913,72.878922,160.149429,65000,15/9-13
4,19.453100,1.602679,1.795299,1.880034,71.729141,160.128342,65000,15/9-13
...,...,...,...,...,...,...,...,...
1170506,8.423170,,,2.527984,77.654900,,30000,7/1-2 S
1170507,8.379244,,,2.537613,75.363937,,65030,7/1-2 S
1170508,8.350248,,,2.491860,66.452843,,65030,7/1-2 S
1170509,8.313779,,,2.447539,55.784817,,65030,7/1-2 S


And we create a canary dataset, to test our methodologies.

In [15]:
first_wells = df_train["WELL"].unique()[:3]

In [16]:
first_wells

array(['15/9-13', '15/9-15', '15/9-17'], dtype=object)

In [17]:
df_train_canary = (df_train_limited
                  .query("WELL in @first_wells"))

### Missingness

Based on this process, we will briefly review the data missingness:

In [21]:
(df_train_limited
.groupby("WELL")
.aggregate(lambda s: s.isna().sum() / len(s))
.sort_values("CALI", ascending=False))

Unnamed: 0_level_0,CALI,RMED,RDEP,RHOB,GR,DTC,FORCE_2020_LITHOFACIES_LITHOLOGY
WELL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
35/8-4,1.000000,0.000000,0.000000,0.000000,0.0,0.059432,0
36/7-3,0.982359,0.001103,0.000827,0.042448,0.0,0.033352,0
35/11-12,0.804521,0.003924,0.000000,0.822677,0.0,0.463949,0
31/5-4 S,0.796270,0.796270,0.158299,0.786371,0.0,0.843822,0
33/9-1,0.689027,0.000000,0.000000,0.398484,0.0,0.000058,0
...,...,...,...,...,...,...,...
25/8-5 S,0.000000,0.005481,0.000000,0.044053,0.0,0.448959,0
25/7-2,0.000000,0.002638,0.000000,0.004397,0.0,0.002387,0
25/6-3,0.000000,0.003421,0.000000,0.316479,0.0,0.000000,0
25/6-2,0.000000,0.000000,0.000000,0.000000,0.0,0.005470,0


Most of our wells have relatively complete data, but there are some outliers. `16/11-1 ST3` is particularly messy. In fact, were our threshold for whether a feature is "present" slightly lower, we probably wouldn't include `X_LOC`, `Y_LOC`, `Z_LOC`, and `RDEP` as features.

## Pipeline

We wish to design a pipeline for our model. We will:

* Split our data into X and Y inputs
* Impute the missing values
* Build a basic linear classification model
* Judge this by our custom metric

In [49]:
from sklearn import compose, pipeline, preprocessing, linear_model, impute

### Column split out

In [81]:
out = df_train_canary[features.target[0]]

In [82]:
out.map()

0        65000
1        65000
2        65000
3        65000
4        65000
         ...  
53332    65000
53333    65000
53334    65000
53335    65000
53336    65000
Name: FORCE_2020_LITHOFACIES_LITHOLOGY, Length: 53337, dtype: int64

In [85]:
def split_into_x_y_groups(df, x_columns, y_column="FORCE_2020_LITHOFACIES_LITHOLOGY", group_column=["WELL"]):
    X = df[x_columns].to_numpy()
    Y = df[y_column].map(features.lithology_mapping).to_numpy()
    group = df[group_column].to_numpy()
    
    return X, Y, group

In [86]:
X, Y, group = split_into_x_y_groups(df_train_canary, feature_columns, features.target[0], group_column=["WELL"])

### Build a pipeline

In [112]:
classification_pipeline = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    impute.SimpleImputer(strategy="median"),
    linear_model.SGDClassifier()
)

In [113]:
classification_pipeline.fit(X, Y)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('simpleimputer', SimpleImputer(strategy='median')),
                ('sgdclassifier', SGDClassifier())])

In [114]:
Y_hat = classification_pipeline.predict(X)

### Score the results

In [115]:
A = np.load(join(xeek.external_data_dir, 'penalty_matrix.npy'))

In [116]:
def score(y_true, y_pred):
    S = 0.0
    y_true = y_true.astype(int)
    y_pred = y_pred.astype(int)
    for i in range(0, y_true.shape[0]):
        S -= A[y_true[i], y_pred[i]]
    return S/y_true.shape[0]

In [117]:
score(Y, Y_hat)

-0.42913221591015616

We have a model of extremely question results.

In [118]:
X_full, Y_full, group_full = split_into_x_y_groups(df_train, feature_columns, features.target[0], group_column=["WELL"])

In [119]:
Y_hat_full = classification_pipeline.predict(X_full)

In [120]:
score(Y_full, Y_hat_full)

-1.0578384141627033

Silly, but slightly better than the "everything is shale" method.

This is a silly example, mostly contrived to work around the fact that the machine this is being written on breaks if the full dataset is fit. 

We now want to take these pieces and extend it into a meaningful model.