# First models

Here, we build a modelling skeleton that we can hang further developments off of. 

In [1]:
import scipy
import pandas as pd
import os
from os.path import join
from pprint import pprint

import xeek
import xeek.features as features

from sklearn.impute import KNNImputer

%matplotlib inline
from importlib import reload
reload(xeek)
reload(features)

<module 'xeek.features' from '/home/goyder/Projects/xeek/xeek/features.py'>

## Data import

In [2]:
df_train = pd.read_csv(xeek.raw_train_filepath, sep=";")

In [3]:
df_train.head()

Unnamed: 0,WELL,DEPTH_MD,X_LOC,Y_LOC,Z_LOC,GROUP,FORMATION,CALI,RSHA,RMED,...,ROP,DTS,DCAL,DRHO,MUDWEIGHT,RMIC,ROPA,RXO,FORCE_2020_LITHOFACIES_LITHOLOGY,FORCE_2020_LITHOFACIES_CONFIDENCE
0,15/9-13,494.528,437641.96875,6470972.5,-469.501831,NORDLAND GP.,,19.480835,,1.61141,...,34.63641,,,-0.574928,,,,,65000,1.0
1,15/9-13,494.68,437641.96875,6470972.5,-469.653809,NORDLAND GP.,,19.4688,,1.61807,...,34.63641,,,-0.570188,,,,,65000,1.0
2,15/9-13,494.832,437641.96875,6470972.5,-469.805786,NORDLAND GP.,,19.4688,,1.626459,...,34.779556,,,-0.574245,,,,,65000,1.0
3,15/9-13,494.984,437641.96875,6470972.5,-469.957794,NORDLAND GP.,,19.459282,,1.621594,...,39.965164,,,-0.586315,,,,,65000,1.0
4,15/9-13,495.136,437641.96875,6470972.5,-470.109772,NORDLAND GP.,,19.4531,,1.602679,...,57.483765,,,-0.597914,,,,,65000,1.0


In [4]:
df_test = pd.read_csv(xeek.raw_test_filepath, sep=";")

In [5]:
df_test.head()

Unnamed: 0,WELL,DEPTH_MD,X_LOC,Y_LOC,Z_LOC,GROUP,FORMATION,CALI,RSHA,RMED,...,SP,BS,ROP,DTS,DCAL,DRHO,MUDWEIGHT,RMIC,ROPA,RXO
0,15/9-14,480.628001,423244.5,6461862.5,-455.62442,NORDLAND GP.,,19.2031,,1.613886,...,35.525719,,96.46199,,,-0.538873,0.130611,,,
1,15/9-14,480.780001,423244.5,6461862.5,-455.776428,NORDLAND GP.,,19.2031,,1.574376,...,36.15852,,96.454399,,,-0.539232,0.130611,,,
2,15/9-14,480.932001,423244.5,6461862.5,-455.928436,NORDLAND GP.,,19.2031,,1.436627,...,36.873703,,96.446686,,,-0.54083,0.130611,,,
3,15/9-14,481.084001,423244.5,6461862.5,-456.080444,NORDLAND GP.,,19.2031,,1.276094,...,37.304054,,161.170166,,,-0.543943,0.130611,,,
4,15/9-14,481.236001,423244.53125,6461862.5,-456.232422,NORDLAND GP.,,19.2031,,1.204704,...,37.864922,,172.48912,,,-0.542104,0.130611,,,


## Modelling approach

For this first approach, we generate an "end-to-end" model with a deliberately simple approach. Our goal here is not to produce an impressive model, but moreso to generate a skeleton to hang further work off of.

Our first modelling approach will:

* Use only features that appear in *all* wells
* Cross-validate on a well basis
* In-fill missing values with a mean-per-well value

### Feature selection

We retrieve only the features that are present in all well logs:

In [6]:
universal_features = features.features_mostly_present(df_train, presence_threshold=1)
pprint(universal_features)

['WELL',
 'DEPTH_MD',
 'X_LOC',
 'Y_LOC',
 'Z_LOC',
 'GROUP',
 'RDEP',
 'GR',
 'FORCE_2020_LITHOFACIES_LITHOLOGY',
 'FORCE_2020_LITHOFACIES_CONFIDENCE']


How does this compare to our test dataset?

In [7]:
universal_features_test = features.features_mostly_present(df_test)
pprint(universal_features_test)

['WELL',
 'DEPTH_MD',
 'X_LOC',
 'Y_LOC',
 'Z_LOC',
 'GROUP',
 'FORMATION',
 'CALI',
 'RMED',
 'RDEP',
 'GR',
 'DTC']


Our test dataset is a superset, so we are good to work with this limited dataset.

We will further limit our features to continuous features.

In [8]:
columns = [feature for feature in universal_features if 
                   (feature in features.continuous_metadata_features) or 
                   (feature in features.well_log_features) or
                   (feature in features.target)] + ["WELL"]

In [9]:
columns

['DEPTH_MD',
 'X_LOC',
 'Y_LOC',
 'Z_LOC',
 'RDEP',
 'GR',
 'FORCE_2020_LITHOFACIES_LITHOLOGY',
 'WELL']

In [10]:
df_train_limited = df_train.loc[:, columns]

In [11]:
df_train_limited

Unnamed: 0,DEPTH_MD,X_LOC,Y_LOC,Z_LOC,RDEP,GR,FORCE_2020_LITHOFACIES_LITHOLOGY,WELL
0,494.5280,437641.96875,6470972.5,-469.501831,1.798681,80.200851,65000,15/9-13
1,494.6800,437641.96875,6470972.5,-469.653809,1.795641,79.262886,65000,15/9-13
2,494.8320,437641.96875,6470972.5,-469.805786,1.800733,74.821999,65000,15/9-13
3,494.9840,437641.96875,6470972.5,-469.957794,1.801517,72.878922,65000,15/9-13
4,495.1360,437641.96875,6470972.5,-470.109772,1.795299,71.729141,65000,15/9-13
...,...,...,...,...,...,...,...,...
1170506,3169.3124,,,,,77.654900,30000,7/1-2 S
1170507,3169.4644,,,,,75.363937,65030,7/1-2 S
1170508,3169.6164,,,,,66.452843,65030,7/1-2 S
1170509,3169.7684,,,,,55.784817,65030,7/1-2 S


And we create a canary dataset, to test our methodologies.

In [14]:
first_wells = df_train["WELL"].unique()[:3]

In [15]:
first_wells

array(['15/9-13', '15/9-15', '15/9-17'], dtype=object)

In [16]:
df_train_canary = (df_train_limited
                  .query("WELL in @first_wells"))

### Missingness

Based on this process, we will briefly review the data missingness:

In [12]:
(df_train_limited
.groupby("WELL")
.aggregate(lambda s: s.isna().sum() / len(s))
.sort_values("X_LOC", ascending=False))

Unnamed: 0_level_0,DEPTH_MD,X_LOC,Y_LOC,Z_LOC,RDEP,GR,FORCE_2020_LITHOFACIES_LITHOLOGY
WELL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
16/11-1 ST3,0.0,0.493640,0.493640,0.493640,0.493640,0.0,0
31/5-4 S,0.0,0.158299,0.158299,0.158299,0.158299,0.0,0
25/8-7,0.0,0.039449,0.039449,0.039449,0.039449,0.0,0
31/3-1,0.0,0.034980,0.034980,0.034980,0.034980,0.0,0
35/11-6,0.0,0.022426,0.022426,0.022426,0.022426,0.0,0
...,...,...,...,...,...,...,...
25/7-2,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0
25/6-3,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0
25/6-2,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0
25/6-1,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0


Most of our wells have relatively complete data, but there are some outliers. `16/11-1 ST3` is particularly messy. In fact, were our threshold for whether a feature is "present" slightly lower, we probably wouldn't include `X_LOC`, `Y_LOC`, `Z_LOC`, and `RDEP` as features.

## Data imputation

Regardless, we push on with a very basic data imputation method. We will begin with [KNN imputation](https://scikit-learn.org/stable/modules/impute.html#nearest-neighbors-imputation).

In [20]:
imputer = KNNImputer(n_neighbors=2, weights="uniform")
imputer.fit_transform(df_train_canary[df_train_canary.columns.difference(["WELL"])])

array([[ 4.94528000e+02,  6.50000000e+04,  8.02008514e+01, ...,
         4.37641969e+05,  6.47097250e+06, -4.69501831e+02],
       [ 4.94680000e+02,  6.50000000e+04,  7.92628860e+01, ...,
         4.37641969e+05,  6.47097250e+06, -4.69653809e+02],
       [ 4.94832000e+02,  6.50000000e+04,  7.48219986e+01, ...,
         4.37641969e+05,  6.47097250e+06, -4.69805786e+02],
       ...,
       [ 3.11419000e+03,  6.50000000e+04,  1.09434319e+02, ...,
         4.38594719e+05,  6.47896750e+06, -3.09195459e+03],
       [ 3.11434200e+03,  6.50000000e+04,  1.11738235e+02, ...,
         4.38594719e+05,  6.47896750e+06, -3.09210669e+03],
       [ 3.11449400e+03,  6.50000000e+04,  1.12680618e+02, ...,
         4.38594719e+05,  6.47896750e+06, -3.09225855e+03]])

```
TODO: Implement a basic pipeline - 
 * feature scale
 * feature imputation
 * fold by well
 * minimal classifier
```

```
TODO: utility functions for extracting x/y tables 
```