# Building a Simple Neural Network with Scikit-Learn to Predict Well Log Measurements

Material from:
- [SPWLA Workshop on Machine Learning and Artificial Intelligence](https://github.com/andymcdgeo/spwla_2022_machine_learning_workshop). Instructors: Lalitha Venkataramanan (SLB), Andy McDonald (LR), Vikas Jain (SLB)

- Work by Andy McDonald at [Towards Data Science](https://towardsdatascience.com/how-to-create-a-simple-neural-network-model-in-python-70697967738f)

**Introduction**

The adoption and interest of Machine Learning (ML) algorithms applied to Oil & Gas problems has increased dramatically over the past decade. ML is a subdivision or Artificial Intelligence (AI) and is the process by which computers can learn and make predictions from data without being explicily programmed to do so.

This workshop will focus on the applications of AI & ML to petrophysical data and will provide an introduction to ML, Data Quality considerations with well log data and how to identify missing data and outliers. It will also cover sample workflows for applying supervised learning to predict key reservoir properties (Porosity, Water Saturation and Shale Volume) and unsupervised algorithms to identify facies without the need for labelled data.

**Data**
In 2018, Equinor released the entire contents of the Volve Field to the public domain to foster research and learning. Data includes:

- Well Logs
- Petrophysical interpretaions
- Reports
- Core measurements
- Seismic data
- Models
- And more

The data is licensed under the Equinor Open Data Licence. The full licence agreement can be found here: https://www.equinor.com/content/dam/statoil/documents/what-we-do/Equinor-HRS-Terms-and-conditions-for-licence-to-data-Volve.pdf

The Volve Field is located some 200 km west of Stavanger in the Norwegian Sector of the North Sea. Hydrocarbons were discovered within the Jurassic aged Hugin Formation in 1993. Oil production began in 2008 and lasted for 8 years (twice as long as planned) until 2016, when production ceased. In total 63 MMBO were produced over the field's lifetime and reached a plateau of 56,000 B/D.

Selected Data
Five wells:

15/9-F-1 A
15/9-F-1 B
15/9-F-1 C
15/9-F-11 A
15/9-F-11 B

In [2]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt

# scikit learn imports
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

In [3]:
url = 'https://github.com/andymcdgeo/spwla_2022_machine_learning_workshop/blob/main/data/spwla_volve_data.csv?raw=TRUE'
df = pd.read_csv(url)
df.columns

Index(['wellName', 'MD', 'BS', 'CALI', 'DT', 'DTS', 'GR', 'NPHI', 'RACEHM',
       'RACELM', 'RHOB', 'RPCEHM', 'RPCELM', 'PHIF', 'SW', 'VSH'],
      dtype='object')

In [4]:
df.head()

Unnamed: 0,wellName,MD,BS,CALI,DT,DTS,GR,NPHI,RACEHM,RACELM,RHOB,RPCEHM,RPCELM,PHIF,SW,VSH
0,15/9-F-1 A,3431.0,8.5,8.6718,86.9092,181.2241,53.9384,0.3222,0.5084,0.8457,2.7514,0.6461,0.6467,0.02,1.0,0.6807
1,15/9-F-1 A,3431.1,8.5,8.625,86.4334,181.1311,57.2889,0.3239,0.4695,0.8145,2.7978,0.7543,0.657,0.02,1.0,0.7316
2,15/9-F-1 A,3431.2,8.5,8.625,85.9183,180.9487,59.0455,0.3277,0.5012,0.8048,2.8352,0.8718,0.6858,0.02,1.0,0.7583
3,15/9-F-1 A,3431.3,8.5,8.625,85.3834,180.7211,58.255,0.3357,0.6048,0.7984,2.8557,0.9451,0.7913,0.02,1.0,0.7462
4,15/9-F-1 A,3431.4,8.5,8.625,84.8484,180.493,59.4569,0.3456,0.7115,0.7782,2.8632,1.0384,0.873,0.02,1.0,0.7646


In [5]:
# clean data
df = df.dropna()

In [6]:
df.wellName.unique()

array(['15/9-F-1 A', '15/9-F-1 B', '15/9-F-11 A'], dtype=object)

In [7]:
# Split training and test datasets
# Assigning wells to train/test

# Training Wells
training_wells = ['15/9-F-11 A', '15/9-F-1 A']

# Test Well
test_well = ['15/9-F-1 B']

# Create training and testing dataframes
train_val_df = df[df['wellName'].isin(training_wells)].copy()
test_df = df[df['wellName'].isin(test_well)].copy()


In [8]:
train_val_df.describe()

Unnamed: 0,MD,BS,CALI,DT,DTS,GR,NPHI,RACEHM,RACELM,RHOB,RPCEHM,RPCELM,PHIF,SW,VSH
count,3481.0,3481.0,3481.0,3481.0,3481.0,3481.0,3481.0,3481.0,3481.0,3481.0,3481.0,3481.0,3481.0,3481.0,3481.0
mean,3581.346423,8.5,8.651004,78.42327,131.617011,48.10187,0.170462,23.273315,4.327236,2.442223,250.637309,4.88733,0.123395,0.779902,0.329942
std,77.77277,0.0,0.043905,7.998634,14.6191,23.515874,0.050523,333.784034,53.714008,0.140364,3781.377973,12.228492,0.073296,0.304893,0.209257
min,3431.0,8.5,8.4688,57.603,96.9007,8.477,0.05,0.1974,0.2349,2.153,0.2109,0.1366,0.001,0.079,0.01
25%,3518.0,8.5,8.625,72.1771,121.8402,33.2968,0.1332,1.0824,1.0481,2.321,1.1563,1.1175,0.0609,0.497,0.1927
50%,3591.5,8.5,8.6602,78.608,132.3542,43.0132,0.173,2.0286,1.9434,2.4545,2.3956,2.149,0.118,1.0,0.2769
75%,3636.2,8.5,8.672,84.926,139.1118,56.8333,0.206,3.561,3.1727,2.552,4.5416,4.022,0.1877,1.0,0.4008
max,3722.0,8.5,8.8749,94.871,186.0908,127.0557,0.4063,6381.098,2189.603,2.9315,62290.77,202.161,0.264,1.0,1.0


In [9]:
test_df.describe()

Unnamed: 0,MD,BS,CALI,DT,DTS,GR,NPHI,RACEHM,RACELM,RHOB,RPCEHM,RPCELM,PHIF,SW,VSH
count,1858.0,1858.0,1858.0,1858.0,1858.0,1858.0,1858.0,1858.0,1858.0,1858.0,1858.0,1858.0,1858.0,1858.0,1858.0
mean,3330.15,8.5,8.673006,77.094378,130.365142,44.077149,0.164187,3.806371,2.610038,2.442585,5.766324,4.330117,0.121497,0.807265,0.261368
std,53.650272,0.0,0.033749,6.575979,9.124651,14.790551,0.042316,10.380389,3.196635,0.136678,17.171186,11.281374,0.064124,0.291485,0.107317
min,3237.3,8.5,8.4784,58.6318,99.9092,8.0015,0.0595,0.3561,0.4071,2.1501,0.2349,0.3734,0.02,0.0815,0.0171
25%,3283.725,8.5,8.6637,73.048825,125.981175,34.669225,0.1415,1.236075,1.215275,2.3463,1.249425,1.2662,0.076125,0.6044,0.1908
50%,3330.15,8.5,8.6638,75.4478,131.21145,46.0101,0.1631,1.47975,1.4638,2.46835,1.58615,1.5312,0.1044,1.0,0.27725
75%,3376.575,8.5,8.6976,82.74905,136.093025,54.6927,0.184875,2.7147,2.65435,2.51705,3.21565,2.949825,0.177975,1.0,0.338675
max,3423.0,8.5,8.7991,96.2241,151.9516,101.08,0.4109,129.649,29.0693,3.0517,134.6995,96.4255,0.2726,1.0,0.5665


In [11]:

# Setup the columns for training and target features
X = train_val_df[['RHOB', 'GR', 'NPHI']]
y = train_val_df['DT']

# Split the data into training and validation datasets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

**Standardise values**

- Calculating the mean of a feature, subtracting it from each data point and then dividing by the feature’s standard deviation.
- USe scikit-learn:  StandardScaler class to transform our data.

Note: When it comes to the validation data, we don’t want to fit the StandardScaler to that data as we have already done it. Instead we just want to apply it. This is done using the transform method.

In [12]:
scaler = StandardScaler()

#Fit the StandardScaler to the training data
X_train = scaler.fit_transform(X_train)

# Apply the StandardScaler, but not fit, to the validation data
X_val = scaler.transform(X_val)