# Binary classifier of Spurious Closure of DHSV

This is an **example** of how to use the 3W toolkit, a software package written in Python 3 that contains resources that make the following easier:

* 3W dataset overview generation;
* Experimentation and comparative analysis of Machine Learning-based approaches and algorithms for specific problems related to undesirable events that occur in offshore oil wells during their respective production phases;
* Standardization of key points of the Machine Learning-based algorithm development pipeline.

The 3W toolkit and the 3W dataset are major resources that compose the 3W project, a pilot of a Petrobras' program called [Conexões para Inovação - Módulo Open Lab](https://prd.hotsitespetrobras.com.br/pt/nossas-atividades/tecnologia-e-inovacao/conexoes-para-inovacao/) that promotes experimentation of Machine Learning-based approaches and algorithms for specific problems related to undesirable events that occur in offshore oil wells.

# 1. Introduction

This [Jupyter Notebooks](https://jupyter.org/) presents a **basic** example of how to use the 3W toolkit's resources to develop an experiment for a specific problem.

You can adapt this example to experiment other approaches. To do so, follow the instructions included in the following codes as comments.

**IMPORTANT**: in order to experiment very different approaches with other Machine Learning pipelines, we need to evolve the 3W toolkit first. Your help with this is greatly appreciated.

# 2. Imports and Configurations

In [1]:
import sys
import os
import numpy as np
import collections
from sklearn import preprocessing
from matplotlib import pyplot as plt
from tensorflow import keras

sys.path.append(os.path.join('..','..','..'))
import toolkit as tk

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

ModuleNotFoundError: No module named 'sklearn'

# 3. Creating an Experiment for a Specific Problem

A specific event type defined in the `dataset\dataset.ini` must be used as `event_name` when we create a experiment.

In [None]:
experiment = tk.Experiment(event_name="SPURIOUS_CLOSURE_OF_DHSV")

# 4. Setting up Folds for an Experiment

As the 3W toolkit defines and standardizes a number of things, we don't need to worry about labels and IDs associated with the specific event type chosen, number of folds, and which folds consider which instances.

In [None]:
event_labels = list(experiment.event_labels.values())
event_labels_idx = {v: i for i, v in enumerate(event_labels)}
fold: tk.EventFold
folds: tk.EventFolds = experiment.folds()

# 5. Executing an Experiment

In [None]:
print(event_labels)
print(event_labels_idx)

In [None]:
cont=0
for fold in folds:
    X_train, y_train = fold.extract_training_samples()
    cont=cont+1
print(cont)

In [None]:
fig, axs = plt.subplots(8)
for i in range(8):
    axs[i].plot(X_train[0][:,i])
print(collections.Counter(y_train))
print(X_train[0][:,6])

In [None]:
    # Normalizes the samples (zero mean and unit variance)
    scaler = preprocessing.StandardScaler()
    X_train_normalized=[]
    
    for sample in X_train:
        X_train_normalized.append(scaler.fit_transform(np.nan_to_num(sample)))


In [None]:
fig, axs = plt.subplots(8)
for i in range(8):
    axs[i].plot(X_train_normalized[0][:,i])

In [None]:
print(len(X_train))
print(len(y_train))
print(y_train[3])

In [None]:
y_train[240]

In [None]:
fig, axs = plt.subplots(8)
for i in range(8):
    axs[i].plot(X_train_normalized[240][:,i])
print(collections.Counter(y_train))
print(X_train[0][:,6])

In [None]:
num_classes = 3
def make_model(input_shape):
    input_layer = keras.layers.Input(input_shape)

    conv1 = keras.layers.Conv1D(filters=64, kernel_size=3, padding="same")(input_layer)
    conv1 = keras.layers.BatchNormalization()(conv1)
    conv1 = keras.layers.ReLU()(conv1)

    conv2 = keras.layers.Conv1D(filters=64, kernel_size=3, padding="same")(conv1)
    conv2 = keras.layers.BatchNormalization()(conv2)
    conv2 = keras.layers.ReLU()(conv2)

    conv3 = keras.layers.Conv1D(filters=64, kernel_size=3, padding="same")(conv2)
    conv3 = keras.layers.BatchNormalization()(conv3)
    conv3 = keras.layers.ReLU()(conv3)

    gap = keras.layers.GlobalAveragePooling1D()(conv3)

    output_layer = keras.layers.Dense(num_classes, activation="softmax")(gap)

    return keras.models.Model(inputs=input_layer, outputs=output_layer)


model = make_model(input_shape=(X_train[0].shape))
#keras.utils.plot_model(model, show_shapes=True)

In [None]:
#Requisito para a rede neural
y_train = np.array(y_train)

In [None]:
epochs = 100
batch_size = 128

callbacks = [
    keras.callbacks.ModelCheckpoint(
        "best_model.h5", save_best_only=True, monitor="val_loss"
    ),
    keras.callbacks.ReduceLROnPlateau(
        monitor="val_loss", factor=0.5, patience=20, min_lr=0.0001
    ),
    keras.callbacks.EarlyStopping(monitor="val_loss", patience=50, verbose=1),
]
model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["sparse_categorical_accuracy"],
)
history = model.fit(
    X_train,
    y_train,
    batch_size=batch_size,
    epochs=epochs,
    callbacks=callbacks,
    validation_split=0.2,
    verbose=1,
)

You can see below that the 3W toolkit has methods to extract samples for both training and testing, and also to calculate metrics for each fold.

In [None]:

for fold in folds:
    X_train, y_train = fold.extract_training_samples()
    X_test = fold.extract_test_samples()
    
    # This is the section that contains the heart of this basic example.
    # 
    # Note that this basic example learns the frequency of each label 
    # in the training set and always uses these frequencies as probabilities
    # in the testing samples (regardless of the samples themselves).
    #  
    # It is interesting to mention that the metrics obtained with this 
    # simple approach seem to be good because of the considerable imbalance 
    # of the 3W dataset that is not addressed by the 3W toolkit.
    #
    # You can modify this section to try other more interesting approaches.
    # All you have to do is generate an array (numpy.ndarray) `y_pred` 
    # with probability estimates for the test samples. For each test sample
    # you need to estimate an array (numpy.ndarray) of probabilities 
    # associated with each label in the order they are in the `event_labels`.
    # So `y_pred` must be an array of arrays.
    #
     
    #################################################################### 
    y_train_idx = list(map(event_labels_idx.__getitem__, y_train))
    y_bins = np.bincount(y_train_idx) / len(y_train_idx)
    y_pred = np.tile(y_bins, (len(X_test), 1))
    ####################################################################

    fold.calculate_partial_metrics(y_pred, event_labels)

X_train é uma lista de ndarray

In [None]:
print(X_train[1].shape)
print("len(X_train):",len(X_train))
print("len(y_train):",len(y_train))

In [None]:
for fold in folds: 
    X_test= fold.extract_test_samples()
    break

X_test[0][:,0]

# 6. Printing the Results

The 3W toolkit provides a method for retrieving and presenting the metrics calculated for each fold.

In [None]:
folds.get_metrics(boxplot=True)