# 3W dataset's General Presentation

This is a general presentation of the 3W dataset, to the best of its authors' knowledge, the first realistic and public dataset with rare undesirable real events in oil wells that can be readily used as a benchmark dataset for development of machine learning techniques related to inherent difficulties of actual data.

For more information about the theory behind this dataset, refer to the paper **A Realistic and Public Dataset with Rare Undesirable Real Events in Oil Wells** published in the **Journal of Petroleum Science and Engineering** (link [here](https://doi.org/10.1016/j.petrol.2019.106223)).

# 1. Introduction

This Jupyter Notebook presents a new 3w dataset overview. For this, One **interactive plot graph** from a specific instance from an event class is presented. 
By default, the instance is downsampling (n=100) and applied Z-score Scaler.
To help the visualization transient labels were changed to '0.5'.

# 2. Imports and Configurations

In [1]:
import warnings
warnings.simplefilter("ignore", FutureWarning)

import sys
import os
sys.path.append(os.path.join('..','..'))
import toolkit as tk

import plotly.offline as py 
import plotly.graph_objs as go 
import glob
import pandas as pd
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
import matplotlib.pyplot as plt 
import pandas_profiling

%matplotlib inline
%config InlineBackend.figure_format = 'svg'


h5py not installed, hdf5 features will not be supported.
Install h5py to use hdf5 features: http://docs.h5py.org/



Each instance is stored in a CSV file and loaded into a pandas DataFrame. Each observation is stored in a line in the CSV file and loaded as a line in the pandas DataFrame. The first line of each CSV file contains a header with column identifiers. Each column of CSV files stores the following type of information:

* **timestamp**: observations timestamps loaded into pandas DataFrame as its index;
* **P-PDG**: pressure variable at the Permanent Downhole Gauge (PDG);
* **P-TPT**: pressure variable at the Temperature and Pressure Transducer (TPT);
* **T-TPT**: temperature variable at the Temperature and Pressure Transducer (TPT);
* **P-MON-CKP**: pressure variable upstream of the production choke (CKP);
* **T-JUS-CKP**: temperature variable downstream of the production choke (CKP);
* **P-JUS-CKGL**: pressure variable upstream of the gas lift choke (CKGL);
* **T-JUS-CKGL**: temperature variable upstream of the gas lift choke (CKGL);
* **QGL**: gas lift flow rate;
* **class**: observations labels associated with three types of periods (normal, fault transient, and faulty steady state).

Other information are also loaded into each pandas Dataframe:

* **label**: instance label (event type);
* **well**: well name. Hand-drawn and simulated instances have fixed names. Real instances have names masked with incremental id;
* **id**: instance identifier. Hand-drawn and simulated instances have incremental id. Each real instance has an id generated from its first timestamp.

More information about these variables can be obtained from the following publicly available documents:

* ***Option in Portuguese***: R.E.V. Vargas. Base de dados e benchmarks para prognóstico de anomalias em sistemas de elevação de petróleo. Universidade Federal do Espírito Santo. Doctoral thesis. 2019. https://github.com/petrobras/3W/raw/main/docs/doctoral_thesis_ricardo_vargas.pdf.
* ***Option in English***: B.G. Carvalho. Evaluating machine learning techniques for detection of flow instability events in offshore oil wells. Universidade Federal do Espírito Santo. Master's degree dissertation. 2021. https://github.com/petrobras/3W/raw/main/docs/master_degree_dissertation_bruno_carvalho.pdf.

# 8. Plot Instances

Downsampling instances Functions. The original frequency rate is 1Hz. In some 3W classes, due to a large number of samples in some instances, it is impossible to plot the entire series. For that, this function provides an alternative to change/decrease the frequency of the samples, such as from seconds to minutes, and allows data visualization.

In [2]:
def resample(data, n, class_number):
    """Downsampling for instances.

    Args:
        data (string): Instance path
        n (integer): Factor to downsampling the instance.
        class_number (integer):  integer that represents the event class [0-8]

    Returns:
        pandas.DataFrame: Downsamplig instance DataFrame
    """
    # Group Timestamp and get last value
    resampleTimestamp = data.timestamp.groupby(data.index // n).max()
    # Replace transient label from 100 to 0.5
    tempClassLabel = data['class'].replace(class_number+100,0.5)
    # Get the max value from the group Class column
    resampleClass = tempClassLabel.groupby(tempClassLabel.index // n).max()
    # back with transient label value
    resampleClass.replace(0.5,class_number+100,inplace=True)
    # non overlap group and get the average value from the data 
    dfResample = data.groupby(data.index // n).mean()
    # Drop class column
    dfResample.drop(['class'],axis=1, inplace=True)
    # Insert new class label values group by non overlap
    dfResample.insert(8,'class', resampleClass)
    # Insert new timestamp values group by non overlap
    dfResample.insert(0,'timestamp', resampleTimestamp)
    return dfResample



Plot one especific event class and instance. By default the instance is downsampling (n=100) and Z-score Scaler. 
        To help the visualization transient labels were changed to '0.5'.

In [8]:
def plot_instance(class_number, instance_index):
    """Plot one especific event class and instance. By default the instance is downsampling (n=100) and Z-score Scaler. 
        In order to help the visualization transient labels was changed to '0.5'.

    Args:
        class_number (integer): integer that represents the event class [0-8]
        instance_index (integer): input the instance file index
    """
    instances_path = tk.PATH_DATASET + '\\'+str(class_number)+'\\*.csv'
    instances_path_list = glob.glob(instances_path)
    if class_number > 8 or class_number < 0:
        print(f'invalid class number: {class_number} - Type a valid class number 0 to 8')
    elif instance_index >= len(instances_path_list):
        print(f'instance index {instance_index} out of range - Insert a valid index between 0 and {len(instances_path_list)-1}')
    else:
        df_instance = pd.read_csv(instances_path_list[instance_index], sep=',', header=0)
        df_instance_resampled = resample(df_instance, 100, class_number)
        df_drop_resampled = df_instance_resampled.drop(['timestamp','class'], axis=1)
        df_drop_resampled.interpolate(method='linear', limit_direction='both', axis=0, inplace=True)
        df_drop_resampled.fillna(0, inplace=True, )
        scaler_resampled = TimeSeriesScalerMeanVariance().fit_transform(df_drop_resampled)
        df_scaler_resampled = pd.DataFrame(scaler_resampled.squeeze(), index=df_drop_resampled.index, columns=df_drop_resampled.columns)

        data = [
            go.Scatter(
                x=df_instance_resampled['timestamp'],
                y=df_scaler_resampled[tk.VARS[0]],
                mode='lines+markers',
                marker_symbol='circle',
                marker_size=3,
                name=tk.VARS[0]
                ),
            go.Scatter(
                x=df_instance_resampled['timestamp'], 
                y=df_scaler_resampled[tk.VARS[1]],
                mode='lines+markers',
                marker_symbol='diamond',
                marker_size=3,
                name=tk.VARS[1]
                ),
            go.Scatter(
                x=df_instance_resampled['timestamp'],
                y=df_scaler_resampled[tk.VARS[2]],
                mode='lines+markers',
                marker_symbol='x',
                marker_size=3,
                name=tk.VARS[2]
                ),
            go.Scatter(
                x=df_instance_resampled['timestamp'], 
                y=df_scaler_resampled[tk.VARS[3]],
                mode='lines+markers',
                marker_symbol='star',
                marker_size=3,
                name=tk.VARS[3]
                ),
            go.Scatter(
                x=df_instance_resampled['timestamp'],
                y=df_scaler_resampled[tk.VARS[4]],
                mode='lines+markers',
                marker_symbol='triangle-up',
                marker_size=3,
                name=tk.VARS[4]
                ),
            go.Scatter(
                x=df_instance_resampled['timestamp'],
                y=df_scaler_resampled[tk.VARS[5]],
                mode='lines',
                name=tk.VARS[5]
                ),
            go.Scatter(
                x=df_instance_resampled['timestamp'],
                y=df_scaler_resampled[tk.VARS[6]],
                mode='lines',
                name=tk.VARS[6]
                ),
            go.Scatter(
                x=df_instance_resampled['timestamp'],
                y=df_scaler_resampled[tk.VARS[7]],
                mode='lines',
                name=tk.VARS[7]
                ),
            go.Scatter(
                x=df_instance_resampled['timestamp'], 
                y=df_instance_resampled['class'].replace(100+int(class_number),0.5),
                mode='markers',
                name='Label'
                ),
            
        ]

        fig = go.Figure(data)
        fileName = instances_path_list[instance_index].split('\\')
        fig.update_layout(title=tk.EVENT_NAMES[class_number]+' - '+fileName[-1],
                        xaxis_title='Time(s)',
                        yaxis_title='Scaled',
                        font=dict(size=12))
        fig.show()
        

Plot one interactive graph from an especific event class and instance. 
By default the instance is downsampling (n=100) and Z-score Scaler.
To help the visualization transient labels were changed to '0.5'.

In [9]:
class_number = 2
instance_index = 32
plot_instance(class_number,instance_index)

# 9. Profiling Report

In this part, we generate a complete interactive HTML report from the data set. It is possible to have a complete view of the 3W dataset of one event class, such as the number of lines, number of columns (variables), number of missing values (null cells, NaNs), duplicate lines, size, and the types of variables that we have in the database. In addition, the tool also brings statistics, histograms, interactions, and correlations.

In the Warnings field, the report already brings some things that we will have to be careful about when analyzing the dataset. With this, it is possible to assess the need or not to perform some initial treatment on the data, before starting the exploration.

The original frequency rate is 1Hz. In some 3W classes, due to a large number of samples, the maximum allowed size is exceeded. Thus we reduce the frequency rate. The parameter, that determines the new frequency is "resize". In this case, we downsampling 100 times. To visualize the original data use "resize=1", but it's no warranty that the report will be generated.

In [8]:
class_number = 2
resize = 100
df_all_instances_class = pd.concat([resample(pd.read_csv(f, sep=',',header=0),resize,class_number)for f in glob.glob(tk.PATH_DATASET + '\\'+str(class_number)+'\\*.csv')],ignore_index=True)

Genarate the Profile Report

In [9]:
profile = df_all_instances_class.profile_report(title=tk.EVENT_NAMES[class_number]+' Profiling Report')
profile.to_file(tk.EVENT_NAMES[class_number].replace(" ","")+"DataReport.html")
print('Generated Profiling Report: ' + tk.EVENT_NAMES[class_number].replace(" ","")+"DataReport.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Generated Profiling ReportSPURIOUS_CLOSURE_OF_DHSVDataReport.html


Open the Interactive Report on new tab browser 

In [5]:
import webbrowser
  
webbrowser.open_new_tab(tk.EVENT_NAMES[class_number].replace(" ","")+"DataReport.html")

True