# 3W Dataset's General Presentation

This is a general presentation of the 3W Dataset, to the best of its authors' knowledge, the first realistic and public dataset with rare undesirable real events in oil wells that can be readily used as a benchmark dataset for development of machine learning techniques related to inherent difficulties of actual data.

For more information about the theory behind this dataset, refer to the paper **A Realistic and Public Dataset with Rare Undesirable Real Events in Oil Wells** published in the **Journal of Petroleum Science and Engineering** (link [here](https://doi.org/10.1016/j.petrol.2019.106223)).

# 1 Introduction

This Jupyter Notebook presents the 3W Dataset 2.0.0 in a general way. For this, some functionalities for data unification and the benefits of this process are demonstrated.

In complex datasets like 3W, data is often distributed across multiple folders and files, which may hinder quick insights and analysis. The data unification process involves loading, cleaning, and merging these scattered files into a single, well-structured data frame. This process offers several key benefits:

Functionalities of Data Unification
Automated Loading of Distributed Data:

The notebook loads all Parquet files from multiple folders efficiently.
It filters out irrelevant files (e.g., simulated data) and extracts important metadata like timestamps directly from file names.
Data Normalization:

Additional columns (e.g., folder ID, date, and time) are added, ensuring consistency across different data points.
This enhances downstream analysis by making sure that different segments are harmonized.
Handling Large-Scale Data with Dask:

The use of Dask allows seamless processing of large datasets that would otherwise not fit into memory.
This makes it easier to explore and manipulate the entire dataset efficiently.
Benefits of Data Unification
Improved Data Accessibility:
With all data combined into a single structure, researchers and engineers can access relevant information faster, minimizing the time spent searching across files.

Enhanced Analytical Capabilities:
Unified data allows for richer analytics, such as visualizing trends and patterns across the entire dataset. Anomalies and transient events can be identified and classified more accurately.

Simplified Visualization:
By consolidating data into a single DataFrame, it's easier to generate comprehensive visualizations that provide meaningful insights about operational states.

Facilitates Collaboration:
When datasets are standardized and merged, it becomes easier for teams to share their findings and collaborate on data-driven projects. The unified dataset serves as a single source of truth.

This notebook demonstrates these functionalities and benefits by loading the 3W Dataset, classifying events across multiple operational states, and generating visualizations that offer a deeper understanding of system behavior.

# 2. Imports and Configurations

In [None]:
import os
import pandas as pd
import dask.dataframe as dd
import matplotlib.pyplot as plt

# from toolkit.misc import load_and_combine_data, classify_events, visualize_data

plt.style.use('ggplot')  # Estilo do matplotlib
pd.set_option('display.max_columns', None)  # Exibe todas as colunas do DataFrame

dataset_dir = "C:/Users/anabe/OneDrive/Área de Trabalho/HACKATHON PETROBRÁS/dataset_modificado/dataset_modificado"

# 3. Instances' Structure

In this section, we explain the organization of the folders and files within the dataset. The 3W Dataset contains subfolders numbered from 0 to 9, where each folder represents a specific operational situation, as described below:

* 0 = Normal Operation
* 1 = Abrupt Increase of BSW
* 2 = Spurious Closure of DHSV
* 3 = Severe Slugging
* 4 = Flow Instability
* 5 = Rapid Productivity Loss
* 6 = Quick Restriction in PCK
* 7 = Scaling in PCK
* 8 = Hydrate in Production Line
* 9 = Hydrate in Service Line

Each file follows the naming pattern:
* WELL-00008_20170818000222.parquet

* WELL-00008: Identification of the well.
* 20170818000222: Timestamp in the format yyyyMMddhhmmss.
* .parquet: File extension indicating the data format.

In [None]:
from toolkit.misc import load_and_combine_data, classify_events, visualize_data

datatype = 'SIMULATED'
df = load_and_combine_data(dataset_dir, datatype)

if df is not None:
    event_summary = classify_events(df)

    visualize_data(event_summary)
else:
    print("Nenhum dado foi carregado.")