## Notebook Summary

This notebook performs an initial exploration and visualization of the 3W Dataset 2.0.0 for a selected class. The key steps and findings are as follows:

1.  Package Installation
2.  Data Loading
3.  Data Stacking (for a chosen class)
5.  Data Visualization (Single Instance)
6.  Filtered Numeric Column Selection
7.  Linked Subplots Visualization
8.  Normalized Single Plot Visualization

Overall, this notebook demonstrates the process of loading, combining, and visualizing a subset of the 3W Dataset, focusing on a specific class and a single instance to gain initial insights into the data's characteristics and time series patterns.

# Install Packages

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np
import tempfile, subprocess
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import MinMaxScaler

# Load parquet file from the Official Repository

In [None]:
# Create temp folder and shallow clone
temp_dir = tempfile.mkdtemp()
subprocess.run(["git", "clone", "--depth", "1", "https://github.com/petrobras/3W.git", temp_dir])

# Variables Description


* timestamp = Instant at which observation was generated
* ABER-CKGL = Opening of the GLCK (gas lift choke) [%%]
* ABER-CKP = Opening of the PCK (production choke) [%%]
* ESTADO-DHSV = State of the DHSV (downhole safety valve) [0, 0.5, or 1]
* ESTADO-M1 = State of the PMV (production master valve) [0, 0.5, or 1]
* ESTADO-M2 = State of the AMV (annulus master valve) [0, 0.5, or 1]
* ESTADO-PXO = State of the PXO (pig-crossover) valve [0, 0.5, or 1]
* ESTADO-SDV-GL = State of the gas lift SDV (shutdown valve) [0, 0.5, or 1]
* ESTADO-SDV-P = State of the production SDV (shutdown valve) [0, 0.5, or 1]
* ESTADO-W1 = State of the PWV (production wing valve) [0, 0.5, or 1]
* ESTADO-W2 = State of the AWV (annulus wing valve) [0, 0.5, or 1]
* ESTADO-XO = State of the XO (crossover) valve [0, 0.5, or 1]
* P-ANULAR = Pressure in the well annulus [Pa]
* P-JUS-BS = Downstream pressure of the SP (service pump) [Pa]
* P-JUS-CKGL = Downstream pressure of the GLCK (gas lift choke) [Pa]
* P-JUS-CKP = Downstream pressure of the PCK (production choke) [Pa]
* P-MON-CKGL = Upstream pressure of the GLCK (gas lift choke) [Pa]
* P-MON-CKP = Upstream pressure of the PCK (production choke) [Pa]
* P-MON-SDV-P = Upstream pressure of the production SDV (shutdown valve) [Pa]
* P-PDG = Downhole pressure at the PDG (permanent downhole gauge) [Pa]
* PT-P = Subsea Xmas-tree pressure downstream of the PWV (production wing valve) in the production line [Pa]
* P-TPT = Subsea Xmas-tree pressure at the TPT (temperature and pressure transducer) [Pa]
* QBS = Flow rate at the SP (service pump) [m3/s]
* QGL = Gas lift flow rate [m3/s]
* T-JUS-CKP = Downstream temperature of the PCK (production choke) [oC]
* T-MON-CKP = Upstream temperature of the PCK (production choke) [oC]
* T-PDG = Downhole temperature at the PDG (permanent downhole gauge) [oC]
* T-TPT = Subsea Xmas-tree temperature at the TPT (temperature and pressure transducer) [oC]
* class = Label of the observation
* state = Well operational status

# File per class from the repo

In [None]:
CLASSES = range(0,10)
data_files_by_class = {}
for classe in CLASSES:
  data_files_by_class[classe] = list(Path(temp_dir).rglob(f"*/{classe}/*.parquet"))
  print(f"Found {len(data_files_by_class[classe])} parquet files for class {classe}")

# Load and stack data for a specific class


Select class number, load all its parquet files, and concatenate them into a single DataFrame.


In [None]:
CLASSES_NAMES = ['NORMAL', 'ABRUPT_INCREASE_OF_BSW', 'SPURIOUS_CLOSURE_OF_DHSV', 'SEVERE_SLUGGING',
                 'FLOW_INSTABILITY', 'RAPID_PRODUCTIVITY_LOSS', 'QUICK_RESTRICTION_IN_PCK', 'SCALING_IN_PCK',
                 'HYDRATE_IN_PRODUCTION_LINE', 'HYDRATE_IN_SERVICE_LINE']
chosen_class = 2
parquet_files = data_files_by_class[chosen_class]

dfs = []
for file in parquet_files:
  df_temp = pd.read_parquet(file, engine="pyarrow")
  dfs.append(df_temp)

df_combined = pd.concat(dfs, ignore_index=True)
print('Class Selected: ', CLASSES_NAMES[chosen_class])
display(df_combined.head())
display(df_combined.shape)

In [None]:
# Numeric summary
print(df_combined.describe(percentiles=[0.01, 0.1, 0.5, 0.9, 0.99]).T)

# Visualize Data - One Instance

Create visualizations for the combined DataFrame.

In [None]:
# Select one instance to visualize
chosen_instance = 25
# Select only numeric columns for plotting that have a reasonable amount of data
# You can adjust the threshold (e.g., 0.5 for at least 50% non-missing values)
threshold = 0.5

In [None]:
# Plot Chosen Instance
try:
    df = pd.read_parquet(data_files_by_class[chosen_class][chosen_instance], engine="pyarrow")
except Exception as e:
    print(f"Error reading parquet file: {e}")
    print(f"Choose a value between 0 and {len(data_files_by_class[chosen_class]) - 1} for Class {chosen_class}")

# print(f"Loaded {data_files_by_class[chosen_class][chosen_instance]}")

numeric_cols = df.select_dtypes(include=np.number).columns.tolist()
valid_numeric_cols = [col for col in numeric_cols if df[col].notna().sum() / len(df) > threshold]


# Determine the number of subplots needed (one for each valid numeric column)
n_subplots = len(valid_numeric_cols)

if n_subplots == 0:
    print("No valid numeric columns found in the DataFrame to plot based on the non-missing value threshold.")
else:
    # Create subplots
    fig = make_subplots(rows=n_subplots, cols=1, shared_xaxes=True)

    # Add a trace for each valid numeric column to its own subplot
    for i, col in enumerate(valid_numeric_cols):
        fig.add_trace(go.Scattergl(x=df.index, y=df[col], mode='lines', name=col),
                      row=i+1, col=1)

    # Update layout for linked axes and better appearance
    fig.update_layout(
        title=f"Subplots for Numeric variables - Class: {CLASSES_NAMES[chosen_class]} - {str(data_files_by_class[chosen_class][chosen_instance]).split('/')[-1]}",
        hovermode='x unified',
        height=250 * n_subplots # Adjust height based on number of subplots
    )

    # Show the figure
    fig.show()

In [None]:
pd.options.plotting.backend = "plotly"
df[valid_numeric_cols].plot(title=f"Single Plot Numeric Variables - Class: {CLASSES_NAMES[chosen_class]} - {str(data_files_by_class[chosen_class][chosen_instance]).split('/')[-1]}")

In [None]:
# Normalize the data in valid_numeric_cols
scaler = MinMaxScaler()
df_normalized = df[valid_numeric_cols].copy()
df_normalized[valid_numeric_cols] = scaler.fit_transform(df_normalized[valid_numeric_cols])

# Create a single plot with normalized values
pd.options.plotting.backend = "plotly"
df_normalized.plot(title=f"Single Plot Normalized Numeric Variables (Class: {CLASSES_NAMES[chosen_class]}, Instance: {chosen_instance}) - - {str(data_files_by_class[chosen_class][chosen_instance]).split('/')[-1]}")