# Multi-Year Anomaly Comparison (2017-2019)

Este notebook compara anomalias detectadas nos datasets de 2017, 2018 e 2019.

**Objetivos:**
- Carregar datasets de múltiplos anos (2017, 2018, 2019)
- Extrair janelas de anomalias de cada dataset
- Visualizar anomalias usando as funções refatoradas do módulo `wqdab`
- Comparar padrões de anomalias entre anos
- Criar visualizações comparativas

**Funções utilizadas:**
- `wqdab.data.load_single_dataset_*` - Carregar datasets
- `wqdab.utils.data_utils` - Preparação de dados
- `wqdab.visualization.anomaly_plots` - Visualização de anomalias



In [1]:
# Imports and setup
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

# WQDAB package imports
from wqdab.data import (
    load_single_dataset_2017,
    load_single_dataset_2018,
    load_single_dataset_2019,
)
from wqdab.utils.data_utils import (
    prepare_time_series,
    get_standard_sensors,
    load_and_prepare_dataset,
)
from wqdab.visualization import (
    get_anomaly_windows,
    plot_anomaly_zoom,
    plot_all_anomaly_windows,
)
from wqdab.utils.paths import ensure_directories_exist, FIGURES_DIR

# Configure matplotlib
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)


In [2]:
# Ensure directories exist
ensure_directories_exist()
print(f"📁 FIGURES_DIR: {FIGURES_DIR}")


📁 FIGURES_DIR: /home/nelso/Documents/IA - Detecção Falhas/reports/figures


## Load and Prepare Data

Carregamos os datasets de 2017, 2018 e 2019, aplicando limpeza e preparação padrão.


In [4]:
# Load 2017 dataset
print("=" * 60)
print("📊 YEAR 2017")
print("=" * 60)
df_train_2017, df_test_2017, _ = load_and_prepare_dataset(
    load_single_dataset_2017,
    year=2017
)

# Get sensors list
sensors_2017 = get_standard_sensors(df_train_2017)
print(f"Sensors: {sensors_2017}")


📊 YEAR 2017
📥 Loading dataset for year 2017...


ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
 - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
 - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.

In [5]:
# Load 2018 dataset
print("\n" + "=" * 60)
print("📊 YEAR 2018")
print("=" * 60)
df_train_2018, df_test_2018, _ = load_and_prepare_dataset(
    load_single_dataset_2018,
    year=2018
)

sensors_2018 = get_standard_sensors(df_train_2018)
print(f"Sensors: {sensors_2018}")



📊 YEAR 2018
📥 Loading dataset for year 2018...


ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
 - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
 - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.

In [None]:
# Load 2019 dataset
print("\n" + "=" * 60)
print("📊 YEAR 2019")
print("=" * 60)
df_train_2019, df_test_2019, df_val_2019 = load_and_prepare_dataset(
    load_single_dataset_2019,
    year=2019
)

sensors_2019 = get_standard_sensors(df_train_2019)
print(f"Sensors: {sensors_2019}")


## Detect Anomaly Windows

Extraímos janelas de anomalias de cada dataset usando a função refatorada.


In [None]:
# Combine train and test for each year for complete analysis
df_2017_all = pd.concat([df_train_2017, df_test_2017]).sort_index()
df_2018_all = pd.concat([df_train_2018, df_test_2018]).sort_index()
df_2019_all = pd.concat([df_train_2019, df_test_2019, df_val_2019]).sort_index()

print("📊 Dataset shapes after combining train/test/val:")
print(f"  2017: {df_2017_all.shape}")
print(f"  2018: {df_2018_all.shape}")
print(f"  2019: {df_2019_all.shape}")


In [None]:
# Get anomaly windows for each year
print("\n🔍 Detecting anomaly windows...\n")

windows_2017 = get_anomaly_windows(df_2017_all, margin_minutes=120)
windows_2018 = get_anomaly_windows(df_2018_all, margin_minutes=120)
windows_2019 = get_anomaly_windows(df_2019_all, margin_minutes=120)

print(f"✅ 2017: Found {len(windows_2017)} anomaly windows")
print(f"✅ 2018: Found {len(windows_2018)} anomaly windows")
print(f"✅ 2019: Found {len(windows_2019)} anomaly windows")


## Visualize Anomalies - First 5 Windows per Year

Vamos visualizar as primeiras 5 janelas de anomalias de cada ano para comparação.


In [None]:
# Visualize first 5 anomaly windows from 2017
print("\n📊 Visualizing 2017 anomalies...")
for i in range(min(5, len(windows_2017))):
    start, end, _, _ = windows_2017[i]
    plot_anomaly_zoom(
        df_2017_all, start, end,
        sensors=get_standard_sensors(df_2017_all),
        figsize=(14, 10),
        save_path=FIGURES_DIR / f'anomaly_comparison_2017_window_{i}.png',
        show=False
    )


In [None]:
# Visualize first 5 anomaly windows from 2018
print("\n📊 Visualizing 2018 anomalies...")
for i in range(min(5, len(windows_2018))):
    start, end, _, _ = windows_2018[i]
    plot_anomaly_zoom(
        df_2018_all, start, end,
        sensors=get_standard_sensors(df_2018_all),
        figsize=(14, 10),
        save_path=FIGURES_DIR / f'anomaly_comparison_2018_window_{i}.png',
        show=False
    )


In [None]:
# Visualize first 5 anomaly windows from 2019
print("\n📊 Visualizing 2019 anomalies...")
for i in range(min(5, len(windows_2019))):
    start, end, _, _ = windows_2019[i]
    plot_anomaly_zoom(
        df_2019_all, start, end,
        sensors=get_standard_sensors(df_2019_all),
        figsize=(14, 10),
        save_path=FIGURES_DIR / f'anomaly_comparison_2019_window_{i}.png',
        show=False
    )


## Summary Statistics

Vamos comparar estatísticas de anomalias entre os diferentes anos.


In [None]:
# Compare anomaly statistics across years
print("\n" + "=" * 60)
print("📊 ANOMALY STATISTICS SUMMARY")
print("=" * 60)

years_data = {
    '2017': (df_2017_all, windows_2017),
    '2018': (df_2018_all, windows_2018),
    '2019': (df_2019_all, windows_2019),
}

for year, (df, windows) in years_data.items():
    total_samples = len(df)
    anomaly_samples = df['EVENT'].sum()
    anomaly_rate = (anomaly_samples / total_samples) * 100
    n_windows = len(windows)
    
    avg_window_duration = np.mean([
        (end - start).total_seconds() / 3600  # hours
        for start, end, _, _ in windows
    ]) if windows else 0
    
    print(f"\n{year}:")
    print(f"  Total samples: {total_samples:,}")
    print(f"  Anomaly samples: {anomaly_samples:,} ({anomaly_rate:.2f}%)")
    print(f"  Anomaly windows: {n_windows}")
    print(f"  Avg window duration: {avg_window_duration:.2f} hours")
