# üìö Classification Fundamentals and Problem Formulation (5 minutes)

### What is Supervised Classification?

**Supervised Classification** is a machine learning task where we train algorithms to predict discrete class labels for new data based on labeled training examples.

### Key Components:

1. **Training Data**: Labeled examples (X, y) where X = features, y = class labels
2. **Features**: Input variables that describe each sample
3. **Labels**: Target classes we want to predict
4. **Model**: Algorithm that learns the mapping from features to labels
5. **Evaluation**: Metrics to measure how well our model performs

### Our Classification Problem: 3W Oil Well Fault Detection

- **Objective**: Classify oil well operational states from sensor data
- **Input Features**: Time series sensor measurements (flattened into feature vectors)
- **Output Classes**: Different types of operational faults (0=normal, 1-9=different fault types)
- **Challenge**: Multi-class classification with imbalanced classes

### Why Classification Matters in Oil Wells:
- **Early Fault Detection**: Prevent costly equipment failures
- **Operational Safety**: Avoid dangerous situations
- **Maintenance Planning**: Schedule repairs before critical failures
- **Cost Reduction**: Minimize downtime and repair costs

### Problem Characteristics:
- **Multi-class**: 10 different classes (0-9)
- **Time Series**: Sequential sensor measurements
- **High Dimensional**: Many sensors √ó time steps = many features
- **Imbalanced**: Some fault types are much rarer than others
- **Real-world**: Noisy, complex industrial data

Let's explore different algorithms to solve this classification challenge!

In [1]:
# ============================================================
# LOAD 3W DATASET FOR SUPERVISED CLASSIFICATION
# ============================================================
import time

start_time = time.time()

print("ü§ñ Loading 3W Dataset for Supervised Classification")
print("=" * 55)

# Import data loading utilities
import sys
import os

sys.path.append("src")

print("üì¶ Importing modules...", end=" ")
from src.data_persistence import DataPersistence
from src import config
import pandas as pd
import numpy as np

print("‚úÖ")

try:
    print("üìÇ Initializing data persistence...", end=" ")
    persistence = DataPersistence(base_dir=config.PROCESSED_DATA_DIR, verbose=False)
    print("‚úÖ")

    print(f"‚ö° Using format: {config.SAVE_FORMAT} for maximum speed")

    # Check if windowed directory exists
    windowed_dir = os.path.join(persistence.cv_splits_dir, "windowed")
    print(f"üìÅ Checking windowed directory: {windowed_dir}...", end=" ")

    if not os.path.exists(windowed_dir):
        print("‚ùå")
        print(
            "‚ùå No windowed data directory found. Please run Data Treatment notebook first to generate windowed time series data."
        )
    else:
        print("‚úÖ")

        # Look for fold directories
        print("üîç Looking for fold directories...", end=" ")
        fold_dirs = [
            d
            for d in os.listdir(windowed_dir)
            if d.startswith("fold_") and os.path.isdir(os.path.join(windowed_dir, d))
        ]
        fold_dirs.sort()
        print(f"‚úÖ Found {len(fold_dirs)} folds")

        if not fold_dirs:
            print("‚ùå No fold directories found in windowed data.")
        else:
            # Load data from ALL folds for comprehensive classification
            print(
                f"üìä Loading windowed data from ALL {len(fold_dirs)} folds for classification..."
            )

            all_train_windows = []
            all_train_classes = []
            all_train_fold_info = []  # Track which fold each training sample comes from
            all_test_windows = []
            all_test_classes = []
            all_test_fold_info = []  # Track which fold each test sample comes from

            load_start = time.time()

            for fold_name in fold_dirs:
                fold_path = os.path.join(windowed_dir, fold_name)
                fold_num = fold_name.replace("fold_", "")

                print(f"üìÅ Processing {fold_name}...", end=" ")

                # Load training data
                train_pickle = os.path.join(
                    fold_path, f"train_windowed.{config.SAVE_FORMAT}"
                )
                train_parquet = os.path.join(fold_path, "train_windowed.parquet")

                if os.path.exists(train_pickle):
                    fold_train_dfs, fold_train_classes = persistence._load_dataframes(
                        train_pickle, config.SAVE_FORMAT
                    )
                    all_train_windows.extend(fold_train_dfs)
                    all_train_classes.extend(fold_train_classes)
                    all_train_fold_info.extend(
                        [fold_name] * len(fold_train_dfs)
                    )  # Track fold info
                elif os.path.exists(train_parquet):
                    fold_train_dfs, fold_train_classes = persistence._load_from_parquet(
                        train_parquet
                    )
                    all_train_windows.extend(fold_train_dfs)
                    all_train_classes.extend(fold_train_classes)
                    all_train_fold_info.extend(
                        [fold_name] * len(fold_train_dfs)
                    )  # Track fold info

                # Load test data
                test_pickle = os.path.join(
                    fold_path, f"test_windowed.{config.SAVE_FORMAT}"
                )
                test_parquet = os.path.join(fold_path, "test_windowed.parquet")

                if os.path.exists(test_pickle):
                    fold_test_dfs, fold_test_classes = persistence._load_dataframes(
                        test_pickle, config.SAVE_FORMAT
                    )
                    all_test_windows.extend(fold_test_dfs)
                    all_test_classes.extend(fold_test_classes)
                    all_test_fold_info.extend(
                        [fold_name] * len(fold_test_dfs)
                    )  # Track fold info
                elif os.path.exists(test_parquet):
                    fold_test_dfs, fold_test_classes = persistence._load_from_parquet(
                        test_parquet
                    )
                    all_test_windows.extend(fold_test_dfs)
                    all_test_classes.extend(fold_test_classes)
                    all_test_fold_info.extend(
                        [fold_name] * len(fold_test_dfs)
                    )  # Track fold info

                print("‚úÖ")

            load_time = time.time() - load_start

            if all_train_windows and all_test_windows:
                print(f"‚úÖ Successfully loaded windowed data from ALL folds!")
                print(f"üöÇ Training windows: {len(all_train_windows)}")
                print(f"üß™ Test windows: {len(all_test_windows)}")
                print(f"‚ö° Loading time: {load_time:.3f} seconds")

                # Store for further processing
                train_dfs = all_train_windows
                train_classes = all_train_classes
                train_fold_info = all_train_fold_info  # Store fold tracking info
                test_dfs = all_test_windows
                test_classes = all_test_classes
                test_fold_info = all_test_fold_info  # Store fold tracking info

                # Display sample window information
                if train_dfs:
                    print("üìã Processing first training window...", end=" ")
                    first_train_window = train_dfs[0]
                    first_train_class = train_classes[0]
                    print("‚úÖ")

                    print(f"\nü™ü Sample Training Window (Window #1):")
                    print(f"   ‚Ä¢ Shape: {first_train_window.shape}")
                    print(f"   ‚Ä¢ Class: {first_train_class}")
                    print(f"   ‚Ä¢ Features: {list(first_train_window.columns)}")

                    # Show class distribution
                    print(f"\nüìä Training Set Class Distribution:")
                    train_unique, train_counts = np.unique(
                        train_classes, return_counts=True
                    )
                    for cls, count in zip(train_unique, train_counts):
                        print(f"   ‚Ä¢ Class {cls}: {count} windows")

                    print(f"\nüìä Test Set Class Distribution:")
                    test_unique, test_counts = np.unique(
                        test_classes, return_counts=True
                    )
                    for cls, count in zip(test_unique, test_counts):
                        print(f"   ‚Ä¢ Class {cls}: {count} windows")

                    total_time = time.time() - start_time
                    print(f"\n‚ö° Performance Summary:")
                    print(f"   ‚Ä¢ Total execution time: {total_time:.3f} seconds")
                    print(f"   ‚Ä¢ Data loading time: {load_time:.3f} seconds")
                    print(f"   ‚Ä¢ File format: {config.SAVE_FORMAT}")
                    print(f"   ‚Ä¢ Folds processed: {len(fold_dirs)}")

                    print(f"\nüéØ Dataset Summary for Supervised Classification:")
                    print(f"   ‚Ä¢ Total training windows: {len(train_dfs)}")
                    print(f"   ‚Ä¢ Total test windows: {len(test_dfs)}")
                    print(f"   ‚Ä¢ Window dimensions: {first_train_window.shape}")
                    print(f"   ‚Ä¢ Classes available: {sorted(train_unique)}")
                    print(f"   ‚Ä¢ Ready for: Decision Trees, SVM, Neural Networks")

                else:
                    print("‚ö†Ô∏è No training windows found in any fold")
                    train_dfs = []
                    train_classes = []
                    train_fold_info = []
                    test_dfs = []
                    test_classes = []
                    test_fold_info = []

except Exception as e:
    print(f"‚ùå Error loading data: {str(e)}")
    print(f"\nüí° Troubleshooting:")
    print(f"   1. Make sure 'Data Treatment.ipynb' ran completely")
    print(f"   2. Check if windowed data was saved successfully")
    print(f"   3. Verify the processed_data directory exists")
    print(f"   4. Ensure pickle format is available for fast loading")

    # Show directory status
    expected_dir = config.PROCESSED_DATA_DIR
    print(f"\nüìÅ Directory check: {expected_dir}")
    if os.path.exists(expected_dir):
        print(f"‚úÖ Base directory exists")
        windowed_path = os.path.join(expected_dir, "cv_splits", "windowed")
        if os.path.exists(windowed_path):
            print(f"‚úÖ Windowed directory exists")
            try:
                contents = os.listdir(windowed_path)
                print(f"üìÑ Contents: {contents}")
            except:
                print("‚ùå Cannot list directory contents")
        else:
            print(f"‚ùå Windowed directory missing: {windowed_path}")
    else:
        print(f"‚ùå Base directory does not exist")

    # Initialize empty variables for error case
    train_dfs = []
    train_classes = []
    train_fold_info = []
    test_dfs = []
    test_classes = []
    test_fold_info = []

ü§ñ Loading 3W Dataset for Supervised Classification
üì¶ Importing modules... ‚úÖ
üìÇ Initializing data persistence... ‚úÖ
‚ö° Using format: pickle for maximum speed
üìÅ Checking windowed directory: processed_data\cv_splits\windowed... ‚úÖ
üîç Looking for fold directories... ‚úÖ Found 3 folds
üìä Loading windowed data from ALL 3 folds for classification...
üìÅ Processing fold_1... ‚úÖ
üìÅ Processing fold_2... ‚úÖ
üìÅ Processing fold_3... ‚úÖ
‚úÖ Successfully loaded windowed data from ALL folds!
üöÇ Training windows: 505336
üß™ Test windows: 78491
‚ö° Loading time: 94.042 seconds
üìã Processing first training window... ‚úÖ

ü™ü Sample Training Window (Window #1):
   ‚Ä¢ Shape: (300, 4)
   ‚Ä¢ Class: 0
   ‚Ä¢ Features: ['P-PDG_scaled', 'P-TPT_scaled', 'T-TPT_scaled', 'class']

üìä Training Set Class Distribution:
   ‚Ä¢ Class 0: 52936 windows
   ‚Ä¢ Class 1: 80335 windows
   ‚Ä¢ Class 2: 5796 windows
   ‚Ä¢ Class 3: 46257 windows
   ‚Ä¢ Class 4: 10236 windows
   ‚Ä¢ Class 5

# üîç Enhanced Analysis Summary

## Five Key Analysis Features

The enhanced classification provides five critical analysis capabilities:

### 1. üéØ **Configurable Class Selection**
- **Purpose**: Choose which specific classes to include in your analysis
- **Configuration**: Controlled via `src/config.py` for easy management
- **Options**: 
  - Default: Exclude only class 0 (analyze all fault types 1-9)
  - Custom: Select specific classes of interest (e.g., [2,3,8] for particular faults)
  - Presets: Use pre-configured selections for common scenarios
- **Use Case**: Focus on specific fault types, reduce problem complexity, targeted analysis

### 2. üìä **Accuracy Per Fold**
- **Purpose**: Understand model consistency across different data splits
- **Insight**: Identifies if models perform consistently or if some folds are particularly challenging
- **Use Case**: Helps detect dataset bias, temporal patterns, or fold-specific issues

### 3. üè∑Ô∏è **Accuracy Per Class** 
- **Purpose**: Understand model performance on each selected class
- **Insight**: Reveals which fault types are easier/harder to detect
- **Use Case**: Critical for industrial applications where missing certain fault types has higher cost

### 4. ‚ö†Ô∏è **Flexible Class Filtering**
- **Purpose**: Either exclude normal operation (class 0) or focus on specific fault types
- **Insight**: Create focused classification problems aligned with operational needs
- **Use Case**: 
  - Fault diagnosis systems (exclude normal operation)
  - Specific fault type analysis (select particular faults)
  - Simplified problems for testing and development

### 5. üîÑ **Configurable Test Data Balancing**
- **Purpose**: Ensure robust evaluation with sufficient samples per class
- **Configuration**: Enable/disable via config file settings
- **Insight**: Balances test data to have minimum samples per selected class for reliable metrics
- **Use Case**: Prevents evaluation bias due to class imbalance in test set

## Configuration Management

### Config File Location:
```
üìÅ src/config.py
‚îî‚îÄ‚îÄ CLASSIFICATION_CONFIG section
‚îî‚îÄ‚îÄ CLASSIFICATION_PRESETS section
```

### Available Presets:
```python
CLASSIFICATION_PRESETS = {
    'all_faults': None,                # All fault types (exclude class 0)
    'specific_faults': [2, 3, 8],     # User-defined specific faults
    'early_faults': [1, 2, 3, 4, 5],  # First 5 fault types
    'late_faults': [7, 8, 9],         # Last 3 fault types
    'odd_faults': [1, 3, 5, 7, 9],    # Odd-numbered fault types
    'even_faults': [2, 4, 6, 8],      # Even-numbered fault types
    'critical_faults': [3, 6, 8, 9],  # Critical operational faults
    'minor_faults': [1, 2, 4, 5, 7],  # Minor operational faults
    'binary_test': [3, 8],            # Simple binary classification
}
```

### Configuration Examples:

#### In config.py:
```python
CLASSIFICATION_CONFIG = {
    'selected_classes': [2, 3, 8],     # Default selection
    'balance_test': False,             # Test balancing setting
    'min_test_samples_per_class': 300, # Min samples when balancing
    'balance_classes': True,           # Training balancing
    'balance_strategy': 'combined',    # Balancing strategy
    'max_samples_per_class': 1000,     # Training limit
    'verbose': True                    # Progress display
}
```

#### In notebook override:
```python
# Quick override in notebook
selected_classes = config.CLASSIFICATION_PRESETS['binary_test']  # Use preset
# OR
selected_classes = [1, 4, 7]  # Custom selection
```

## Why Configuration Management Matters

### Easy Experimentation:
- **Quick Changes**: Modify config file without editing notebook code
- **Preset Library**: Use common configurations instantly
- **Version Control**: Track configuration changes easily
- **Reproducibility**: Share exact configurations with team

### Industrial Relevance:
- **Deployment Ready**: Production configurations separate from code
- **Operational Flexibility**: Adapt to different operational scenarios
- **Maintenance Efficiency**: Update configurations without code changes
- **Team Collaboration**: Shared configuration standards

### Model Development:
- **Systematic Testing**: Compare different class combinations systematically
- **Performance Optimization**: Focus computational resources efficiently
- **Iterative Development**: Progressive complexity increase
- **Configuration Documentation**: Track what works best for different scenarios

In [2]:
# ============================================================
# COMPREHENSIVE SUPERVISED CLASSIFICATION WITH ENHANCED ANALYSIS
# ============================================================

print("ü§ñ Running Enhanced Supervised Classification Analysis")
print("=" * 60)

# Check if we have loaded data from previous cell
if (
    "train_dfs" in locals()
    and train_dfs is not None
    and len(train_dfs) > 0
    and "test_dfs" in locals()
    and test_dfs is not None
    and len(test_dfs) > 0
):

    # Import the enhanced supervised classification module and config
    from src.supervised_classification import enhanced_fold_analysis
    from src import config

    print("üìä Using enhanced supervised classification module...")
    print(f"üöÇ Training windows available: {len(train_dfs)}")
    print(f"üß™ Test windows available: {len(test_dfs)}")

    # Check if fold information is available
    fold_available = (
        "test_fold_info" in locals()
        and test_fold_info is not None
        and len(test_fold_info) == len(test_dfs)
    )

    if fold_available:
        unique_folds = sorted(set(test_fold_info))
        print(f"üìÅ Fold information available: {len(unique_folds)} folds detected")
        print(f"üìÅ Folds: {unique_folds}")
    else:
        print("‚ö†Ô∏è Fold information not available - will skip per-fold analysis")

    # ============================================================
    # LOAD CLASSIFICATION CONFIGURATION FROM CONFIG FILE
    # ============================================================

    print(f"\n‚öôÔ∏è Loading classification configuration from config.py...")

    # Load configuration from config file
    classification_config = config.CLASSIFICATION_CONFIG
    classification_presets = config.CLASSIFICATION_PRESETS

    # You can easily change the configuration by modifying these lines:
    # Option 1: Use configuration directly from config file
    selected_classes = classification_config["selected_classes"]
    balance_test = classification_config["balance_test"]
    min_test_samples_per_class = classification_config["min_test_samples_per_class"]

    # Option 2: Override with a preset (uncomment one to use)
    # selected_classes = classification_presets['all_faults']         # All fault types
    # selected_classes = classification_presets['early_faults']       # Classes 1-5
    # selected_classes = classification_presets['late_faults']        # Classes 7-9
    # selected_classes = classification_presets['odd_faults']         # Odd classes
    # selected_classes = classification_presets['binary_test']        # Classes 3,8

    # Option 3: Custom selection (uncomment and modify)
    # selected_classes = [1, 4, 7]  # Your custom selection

    print(f"üìã Available classification presets:")
    for preset_name, preset_classes in classification_presets.items():
        print(f"   ‚Ä¢ {preset_name}: {preset_classes}")

    print(f"\nüéØ Current Classification Configuration:")
    print(f"   üìä Selected classes: {selected_classes}")
    print(f"   üìä Balance test data: {balance_test}")
    if balance_test:
        print(f"   üìä Min test samples per class: {min_test_samples_per_class}")

    if selected_classes is None:
        print(f"   üìä Analysis Mode: All fault types (exclude only class 0)")
        print(f"   üìä Classes to analyze: All available fault classes (1-9)")
    else:
        print(f"   üìä Analysis Mode: Custom class selection")
        print(f"   üìä Classes to analyze: {selected_classes}")

    # Run enhanced supervised classification with fold analysis
    # This will provide:
    # 1. Class filtering based on selected_classes parameter
    # 2. Accuracy per fold (if fold info available and test balancing disabled)
    # 3. Accuracy per class for selected classes only
    # 4. Confirmation of class selection status
    # Plus all the standard analysis (model comparison, visualizations, etc.)

    print("\nüîÑ Starting enhanced classification pipeline...")
    print("üìã Analysis will include:")
    if selected_classes is None:
        print("   ‚ö†Ô∏è Complete exclusion of class 0 (normal operation)")
        print("   ‚úÖ All fault types (classes 1-9) included")
    else:
        print(f"   üéØ Custom class selection: {selected_classes}")
        print("   ‚úÖ Only selected classes included in analysis")
    print("   ‚úÖ Standard model training (Decision Trees, SVM, Neural Networks)")
    print("   ‚úÖ Data augmentation for class balancing (training)")
    if balance_test:
        print(
            f"   ‚úÖ Test data balancing (ensures min {min_test_samples_per_class} samples per class)"
        )
    else:
        print("   ‚ö†Ô∏è Test data balancing disabled")
    print("   ‚úÖ Per-class accuracy analysis (selected classes only)")
    if fold_available:
        print("   ‚ö†Ô∏è Per-fold analysis (may be disabled due to test balancing)")
    else:
        print("   ‚ö†Ô∏è Per-fold analysis (skipped - fold info unavailable)")

    classifier = enhanced_fold_analysis(
        train_dfs=train_dfs,  # Full training data
        train_classes=train_classes,  # Full training labels
        test_dfs=test_dfs,  # Full test data
        test_classes=test_classes,  # Full test labels
        fold_info=(
            test_fold_info if fold_available else None
        ),  # Pass fold info if available
        balance_classes=classification_config["balance_classes"],  # From config
        balance_strategy=classification_config["balance_strategy"],  # From config
        max_samples_per_class=classification_config[
            "max_samples_per_class"
        ],  # From config
        balance_test=balance_test,  # From config (configurable)
        min_test_samples_per_class=min_test_samples_per_class,  # From config (configurable)
        selected_classes=selected_classes,  # From config (configurable)
        verbose=classification_config["verbose"],  # From config
    )

    print(f"\nüéâ Enhanced Supervised Classification Complete!")
    print(f"‚úÖ All models trained and compared")
    if selected_classes is None:
        print(f"‚ö†Ô∏è Class 0 (normal operation) completely excluded from analysis")
        print(f"‚úÖ All fault types (classes 1-9) analyzed")
    else:
        print(f"üéØ Custom class selection applied: {selected_classes}")
        print(f"‚úÖ Only selected classes analyzed")
    print(f"‚úÖ Class balancing applied using data augmentation (training)")
    if balance_test:
        print(
            f"‚úÖ Test data balanced to ensure robust evaluation (min {min_test_samples_per_class} per class)"
        )
    else:
        print(f"‚ö†Ô∏è Test data balancing disabled")
    print(f"‚úÖ Per-class accuracy analysis completed (selected classes only)")
    if fold_available:
        print(f"‚ö†Ô∏è Per-fold accuracy analysis (may be disabled due to test balancing)")
    print(f"‚úÖ Standard performance visualizations created")
    print(f"‚úÖ Enhanced analysis provides comprehensive classification insights")

    print(f"\nüí° Configuration Tips:")
    print(f"   ‚Ä¢ Edit 'src/config.py' to change default settings")
    print(f"   ‚Ä¢ Use classification presets for common scenarios")
    print(f"   ‚Ä¢ Override settings in this cell for quick experiments")

    # Store classifier for further analysis if needed
    supervised_classifier = classifier

else:
    print(
        "‚ùå No data available. Please run the previous cell first to load training and test data."
    )
    supervised_classifier = None

ü§ñ Running Enhanced Supervised Classification Analysis
üìä Using enhanced supervised classification module...
üöÇ Training windows available: 505336
üß™ Test windows available: 78491
üìÅ Fold information available: 3 folds detected
üìÅ Folds: ['fold_1', 'fold_2', 'fold_3']

‚öôÔ∏è Loading classification configuration from config.py...
üìã Available classification presets:
   ‚Ä¢ all_faults: None
   ‚Ä¢ specific_faults: [2, 3, 8]
   ‚Ä¢ early_faults: [1, 2, 3, 4, 5]
   ‚Ä¢ late_faults: [7, 8, 9]
   ‚Ä¢ odd_faults: [1, 3, 5, 7, 9]
   ‚Ä¢ even_faults: [2, 4, 6, 8]
   ‚Ä¢ critical_faults: [3, 6, 8, 9]
   ‚Ä¢ minor_faults: [1, 2, 4, 5, 7]
   ‚Ä¢ binary_test: [3, 8]

üéØ Current Classification Configuration:
   üìä Selected classes: [1, 2, 3, 4, 5, 8, 9]
   üìä Balance test data: False
   üìä Analysis Mode: Custom class selection
   üìä Classes to analyze: [1, 2, 3, 4, 5, 8, 9]

üîÑ Starting enhanced classification pipeline...
üìã Analysis will include:
   üéØ Custom class se

IndexError: boolean index did not match indexed array along axis 0; size of axis is 31555 but size of corresponding boolean axis is 2664

## Ô∏è Configuration-Based Classification (New!)

### Easy Configuration Management

The classification system now uses a configuration file approach for easy experimentation:

**üìÅ Configuration File:** `src/config.py`
- Contains all classification settings
- Pre-configured presets for common scenarios
- Easy to modify without changing notebook code

### Quick Configuration Examples:

#### 1. Use Default Configuration (Current: [2, 3, 8]):
```python
# Already configured in config.py
selected_classes = config.CLASSIFICATION_CONFIG['selected_classes']
```

#### 2. Use a Preset:
```python
selected_classes = config.CLASSIFICATION_PRESETS['binary_test']    # Classes [3, 8]
selected_classes = config.CLASSIFICATION_PRESETS['early_faults']   # Classes [1,2,3,4,5]
selected_classes = config.CLASSIFICATION_PRESETS['all_faults']     # All fault types (1-9)
```

#### 3. Custom Selection:
```python
selected_classes = [1, 4, 7]  # Your specific choice
```

### Test Data Balancing Control:
```python
balance_test = True   # Enable balanced test evaluation
balance_test = False  # Use original test distribution (faster)
```

### Benefits:
- **üöÄ Quick Experiments**: Change focus without code modifications
- **üìã Reproducible**: Share exact configurations
- **‚öôÔ∏è Production Ready**: Separate configuration from code
- **üîÑ Version Control**: Track configuration changes

---

## üå≥ Decision Trees and Random Forest (8 minutes)

### Decision Trees
- **How it works**: Creates a tree-like model of decisions
- **Advantages**: Easy to interpret, handles both numerical and categorical data
- **Disadvantages**: Prone to overfitting, can be unstable

### Random Forest
- **How it works**: Combines many decision trees (ensemble method)
- **Advantages**: Reduces overfitting, more robust, provides feature importance
- **Disadvantages**: Less interpretable than single tree, can still overfit with noisy data

### Why Good for Oil Well Data:
- Handles high-dimensional features well
- Provides feature importance (which sensors are most important)
- Robust to outliers and noise
- Works well with time series features

In [None]:
# ============================================================
# INDIVIDUAL ALGORITHM TRAINING (OPTIONAL)
# ============================================================

print("üå≥ Individual Algorithm Training Example")
print("=" * 50)

# This cell demonstrates how to use individual algorithms from the module
# The comprehensive classification above already ran all algorithms

if "supervised_classifier" in locals() and supervised_classifier is not None:
    print("üìä Comprehensive classification already completed above!")
    print("üîç Here's how to access individual algorithm results:")

    # Access results from the comprehensive run
    results = supervised_classifier.results

    # Find tree-based models
    tree_models = [
        r for r in results if "Tree" in r["model_name"] or "Forest" in r["model_name"]
    ]

    if tree_models:
        print(f"\nüå≥ Tree-Based Models Performance:")
        print("-" * 40)
        for result in tree_models:
            print(f"{result['model_name']}:")
            print(f"   ‚Ä¢ Training Accuracy: {result['train_accuracy']:.3f}")
            print(f"   ‚Ä¢ Test Accuracy: {result['test_accuracy']:.3f}")
            print(f"   ‚Ä¢ Training Time: {result['training_time']:.3f}s")

            # Show feature importance for Random Forest
            if (
                "Random Forest" in result["model_name"]
                and "feature_importance" in result
            ):
                print(f"   ‚Ä¢ Top 5 Most Important Features:")
                feature_importance = result["feature_importance"]
                top_features_idx = np.argsort(feature_importance)[-5:][::-1]
                for i, idx in enumerate(top_features_idx, 1):
                    print(f"     {i}. Feature {idx}: {feature_importance[idx]:.4f}")
            print()

    print("üí° To train individual algorithms separately:")
    print(
        "   1. Use supervised_classifier.prepare_data() to get X_train, y_train, X_test, y_test"
    )
    print(
        "   2. Call supervised_classifier.train_decision_trees(X_train, y_train, X_test, y_test)"
    )
    print("   3. Or use supervised_classifier.train_svm() or train_neural_networks()")

    # Example of how to train just decision trees individually
    print(f"\nüîß Example: Training only Decision Trees individually")
    print("(This would be useful if you only want specific algorithms)")

    # Note: Data is already prepared in the classifier
    print(
        "‚úÖ Data preparation, class balancing, and augmentation already handled by module"
    )
    print("‚úÖ All models already trained - see comprehensive results above")

else:
    print(
        "‚ùå No classifier available. Please run the comprehensive classification cell first."
    )
    print("üí° The comprehensive cell above handles:")
    print("   ‚Ä¢ Data preparation and class balancing with augmentation")
    print("   ‚Ä¢ Training of all algorithms (Decision Trees, SVM, Neural Networks)")
    print("   ‚Ä¢ Model comparison and visualization")
    print("   ‚Ä¢ Performance analysis and recommendations")

üå≥ Individual Algorithm Training Example
‚ùå No classifier available. Please run the comprehensive classification cell first.
üí° The comprehensive cell above handles:
   ‚Ä¢ Data preparation and class balancing with augmentation
   ‚Ä¢ Training of all algorithms (Decision Trees, SVM, Neural Networks)
   ‚Ä¢ Model comparison and visualization
   ‚Ä¢ Performance analysis and recommendations


## ‚ö° Support Vector Machines (4 minutes)

### How SVM Works:
- **Goal**: Find the optimal hyperplane that separates classes with maximum margin
- **Kernel Trick**: Map data to higher dimensions to make it linearly separable
- **Support Vectors**: Data points closest to the decision boundary

### SVM Advantages:
- Effective in high-dimensional spaces (perfect for our flattened time series)
- Memory efficient (only uses support vectors)
- Versatile (different kernels for different data patterns)

### SVM for Oil Well Data:
- Handles high-dimensional sensor data well
- RBF kernel can capture non-linear patterns in sensor readings
- Good for binary classification problems (normal vs fault)

In [None]:
# ============================================================
# SVM RESULTS FROM COMPREHENSIVE CLASSIFICATION
# ============================================================

print("‚ö° Support Vector Machines Results")
print("=" * 40)

if "supervised_classifier" in locals() and supervised_classifier is not None:
    print("üìä SVM models already trained in comprehensive classification!")

    # Access SVM results
    results = supervised_classifier.results
    svm_models = [r for r in results if "SVM" in r["model_name"]]

    if svm_models:
        print(f"\n‚ö° SVM Performance Summary:")
        print("-" * 35)
        for result in svm_models:
            print(f"{result['model_name']}:")
            print(f"   ‚Ä¢ Training Accuracy: {result['train_accuracy']:.3f}")
            print(f"   ‚Ä¢ Test Accuracy: {result['test_accuracy']:.3f}")
            print(f"   ‚Ä¢ Training Time: {result['training_time']:.3f}s")
            print()

        # Show which SVM performed better
        best_svm = max(svm_models, key=lambda x: x["test_accuracy"])
        print(
            f"üèÜ Best SVM: {best_svm['model_name']} (Test Accuracy: {best_svm['test_accuracy']:.3f})"
        )

        print(f"\n SVM Implementation Details:")
        print(f"   ‚Ä¢ Used subset sampling for computational efficiency")
        print(f"   ‚Ä¢ Linear SVM: Faster, good for linearly separable data")
        print(f"   ‚Ä¢ RBF SVM: Better for complex non-linear patterns")
        print(f"   ‚Ä¢ Data already normalized - optimal for SVM performance")
        print(f"   ‚Ä¢ Class balancing with augmentation improved SVM robustness")

    print(f"\n Module Benefits:")
    print(f"   ‚úÖ Automatic subset sampling for large datasets")
    print(f"   ‚úÖ Both Linear and RBF kernels tested")
    print(f"   ‚úÖ Optimized hyperparameters")
    print(f"   ‚úÖ Integrated with data augmentation for class balancing")

else:
    print("‚ùå No classifier results available.")
    print("üí° Run the comprehensive classification cell to see SVM results.")
    print("üìã The module automatically handles:")
    print("   ‚Ä¢ Subset sampling for SVM computational efficiency")
    print("   ‚Ä¢ Training both Linear and RBF SVM variants")
    print("   ‚Ä¢ Performance comparison with other algorithms")
    print("   ‚Ä¢ Integration with balanced dataset using augmentation")

## üß† Neural Networks and Deep Learning Basics (8 minutes)

### Neural Networks Fundamentals:
- **Neurons**: Basic processing units that compute weighted sums + activation
- **Layers**: Input layer ‚Üí Hidden layers ‚Üí Output layer
- **Backpropagation**: Learning algorithm that adjusts weights based on errors
- **Activation Functions**: Non-linear functions (ReLU, sigmoid, tanh)

### Why Neural Networks for Oil Well Data:
- **Automatic Feature Learning**: Can discover complex patterns in sensor data
- **Non-linear Relationships**: Capture complex interactions between sensors
- **Temporal Patterns**: Can model dependencies in time series data
- **Scalability**: Handle large amounts of high-dimensional data

### Types of Neural Networks:
1. **Multi-Layer Perceptron (MLP)**: Standard feedforward network
2. **Convolutional Neural Networks (CNN)**: Good for pattern recognition
3. **Recurrent Neural Networks (RNN/LSTM)**: Designed for sequential data

### Deep Learning Advantages:
- Learns features automatically from raw data
- Can model very complex relationships
- State-of-the-art performance on many tasks

In [None]:
# ============================================================
# NEURAL NETWORKS RESULTS FROM COMPREHENSIVE CLASSIFICATION
# ============================================================

print("üß† Neural Networks Results")
print("=" * 35)

if "supervised_classifier" in locals() and supervised_classifier is not None:
    print("üìä Neural Network models already trained in comprehensive classification!")

    # Access Neural Network results
    results = supervised_classifier.results
    nn_models = [r for r in results if "Neural Network" in r["model_name"]]

    if nn_models:
        print(f"\nüß† Neural Networks Performance Summary:")
        print("-" * 45)

        # Create comparison table
        print(
            f"{'Model':<25} {'Train Acc':<10} {'Test Acc':<10} {'Time (s)':<10} {'Iterations':<12}"
        )
        print("-" * 67)

        for result in nn_models:
            iterations = result.get("iterations", "N/A")
            print(
                f"{result['model_name']:<25} {result['train_accuracy']:<10.3f} "
                f"{result['test_accuracy']:<10.3f} {result['training_time']:<10.3f} {iterations:<12}"
            )

        # Show best neural network
        best_nn = max(nn_models, key=lambda x: x["test_accuracy"])
        print(
            f"\nüèÜ Best Neural Network: {best_nn['model_name']} (Test Accuracy: {best_nn['test_accuracy']:.3f})"
        )

        print(f"\nüß† Neural Network Architecture Analysis:")
        print(f"   ‚Ä¢ Simple NN (1 layer): Fast baseline, good for simple patterns")
        print(f"   ‚Ä¢ Deep NN (3 layers): Complex pattern recognition, may overfit")
        print(f"   ‚Ä¢ Regularized NN: Balanced approach with dropout prevention")

        print(f"\nüí° Key Neural Network Insights:")
        print(f"   ‚Ä¢ Early stopping prevented overfitting automatically")
        print(f"   ‚Ä¢ Data already normalized - optimal for neural network training")
        print(f"   ‚Ä¢ Class balancing improved learning from minority classes")
        print(f"   ‚Ä¢ Adaptive learning rates helped convergence")

        # Show training efficiency
        total_nn_time = sum(r["training_time"] for r in nn_models)
        print(f"\n‚ö° Training Efficiency:")
        print(f"   ‚Ä¢ Total NN training time: {total_nn_time:.3f} seconds")
        print(
            f"   ‚Ä¢ Average iterations: {np.mean([r.get('iterations', 0) for r in nn_models]):.1f}"
        )
        print(f"   ‚Ä¢ All models used early stopping for efficiency")

    print(f"\n Module Neural Network Features:")
    print(f"   ‚úÖ Multiple architectures tested automatically")
    print(f"   ‚úÖ Early stopping and regularization built-in")
    print(f"   ‚úÖ Optimized hyperparameters for time series data")
    print(f"   ‚úÖ Integrated with balanced dataset")
    print(f"   ‚úÖ Automatic feature learning from flattened windows")

else:
    print("‚ùå No classifier results available.")
    print("üí° Run the comprehensive classification cell to see Neural Network results.")
    print("üìã The module automatically provides:")
    print("   ‚Ä¢ Simple Neural Network (1 hidden layer)")
    print("   ‚Ä¢ Deep Neural Network (3 hidden layers)")
    print("   ‚Ä¢ Regularized Neural Network (with dropout prevention)")
    print("   ‚Ä¢ Early stopping and validation for all networks")
    print("   ‚Ä¢ Performance comparison and overfitting analysis")

## üöÄ Implementation and Training Summary (5 minutes)

### Model Performance Comparison
Let's compare all the models we trained and understand their strengths and weaknesses for oil well fault detection.

In [None]:
# ============================================================
# COMPREHENSIVE ANALYSIS AND TUTORIAL SUMMARY
# ============================================================

print(" Supervised Classification Tutorial Summary")
print("=" * 50)

if "supervised_classifier" in locals() and supervised_classifier is not None:
    print("‚úÖ Comprehensive supervised classification completed successfully!")

    # Show final summary of what was accomplished
    results = supervised_classifier.results

    print(f"\nüìä Tutorial Accomplishments:")
    print(f"   ‚úÖ Loaded windowed data from ALL folds")
    print(f"   ‚úÖ Applied data augmentation for class balancing")
    print(f"   ‚úÖ Used existing data normalization (no additional scaling)")
    print(f"   ‚úÖ Trained {len(results)} different classification models")
    print(f"   ‚úÖ Compared model performance comprehensively")
    print(f"   ‚úÖ Generated performance visualizations")
    print(f"   ‚úÖ Provided practical recommendations")

    # Show algorithm categories covered
    tree_models = [
        r for r in results if "Tree" in r["model_name"] or "Forest" in r["model_name"]
    ]
    svm_models = [r for r in results if "SVM" in r["model_name"]]
    nn_models = [r for r in results if "Neural Network" in r["model_name"]]

    print(f"\nüî¨ Algorithms Implemented:")
    print(f"   üå≥ Tree-Based: {len(tree_models)} models (Decision Tree, Random Forest)")
    print(f"   ‚ö° Support Vector Machines: {len(svm_models)} models (Linear, RBF)")
    print(f"   üß† Neural Networks: {len(nn_models)} models (Simple, Deep, Regularized)")

    # Show best performers in each category
    if tree_models:
        best_tree = max(tree_models, key=lambda x: x["test_accuracy"])
        print(
            f"\nüèÜ Best Tree Model: {best_tree['model_name']} ({best_tree['test_accuracy']:.3f})"
        )

    if svm_models:
        best_svm = max(svm_models, key=lambda x: x["test_accuracy"])
        print(
            f"üèÜ Best SVM Model: {best_svm['model_name']} ({best_svm['test_accuracy']:.3f})"
        )

    if nn_models:
        best_nn = max(nn_models, key=lambda x: x["test_accuracy"])
        print(
            f"üèÜ Best Neural Network: {best_nn['model_name']} ({best_nn['test_accuracy']:.3f})"
        )

    # Overall best model
    overall_best = max(results, key=lambda x: x["test_accuracy"])
    print(
        f"\nü•á Overall Best Model: {overall_best['model_name']} ({overall_best['test_accuracy']:.3f})"
    )

    print(f"\nüí° Key Learning Outcomes Achieved:")
    print(f"   üìö Classification Fundamentals: Problem formulation and evaluation")
    print(f"   üå≥ Decision Trees & Random Forest: Interpretable ensemble methods")
    print(f"   ‚ö° Support Vector Machines: High-dimensional data classification")
    print(f"   üß† Neural Networks: Deep learning for automatic feature extraction")
    print(f"   üìä Model Comparison: Performance analysis and selection criteria")

    print(f"\nüîß Technical Implementations:")
    print(
        f"   ‚úÖ Data augmentation for class balancing (using src/data_augmentation.py)"
    )
    print(
        f"   ‚úÖ Comprehensive classification module (src/supervised_classification.py)"
    )
    print(f"   ‚úÖ No redundant normalization (data already preprocessed)")
    print(f"   ‚úÖ Efficient computational strategies (SVM subsampling)")
    print(f"   ‚úÖ Early stopping and regularization for neural networks")

    print(f"\nüéØ Production Readiness:")
    print(f"   ‚Ä¢ Model code modularized in src/ folder")
    print(f"   ‚Ä¢ Class balancing pipeline integrated")
    print(f"   ‚Ä¢ Performance metrics and visualizations available")
    print(f"   ‚Ä¢ Best model identified with practical recommendations")
    print(f"   ‚Ä¢ Ready for deployment and further optimization")

    print(f"\nüìà Next Steps:")
    print(f"   1. Implement cross-validation for more robust evaluation")
    print(f"   2. Hyperparameter tuning for the best performing algorithms")
    print(f"   3. Ensemble methods combining multiple top performers")
    print(f"   4. Real-time inference pipeline development")
    print(f"   5. Model monitoring and maintenance procedures")

else:
    print("‚ùå Comprehensive classification not completed.")
    print("üí° Please run the comprehensive classification cell to:")
    print("   1. Load and prepare windowed data from all folds")
    print("   2. Apply data augmentation for class balancing")
    print("   3. Train all classification algorithms")
    print("   4. Compare model performance")
    print("   5. Generate visualizations and recommendations")

print(f"\n Supervised Classification Tutorial Complete!")
print(f"   You have successfully implemented and compared:")
print(f"   ‚úÖ Decision Trees and Random Forest (8 min)")
print(f"   ‚úÖ Support Vector Machines (4 min)")
print(f"   ‚úÖ Neural Networks and Deep Learning (8 min)")
print(f"   ‚úÖ Implementation and Training Summary (5 min)")
print(f"   ‚úÖ Comprehensive model comparison and analysis")
print(f"   ‚úÖ Production-ready modular implementation")