# OIBC Submission Pipeline - Demonstration Notebook

This notebook demonstrates the complete workflow for the OIBC (Open Innovation Big Competition) submission.

## Pipeline Overview

The pipeline consists of 5 main steps:
1. **Data Splitting**: Split data into train/validation sets
2. **Clustering**: Create location-based PV clusters
3. **Feature Engineering**: Generate cluster-based features
4. **Model Training**: Train ensemble or individual models
5. **Inference**: Generate predictions

## Quick Start

You can run the entire pipeline using the orchestration script:

```bash
python run_pipeline.py --config p34/config.yaml
```

## Setup and Imports

In [None]:
import pandas as pd
import numpy as np
import yaml
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

## Load Configuration

In [None]:
# Load configuration
with open('p34/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

print("Configuration:")
for key, value in config.items():
    print(f"  {key}: {value}")

## Step 1: Data Exploration

Let's explore the training data to understand its structure.

In [None]:
# Load training data (update path as needed)
train_path = config.get('train_split_path', '/workspace/oibc/data/train.csv')

# Read a sample of the data
df = pd.read_csv(train_path, nrows=10000, parse_dates=['time'])

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Data statistics
print("Data types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())
print("\nBasic statistics:")
df.describe()

## Step 2: Visualize PV Locations

Visualize the geographic distribution of PV (photovoltaic) systems.

In [None]:
# Extract unique PV locations
pv_locations = df[['pv_id', 'coord1', 'coord2']].drop_duplicates()

plt.figure(figsize=(10, 8))
plt.scatter(pv_locations['coord1'], pv_locations['coord2'], alpha=0.5)
plt.xlabel('Coordinate 1 (Longitude)')
plt.ylabel('Coordinate 2 (Latitude)')
plt.title(f'PV System Locations (n={len(pv_locations)})')
plt.grid(True, alpha=0.3)
plt.show()

## Step 3: Clustering Visualization

If clustering has been performed, visualize the clusters.

In [None]:
# Load cluster model if available
cluster_model_path = 'train_cluster_model.joblib'

if Path(cluster_model_path).exists():
    kmeans = joblib.load(cluster_model_path)
    
    # Predict clusters for PV locations
    pv_locations['cluster'] = kmeans.predict(pv_locations[['coord1', 'coord2']])
    
    # Plot clusters
    plt.figure(figsize=(12, 8))
    scatter = plt.scatter(
        pv_locations['coord1'], 
        pv_locations['coord2'],
        c=pv_locations['cluster'],
        cmap='tab10',
        alpha=0.6,
        s=50
    )
    
    # Plot cluster centers
    centers = kmeans.cluster_centers_
    plt.scatter(
        centers[:, 0],
        centers[:, 1],
        c='red',
        marker='X',
        s=200,
        edgecolors='black',
        label='Cluster Centers'
    )
    
    plt.xlabel('Coordinate 1 (Longitude)')
    plt.ylabel('Coordinate 2 (Latitude)')
    plt.title(f'PV Clustering ({kmeans.n_clusters} clusters)')
    plt.colorbar(scatter, label='Cluster ID')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
    
    # Cluster size distribution
    plt.figure(figsize=(10, 6))
    pv_locations['cluster'].value_counts().sort_index().plot(kind='bar')
    plt.xlabel('Cluster ID')
    plt.ylabel('Number of PV Systems')
    plt.title('PV Systems per Cluster')
    plt.grid(True, alpha=0.3, axis='y')
    plt.show()
else:
    print("Cluster model not found. Run clustering step first.")

## Step 4: Run Individual Pipeline Steps

You can run individual steps of the pipeline:

In [None]:
# Run data splitting
# !python data_split/split.py

In [None]:
# Run clustering
# !python cluster_code/add_cluster_efficient.py
# !python cluster_code/add_cluster_from_train.py

In [None]:
# Run model training
# !cd p34 && python main.py

## Step 5: Analyze Model Results

After training, analyze the model performance and predictions.

In [None]:
# Load predictions if available
save_path = Path(config.get('save_path', './output'))
prediction_files = list(save_path.glob('predictions_*.csv'))

if prediction_files:
    print("Available prediction files:")
    for i, pred_file in enumerate(prediction_files, 1):
        print(f"  {i}. {pred_file.name}")
    
    # Load the first prediction file
    predictions = pd.read_csv(prediction_files[0])
    print(f"\nLoaded predictions from: {prediction_files[0].name}")
    print(f"Shape: {predictions.shape}")
    predictions.head()
else:
    print("No prediction files found. Run the inference step first.")

In [None]:
# Visualize prediction distribution
if prediction_files:
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    plt.hist(predictions.iloc[:, -1], bins=50, edgecolor='black')
    plt.xlabel('Predicted Value')
    plt.ylabel('Frequency')
    plt.title('Distribution of Predictions')
    plt.grid(True, alpha=0.3, axis='y')
    
    plt.subplot(1, 2, 2)
    predictions.iloc[:, -1].plot(kind='box')
    plt.ylabel('Predicted Value')
    plt.title('Box Plot of Predictions')
    plt.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    print("\nPrediction statistics:")
    print(predictions.iloc[:, -1].describe())

## Step 6: Run Complete Pipeline

Run the entire pipeline using the orchestration script:

In [None]:
# Run complete pipeline (uncomment to execute)
# !python run_pipeline.py --config p34/config.yaml

In [None]:
# Run only training and inference (skip data prep)
# !python run_pipeline.py --config p34/config.yaml --skip-split --skip-cluster

In [None]:
# Run a specific step
# !python run_pipeline.py --config p34/config.yaml --step train

## Summary

This notebook demonstrates:
1. Data exploration and visualization
2. PV location clustering visualization
3. Running individual pipeline steps
4. Analyzing model predictions
5. Using the orchestration script for complete pipeline execution

For production runs, use the `run_pipeline.py` script for automated execution.