# Cluster 15 Cleaning & Feature Engineering

This notebook cleans and standardizes CAISO **Cluster 15** interconnection request data.

Goals:
- Load Cluster 15 active + withdrawn sheets
- Standardize column names for analysis
- Engineer capacity and storage-duration features
- Output `data/processed/cluster15_clean.csv`


## 1) Imports

Keep imports explicit and minimal.

In [7]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np

print('Python:', sys.version)
print('Executable:', sys.executable)

Python: 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
Executable: C:\Users\danci\Interconnection-Queue-Intelligence\.venv\Scripts\python.exe


## 2) Paths (robust to running from `notebooks/`)

Jupyter may run with the working directory set to `notebooks/`.
This block detects that and moves `ROOT` up to the repository root.


In [8]:
ROOT = Path.cwd()
if ROOT.name == 'notebooks':
    ROOT = ROOT.parent

RAW = ROOT / 'data' / 'raw'
PROCESSED = ROOT / 'data' / 'processed'
OUTPUTS = ROOT / 'outputs'

PROCESSED.mkdir(parents=True, exist_ok=True)
OUTPUTS.mkdir(parents=True, exist_ok=True)

print('ROOT:', ROOT)
print('RAW files:', [p.name for p in RAW.glob('*')])
print('PROCESSED exists:', PROCESSED.exists(), PROCESSED)

ROOT: C:\Users\danci\Interconnection-Queue-Intelligence
RAW files: ['cluster-15-interconnection-requests.xlsx', 'publicqueuereport.xlsx']
PROCESSED exists: True C:\Users\danci\Interconnection-Queue-Intelligence\data\processed


## 3) Load Cluster 15 sheets

The Cluster 15 workbook has:
- `Cluster 15 ` (active/ongoing)
- `Withdrawn` (withdrawn requests)

We load both and add an explicit `status` label.


In [9]:
CLUSTER15_PATH = RAW / 'cluster-15-interconnection-requests.xlsx'

cluster15_active = pd.read_excel(CLUSTER15_PATH, sheet_name='Cluster 15 ')
cluster15_withdrawn = pd.read_excel(CLUSTER15_PATH, sheet_name='Withdrawn')

cluster15_active['status'] = 'active'
cluster15_withdrawn['status'] = 'withdrawn'

cluster15_all = pd.concat([cluster15_active, cluster15_withdrawn], ignore_index=True)

cluster15_active.shape, cluster15_withdrawn.shape, cluster15_all.shape

((108, 21), (62, 22), (170, 24))

### Inspect columns (once)

We print columns to confirm exact names before selecting and renaming.

In [10]:
for col in cluster15_all.columns:
    print(col)

Queue Number
Project Number
Project Name
Generation/Fuel 1
NET MW 1
Generation/Fuel 2
NET MW 2
Generation/Fuel 3
NET MW 3
NET MW POI
PROJECT COUNTY
Project State
Study Area
PTO
POI
Voltage kV
Requested COD
Queue Date 
Application Date
Service Type
status
Queue Date
Application Date 
Withdrawal Date


## 4) Select core columns

We keep submission-time attributes useful for survivability analysis.

Note: Column names can vary slightly between downloads.
We use a dictionary-based rename and will error early if any required column is missing.


In [13]:
# --- Required columns for Cluster 15 analysis (based on what exists in your file) ---
RENAME_MAP = {
    "Queue Number": "queue_number",
    "Project Number": "project_number",
    "Project Name": "project_name",
    "Generation/Fuel 1": "fuel_primary",
    "Generation/Fuel 2": "fuel_secondary",
    "NET MW 1": "mw_1",
    "NET MW 2": "mw_2",
    "Service Type": "service_type",
    "Queue Date": "queue_date",
    "Application Date": "application_date",
    "Withdrawal Date": "withdrawal_date",
    "status": "status",
}

required = list(RENAME_MAP.keys())
missing = [c for c in required if c not in cluster15_all.columns]
if missing:
    raise KeyError(f"Missing required columns in Cluster 15 file: {missing}")

cluster15 = cluster15_all[required].rename(columns=RENAME_MAP).copy()
cluster15.shape

(170, 12)

## 5) Clean numeric fields + engineer capacity features

Cluster 15 splits MW across MW-1 and MW-2 for hybrid projects.
We compute:
- `net_mw` = sum of MW components
- `storage_duration_hours` = MWh / net_mw


In [14]:
# Numeric MW
cluster15["mw_1"] = pd.to_numeric(cluster15["mw_1"], errors="coerce")
cluster15["mw_2"] = pd.to_numeric(cluster15["mw_2"], errors="coerce")
cluster15["net_mw"] = cluster15[["mw_1", "mw_2"]].sum(axis=1, min_count=1)

# Parse dates (some may be missing for active projects)
for c in ["queue_date", "application_date", "withdrawal_date"]:
    cluster15[c] = pd.to_datetime(cluster15[c], errors="coerce")

# Time-to-withdraw proxies (in days)
cluster15["days_queue_to_withdrawal"] = (cluster15["withdrawal_date"] - cluster15["queue_date"]).dt.days
cluster15["days_app_to_withdrawal"] = (cluster15["withdrawal_date"] - cluster15["application_date"]).dt.days

cluster15[["net_mw", "days_queue_to_withdrawal", "days_app_to_withdrawal"]].describe()

Unnamed: 0,net_mw,days_queue_to_withdrawal,days_app_to_withdrawal
count,170.0,62.0,0.0
mean,495.695573,208.693548,
std,464.711599,101.962962,
min,10.072,70.0,
25%,204.9075,93.25,
50%,365.542,283.5,
75%,558.482333,294.0,
max,2346.96,303.0,


## 6) Technology categories (aligned to Public Queue)

We map fuel fields into a small, interpretable set of technology categories.
This mirrors the approach used in `02_clean_public_queue.ipynb`.


In [15]:
def normalize_fuel(x):
    if pd.isna(x):
        return ''
    return str(x).strip().lower()

def infer_technology(row):
    fuels = ' '.join([
        normalize_fuel(row.get('fuel_primary')),
        normalize_fuel(row.get('fuel_secondary')),
    ])

    has_solar = ('solar' in fuels) or ('photovoltaic' in fuels) or ('pv' in fuels)
    has_storage = ('storage' in fuels) or ('battery' in fuels)
    has_wind = ('wind' in fuels)

    if has_solar and has_storage:
        return 'hybrid_solar_storage'
    if has_storage:
        return 'storage'
    if has_solar:
        return 'solar'
    if has_wind:
        return 'wind'
    return 'other'

cluster15['technology'] = cluster15.apply(infer_technology, axis=1)
cluster15['technology'].value_counts(dropna=False)

technology
storage                 88
hybrid_solar_storage    56
solar                   23
wind                     2
other                    1
Name: count, dtype: int64

## 7) Outcome flags + quick sanity checks

In [19]:
# Outcome flags
cluster15['is_withdrawn'] = (cluster15['status'] == 'withdrawn').astype(int)
cluster15['is_active'] = (cluster15['status'] == 'active').astype(int)

# Quick status check
print(cluster15['status'].value_counts(dropna=False))

# Show only columns that actually exist
cols_to_show = [
    'project_name',
    'status',
    'technology',
    'net_mw',
    'queue_date',
    'application_date',
    'withdrawal_date',
    'days_queue_to_withdrawal',
    'days_app_to_withdrawal',
    # future-proof (will be ignored if not present)
    'mwh',
    'storage_duration_hours',
]

cols_to_show = [c for c in cols_to_show if c in cluster15.columns]
cluster15[cols_to_show].head()

status
active       108
withdrawn     62
Name: count, dtype: int64


Unnamed: 0,project_name,status,technology,net_mw,queue_date,application_date,withdrawal_date,days_queue_to_withdrawal,days_app_to_withdrawal
0,Alisa Solar Energy Complex 2,active,hybrid_solar_storage,1000.0,NaT,2024-11-18,NaT,,
1,Amanece,active,hybrid_solar_storage,835.537811,NaT,2024-11-21,NaT,,
2,Ambar Energy Storage,active,storage,504.9,NaT,2024-11-21,NaT,,
3,Annapurna,active,storage,257.0,NaT,2024-11-20,NaT,,
4,Antlia,active,storage,204.859,NaT,2024-11-19,NaT,,


## 8) Save cleaned dataset

This is the analysis-ready Cluster 15 cohort dataset.

In [20]:
out_path = PROCESSED / 'cluster15_clean.csv'
cluster15.to_csv(out_path, index=False)
out_path

WindowsPath('C:/Users/danci/Interconnection-Queue-Intelligence/data/processed/cluster15_clean.csv')

## Summary

In this notebook we:
- Loaded Cluster 15 active + withdrawn sheets
- Selected and standardized key columns
- Engineered capacity and storage duration features
- Generated technology categories aligned with the public queue
- Saved `cluster15_clean.csv`

Next: `04_survivability_insights.ipynb` to generate plots and conclusions.
