# CAISO Interconnection Queue – Data Inventory

This notebook loads and inspects the two raw input datasets used in this project:

- `publicqueuereport.xlsx` (system-wide snapshot)
- `cluster-15-interconnection-requests.xlsx` (Cluster 15 cohort)

Goals:
1. Confirm file structure and column availability
2. Handle layout quirks (notably the Public Queue Report header rows)
3. Save quick parsed CSVs for downstream notebooks


## 1) Environment + Imports

We keep imports small and explicit for readability and reproducibility.

In [1]:
import sys
from pathlib import Path
import re
import pandas as pd

print('Python:', sys.version)
print('Executable:', sys.executable)

Python: 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
Executable: C:\Users\danci\Interconnection-Queue-Intelligence\.venv\Scripts\python.exe


## 2) Project paths (robust to running from `notebooks/`)

Jupyter sometimes runs with the working directory set to `notebooks/`.
This block detects that case and moves `ROOT` up to the repository root.


In [2]:
ROOT = Path.cwd()
if ROOT.name == 'notebooks':
    ROOT = ROOT.parent

RAW = ROOT / 'data' / 'raw'
PROCESSED = ROOT / 'data' / 'processed'
PROCESSED.mkdir(parents=True, exist_ok=True)

PUBLIC_QUEUE_PATH = RAW / 'publicqueuereport.xlsx'
CLUSTER15_PATH = RAW / 'cluster-15-interconnection-requests.xlsx'

print('ROOT:', ROOT)
print('RAW exists:', RAW.exists(), RAW)
print('RAW files:', [p.name for p in RAW.glob('*')])
print('PROCESSED exists:', PROCESSED.exists(), PROCESSED)

ROOT: C:\Users\danci\Interconnection-Queue-Intelligence
RAW exists: True C:\Users\danci\Interconnection-Queue-Intelligence\data\raw
RAW files: ['cluster-15-interconnection-requests.xlsx', 'publicqueuereport.xlsx']
PROCESSED exists: True C:\Users\danci\Interconnection-Queue-Intelligence\data\processed


## 3) Public Queue Report parsing helpers

The CAISO Public Queue Report does **not** place column headers on the first row.
Instead, headers appear a few rows down.

We:
1. Load the sheet without headers
2. Find the row containing `Project Name`
3. Use that row as the header


In [3]:
def find_header_row(df: pd.DataFrame, first_col_contains: list[str]) -> int:
    col0 = df.iloc[:, 0].astype(str).fillna('')
    for i, val in enumerate(col0):
        for key in first_col_contains:
            if key.lower() in val.lower():
                return i
    raise ValueError('Header row not found')

def read_public_queue_sheet(path: Path, sheet_name: str) -> pd.DataFrame:
    raw = pd.read_excel(path, sheet_name=sheet_name, header=None)
    header_row = find_header_row(raw, first_col_contains=['Project Name', 'Project Name - Confidential'])
    header = raw.iloc[header_row].astype(str)
    df = raw.iloc[header_row + 1:].copy()
    df.columns = header
    df = df.dropna(how='all')
    df.columns = [c.strip() for c in df.columns]
    return df

## 4) Load Public Queue Report (3 sheets) and combine

We load:
- Active projects
- Completed projects
- Withdrawn projects

Then we add a simple `status` label and concatenate the results.

In [4]:
public_active = read_public_queue_sheet(PUBLIC_QUEUE_PATH, 'Grid GenerationQueue')
public_completed = read_public_queue_sheet(PUBLIC_QUEUE_PATH, 'Completed Generation Projects')
public_withdrawn = read_public_queue_sheet(PUBLIC_QUEUE_PATH, 'Withdrawn Generation Projects')

public_active['status'] = 'active'
public_completed['status'] = 'completed'
public_withdrawn['status'] = 'withdrawn'

public_all = pd.concat([public_active, public_completed, public_withdrawn], ignore_index=True)
public_all.shape

(2285, 38)

### Inspect Public Queue columns and a few rows

In [5]:
public_all.columns

Index(['Project Name', 'Queue Position',
       'Interconnection Request\nReceive Date', 'Queue Date',
       'Application Status', 'Study\nProcess', 'Type-1', 'Type-2', 'Type-3',
       'Fuel-1', 'Fuel-2', 'Fuel-3', 'MW-1', 'MW-2', 'MW-3', 'Net MWs to Grid',
       'Full Capacity, Partial or Energy Only (FC/P/EO)',
       'TPD Allocation Percentage',
       'Off-Peak Deliverability and Economic Only', 'TPD Allocation Group',
       'County', 'State', 'Utility', 'PTO Study Region',
       'Station or Transmission Line',
       'Proposed\nOn-line Date\n(as filed with IR)', 'Current\nOn-line Date',
       'Suspension Status', 'Feasibility Study or Supplemental Review',
       'System Impact Study or \nPhase I Cluster Study',
       'Facilities Study (FAS) or \nPhase II Cluster Study',
       'Optional Study\n(OS)', 'Interconnection Agreement \nStatus', 'status',
       'Actual\nOn-line Date', 'Project Name - Confidential', 'Withdrawn Date',
       'Reason for Withdrawal'],
      dtype='o

In [6]:
public_all.head()

Unnamed: 0,Project Name,Queue Position,Interconnection Request\nReceive Date,Queue Date,Application Status,Study\nProcess,Type-1,Type-2,Type-3,Fuel-1,...,Feasibility Study or Supplemental Review,System Impact Study or \nPhase I Cluster Study,Facilities Study (FAS) or \nPhase II Cluster Study,Optional Study\n(OS),Interconnection Agreement \nStatus,status,Actual\nOn-line Date,Project Name - Confidential,Withdrawn Date,Reason for Withdrawal
0,MONTEZUMA (HIGH WINDS III),22,2003-11-18 00:00:00,2003-11-18 08:00:00,ACTIVE,AMEND 39,Wind Turbine,Storage,,Wind Turbine,...,,Complete,Complete,,Executed,active,,,,
1,TULE WIND,32,2004-05-12 00:00:00,2004-05-24 07:00:00,ACTIVE,Serial LGIP,Wind Turbine,Storage,,Wind Turbine,...,Waived,Complete,Complete,,Executed,active,,,,
2,MIDWAY PEAKING,54,2005-01-12 00:00:00,2005-01-12 08:00:00,ACTIVE,Serial LGIP,Gas Turbine,Storage,,Natural Gas,...,Waived,Complete,Re-Study,,Executed,active,,,,
3,FRESNO COGENERATION EXPANSION PROJECT,61,2005-03-28 00:00:00,2005-03-30 08:00:00,ACTIVE,AMEND 39,Steam Turbine,Storage,,Natural Gas,...,,Complete,Complete,,Executed,active,,,,
4,LAKE ELSINORE ADVANCED PUMPED STORAGE PROJECT,72,2005-04-26 00:00:00,2005-06-21 07:00:00,ACTIVE,Serial LGIP,Storage,,,Pumped-Storage hydro,...,Waived,Complete,Re-Study,,Executed,active,,,,


## 5) Load Cluster 15 cohort sheets

Cluster 15 file is well-structured and includes:
- `Cluster 15 ` (active/ongoing)
- `Withdrawn` (withdrawn from Cluster 15)


In [7]:
cluster15_active = pd.read_excel(CLUSTER15_PATH, sheet_name='Cluster 15 ')
cluster15_withdrawn = pd.read_excel(CLUSTER15_PATH, sheet_name='Withdrawn')

cluster15_active.shape, cluster15_withdrawn.shape

((108, 20), (62, 21))

### Inspect Cluster 15 columns and a few rows

In [8]:
cluster15_active.columns

Index(['Queue Number', 'Project Number', 'Project Name', 'Generation/Fuel 1',
       'NET MW 1', 'Generation/Fuel 2', 'NET MW 2', 'Generation/Fuel 3',
       'NET MW 3', 'NET MW POI', 'PROJECT COUNTY', 'Project State',
       'Study Area', 'PTO', 'POI', 'Voltage kV', 'Requested COD',
       'Queue Date ', 'Application Date', 'Service Type'],
      dtype='object')

In [9]:
cluster15_active.head()

Unnamed: 0,Queue Number,Project Number,Project Name,Generation/Fuel 1,NET MW 1,Generation/Fuel 2,NET MW 2,Generation/Fuel 3,NET MW 3,NET MW POI,PROJECT COUNTY,Project State,Study Area,PTO,POI,Voltage kV,Requested COD,Queue Date,Application Date,Service Type
0,2207,54516,Alisa Solar Energy Complex 2,Photovoltaic/Solar,500.0,Storage/Battery,500.0,,,500.0,Yuma,AZ,SAN DIEGO,SDGE,NORTH GILA - HOODOO WASH (SDGE Portion Only),525,2030-06-01,2025-02-12,2024-11-18,Energy Only Requested
1,2328,54934,Amanece,Photovoltaic/Solar,418.992798,Storage/Battery,416.545013,,,400.0,Stanislaus,CA,PG&E FRESNO,PGAE,QUINTO SW STA- FINK SW STA 230 kV,230,2029-07-31,2025-02-12,2024-11-21,Full Capacity Deliverability Status Requested
2,2322,55045,Ambar Energy Storage,Storage/Battery,504.9,,,,,500.01,San Bernardino,CA,SCE METRO,SCE,LUGO 500 kV,500,2030-06-01,2025-02-12,2024-11-21,Full Capacity Deliverability Status Requested
3,2244,54963,Annapurna,Storage/Battery,257.0,,,,,250.0,Merced County,CA,PG&E FRESNO,PGAE,QUINTO SW STA 230 kV,230,2028-06-01,2025-02-12,2024-11-20,Full Capacity Deliverability Status Requested
4,2204,54897,Antlia,Storage/Battery,204.859,,,,,199.0,Monterey,CA,PG&E GBA,PGAE,MOSS LANDING PP 115 kV,115,2031-12-01,2025-02-12,2024-11-19,Full Capacity Deliverability Status Requested


## 6) Save quick parsed CSVs

These CSVs are **not final cleaned outputs**.
They’re saved only to make downstream notebooks faster and easier to debug.

In [10]:
public_all.to_csv(PROCESSED / '_public_queue_raw_parsed.csv', index=False)
cluster15_active.to_csv(PROCESSED / '_cluster15_raw.csv', index=False)
cluster15_withdrawn.to_csv(PROCESSED / '_cluster15_withdrawn_raw.csv', index=False)

print('Wrote:')
print(' -', PROCESSED / '_public_queue_raw_parsed.csv')
print(' -', PROCESSED / '_cluster15_raw.csv')
print(' -', PROCESSED / '_cluster15_withdrawn_raw.csv')

Wrote:
 - C:\Users\danci\Interconnection-Queue-Intelligence\data\processed\_public_queue_raw_parsed.csv
 - C:\Users\danci\Interconnection-Queue-Intelligence\data\processed\_cluster15_raw.csv
 - C:\Users\danci\Interconnection-Queue-Intelligence\data\processed\_cluster15_withdrawn_raw.csv


## Summary

In this notebook we:
- Loaded both raw CAISO datasets
- Parsed the non-standard Public Queue Report layout
- Confirmed column availability
- Saved quick parsed CSVs for later cleaning notebooks
