# Data Exploration: Hawaii Rainfall Data

**Goal**: Explore the available HCDP rainfall data, understand its structure, and prepare a dataset for the workshop.

We have data from 1990 to 2026 arranged in folders by Year -> Month.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import glob
import os
from tqdm import tqdm

plt.rcParams["font.family"] = "monospace"

In [None]:
# Define path to data
DATA_ROOT = "../data/HCDP_data/rainfall/new/day/statewide/partial/station_data"

# Check if path exists
if os.path.exists(DATA_ROOT):
    print(f"Data root found: {DATA_ROOT}")
else:
    print(f"WARNING: Data root not found at {DATA_ROOT}")

## 1. Load Data

We will iterate through the directories and load a subset of data to understand the structure. Loading 30+ years might take time, so let's start by listing files.

In [None]:
all_files = glob.glob(os.path.join(DATA_ROOT, "*", "*", "*.csv"))
all_files = sorted(all_files)

print(f"Total CSV files found: {len(all_files)}")
print("Sample files:", all_files[:3])

### Load one file to inspect columns

In [None]:
sample_df = pd.read_csv(all_files[0])
print("Shape:", sample_df.shape)
sample_df.head()

We can plot the `(lat,lon)` pairs to have a general overview of the location of the stations

In [None]:
fig, ax = plt.subplots(subplot_kw=dict(projection=ccrs.PlateCarree()))
sample_df.plot.scatter("LON", "LAT", ax=ax, s=2, c="r")
gl = ax.gridlines(draw_labels=True, ls="--", lw=0.5)
ax.coastlines()

The data is in **Wide Format**. Each day of the month is a column.

## 2. Extract Data for a Target Station

We want to predict rainfall for a specific location. Let's find a station with good data coverage. 
**Honolulu International Airport** is usually a reliable station. Let's look for it.

In [None]:
# Search for HNL Airport in the sample
hnl_station = sample_df[sample_df['Station.Name'].str.contains("HONOLULU INTERNATIONAL", case=False, na=False)]
hnl_station

If found, we will trace this specific `SKN` (Station Key Number) across all files to build a time series.

In [None]:
TARGET_SKN = 703 # Honolulu International Airport

def extract_station_timeseries(skn, file_list):
    daily_rainfall = {}
    
    for file_path in tqdm(file_list, desc="Processing files"):
        try:
            df = pd.read_csv(file_path)
            # Filter
            station_data = df[df['SKN'] == skn]
            
            if not station_data.empty:
                # Extract date columns (start with 'X')
                date_cols = [c for c in df.columns if c.startswith('X')]
                
                for col in date_cols:
                    # Format: X1990.01.01
                    date_str = col[1:] # Remove 'X'
                    val = station_data.iloc[0][col]
                    daily_rainfall[date_str] = val
        except Exception as e:
            print(f"Error reading {file_path}: {e}")
            continue
            
    return daily_rainfall

# For the workshop preparation, we want to see full extent availability.
rainfall_data = extract_station_timeseries(TARGET_SKN, all_files)

## 3. Visualize & Clean

Convert to DataFrame and fix dates.

In [None]:
ts_df = pd.DataFrame.from_dict(rainfall_data, orient='index', columns=['Rainfall_mm'])
ts_df.index = pd.to_datetime(ts_df.index, format='%Y.%m.%d')
ts_df = ts_df.sort_index()

print(f"Extracted {len(ts_df)} days of data.")
print("Missing values:", ts_df['Rainfall_mm'].isna().sum())

In [None]:
# Handle Missing Values (Interpolate)
ts_df['Rainfall_mm_filled'] = ts_df['Rainfall_mm'].interpolate(method='time')
print("Missing values:", ts_df['Rainfall_mm_filled'].isna().sum())

Now lets visualize the data

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.bar(ts_df.index, ts_df['Rainfall_mm_filled'], 10, label='Rainfall (Interpolated)', alpha=0.7)
ax.set_title(f"Daily Rainfall for Station SKN {TARGET_SKN} (1990-202x)")
ax.set_ylabel("Rainfall (mm)")
ax.legend()
ax.set_xlim(ts_df.index.min(), ts_df.index.max())

## 4. Export data

We can save this processed time-series as a single CSV `station_703_rainfall.csv` so we have an already consolidated dataset to work with.

In [None]:
ts_df.to_csv("../data/processed/station_703_daily_rainfall.csv")
print("Saved consolidated dataset.")