# Geospatial Tabular Template ‚Äì Weather Prediction from ZIP Codes

This template is for **tabular ML with simple geospatial features**, e.g.:

- Predicting daily high temperature for a **US ZIP code**
- Using **latitude/longitude**, distances, and regional clusters as features
- Training a **standard regression model** (RandomForest, Gradient Boosting, etc.)

We stay in the **tabular world**, but add a light layer of geospatial thinking:

- ZIP ‚Üí latitude/longitude
- Distances to reference points (e.g., weather stations, coastline, city center)
- Geo-clusters (KMeans on lat/lon)
- Optional interaction with time (date features)

This is the ‚Äúbridge‚Äù between pure tabular and full GIS / geospatial ML.


## üîÅ High-Level Workflow (Template A ‚Äì Tabular Geospatial)

1. Imports & config
2. Load data (ZIP, date, target, extra covariates)
3. ZIP ‚Üí latitude/longitude (via lookup or pre-joined file)
4. Geospatial feature engineering
   - Distances (to weather stations or city centers)
   - Simple geo-clusters on lat/lon
5. Time features (month, day-of-year, etc.)
6. Train/validation split (random vs time-based)
7. Baseline regression models
8. Evaluation & feature importance


In [None]:
# ========== 1. Imports & Config (Geo Tabular Weather) ==========

from pathlib import Path
from typing import Optional, List

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestRegressor

pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["figure.dpi"] = 100

# If you want to use pgeocode for ZIP -> lat/lon, you can install it via:
#   pip install pgeocode
try:
    import pgeocode
    PGEOCODE_AVAILABLE = True
except ImportError:
    PGEOCODE_AVAILABLE = False
    print("pgeocode not installed; ZIP -> lat/lon via pgeocode will be skipped unless you install it.")

# ---- Config ----
DATA_DIR = Path("../input")
TRAIN_FILE = "weather_train.csv"   # edit to your file name

ZIP_COL = "zip"
DATE_COL = "date"
TARGET_COL = "temp_high"           # numeric regression target

ID_COL = "id"                      # optional

RANDOM_STATE = 42


In [None]:
# ========== 2. Load Data & Basic Checks ==========

def load_data(data_dir: Path = DATA_DIR, train_file: str = TRAIN_FILE) -> pd.DataFrame:
    path = data_dir / train_file
    if not path.exists():
        raise FileNotFoundError(f"Train file not found: {path}")
    df = pd.read_csv(path)
    print("Data shape:", df.shape)
    display(df.head())
    return df


df = load_data()

# Basic sanity checks
if ZIP_COL not in df.columns:
    raise ValueError(f"Expected ZIP column '{ZIP_COL}' not in dataframe")
if TARGET_COL not in df.columns:
    raise ValueError(f"Expected target column '{TARGET_COL}' not in dataframe")

print("\nDtypes:")
display(df.dtypes)

print("\nMissing (%):")
display((df.isna().mean() * 100).sort_values(ascending=False))

# Simple target distribution
sns.histplot(df[TARGET_COL], bins=40)
plt.title("Target distribution")
plt.xlabel(TARGET_COL)
plt.show()


## 3Ô∏è‚É£ ZIP ‚Üí Latitude/Longitude

You have two main options:

1. **Pre-joined coordinates**: your CSV already has `lat` / `lon` columns.
2. **Lookup with pgeocode**: derive lat/lon from ZIP on the fly.

For performance and reproducibility, pre-joining coordinates into your dataset
(often by merging with a ZIP‚Üílat/lon lookup table) is recommended.


In [None]:
# ========== 3. ZIP -> Lat/Lon (if not already present) ==========

if "lat" in df.columns and "lon" in df.columns:
    print("Using existing lat/lon columns in dataframe.")
else:
    if not PGEOCODE_AVAILABLE:
        raise ImportError(
            "lat/lon not in dataframe and pgeocode is not installed. "
            "Either add lat/lon to your CSV or install pgeocode."
        )
    geo = pgeocode.Nominatim("us")
    # pgeocode expects strings
    df[ZIP_COL] = df[ZIP_COL].astype(str)
    loc = df[ZIP_COL].apply(lambda z: geo.query_postal_code(z))
    df["lat"] = loc.apply(lambda r: r.latitude)
    df["lon"] = loc.apply(lambda r: r.longitude)

print("Lat/Lon summary:")
display(df[["lat", "lon"]].describe(include="all"))


## 4Ô∏è‚É£ Geospatial Feature Engineering (Tabular)

We will add:

- **Distance to a reference point** (e.g., a ‚Äúcentral‚Äù location or known station)
- **Regional geo-cluster** using KMeans on (lat, lon)

You can adapt this to:

- Distance to nearest coastline / city
- Distance to nearest station with real measurements
- Clusters based on your specific region of interest


In [None]:
# ========== 4. Geospatial Feature Engineering ==========

# Simple haversine distance (in kilometers)
def haversine_km(lat1, lon1, lat2, lon2):
    R = 6371.0  # Earth radius in km
    lat1_rad = np.radians(lat1)
    lon1_rad = np.radians(lon1)
    lat2_rad = np.radians(lat2)
    lon2_rad = np.radians(lon2)
    dlat = lat2_rad - lat1_rad
    dlon = lon2_rad - lon1_rad
    a = np.sin(dlat / 2) ** 2 + np.cos(lat1_rad) * np.cos(lat2_rad) * np.sin(dlon / 2) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    return R * c


# Example: distance to a reference point (e.g., Kansas center ~ US centroid)
REF_LAT, REF_LON = 39.5, -98.35  # approximate geographic center of contiguous US

df["dist_ref_km"] = haversine_km(df["lat"], df["lon"], REF_LAT, REF_LON)

# Geo-clusters on lat/lon
geo_coords = df[["lat", "lon"]].dropna().values

N_CLUSTERS = 8
kmeans = KMeans(n_clusters=N_CLUSTERS, random_state=RANDOM_STATE)
cluster_labels = kmeans.fit_predict(geo_coords)

df.loc[df[["lat", "lon"]].notna().all(axis=1), "geo_cluster"] = cluster_labels
df["geo_cluster"] = df["geo_cluster"].astype("Int64")  # nullable integer

print("Geo features created: dist_ref_km, geo_cluster")
display(df[["lat", "lon", "dist_ref_km", "geo_cluster"]].head())


## 5Ô∏è‚É£ Time Features (Optional but Recommended)

If you have a date column, you can add:

- Year, month, day, day-of-week
- Day-of-year (captures seasonality)
- Simple cyclical encodings (sin/cos of day-of-year)

For **proper forecasting**, your train/valid split should be **time-based**, not random.


In [None]:
# ========== 5. Time Features from DATE_COL (if present) ==========

if DATE_COL in df.columns:
    df[DATE_COL] = pd.to_datetime(df[DATE_COL])
    df["year"] = df[DATE_COL].dt.year
    df["month"] = df[DATE_COL].dt.month
    df["day"] = df[DATE_COL].dt.day
    df["dayofyear"] = df[DATE_COL].dt.dayofyear
    df["dayofweek"] = df[DATE_COL].dt.dayofweek

    # Simple cyclical encoding for seasonality
    df["doy_sin"] = np.sin(2 * np.pi * df["dayofyear"] / 365.25)
    df["doy_cos"] = np.cos(2 * np.pi * df["dayofyear"] / 365.25)

    print("Added time features from DATE_COL.")
else:
    print(f"DATE_COL '{DATE_COL}' not in dataframe; skipping time features.")


## 6Ô∏è‚É£ Train/Validation Split ‚Äì Random vs Time-Based

Two main options:

1. **Random split** (standard): fine if you treat this as generic regression.
2. **Time-based split**: required if you want honest forecasting evaluation.

For a **time-based split**, you can:

- Sort by date
- Use the earliest part for training, later part for validation


In [None]:
# ========== 6. Train/Validation Split ==========

# Choose strat or time-based behavior here:
USE_TIME_BASED_SPLIT = DATE_COL in df.columns

drop_cols = [TARGET_COL]
for c in [ID_COL, DATE_COL]:
    if c in df.columns:
        drop_cols.append(c)

# Example: treat geo_cluster as categorical and one-hot encode later (or let a tree handle it)
X = df.drop(columns=drop_cols)
y = df[TARGET_COL]

if USE_TIME_BASED_SPLIT:
    df_sorted = df.sort_values(DATE_COL)
    split_idx = int(len(df_sorted) * 0.8)
    train_idx = df_sorted.index[:split_idx]
    valid_idx = df_sorted.index[split_idx:]
    X_train, X_valid = X.loc[train_idx], X.loc[valid_idx]
    y_train, y_valid = y.loc[train_idx], y.loc[valid_idx]
    print("Using time-based split (80% earliest dates for train, 20% latest for valid).")
else:
    X_train, X_valid, y_train, y_valid = train_test_split(
        X, y, test_size=0.2, random_state=RANDOM_STATE
    )
    print("Using random train/validation split.")

print("Train shape:", X_train.shape, "Valid shape:", X_valid.shape)


## 7Ô∏è‚É£ Baseline Regression Model

We will start with **RandomForestRegressor** because:

- Handles mixed feature types reasonably well (numeric + integer clusters)
- Captures nonlinear relationships without heavy tuning

You can later plug in:

- GradientBoosting, XGBoost, LightGBM, CatBoost
- Linear models on standardized features


In [None]:
# ========== 7. Baseline Model: RandomForestRegressor ==========

rf = RandomForestRegressor(
    n_estimators=500,
    max_depth=None,
    n_jobs=-1,
    random_state=RANDOM_STATE,
)

rf.fit(X_train, y_train)
y_pred = rf.predict(X_valid)

rmse = mean_squared_error(y_valid, y_pred, squared=False)
mae = mean_absolute_error(y_valid, y_pred)
r2 = r2_score(y_valid, y_pred)

print(f"RandomForest - RMSE: {rmse:.3f}, MAE: {mae:.3f}, R2: {r2:.3f}")

plt.scatter(y_valid, y_pred, alpha=0.3)
plt.xlabel("True")
plt.ylabel("Predicted")
plt.title("RandomForest predictions vs true")
plt.axline((0, 0), slope=1, color="red", linestyle="--")
plt.show()


## 8Ô∏è‚É£ Feature Importance & Next Steps

We can inspect feature importances to understand what the model is using:

- Are `lat`, `lon`, `dist_ref_km`, `geo_cluster` important?
- Are time features (month, doy_sin/cos) important?

Then, iterate:

- Add better distance-based features (e.g., distance to nearest station)
- Add richer time features (e.g., lagged temps if you have history)
- Try gradient boosting models


In [None]:
# ========== 8. Feature Importance Plot ==========

importances = rf.feature_importances_
feat_names = X_train.columns

fi = pd.DataFrame({"feature": feat_names, "importance": importances})
fi = fi.sort_values("importance", ascending=False)

plt.figure(figsize=(8, 0.3 * len(fi)))
sns.barplot(data=fi.head(30), x="importance", y="feature")
plt.title("RandomForest Feature Importances (top 30)")
plt.tight_layout()
plt.show()

fi.head(30)
