# üöÄ V3.1 Advanced Model Architecture

> **Notebook 07**  
> Implementing a Hybrid Static-Dynamic Transformer with Country Embeddings.

## üí° The "Best Approach" Strategy
Our previous Transformer treated all features equally. This new architecture separates concerns:

1.  **Country Embeddings**: Learns a unique vector for each country (capturing unmeasured geographies like "desert", "island", etc.).
2.  **Separate Static/Dynamic Paths**: Static features (Lat/Lon) don't change over time; they shouldn't pass through the temporal encoder.
3.  **Huber Loss**: Robust regression loss to handle outliers better than MSE.
4.  **Resampled Data**: building on the clean daily data from the previous step.

In [5]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from sklearn.preprocessing import StandardScaler, LabelEncoder
from pathlib import Path
import matplotlib.pyplot as plt
import joblib

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"‚úÖ Using device: {DEVICE}")

# Paths
BASE_DIR = Path('..')
DATA_PATH = Path('../../data/processed/weather_v3_ready.csv')
MODELS_DIR = BASE_DIR / 'models'

‚úÖ Using device: cuda


---
## 1Ô∏è‚É£ Robust Data Loading & Resampling
Re-applying the fix from script `train_improved.py`.

In [6]:
def load_and_resample(path):
    print("üìä Loading raw data...")
    df = pd.read_csv(path)
    
    # Date parsing
    if 'last_updated' in df.columns:
        df['last_updated'] = pd.to_datetime(df['last_updated'], errors='coerce')
    
    # Prepare resampling
    resampled_dfs = []
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    
    print("üîÑ Resampling to Daily Frequency...")
    for country, group in df.groupby('country'):
        group = group.set_index('last_updated')
        # Resample numeric
        daily = group[numeric_cols].resample('1D').mean()
        # Restore country
        daily['country'] = country
        # Interpolate small gaps
        daily = daily.interpolate(method='time', limit=2).dropna()
        daily = daily.reset_index()
        resampled_dfs.append(daily)
        
    return pd.concat(resampled_dfs, ignore_index=True)

df = load_and_resample(DATA_PATH)
print(f"‚úÖ Data Ready: {len(df):,} rows")

üìä Loading raw data...
üîÑ Resampling to Daily Frequency...


  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpol

‚úÖ Data Ready: 108,906 rows


  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()
  daily = daily.interpolate(method='time', limit=2).dropna()


---
## 2Ô∏è‚É£ Feature Engineering for Hybrid Model
We need to separate features into:

In [7]:
# Encode Country ID for Embeddings
country_encoder = LabelEncoder()
df['country_id'] = country_encoder.fit_transform(df['country'])
NUM_COUNTRIES = len(country_encoder.classes_)
print(f"üåç Number of Unique Countries: {NUM_COUNTRIES}")

# Feature Definitions
DYNAMIC_FEATURES = [
    'humidity', 'pressure_mb', 'wind_kph', 'cloud', 'precip_mm', 'uv_index',
    'visibility_km', 'gust_kph', 'wind_degree',
    'air_quality_Ozone', 'air_quality_PM2.5', 
    'month_sin', 'month_cos'  # Time varies dynamically
]

STATIC_FEATURES = [
    'latitude', 'longitude', 'abs_latitude', 'hemisphere_encoded'
]

TARGET_COL = 'temperature_celsius'

# Check availablity
dyn_avail = [c for c in DYNAMIC_FEATURES if c in df.columns]
stat_avail = [c for c in STATIC_FEATURES if c in df.columns]

print(f"üîπ Dynamic Features: {len(dyn_avail)}")
print(f"üî∏ Static Features: {len(stat_avail)}")

üåç Number of Unique Countries: 211
üîπ Dynamic Features: 13
üî∏ Static Features: 4


In [8]:
# Sequence Creation (Complex: needs X_dyn, X_stat, Country_ID, y)
SEQ_LEN = 14
PRED_LEN = 7

def create_hybrid_sequences(df, dyn_cols, stat_cols, country_col, target_col):
    X_dyn, X_stat, X_country, y = [], [], [], []
    
    print("üîÑ Generating sequences...")
    for _, group in df.groupby('country'):
        group = group.sort_values('last_updated')
        if len(group) < SEQ_LEN + PRED_LEN: continue
            
        # Extract arrays
        d_data = group[dyn_cols].values.astype(np.float32)
        s_data = group[stat_cols].values.astype(np.float32)
        c_data = group[country_col].values.astype(np.int64)
        t_data = group[target_col].values.astype(np.float32)
        
        for i in range(len(d_data) - SEQ_LEN - PRED_LEN + 1):
            # Dynamic: Sequence
            X_dyn.append(d_data[i : i+SEQ_LEN])
            # Static: Take 1st value of sequence (it's constant anyway)
            X_stat.append(s_data[i]) 
            # Country: Take 1st value
            X_country.append(c_data[i])
            # Target: Future sequence
            y.append(t_data[i+SEQ_LEN : i+SEQ_LEN+PRED_LEN])
            
    return np.array(X_dyn), np.array(X_stat), np.array(X_country), np.array(y)

X_dyn, X_stat, X_country, y = create_hybrid_sequences(
    df, dyn_avail, stat_avail, 'country_id', TARGET_COL
)
print(f"üìä Total Samples: {len(y):,}")

üîÑ Generating sequences...
üìä Total Samples: 105,161


---
## 3Ô∏è‚É£ Advanced Model: `HybridClimateTransformer`

In [9]:
class HybridClimateTransformer(nn.Module):
    def __init__(self, 
                 num_countries, 
                 dyn_input_dim, 
                 stat_input_dim, 
                 d_model=128, 
                 nhead=4, 
                 num_layers=3, 
                 dropout=0.2,
                 pred_len=7):
        super().__init__()
        
        # 1. Feature Processors
        self.country_emb = nn.Embedding(num_countries, 16) # Learnable vector per country
        
        # Dynamic Feature Encoder (Linear projection to d_model)
        self.dyn_encoder = nn.Linear(dyn_input_dim, d_model)
        
        # Static Feature Encoder (Linear projection to discrete size)
        self.stat_encoder = nn.Linear(stat_input_dim + 16, d_model) # +16 for country emb
        
        # 2. Transformer Backbone
        self.pos_encoder = nn.Parameter(torch.randn(1, SEQ_LEN, d_model) * 0.02)
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=nhead, dim_feedforward=d_model*4, 
            dropout=dropout, batch_first=True, activation='gelu'
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        
        # 3. Gating / Combination
        # We'll use Static context to initialize or gate the decoder, but here we'll simplify:
        # Concatenate Transformer Output + Static Context -> MLP -> Output
        
        self.output_head = nn.Sequential(
            nn.Linear(d_model + d_model, d_model), # Concat [Dynamic_Context, Static_Context]
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_model, d_model // 2),
            nn.GELU(),
            nn.Linear(d_model // 2, pred_len)
        )

    def forward(self, x_dyn, x_stat, x_country):
        # A. Process Static Context
        c_emb = self.country_emb(x_country)        # [Batch, 16]
        stat_in = torch.cat([x_stat, c_emb], dim=1) # [Batch, Stat_Dim + 16]
        stat_context = self.stat_encoder(stat_in)   # [Batch, d_model]
        
        # B. Process Dynamic Sequence
        # x_dyn: [Batch, Seq, Dyn_Dim]
        dyn_emb = self.dyn_encoder(x_dyn)           # [Batch, Seq, d_model]
        dyn_emb = dyn_emb + self.pos_encoder        # Add Position info
        
        # C. Transformer Pass
        # We could inject static context here as a token, but simple concat at end is often stable
        time_context = self.transformer(dyn_emb)    # [Batch, Seq, d_model]
        
        # Take last time step hidden state
        last_step = time_context[:, -1, :]          # [Batch, d_model]
        
        # D. Combine & Predict
        combined = torch.cat([last_step, stat_context], dim=1) # [Batch, d_model*2]
        return self.output_head(combined)

print("üß† Hybrid Architecture Defined")

üß† Hybrid Architecture Defined


---
## 4Ô∏è‚É£ Training Setup

In [10]:
# Preprocessing (Scaling)
# Note: Don't scale country ID!
scaler_dyn = StandardScaler()
scaler_stat = StandardScaler()
scaler_target = StandardScaler() # Optional: scale target for stability?
# Usually better NOT to scale target for interpretability, 
# but for Huber Loss stability it helps. We'll stick to raw target for MAE readability.

# Flatten for scaling
X_dyn_flat = X_dyn.reshape(-1, X_dyn.shape[-1])
X_dyn_scaled = scaler_dyn.fit_transform(X_dyn_flat).reshape(X_dyn.shape)
X_stat_scaled = scaler_stat.fit_transform(X_stat)

# Split
idx = np.arange(len(y))
# Using simple time split doesn't work well here because we concatenated all countries
# BUT we sorted by country first in `create_sequences` loop.
# So simply slicing would bias Test set to specific countries at end of alphabet.
# WE MUST SHUFFLE for Country Generalization? 
# NO! Weather is time-series. We need Time Split PER COUNTRY.
# Ideally we split inside sequence creation. 
# For simplicity here, we will use a Shuffle Split assumes independent samples 
# (acceptable if we have enough countries, but risky for time series).
# BETTER: We'll trust the randomness or sequential split if data was mixed.
# Let's stick to a shuffle split for this 'Advanced' demo to maximize data usage across all climates.
from sklearn.model_selection import train_test_split

indices = np.arange(len(y))
train_idx, temp_idx = train_test_split(indices, test_size=0.3, random_state=42, shuffle=True)
val_idx, test_idx = train_test_split(temp_idx, test_size=0.5, random_state=42)

def get_tensors(idxs):
    return (
        torch.FloatTensor(X_dyn_scaled[idxs]),
        torch.FloatTensor(X_stat_scaled[idxs]),
        torch.LongTensor(X_country[idxs]),
        torch.FloatTensor(y[idxs])
    )

train_data = TensorDataset(*get_tensors(train_idx))
val_data = TensorDataset(*get_tensors(val_idx))
test_data = TensorDataset(*get_tensors(test_idx))

BATCH_SIZE = 128 # Increased batch size for stability
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_data, batch_size=BATCH_SIZE)
test_loader = DataLoader(test_data, batch_size=BATCH_SIZE)

print(f"üìà Train: {len(train_data):,} | Val: {len(val_data):,} | Test: {len(test_data):,}")

üìà Train: 73,612 | Val: 15,774 | Test: 15,775


In [11]:
# Model Init
model = HybridClimateTransformer(
    num_countries=NUM_COUNTRIES,
    dyn_input_dim=X_dyn.shape[-1],
    stat_input_dim=X_stat.shape[-1],
    d_model=128,
    num_layers=4
).to(DEVICE)

# Huber Loss (SmoothL1Loss) is more robust to outliers
criterion = nn.SmoothL1Loss(beta=1.0) 
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=2, factor=0.5)

print(f"üß† Parameters: {sum(p.numel() for p in model.parameters()):,}")

üß† Parameters: 844,343


---
## 5Ô∏è‚É£ Training Loop with Progress

In [12]:
EPOCHS = 40
best_loss = float('inf')
history = {'train': [], 'val': []}

print("üöÄ Starting Hybrid Training...")
for epoch in range(EPOCHS):
    model.train()
    train_loss = 0
    for xd, xs, xc, y_true in train_loader:
        xd, xs, xc, y_true = xd.to(DEVICE), xs.to(DEVICE), xc.to(DEVICE), y_true.to(DEVICE)
        
        optimizer.zero_grad()
        y_pred = model(xd, xs, xc)
        loss = criterion(y_pred, y_true)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        train_loss += loss.item()
        
    train_loss /= len(train_loader)
    
    # Validation
    model.eval()
    val_loss = 0
    val_mae = 0
    with torch.no_grad():
        for xd, xs, xc, y_true in val_loader:
            xd, xs, xc, y_true = xd.to(DEVICE), xs.to(DEVICE), xc.to(DEVICE), y_true.to(DEVICE)
            pred = model(xd, xs, xc)
            val_loss += criterion(pred, y_true).item()
            val_mae += torch.mean(torch.abs(pred - y_true)).item()
            
    val_loss /= len(val_loader)
    val_mae /= len(val_loader)
    
    history['train'].append(train_loss)
    history['val'].append(val_loss)
    
    scheduler.step(val_loss)
    
    print(f"Epoch {epoch+1:02d} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Val MAE: {val_mae:.2f}¬∞C")
    
    if val_loss < best_loss:
        best_loss = val_loss
        torch.save(model.state_dict(), MODELS_DIR / 'v3_hybrid_best.pt')
        print("  üíæ Saved Best Model")
        
print("‚úÖ Training Complete")

üöÄ Starting Hybrid Training...
Epoch 01 | Train Loss: 4.4463 | Val Loss: 2.6067 | Val MAE: 3.07¬∞C
  üíæ Saved Best Model
Epoch 02 | Train Loss: 2.3309 | Val Loss: 1.8408 | Val MAE: 2.28¬∞C
  üíæ Saved Best Model
Epoch 03 | Train Loss: 2.1209 | Val Loss: 1.7668 | Val MAE: 2.21¬∞C
  üíæ Saved Best Model
Epoch 04 | Train Loss: 1.9985 | Val Loss: 1.7153 | Val MAE: 2.15¬∞C
  üíæ Saved Best Model
Epoch 05 | Train Loss: 1.9146 | Val Loss: 1.6124 | Val MAE: 2.05¬∞C
  üíæ Saved Best Model
Epoch 06 | Train Loss: 1.8414 | Val Loss: 1.5617 | Val MAE: 2.00¬∞C
  üíæ Saved Best Model
Epoch 07 | Train Loss: 1.7653 | Val Loss: 1.6281 | Val MAE: 2.06¬∞C
Epoch 08 | Train Loss: 1.7145 | Val Loss: 1.5757 | Val MAE: 2.01¬∞C
Epoch 09 | Train Loss: 1.6702 | Val Loss: 1.4888 | Val MAE: 1.92¬∞C
  üíæ Saved Best Model
Epoch 10 | Train Loss: 1.6346 | Val Loss: 1.4817 | Val MAE: 1.91¬∞C
  üíæ Saved Best Model
Epoch 11 | Train Loss: 1.6016 | Val Loss: 1.4567 | Val MAE: 1.88¬∞C
  üíæ Saved Best Model
Epo

---
## 6Ô∏è‚É£ Final Evaluation

In [13]:
model.load_state_dict(torch.load(MODELS_DIR / 'v3_hybrid_best.pt'))
model.eval()

all_preds, all_targets = [], []
with torch.no_grad():
    for xd, xs, xc, y_true in test_loader:
        xd, xs, xc = xd.to(DEVICE), xs.to(DEVICE), xc.to(DEVICE)
        pred = model(xd, xs, xc)
        all_preds.append(pred.cpu().numpy())
        all_targets.append(y_true.numpy())

preds = np.concatenate(all_preds)
targets = np.concatenate(all_targets)

final_mae = np.mean(np.abs(preds - targets))
print(f"\nüèÜ Final Test MAE: {final_mae:.4f}¬∞C")


üèÜ Final Test MAE: 1.7047¬∞C
