# ü§ñ Production-Ready Forecasting Models

> **PM Accelerator Mission**: "By making industry-leading tools and education available to individuals from all backgrounds, we level the playing field for future PM leaders."

---

## Objectives
Train forecasting models using **only features available in production**:
- **Geographic**: latitude, longitude (from user location selection)
- **Temporal**: month, day, hour (from user-specified date/time)

**No lag features or rolling averages** - these require historical data that won't be available for new predictions.

## Model Type: Climatological Prediction
> "What is the expected temperature for [location] at [time of year]?"

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries loaded!")

‚úÖ Libraries loaded!


## 1. Load and Prepare Data

In [2]:
# Load cleaned data
df = pd.read_csv("../data/weather_cleaned.csv", parse_dates=['last_updated'])
print(f"üìä Loaded {len(df):,} records")
print(f"üìç {df['country'].nunique()} countries, {df['location_name'].nunique()} locations")
print(f"üìÖ Date range: {df['last_updated'].min().date()} to {df['last_updated'].max().date()}")

üìä Loaded 114,203 records
üìç 204 countries, 255 locations
üìÖ Date range: 2024-05-16 to 2025-12-24


In [3]:
# Extract temporal features
df['year'] = df['last_updated'].dt.year
df['month'] = df['last_updated'].dt.month
df['day_of_year'] = df['last_updated'].dt.dayofyear
df['hour'] = df['last_updated'].dt.hour

# Create CYCLICAL encodings (sin/cos) for temporal features
# This helps models understand that December (12) is close to January (1)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

df['day_sin'] = np.sin(2 * np.pi * df['day_of_year'] / 365)
df['day_cos'] = np.cos(2 * np.pi * df['day_of_year'] / 365)

df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

print("‚úÖ Temporal cyclical features created")

‚úÖ Temporal cyclical features created


## 2. Define Production-Ready Feature Set

In [4]:
# PRODUCTION-READY FEATURES ONLY
# These are the ONLY features we can get in production:
#   - Geographic: from user location selection or API
#   - Temporal: from user-specified prediction date/time

FEATURES = [
    # Geographic features (available from location input)
    'latitude',
    'longitude',
    
    # Temporal cyclical features (calculated from prediction date)
    'month_sin', 'month_cos',
    'day_sin', 'day_cos',
    'hour_sin', 'hour_cos',
]

TARGET = 'temperature_celsius'

print(f"üìã Feature Set ({len(FEATURES)} features):")
for f in FEATURES:
    print(f"   ‚Ä¢ {f}")
print(f"\nüéØ Target: {TARGET}")

üìã Feature Set (8 features):
   ‚Ä¢ latitude
   ‚Ä¢ longitude
   ‚Ä¢ month_sin
   ‚Ä¢ month_cos
   ‚Ä¢ day_sin
   ‚Ä¢ day_cos
   ‚Ä¢ hour_sin
   ‚Ä¢ hour_cos

üéØ Target: temperature_celsius


In [5]:
# Prepare model data
df_model = df[FEATURES + [TARGET, 'year']].dropna()
print(f"üìä Model dataset: {len(df_model):,} samples")

# Check feature correlations with target
print("\nüîó Feature Correlations with Temperature:")
for f in FEATURES:
    corr = df_model[f].corr(df_model[TARGET])
    print(f"   {f}: {corr:.3f}")

üìä Model dataset: 114,203 samples

üîó Feature Correlations with Temperature:
   latitude: -0.318
   longitude: 0.149
   month_sin: -0.146
   month_cos: -0.323
   day_sin: -0.052
   day_cos: -0.354
   hour_sin: -0.249
   hour_cos: 0.013


## 3. Train-Test Split (Temporal)

In [6]:
# Temporal split: Train on 2024, Test on 2025
# This simulates real-world scenario: train on past, predict future

train_mask = df_model['year'] == 2024
test_mask = df_model['year'] == 2025

# If not enough 2025 data, use random split
if test_mask.sum() < 1000:
    print("‚ö†Ô∏è Not enough 2025 data, using random 80/20 split")
    X = df_model[FEATURES]
    y = df_model[TARGET]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
else:
    X_train = df_model.loc[train_mask, FEATURES]
    y_train = df_model.loc[train_mask, TARGET]
    X_test = df_model.loc[test_mask, FEATURES]
    y_test = df_model.loc[test_mask, TARGET]

print(f"üìä Training set: {len(X_train):,} samples")
print(f"üìä Test set: {len(X_test):,} samples")

üìä Training set: 44,469 samples
üìä Test set: 69,734 samples


In [7]:
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("‚úÖ Features scaled")

‚úÖ Features scaled


## 4. Train Models

In [8]:
# Define models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=15, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42),
}

results = []

print("üîÑ Training models...\n")
for name, model in models.items():
    print(f"Training {name}...", end=" ")
    
    # Use scaled features for linear models, raw for tree-based
    if 'Linear' in name or 'Ridge' in name:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    # Calculate metrics
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results.append({
        'Model': name,
        'RMSE': rmse,
        'MAE': mae,
        'R2': r2
    })
    
    print(f"‚úÖ RMSE: {rmse:.2f}¬∞C, MAE: {mae:.2f}¬∞C, R¬≤: {r2:.3f}")

print("\n‚úÖ All models trained!")

üîÑ Training models...

Training Linear Regression... ‚úÖ RMSE: 7.84¬∞C, MAE: 6.27¬∞C, R¬≤: 0.268
Training Ridge Regression... ‚úÖ RMSE: 7.84¬∞C, MAE: 6.27¬∞C, R¬≤: 0.268
Training Random Forest... ‚úÖ RMSE: 4.32¬∞C, MAE: 3.21¬∞C, R¬≤: 0.777
Training Gradient Boosting... ‚úÖ RMSE: 3.94¬∞C, MAE: 2.97¬∞C, R¬≤: 0.815

‚úÖ All models trained!


## 5. Try XGBoost (if available)

In [9]:
try:
    import xgboost as xgb
    
    print("Training XGBoost...", end=" ")
    xgb_model = xgb.XGBRegressor(
        n_estimators=100,
        max_depth=6,
        learning_rate=0.1,
        random_state=42
    )
    xgb_model.fit(X_train, y_train)
    y_pred_xgb = xgb_model.predict(X_test)
    
    rmse = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
    mae = mean_absolute_error(y_test, y_pred_xgb)
    r2 = r2_score(y_test, y_pred_xgb)
    
    results.append({
        'Model': 'XGBoost',
        'RMSE': rmse,
        'MAE': mae,
        'R2': r2
    })
    
    print(f"‚úÖ RMSE: {rmse:.2f}¬∞C, MAE: {mae:.2f}¬∞C, R¬≤: {r2:.3f}")
    
except ImportError:
    print("‚ö†Ô∏è XGBoost not installed. Run: pip install xgboost")

Training XGBoost... ‚úÖ RMSE: 3.92¬∞C, MAE: 2.94¬∞C, R¬≤: 0.817


In [10]:
try:
    import lightgbm as lgb
    
    print("Training LightGBM...", end=" ")
    lgb_model = lgb.LGBMRegressor(
        n_estimators=100,
        max_depth=6,
        learning_rate=0.1,
        random_state=42,
        verbose=-1
    )
    lgb_model.fit(X_train, y_train)
    y_pred_lgb = lgb_model.predict(X_test)
    
    rmse = np.sqrt(mean_squared_error(y_test, y_pred_lgb))
    mae = mean_absolute_error(y_test, y_pred_lgb)
    r2 = r2_score(y_test, y_pred_lgb)
    
    results.append({
        'Model': 'LightGBM',
        'RMSE': rmse,
        'MAE': mae,
        'R2': r2
    })
    
    print(f"‚úÖ RMSE: {rmse:.2f}¬∞C, MAE: {mae:.2f}¬∞C, R¬≤: {r2:.3f}")
    
except ImportError:
    print("‚ö†Ô∏è LightGBM not installed. Run: pip install lightgbm")

‚ö†Ô∏è LightGBM not installed. Run: pip install lightgbm


## 6. Model Comparison

In [11]:
# Results dataframe
results_df = pd.DataFrame(results).sort_values('RMSE')
results_df['Rank'] = range(1, len(results_df) + 1)
results_df = results_df[['Rank', 'Model', 'RMSE', 'MAE', 'R2']]

print("üìä Model Performance Comparison (sorted by RMSE):")
print("="*60)
display(results_df.round(3))

üìä Model Performance Comparison (sorted by RMSE):


Unnamed: 0,Rank,Model,RMSE,MAE,R2
4,1,XGBoost,3.92,2.943,0.817
3,2,Gradient Boosting,3.943,2.966,0.815
2,3,Random Forest,4.325,3.208,0.777
1,4,Ridge Regression,7.843,6.272,0.268
0,5,Linear Regression,7.843,6.272,0.268


In [12]:
# Visualize comparison
fig = make_subplots(rows=1, cols=2, subplot_titles=['RMSE (Lower is Better)', 'R¬≤ Score (Higher is Better)'])

# RMSE bars
fig.add_trace(
    go.Bar(x=results_df['Model'], y=results_df['RMSE'], marker_color='#FF6B6B', name='RMSE'),
    row=1, col=1
)

# R¬≤ bars
fig.add_trace(
    go.Bar(x=results_df['Model'], y=results_df['R2'], marker_color='#4ECDC4', name='R¬≤'),
    row=1, col=2
)

fig.update_layout(
    title='ü§ñ Model Performance Comparison',
    template='plotly_dark',
    height=400,
    showlegend=False
)
fig.show()

In [13]:
# Best model
best_model = results_df.iloc[0]
print(f"\nüèÜ Best Model: {best_model['Model']}")
print(f"   RMSE: {best_model['RMSE']:.2f}¬∞C")
print(f"   MAE: {best_model['MAE']:.2f}¬∞C")
print(f"   R¬≤: {best_model['R2']:.3f}")


üèÜ Best Model: XGBoost
   RMSE: 3.92¬∞C
   MAE: 2.94¬∞C
   R¬≤: 0.817


## 7. Feature Importance

In [14]:
# Get feature importance from best tree-based model
rf_model = models['Random Forest']
importance = pd.DataFrame({
    'Feature': FEATURES,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=True)

fig = go.Figure(data=[
    go.Bar(
        x=importance['Importance'],
        y=importance['Feature'],
        orientation='h',
        marker_color='#96CEB4'
    )
])

fig.update_layout(
    title='üìä Feature Importance (Random Forest)',
    xaxis_title='Importance',
    yaxis_title='Feature',
    template='plotly_dark',
    height=400
)
fig.show()

## 8. Prediction vs Actual

In [15]:
# Get predictions from best tree-based model
y_pred_best = rf_model.predict(X_test)

# Sample for visualization
sample_size = min(2000, len(y_test))
sample_idx = np.random.choice(len(y_test), sample_size, replace=False)

fig = go.Figure()

# Scatter plot
fig.add_trace(go.Scatter(
    x=y_test.iloc[sample_idx],
    y=y_pred_best[sample_idx],
    mode='markers',
    marker=dict(color='#4ECDC4', opacity=0.5, size=5),
    name='Predictions'
))

# Perfect prediction line
min_val = min(y_test.min(), y_pred_best.min())
max_val = max(y_test.max(), y_pred_best.max())
fig.add_trace(go.Scatter(
    x=[min_val, max_val],
    y=[min_val, max_val],
    mode='lines',
    line=dict(color='#FF6B6B', dash='dash'),
    name='Perfect Prediction'
))

fig.update_layout(
    title='üéØ Predicted vs Actual Temperature',
    xaxis_title='Actual Temperature (¬∞C)',
    yaxis_title='Predicted Temperature (¬∞C)',
    template='plotly_dark',
    height=500
)
fig.show()

## 9. Save Results

In [16]:
# Save model comparison
results_df.to_csv('../outputs/model_comparison.csv', index=False)
print("‚úÖ Results saved to: outputs/model_comparison.csv")

# Display final summary
print("\n" + "="*60)
print("üìã PRODUCTION-READY MODEL SUMMARY")
print("="*60)
print(f"\nüéØ Best Model: {best_model['Model']}")
print(f"üìä Performance: RMSE = {best_model['RMSE']:.2f}¬∞C, R¬≤ = {best_model['R2']:.3f}")
print(f"\nüìã Features Used ({len(FEATURES)}):")
for f in FEATURES:
    print(f"   ‚Ä¢ {f}")
print("\n‚úÖ These features are all available in production:")
print("   - latitude/longitude: From user location input")
print("   - month/day/hour encodings: From prediction date/time")
print("="*60)

‚úÖ Results saved to: outputs/model_comparison.csv

üìã PRODUCTION-READY MODEL SUMMARY

üéØ Best Model: XGBoost
üìä Performance: RMSE = 3.92¬∞C, R¬≤ = 0.817

üìã Features Used (8):
   ‚Ä¢ latitude
   ‚Ä¢ longitude
   ‚Ä¢ month_sin
   ‚Ä¢ month_cos
   ‚Ä¢ day_sin
   ‚Ä¢ day_cos
   ‚Ä¢ hour_sin
   ‚Ä¢ hour_cos

‚úÖ These features are all available in production:
   - latitude/longitude: From user location input
   - month/day/hour encodings: From prediction date/time


## 10. Example Prediction Function

In [17]:
def predict_temperature(model, latitude, longitude, month, day_of_year, hour):
    """
    Predict temperature for a given location and time.
    This function can be used in production with the trained model.
    
    Parameters:
    - latitude, longitude: Location coordinates
    - month: Month of year (1-12)
    - day_of_year: Day of year (1-365)
    - hour: Hour of day (0-23)
    
    Returns: Predicted temperature in Celsius
    """
    # Create cyclical features
    features = pd.DataFrame([{
        'latitude': latitude,
        'longitude': longitude,
        'month_sin': np.sin(2 * np.pi * month / 12),
        'month_cos': np.cos(2 * np.pi * month / 12),
        'day_sin': np.sin(2 * np.pi * day_of_year / 365),
        'day_cos': np.cos(2 * np.pi * day_of_year / 365),
        'hour_sin': np.sin(2 * np.pi * hour / 24),
        'hour_cos': np.cos(2 * np.pi * hour / 24),
    }])
    
    return model.predict(features)[0]

# Example: Predict for New York (40.7, -74.0) in July at noon
example_pred = predict_temperature(rf_model, 40.7, -74.0, 7, 182, 12)
print(f"üå°Ô∏è Example Prediction:")
print(f"   Location: New York (40.7¬∞N, 74.0¬∞W)")
print(f"   Date: July 1st, 12:00 PM")
print(f"   Predicted Temperature: {example_pred:.1f}¬∞C")

üå°Ô∏è Example Prediction:
   Location: New York (40.7¬∞N, 74.0¬∞W)
   Date: July 1st, 12:00 PM
   Predicted Temperature: 29.2¬∞C
