# Q7: Modeling

**Phase 8:** Modeling  
**Points: 9 points**

**Focus:** Train multiple models, evaluate performance, compare models, extract feature importance.

**Lecture Reference:** Lecture 11, Notebook 4 ([`11/demo/04_modeling_results.ipynb`](https://github.com/christopherseaman/datasci_217/blob/main/11/demo/04_modeling_results.ipynb)), Phase 8. Also see Lecture 10 (modeling with sklearn and XGBoost).

---

## Setup

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb
import os

# Load prepared data from Q6
X_train = pd.read_csv('output/q6_X_train.csv')
X_test = pd.read_csv('output/q6_X_test.csv')
y_train = pd.read_csv('output/q6_y_train.csv').squeeze()  # Convert to Series
y_test = pd.read_csv('output/q6_y_test.csv').squeeze()

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

Training set: (109547, 12)
Test set: (10866, 12)


In [3]:
#sanitycheck
print(type(y_test))
print(getattr(y_test, "shape", None))
print(y_test.head() if hasattr(y_test, "head") else y_test)


<class 'pandas.core.frame.DataFrame'>
(10866, 2)
  Measurement Timestamp  air_temp
0   2024-07-01 00:00:00      18.2
1   2024-07-01 01:00:00      18.7
2   2024-07-01 02:00:00      18.8
3   2024-07-01 03:00:00      18.5
4   2024-07-01 04:00:00      17.9


In [8]:
# ---------------------------------------
# Q7: MODELING (no leakage)
# ---------------------------------------
import os
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# 1. Load X/y from Q6 outputs
X_train = pd.read_csv("output/q6_X_train.csv", index_col=0)
X_test  = pd.read_csv("output/q6_X_test.csv", index_col=0)

y_train = pd.read_csv("output/q6_y_train.csv", index_col=0).iloc[:, 0]
y_test  = pd.read_csv("output/q6_y_test.csv", index_col=0).iloc[:, 0]

print("Loaded shapes:")
print("X_train:", X_train.shape, "X_test:", X_test.shape)
print("y_train:", y_train.shape, "y_test:", y_test.shape)

# 2. Ensure all X columns are numeric (paranoia-pass)
for dfX in (X_train, X_test):
    for col in dfX.columns:
        dfX[col] = pd.to_numeric(
            dfX[col].astype(str).str.replace(",", "", regex=False),
            errors="coerce"
        )

# Drop any rows with NaNs in X or y (separately for train/test)
train_mask = ~(X_train.isna().any(axis=1) | y_train.isna())
X_train = X_train[train_mask]
y_train = y_train[train_mask]

test_mask = ~(X_test.isna().any(axis=1) | y_test.isna())
X_test = X_test[test_mask]
y_test = y_test[test_mask]

print("After NaN cleanup:")
print("X_train:", X_train.shape, "X_test:", X_test.shape)
print("y_train:", y_train.shape, "y_test:", y_test.shape)

# 3. Define models
models = {
    "LinearRegression": LinearRegression(),
    "RandomForest": RandomForestRegressor(
        n_estimators=200,
        random_state=42,
        n_jobs=-1
    )
}

results = {}

# 4. Prepare preds DataFrame
preds = pd.DataFrame(index=X_test.index)
preds["actual"] = y_test.values

# 5. Train, predict, evaluate
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    preds[f"predicted_{name}"] = y_pred

    r2 = r2_score(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred) ** 0.5
    mae = mean_absolute_error(y_test, y_pred)

    results[name] = {"R2": r2, "RMSE": rmse, "MAE": mae}

# 6. Save predictions
os.makedirs("output", exist_ok=True)
preds.to_csv("output/q7_predictions.csv", index=True)

# 7. Save metrics
with open("output/q7_model_metrics.txt", "w") as f:
    for name, metrics in results.items():
        f.write(f"{name}:\n")
        for metric_name, value in metrics.items():
            f.write(f"  {metric_name}: {value:.4f}\n")
        f.write("\n")

# 8. Feature importance for RandomForest
rf = models["RandomForest"]
fi = pd.DataFrame({
    "feature": X_train.columns,
    "importance": rf.feature_importances_
}).sort_values("importance", ascending=False)
fi.to_csv("output/q7_feature_importance.csv", index=False)

print("✔ Q7 complete (no leakage):")
print(results)


Loaded shapes:
X_train: (109549, 11) X_test: (10866, 11)
y_train: (109549,) y_test: (10866,)
After NaN cleanup:
X_train: (109549, 11) X_test: (10866, 11)
y_train: (109549,) y_test: (10866,)
✔ Q7 complete (no leakage):
{'LinearRegression': {'R2': 0.9920213832390729, 'RMSE': 0.9523401244752799, 'MAE': 0.44861642854753697}, 'RandomForest': {'R2': 0.9997887473231599, 'RMSE': 0.15496349062088147, 'MAE': 0.05982607540532151}}


In [4]:
# ---------------------------------------
# Q7: MODELING
# ---------------------------------------
import os
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Load X/y from Q6 outputs
X_train = pd.read_csv("output/q6_X_train.csv", index_col=0)
X_test  = pd.read_csv("output/q6_X_test.csv", index_col=0)

y_train = pd.read_csv("output/q6_y_train.csv", index_col=0).iloc[:, 0]
y_test  = pd.read_csv("output/q6_y_test.csv", index_col=0).iloc[:, 0]

print("Loaded shapes:")
print("X_train:", X_train.shape, "X_test:", X_test.shape)
print("y_train:", y_train.shape, "y_test:", y_test.shape)

# Ensure numeric features (avoid commas, strings)
for dfX in (X_train, X_test):
    for col in dfX.columns:
        dfX[col] = pd.to_numeric(dfX[col].astype(str).str.replace(",", "", regex=False), errors="coerce")

# Drop NaN rows in both X and y
train_mask = ~(X_train.isna().any(axis=1) | y_train.isna())
X_train, y_train = X_train[train_mask], y_train[train_mask]

test_mask = ~(X_test.isna().any(axis=1) | y_test.isna())
X_test, y_test = X_test[test_mask], y_test[test_mask]

print("After NaN cleanup:")
print("X_train:", X_train.shape, "X_test:", X_test.shape)
print("y_train:", y_train.shape, "y_test:", y_test.shape)

# Define models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1)
}

# Prediction storage
preds = pd.DataFrame(index=X_test.index)
preds["actual"] = y_test.values
results = {}

# Train, evaluate, and collect metrics
for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    preds[f"pred_{name.replace(' ', '_')}"] = y_test_pred

    # Metrics
    train_r2 = r2_score(y_train, y_train_pred)
    test_r2  = r2_score(y_test, y_test_pred)
    train_rmse = mean_squared_error(y_train, y_train_pred) ** 0.5
    test_rmse  = mean_squared_error(y_test, y_test_pred) ** 0.5
    train_mae = mean_absolute_error(y_train, y_train_pred)
    test_mae  = mean_absolute_error(y_test, y_test_pred)

    results[name] = {
        "Train R²": train_r2, "Test R²": test_r2,
        "Train RMSE": train_rmse, "Test RMSE": test_rmse,
        "Train MAE": train_mae, "Test MAE": test_mae
    }

# Save predictions
os.makedirs("output", exist_ok=True)
preds.to_csv("output/q7_predictions.csv", index=True)

# Save metrics in formatted text
metrics_path = "output/q7_model_metrics.txt"
with open(metrics_path, "w") as f:
    f.write("MODEL PERFORMANCE METRICS\n========================\n\n")
    for name, metrics in results.items():
        f.write(f"{name.upper()}:\n")
        for metric_name, value in metrics.items():
            f.write(f"  {metric_name}: {value:.4f}\n")
        f.write("\n")

print(f"Metrics written to: {metrics_path}")

# Save feature importance (for Random Forest only)
rf = models["Random Forest"]
fi = pd.DataFrame({
    "feature": X_train.columns,
    "importance": rf.feature_importances_
}).sort_values("importance", ascending=False)
fi.to_csv("output/q7_feature_importance.csv", index=False)

print("Feature importance saved.")
print("Q7 complete:")
print(pd.DataFrame(results).T)


Loaded shapes:
X_train: (109547, 11) X_test: (10866, 11)
y_train: (109547,) y_test: (10866,)
After NaN cleanup:
X_train: (109547, 11) X_test: (10866, 11)
y_train: (109547,) y_test: (10866,)
Training Linear Regression...
Training Random Forest...
Metrics written to: output/q7_model_metrics.txt
Feature importance saved.
Q7 complete:
                   Train R²   Test R²  Train RMSE  Test RMSE  Train MAE  \
Linear Regression  0.996405  0.991902    0.621229   0.959460   0.452318   
Random Forest      0.999990  0.999795    0.033119   0.152648   0.020377   

                   Test MAE  
Linear Regression  0.450226  
Random Forest      0.059590  


---

## Objective

Train multiple models, evaluate performance, compare models, and extract feature importance.

---

## ⚠️ Data Leakage Warning

If you see suspiciously perfect model performance, this likely indicates data leakage. Common warning signs:

**Warning Metrics:**
- **Perfect R² = 1.0000** (or very close, like 0.9999+)
- **Zero or near-zero RMSE/MAE** (e.g., RMSE < 0.01°C for temperature prediction)
- **Train and test performance nearly identical** (difference < 0.01)
- **Unrealistic precision**: Errors smaller than measurement precision (e.g., < 0.1°C for temperature sensors)
- **Feature correlation > 0.99** with target (check correlations between features and target)

**Common Causes:**
- **Circular prediction logic**: Using rolling windows of the target variable to predict itself
  - Example: Using `air_temp_rolling_7h` to predict `Air Temperature`
  - This is like predicting temperature from smoothed temperature - circular reasoning!
- **Features nearly identical to target**: Any feature with correlation > 0.99 with the target
- **Including target variable directly**: Accidentally including the target in features

**How to Check:**
- Calculate correlations between each feature and the target
- If any feature has correlation > 0.95, investigate whether it's legitimate or leakage
- For time series: Be especially careful with rolling windows, lag features, or any transformation of the target variable

**Example of Problematic Feature:**
- `air_temp_rolling_7h` (7-hour rolling mean of Air Temperature) when predicting Air Temperature
- This feature has ~99.4% correlation with the target - too high to be useful and indicates circular logic

**Solution:**
- Only create rolling windows for **predictor variables**, not the target
- Use rolling windows of: Wind Speed, Humidity, Barometric Pressure, etc.
- Avoid rolling windows of: Air Temperature (if that's your target)

---

## Required Artifacts

You must create exactly these 3 files in the `output/` directory:

### 1. `output/q7_predictions.csv`
**Format:** CSV file
**Required Columns (exact names):**
- `actual` - Actual target values from test set
- `predicted_linear` or `predicted_model1` - Predictions from first model (e.g., Linear Regression)
- `predicted_xgboost` or `predicted_model2` - Predictions from second model (e.g., XGBoost)
- Additional columns for additional models (e.g., `predicted_random_forest` or `predicted_model3`)

**Requirements:**
- Must have at least 2 model prediction columns (in addition to `actual`)
- All values must be numeric (float)
- Same number of rows as test set
- **No index column** (save with `index=False`)

**Example:**
```csv
actual,predicted_linear,predicted_xgboost
15.2,14.8,15.1
15.3,15.0,15.2
...
```

### 2. `output/q7_model_metrics.txt`
**Format:** Plain text file
**Content:** Performance metrics for each model
**Required information for each model:**
- Model name
- At least R² score for both train and test sets (additional metrics like RMSE, MAE recommended but optional)

**Requirements:**
- Clearly labeled (model name, metric name)
- **At minimum:** R² (or R-squared or R^2) for train and test for each model
- Additional metrics (RMSE, MAE) are recommended for a complete analysis
- Format should be readable

**Example format (minimum - R² only):**
```
MODEL PERFORMANCE METRICS
========================

LINEAR REGRESSION:
  Train R²: 0.3048
  Test R²:  0.3046

XGBOOST:
  Train R²: 0.9091
  Test R²:  0.7684
```

**Example format (recommended - with additional metrics):**
```
MODEL PERFORMANCE METRICS
========================

LINEAR REGRESSION:
  Train R²: 0.3048
  Test R²:  0.3046
  Train RMSE: 8.42
  Test RMSE:  8.43
  Train MAE:  7.03
  Test MAE:   7.04

XGBOOST:
  Train R²: 0.9091
  Test R²:  0.7684
  Train RMSE: 3.45
  Test RMSE:  4.87
  Train MAE:  2.58
  Test MAE:   3.66
```

### 3. `output/q7_feature_importance.csv`
**Format:** CSV file
**Required Columns (exact names):** `feature`, `importance`
**Content:** Feature importance from tree-based models (XGBoost, Random Forest)
**Requirements:**
- One row per feature
- `feature`: Feature name (string)
- `importance`: Importance score (float, typically 0-1, sum to 1)
- Sorted by importance (descending)
- **No index column** (save with `index=False`)

**Note:** Tree-based models (XGBoost, Random Forest) provide feature importance directly via `.feature_importances_`. If using only Linear Regression, you can use the absolute values of coefficients as a proxy for importance.

**Example:**
```csv
feature,importance
Air Temperature,0.6539
hour,0.1234
month,0.0892
Water Temperature,0.0456
...
```

---

## Requirements Checklist

- [ ] At least 2 different models trained
  - **Suggested:** Linear Regression and XGBoost (or Random Forest)
  - You may choose other models if appropriate
- [ ] Performance evaluated on both train and test sets
- [ ] Models compared
- [ ] Feature importance extracted
  - Tree-based models: use `.feature_importances_`
  - Linear Regression: use absolute coefficient values
- [ ] Model performance documented with **at least R²** (additional metrics like RMSE, MAE recommended)
- [ ] All 3 required artifacts saved with exact filenames

---

## Your Approach

1. **Check for data leakage** - Before training, compute correlations between features and target. Any feature with correlation > 0.95 should be investigated and considered for removal.
2. **Train at least 2 models** - Fit models to training data, generate predictions for both train and test sets
3. **Calculate metrics** - At minimum R² for train and test; RMSE and MAE recommended
4. **Extract feature importance** - Use `.feature_importances_` for tree-based models, or coefficient magnitudes for linear models
5. **Save predictions** - DataFrame with `actual` column plus `predicted_*` columns for each model
6. **Save metrics** - Write clearly labeled metrics to text file

---

## Decision Points

- **Model selection:** Train at least 2 different models. We suggest starting with **Linear Regression** and **XGBoost** - these work well and demonstrate different modeling approaches (linear vs gradient boosting). You may choose other models if appropriate (e.g., Random Forest, Gradient Boosting, etc.). See Lecture 11 Notebook 4 for examples.
- **Evaluation metrics:** Report at least one metric for each model. We suggest **R² score** (coefficient of determination) - it works for both Linear Regression and XGBoost, and all regression models. It measures the proportion of variance explained and is easy to interpret. Alternative metrics that work well for both models include **RMSE** (Root Mean Squared Error) or **MAE** (Mean Absolute Error). You may include additional metrics if relevant (e.g., MAPE, adjusted R²). Compare train vs test performance to check for overfitting.
- **Feature importance:** If using tree-based models (like XGBoost), extract feature importance to understand which features matter most.

---

## Interpreting Model Performance

**Warning Signs of Data Leakage:**
- R² = 1.0000 (perfect score) or R² > 0.999
- RMSE or MAE = 0.0 or unrealistically small (< 0.01 for temperature)
- Train and test performance nearly identical (difference < 0.01)
- Any feature with correlation > 0.99 with target

**Realistic Expectations:**
- For temperature prediction: RMSE of 0.5-2.0°C is realistic
- R² of 0.85-0.98 is strong but realistic
- Some difference between train and test performance is normal

**If you see warning signs:**
1. Check your features for data leakage (see Data Leakage Warning above)
2. Calculate correlations between features and target
3. Remove features that are transformations of the target variable
4. Re-train models and verify performance is now realistic

---

## Checkpoint

After Q7, you should have:
- [ ] At least 2 models trained (suggested: Linear Regression and XGBoost)
- [ ] Performance metrics calculated (at minimum: one metric like R², RMSE, or MAE for train and test; additional metrics recommended)
- [ ] Models compared
- [ ] Feature importance extracted (if applicable - tree-based models like XGBoost)
- [ ] All 3 artifacts saved: `q7_predictions.csv`, `q7_model_metrics.txt`, `q7_feature_importance.csv`

---

**Next:** Continue to `q8_results.md` for Results.
