# Lecture 8: Non-linear Methods

Welcome to **Lecture 8**! Building on what we learned in Lecture 7, we will now explore more advanced **non-linear regression techniques** for energy systems data. We'll continue using the same dataset, but shift our focus to powerful **tree-based models**: **Random Forests** and **Gradient Boosting (LightGBM)**.

By the end of this notebook, you’ll know how to:

- Analyze feature correlations to inform model input selection
- Perform robust data splitting for training, validation, and testing
- Train and evaluate tree-based regression models
- Compare non-linear models with linear regression
- Interpret feature importance and model behavior

This notebook is designed for both **self-study and in-class exercises**.  

**Watch for the 'Exercise' sections to apply what you’ve learned!**

---

### ML Pipeline Overview

In this session, we’ll follow a **simple, repeatable pipeline** for any supervised learning task:

1. **Explore & Prepare Data**  
2. **Split** into train / validation / test  
3. **Train** one or more models  
4. **Evaluate** on validation and test sets  
5. **Interpret** results (feature importance, errors, plots)  
6. **Save** trained models for future use  

As we move through each exercise, you’ll see exactly where it fits in this pipeline.


## 1. Import Required Libraries

Let’s start by importing the libraries we’ll use throughout this notebook:

- `pandas` and `numpy`: Data manipulation and numerical operations
- `matplotlib.pyplot` and `seaborn`: Data visualization
- `scikit-learn`: Tools for model evaluation, splitting, metrics, and Random Forests
- `lightgbm`: Fast and efficient gradient boosting framework

If LightGBM is not installed on your system, you can install it with:

```bash
pip install lightgbm


In [None]:
# Install LightGBM if not already installed
%pip install lightgbm
# Install seaborn for more advanced visualization
%pip install seaborn

In [None]:
# --- Data Handling ---
import pandas as pd  # For loading, manipulating, and analyzing structured data
import numpy as np   # For numerical operations and array management
import pickle  # For serializing and deserializing Python objects, useful for saving models
import os  # For interacting with the operating system, e.g., file paths
import urllib.request # to download files

# --- Visualization ---
import matplotlib.pyplot as plt  # Core plotting library for static visualizations
import seaborn as sns            # High-level interface for drawing attractive statistical graphics

# --- Model Selection & Evaluation ---
from sklearn.model_selection import (
    train_test_split,  # To split the dataset into training, validation, and test sets
    GridSearchCV,      # For exhaustive hyperparameter tuning using cross-validation
)

# --- Machine Learning Models ---
from sklearn.ensemble import RandomForestRegressor  # Ensemble model using decision trees for regression
import lightgbm as lgb                              # Efficient gradient boosting framework (LightGBM)

# --- Evaluation Metrics ---
from sklearn.metrics import (
    mean_absolute_error,  # Measures average magnitude of errors in a set of predictions
    mean_squared_error    # Penalizes larger errors more than MAE by squaring them
)

## 2. Data Exploration & Correlation

### 2.1 Load the Dataset
Let's load the dataset we'll use for this lecture. It's the same data as in Lecture 7, so you can compare results directly.

**Why?**
A consistent dataset allows us to fairly compare linear and non-linear models.

**About the Dataset**

The day-ahead electricity market data used in this exercise is sourced from the **[SMARD.de](https://www.smard.de/home)** platform — the official transparency portal of the German electricity market operated by the Bundesnetzagentur.

SMARD provides free access to a wide range of real-time and historical data including:

- Electricity generation by source,
- Day-ahead and intraday market prices,
- Load forecasts and actual consumption,
- Cross-border electricity flows.

You are encouraged to explore this portal for additional datasets, either for your own projects or to extend this analysis.

**This cell will create an `inputs/` folder (if it doesn't already exist) and then download the `market_data_2024.csv` file from the GitHub repository into that folder.**

In [None]:
# 1) Create the 'inputs' folder if it doesn't exist
os.makedirs('inputs', exist_ok=True)

# 2) Define the URL to the raw CSV on GitHub and the local path
url = 'https://raw.githubusercontent.com/nick-harder/AIES/main/lecture8/data/market_data_2024.csv'
local_path = 'inputs/market_data_2024.csv'

# 3) Download the file only if it's not already present
if not os.path.exists(local_path):
    print(f"Downloading data from {url} ...")
    urllib.request.urlretrieve(url, local_path)
    print("Download complete.")
else:
    print("File already exists, skipping download.")

In [None]:
# Load data
data = pd.read_csv('inputs/market_data_2024.csv', index_col=0, parse_dates=True)
data.head()

### 2.2 Correlation Analysis & Matrix

Before building models, it's important to understand how your features relate to each other. The **correlation matrix** shows linear relationships between variables. High correlations can indicate redundancy (multicollinearity), which can affect some models.

**Goal:** Identify strong linear relationships and potential issues in your features.

In [None]:
# ============================
# Pearson Correlation Analysis
# ============================

# 1. Compute Pearson correlation matrix
# This measures linear correlation between each pair of features in the dataset.
corr_matrix = data.corr(method='pearson')

# 2. Set up the matplotlib figure
plt.figure(figsize=(10, 8))

# 3. Create a heatmap
# - annot=True: show the actual correlation values inside the cells
# - fmt='.2f': format correlation values to 2 decimal places
# - cmap='coolwarm': red/blue colormap where red = strong positive, blue = strong negative
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt='.2f',
    cmap='coolwarm',
    square=True,
    cbar_kws={'label': 'Correlation Coefficient'}
)

# 4. Add a descriptive title
plt.title('Feature Correlation Matrix (Pearson)', fontsize=14)

# 5. Improve layout
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

### 2.3 Exercise 1: Correlation Analysis

After computing the correlation matrix:

1. **Which features are most strongly correlated with the target?**
   - Look for the highest (positive or negative) correlation values in the target column.

2. **Are there any pairs of features with very high correlation (>|0.8|)?**
   - These indicate potential multicollinearity, which may negatively affect some models (e.g., linear regression).

3. **Why might high correlation between features be a problem for some models?**
   - Think in terms of model interpretability, overfitting, and redundancy.

4. **Based on your correlation analysis, choose the top 5 features that you believe are most relevant for predicting the target variable.**
   - Consider both strong correlation with the target and low redundancy with other features.
   - You will use these features in subsequent modeling tasks.

> **Note:** Use the `.corr()` matrix or `.abs().sort_values()` on the target column to help you decide.

Once you’ve made your selection, create a new reduced feature matrix with only those 5 features.


In [None]:
# =================================
# EXERCISE 1: Select Top 5 Features
# =================================

# 1. Compute the absolute correlation of each feature with the target
corr_with_target = ... # your code here

# 2. Sort the features by absolute correlation
top_features = ... # your code here

# 3. Create a new DataFrame with only the selected features
X_selected = data[top_features]

# Include the target column as well
selected_data = data[top_features + ['DA Price']]

# 4. Print the selected features
print("Selected Top 5 Features:", top_features)

## 3. Train/Validation/Test Split

To evaluate models fairly, we split our data into **training**, **validation**, and **test** sets. This helps prevent overfitting and gives us an unbiased estimate of model performance.

- **Training set:** Used to fit the model.
- **Validation set:** Used to tune hyperparameters.
- **Test set:** Used only for final evaluation.

We'll show both manual and sklearn-based splits.

### 3.1 Exercise 2: Manual Data Split

- Split the dataset into: training (first 80%), validation (next 10%), test (final 10%).
- Do NOT shuffle the data.

In [None]:
# ----------------------------
# 1. Perform the split
# ----------------------------
X_train = ...  # your code here, first 80%
X_eval = ...   # your code here, next 10%
X_test = ...   # your code here, last 10%

# ----------------------------
# 2. Extract the target column
# ----------------------------
y_train = X_train['DA Price']
y_eval = X_eval['DA Price']
y_test = X_test['DA Price']

# ----------------------------
# 3. Drop the target from feature sets
# ----------------------------
X_train = X_train.drop(columns=['DA Price'])
X_eval = X_eval.drop(columns=['DA Price'])
X_test = X_test.drop(columns=['DA Price'])

# ----------------------------
# 4. Print shapes to verify splits
# ----------------------------
print(f"Train set: {X_train.shape}, Validation set: {X_eval.shape}, Test set: {X_test.shape}")


### 2.2 Train-test split using `train_test_split`

Now that you’ve manually split the dataset into training, validation, and test sets based on index ranges, let's explore how to perform **the exact same split logic** using a more convenient method.

If your dataset does **not require time-based ordering** (i.e., it's safe to shuffle the data), `sklearn.model_selection.train_test_split` provides a concise way to randomly split it into subsets.

We’ll achieve the same 80/10/10 split by:
- First splitting off 80% of the data for training,
- Then splitting the remaining 20% equally into validation and test sets.

This approach is especially useful when working with datasets where row order doesn't carry semantic meaning, and you want to ensure randomness and reproducibility.

**Note**: For time series or sequence models, avoid shuffling — use manual, time-based splits instead as we did previously.

Let’s implement the shuffle-based split now:


In [None]:
# Using sklearn shuffle split
features = selected_data.drop('DA Price', axis=1)
target = selected_data['DA Price']

# Perform a train-test split
random_state = 42  # For reproducibility
perform_suffle = True  # Shuffle the data before splitting

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=random_state, shuffle=perform_suffle)
# Further split the test set into validation and test sets
X_val, X_test, y_val, y_test = ... # your code here, 50% of the remaining data

# Print shapes to verify splits
print(f"Train set: {X_train.shape}, Validation set: {X_val.shape}, Test set: {X_test.shape}")

## 4. Random Forest Implementation

Random Forests are ensemble models that combine multiple decision trees to improve predictive performance and reduce overfitting. They are especially effective for capturing non-linear relationships in data.

**Key Hyperparameters**
- `n_estimators`: Number of trees in the forest
- `max_depth`: Maximum depth of each tree
- `random_state`: Controls reproducibility

We will now train a Random Forest Regressor and evaluate its performance on the validation set.

---

### 4.1 Exercise 3: Train, Evaluate, and Save a Random Forest

Your task is to implement a complete training pipeline for a Random Forest model. Write your own code for each step.

1. **Instantiate the model**  
   Use: `RandomForestRegressor(n_estimators=..., max_depth=..., random_state=42)`

2. **Train the model on the training data**  
   Use: `model.fit(X_train, y_train)`

3. **Predict on the validation set**  
   Use: `model.predict(X_val)`

4. **Evaluate the model**  
   - Compute **Mean Absolute Error (MAE)**  
     Use: `mean_absolute_error`  
   - Compute **Root Mean Squared Error (RMSE)**  
     Use: `mean_squared_error` and wrap it in `np.sqrt()`

5. **Print the results**  
   Display both training and validation MAE and RMSE.

6. **Save the trained model to disk**  


In [None]:
# ========================================
# EXERCISE 5: Random Forest Implementation
# ========================================

# Step 1: Instantiate your RandomForestRegressor here
rf_model = RandomForestRegressor(...) # your code here

# Step 2: Fit the model to the training data
rf_model.fit(..., ...) # your code here

# Step 3: Predict on the validation set
y_pred_val_rf = ... # your code here

# Step 4: Evaluate MAE and RMSE for training and validation
mae_train_rf = ... # your code here
rmse_train_rf = ... # your code here
mae_val_rf = ... # your code here
rmse_val_rf = ... # your code here

# Step 5: Print your results
print('Training MAE:', ...)
print('Validation MAE:', ...)
print('Training RMSE:', ...)
print('Validation RMSE:', ...)

# Step 6: Save the model to a file using pickle
with open('outputs/rf_model.pkl', 'wb') as f:
    pickle.dump(rf_model, f)

> **Optional Challenge:**
Try different values for `n_estimators` and `max_depth`.
How do these changes affect validation error? Can you identify overfitting or underfitting?

### 4.2 Hyperparameter Tuning for Random Forest (Optional)

In Random Forests, model performance and generalization depend heavily on the choice of **hyperparameters**—configuration settings that control the structure and behavior of the trees before any data is seen.

Let’s explore a **grid search**, which tests different combinations of hyperparameters using cross-validation and selects the best-performing model based on a scoring metric (here, **mean absolute error**).

In [None]:
# Grid Search over number of trees and max depth
param_grid = {
    'n_estimators': [50, 100],     # Number of trees in the forest
    'max_depth': [5, 10, None]     # Maximum depth of each tree
}

grid_rf = GridSearchCV(
    RandomForestRegressor(random_state=42),
    param_grid,
    cv=3,
    scoring='neg_mean_absolute_error'  # Use negative MAE for minimization
)

grid_rf.fit(X_train, y_train)

print('Best RF Params:', grid_rf.best_params_)

best_rf = grid_rf.best_estimator_
y_pred_val_best = best_rf.predict(X_val)
print('Tuned RF MAE:', mean_absolute_error(y_val, y_pred_val_best))

#### Hyperparameters Explained

- **`n_estimators`**:  
  The number of trees in the forest.  
  - More trees typically reduce variance and improve performance (up to a point).  
  - Too few trees → underfitting. Too many → increased training time.

- **`max_depth`**:  
  The maximum number of splits each tree is allowed to make (i.e., tree depth).  
  - Low depth → high bias (underfitting).  
  - Very high/None → high variance (overfitting).

## 5. LightGBM Implementation

LightGBM is a highly efficient implementation of gradient boosting that builds trees sequentially, with each one correcting the errors of the previous. It is particularly well-suited for structured/tabular data.

**Key Hyperparameters**
- `learning_rate`: Step size shrinkage to prevent overfitting
- `num_leaves`: Controls the complexity of each tree
- `objective`: Defines the task (e.g., 'regression')
- `metric`: Evaluation criteria (e.g., MAE 'l1', RMSE 'rmse')

We will now train a LightGBM model and evaluate its performance on the validation set.

---

### 5.1 Exercise 4: Train and Evaluate a LightGBM Model

You will now build and evaluate a LightGBM regression model using your training and validation data.

#### Tasks Overview

1. **Prepare Datasets**  
   Use: `lgb.Dataset(data=X_train, label=y_train)`  
   For validation, use `reference=lgb_train` to inform LightGBM about training statistics.

2. **Define Model Parameters**  
   Create a Python dictionary with key parameters:
   - `'objective': 'regression'`
   - `'metric': ['l1', 'rmse']`
   - `'learning_rate'`: e.g. `0.1`
   - `'num_leaves'`: e.g. `31`
   - `'verbose': -1`

3. **Train the Model**  
   Use: `lgb.train()` with:
   - your `params`
   - `train_set` and `valid_sets`
   - `num_boost_round` (e.g. 100)
   - `callbacks=[lgb.early_stopping(stopping_rounds=5)]`

4. **Make Predictions**  
   Use: `model.predict(X_val, num_iteration=model.best_iteration)`

5. **Evaluate Performance**  
   Use:
   - `mean_absolute_error()`  
   - `mean_squared_error()`  
   - Wrap with `np.sqrt()` to get RMSE

6. **Print Results**  
   Show both training and validation MAE and RMSE.

7. **Save the Trained Model**  
   Use: `model.save_model('outputs/lgbm_model.txt', num_iteration=model.best_iteration)`

In [None]:
# ============================
# EXERCISE 4: LightGBM Implementation
# ============================

# Step 1: Prepare LightGBM datasets
lgb_train = ... # your code here
lgb_val = ... # your code here

# Step 2: Define model parameters
params = {
    'objective': ..., # your code here
    'metric': [...],
    'learning_rate': ...,
    'num_leaves': ...,
    'verbose': ...
}

# Step 3: Train the model with early stopping
num_round = ... # your code here
lgb_model = lgb.train(
    params=..., # your code here
    train_set=...,
    num_boost_round=...,
    valid_sets=[..., ...],
    valid_names=['train', 'valid'],
    callbacks=[lgb.early_stopping(stopping_rounds=...)]
)

# Step 4: Predict on validation set
y_val_pred = ... # your code here

# Step 5: Evaluate performance
mae_train_lgb = ... # your code here
rmse_train_lgb = ...
mae_val_lgb = ...
rmse_val_lgb = ...

# Step 6: Print results
print('LGB Training MAE:', ...) # your code here
print('LGB Training RMSE:', ...)
print('LGB Validation MAE:', ...)
print('LGB Validation RMSE:', ...)

# Step 7: Save the trained model
if not os.path.exists('outputs'):
    os.makedirs('outputs')
lgb_model.save_model('outputs/lgbm_model.txt', num_iteration=...)

> **Optional Challenge:**  
Experiment with different values of `learning_rate` and `num_leaves`.  
How does model performance change? Can you detect underfitting or overfitting?

### 5.2 Understanding the Bias–Variance Tradeoff and Early Stopping

Before we finish, let’s revisit an important concept from the lecture: the **bias–variance tradeoff**.

This framework helps explain why models underfit or overfit—and how techniques like **early stopping** in LightGBM can help us find the right balance.


#### The Bias–Variance Decomposition

For a given model $\hat{f}(x)$ and target variable $y$, the expected test error can be decomposed as:

$$
\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{\text{Bias}^2}_{\text{underfitting}} + \underbrace{\text{Variance}}_{\text{overfitting}} + \underbrace{\sigma^2}_{\text{irreducible error}}
$$

- **Bias**: Error due to wrong assumptions (e.g. linear model for nonlinear data).
- **Variance**: Error due to model sensitivity to training data fluctuations.
- **Irreducible error**: Noise inherent in the data.

We aim to find a model complexity that minimizes the total error—not just training error.


#### How Early Stopping Controls Variance

Early stopping is a regularization technique that halts training once the **validation loss stops improving**. This helps prevent overfitting (i.e., high variance) by:

- Monitoring model performance on a **validation set**.
- Stopping when further training improves fit on training data but **not on validation data**.

In LightGBM, we specify:
- `early_stopping_rounds`: number of rounds with no improvement before stopping.
- `valid_sets`: which data to monitor.

This is a practical way to find a sweet spot in the bias–variance tradeoff **without manual tuning**.


#### What Happens When the Model Overfits?

Let’s explore what happens if we **deliberately overfit** our LightGBM model.

> **Try This**: Change the following parameters in the cell above:
>
> - `learning_rate`: increase from `0.1` to `0.3`
> - `num_leaves`: increase from `31` to `256`
> - `num_boost_round`: increase from `100` to `1000`
> - Remove `early_stopping` (comment out the callback line)
>
> Then retrain the model and compare the **training** and **validation** MAE/RMSE.
>
> - Do you notice a much lower training error and a higher validation error?
> - What does this tell you about the model’s generalization?

This illustrates a classic **overfitting** scenario, where the model performs well on training data but poorly on unseen data.


### 5.3 Hyperparameter Tuning for LightGBM (Optional)

LightGBM offers a range of hyperparameters that can significantly affect model performance. In this section, we perform a **grid search** to find a good combination of values. 

We focus on key parameters that control the learning rate, tree complexity, and regularization.

In [None]:
# ==================================
# Hyperparameter Tuning for LightGBM
# ==================================

# Define the parameter grid
param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],
    'num_leaves': [15, 31, 63],
    'max_depth': [-1, 5, 10],
    'min_data_in_leaf': [10, 20, 50]
}

# Perform Grid Search
grid_lgb = GridSearchCV(
    estimator=lgb.LGBMRegressor(objective='regression', random_state=42),
    param_grid=param_grid,
    cv=3,
    scoring='neg_mean_absolute_error',
    verbose=1
)

# Fit the model
grid_lgb.fit(X_train, y_train)

# Print the best parameters and corresponding score
print('Best LightGBM Params:', grid_lgb.best_params_)
print('Best LightGBM MAE:', -grid_lgb.best_score_)

# Use the best estimator to make predictions on the validation set
best_lgb = grid_lgb.best_estimator_
y_pred_val_best_lgb = best_lgb.predict(X_val)

# Evaluate performance
mae = mean_absolute_error(y_val, y_pred_val_best_lgb)
rmse = np.sqrt(mean_squared_error(y_val, y_pred_val_best_lgb))

print('Tuned LightGBM Validation MAE:', mae)
print('Tuned LightGBM Validation RMSE:', rmse)

#### Hyperparameters Explained

- **`learning_rate`**:  
  Controls how much the model adjusts per tree.  
  - Smaller values lead to slower but more stable learning.  
  - Larger values speed up learning but can overshoot the optimum.

- **`num_leaves`**:  
  Maximum number of leaves per tree.  
  - Higher values increase model complexity and accuracy but may lead to overfitting.  
  - Smaller values limit model flexibility and may underfit.

- **`max_depth`**:  
  Limits the depth of each tree.  
  - `-1` means no limit.  
  - Shallower trees reduce overfitting risk but may miss patterns.

- **`min_data_in_leaf`**:  
  Minimum number of data points required in a leaf.  
  - Helps prevent overfitting by avoiding very small, specific leaves.  
  - Larger values make the model more conservative.

## 6. Performance Comparison on the Test Set

Now that both models have been saved, we will load them from disk and evaluate their performance on the **unseen test set**.

We will:
- Load the trained Random Forest and LightGBM models,
- Generate predictions on the test set,
- Compute MAE and RMSE for each model,
- Compare the results to assess performance.

This is a good opportunity to verify that your saved models work correctly and produce consistent results.


In [None]:
# ==============================
# Load Saved Random Forest Model
# ==============================
with open('outputs/rf_model.pkl', 'rb') as f:
    loaded_rf = pickle.load(f)

# Predict and evaluate
y_pred_test_rf = loaded_rf.predict(X_test)
rf_mae = mean_absolute_error(y_test, y_pred_test_rf)
rf_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test_rf))

print('Random Forest Test MAE:', rf_mae)
print('Random Forest Test RMSE:', rf_rmse)

# =========================
# Load Saved LightGBM Model
# =========================
loaded_gbm = lgb.Booster(model_file='outputs/lgbm_model.txt')

# Predict and evaluate
y_pred_test_gbm = loaded_gbm.predict(X_test, num_iteration=loaded_gbm.best_iteration)
gbm_mae = mean_absolute_error(y_test, y_pred_test_gbm)
gbm_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test_gbm))

print('LightGBM Test MAE:', gbm_mae)
print('LightGBM Test RMSE:', gbm_rmse)

### 6.1 Scatter Plot Comparison: Predicted vs. Actual

These plots compare model predictions to the actual target values from the test set.

- Each point represents a single prediction.
- The red dashed line (`y = x`) shows where **perfect predictions** would lie.
- Points **close to the line** indicate good predictions.
- Systematic deviations from the line reveal **biases** (e.g., consistently over- or under-predicting).

By comparing both Random Forest and LightGBM side by side, you can visually assess which model aligns more closely with the ground truth across the test set.

In [None]:
# ===========================================
# Visual Comparison of Predictions vs. Actual
# ===========================================

# Set figure size and layout
plt.figure(figsize=(12, 5))

# --- Random Forest ---
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred_test_rf, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlim(0, 200)
plt.ylim(0, 200)
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Random Forest: Predicted vs. Actual")

# --- LightGBM ---
plt.subplot(1, 2, 2)
plt.scatter(y_test, y_pred_test_gbm, alpha=0.6, color='orange')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlim(0, 200)
plt.ylim(0, 200)
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("LightGBM: Predicted vs. Actual")

plt.tight_layout()
plt.show()

### 6.2 Time Series Plot: Predicted vs. Actual Values Over Time

This plot compares the actual target values and model predictions **over time**.

- It is useful for identifying periods where the model performs well or poorly.
- Trends, shifts, and lagging predictions become easier to observe in a time series format.
- However, when plotting **long time ranges**, the visualization can become **visually dense** and harder to interpret.
- In practice, it's often more informative to zoom in on **smaller time windows** (e.g., a single day or week) to inspect prediction behavior in detail.

In [None]:
# ============================================
# Time Series Comparison: Predicted vs. Actual
# ============================================

# Create an index for plotting (assumes original ordering is preserved)
time_index = range(len(y_test))

plt.figure(figsize=(14, 6))

# Plot actual values
plt.plot(time_index, y_test, label='Actual', color='black', linewidth=2)

# Plot Random Forest predictions
plt.plot(time_index, y_pred_test_rf, label='Random Forest Prediction', linestyle='--')

# Plot LightGBM predictions
plt.plot(time_index, y_pred_test_gbm, label='LightGBM Prediction', linestyle='--')

plt.xlabel("Time Index")
plt.ylabel("Target Value")
plt.title("Test Set Predictions Over Time")
plt.legend()
plt.grid(True)
plt.ylim(-50, 200)  # Adjust based on your target value range
plt.tight_layout()
plt.show()

## 7. Feature Importance Visualization

One major advantage of tree-based models is their ability to quantify **feature importance** — that is, how much each feature contributes to the model’s predictions.

Understanding feature importance helps us:
- Interpret model behavior,
- Identify which variables the model relies on most,
- Potentially simplify the model by removing unimportant features.

Below, we'll extract and plot the feature importances from both the **Random Forest** and **LightGBM** models to compare them side by side.


In [None]:
# Ensure you have the feature names
feature_names = X_train.columns

# --- Random Forest Feature Importance ---
rf_importances = pd.Series(loaded_rf.feature_importances_, index=feature_names).sort_values(ascending=False)

# --- LightGBM Feature Importance ---
gbm_importances = pd.Series(
    lgb_model.feature_importance(importance_type='gain'),
    index=feature_names
)
gbm_importances = (gbm_importances / gbm_importances.sum()).sort_values(ascending=False)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Random Forest
rf_importances.head(10).plot(kind='barh', ax=axes[0])
axes[0].set_title("Random Forest: Top 5 Feature Importances")
axes[0].invert_yaxis()

# LightGBM
gbm_importances.head(10).plot(kind='barh', color='orange', ax=axes[1])
axes[1].set_title("LightGBM: Top 5 Feature Importances")
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

### 7.1 Exercise 5: Analyze Feature Importances

Now that you've seen the top features identified by both Random Forest and LightGBM:

1. Compare the feature importances to the **correlation matrix** you analyzed earlier.
2. Are the most important features also the most correlated with the target?
   - If yes, why might that be the case?
   - If not, what could explain the difference?
3. Do both models rank the same features as important? If not, what might explain the differences?

Take a few minutes to reflect on these questions. Write down your observations and hypotheses — you'll discuss them later in class.


## 8. Discussion & Conclusion

In this notebook, we explored **non-linear regression models**—Random Forest and LightGBM—and applied them to structured energy systems data. Starting from a **correlation analysis**, we selected a subset of features and evaluated each model using a clear **train/validation/test pipeline**.

You learned to:

* Identify important features based on correlation and model-derived importance,
* Implement and tune ensemble models,
* Use early stopping and save/load trained models,
* Visualize and interpret performance both numerically and graphically.

We found that:

* Both models performed well, but their feature importance rankings differed.
* Predictions generally followed the actual values, but visualizations showed areas where each model excelled or struggled.
* Even with just 5 input features, the models demonstrated strong predictive power.

This exercise emphasized the value of combining **statistical reasoning** (via correlation analysis) with **machine learning tools** to extract insights and build reliable predictive models.

---

## 9. What to Try at Home

To deepen your understanding and push your analysis further, try the following experiments:

---

### 9.1. Shuffle vs. No Shuffle: Does Order Matter?

In our train-test split, we used `shuffle=True`, which assumes the data is **i.i.d.** (independent and identically distributed). But what if you disable shuffling?

Try:

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False, random_state=42
)
```

Observe how the performance changes. Why do you think the results differ?

---


### 9.2. Try K-Fold Cross-Validation

Instead of relying on a single train/validation split, **K-Fold Cross-Validation** splits the data into multiple folds and trains the model on different subsets, which can give more **robust and generalized performance estimates**.

**Why it matters:**

* Reduces dependency on a specific split.
* Helps validate the model more reliably on small or time-sensitive datasets.

**Example:**

```python
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestRegressor
import numpy as np

kf = KFold(n_splits=5, shuffle=True, random_state=42)
rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)

scores = cross_val_score(rf, X, y, cv=kf, scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-scores)

print("RMSE scores:", rmse_scores)
print("Average RMSE:", rmse_scores.mean())
```

Try it with both models. How stable are the results across folds?

---


### 9.3. More Features, Better Model?

We trained our models using only the **top 5 features** based on correlation. But what happens if we train using **all features**? Does the performance improve significantly?

Try:

* Train the same models on the full dataset (`X_all = data.drop(columns='target')`).
* Compare MAE/RMSE against the 5-feature models.

Then, try plotting the gain in performance as you include features one by one (based on importance or correlation):

**Sample Sketch:**

```python
# Assume features_ranked is a list of features ordered by importance
mae_list = []
for i in range(1, len(features_ranked) + 1):
    X_subset = X[features_ranked[:i]]
    rf.fit(X_subset, y)
    y_pred = rf.predict(X_subset)
    mae = mean_absolute_error(y, y_pred)
    mae_list.append(mae)

plt.plot(range(1, len(features_ranked)+1), mae_list)
plt.xlabel("Number of Features")
plt.ylabel("MAE")
plt.title("Performance vs. Number of Features")
plt.grid(True)
plt.show()
```

Observe:

* At what point do you stop gaining performance?
* Is there a plateau or even degradation?

---

By experimenting with these variations, you’ll not only reinforce your understanding but also build intuition around **model validation**, **feature selection**, and **generalization** — essential skills for any data scientist.