# Tutorial 3:  Testing Model Generalization

**Week 2, Day 5: AI and Climate Change**

**By Climatematch Academy**

__Content creators:__  Deepak Mewada, Grace Lindsay

__Content reviewers:__ Jenna Pearson

__Content editors:__ Name Surname, Name Surname

__Production editors:__ Konstantine Tsafatinos

___

# Tutorial Objectives

*Estimated timing of tutorial: 25 minutes*

In this tutorial, we will...
* Understand the problem of overfitting
* Understand generalization
* Learn to split data into train and test data
* Evaluate trained models on held-out test data
* Think about the relationship between model capacity and overfitting



In [None]:
# @title Tutorial slides

# @markdown These are the slides for the videos in all tutorials today


## Uncomment the code below to test your function

#from IPython.display import IFrame
#link_id = "<YOUR_LINK_ID_HERE>"

print("If you want to download the slides: 'Link to the slides'")
      # Example: https://osf.io/download/{link_id}/

#IFrame(src=f"https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{link_id}/?direct%26mode=render", width=854, height=480)



---
# **Setup**

In [None]:
# @title Import necessary libraries:

import pandas as pd                 # For data manipulation
from sklearn.model_selection import train_test_split      # For splitting dataset into train and test sets
from sklearn.ensemble import RandomForestRegressor        # For Random Forest Regression
from sklearn.tree import DecisionTreeRegressor            # For Decision Tree Regression

<details>
<summary> <font color='Red'>Click here if you are running on local machine or you encounter any error while importing   </font></summary>
Please note that if you are running this code on a local machine and encounter an error while importing a library, make sure to install the library via pip. For example, if you receive a `ModuleNotFoundError: No module named 'library name` error , please run `pip install 'library name` to install the required module.

In [None]:
# @title Set random seed { display-mode: "form" }

# @markdown Executing `set_seed(seed=seed)` you are setting the seed

# E.g., for DL its critical to set the random seed so that students can have a
# baseline to compare their results to expected results.
# Read more here: https://pytorch.org/docs/stable/notes/randomness.html

# Call `set_seed` function in the exercises to ensure reproducibility.
import random
import numpy as np

def set_seed(seed=None):
  if seed is None:
    seed = np.random.choice(2 ** 32)
  random.seed(seed)
  np.random.seed(seed)
  print(f'Random seed {seed} has been set.')


set_seed(seed=2024)  # change 2023 with any number you like
# Set a global seed value for reproducibility
random_state = 42

---
# **Section 1: Model generalization**
---


In [None]:
# @title Video 1: Video 1 Name  # put in the title of your video
# note the libraries are imported here on purpose

###@@@ for konstanine. a question, why isn't this above in the list of cells?

from ipywidgets import widgets
from IPython.display import YouTubeVideo
from IPython.display import IFrame
from IPython.display import display


class PlayVideo(IFrame):
  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):
    self.id = id
    if source == 'Bilibili':
      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'
    elif source == 'Osf':
      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'
    super(PlayVideo, self).__init__(src, width, height, **kwargs)


def display_videos(video_ids, W=400, H=300, fs=1):
  tab_contents = []
  for i, video_id in enumerate(video_ids):
    out = widgets.Output()
    with out:
      if video_ids[i][0] == 'Youtube':
        video = YouTubeVideo(id=video_ids[i][1], width=W,
                             height=H, fs=fs, rel=0)
        print(f'Video available at https://youtube.com/watch?v={video.id}')
      else:
        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,
                          height=H, fs=fs, autoplay=False)
        if video_ids[i][0] == 'Bilibili':
          print(f'Video available at https://www.bilibili.com/video/{video.id}')
        elif video_ids[i][0] == 'Osf':
          print(f'Video available at https://osf.io/{video.id}')
      display(video)
    tab_contents.append(out)
  return tab_contents

# curriculum or production team will provide these ids
video_ids = [('Youtube', '<video_id_1>'), ('Bilibili', '<video_id_2>'), ('Osf', '<video_id_3>')]
tab_contents = display_videos(video_ids, W=854, H=480)
tabs = widgets.Tab()
tabs.children = tab_contents
for i in range(len(tab_contents)):
  tabs.set_title(i, video_ids[i][0])
display(tabs)

As discussed in the video, machine learning models can *overfit*. This means they essentially memorize the data points they were trained on. This makes them perform very well on those data points, but when they are presented with data they weren't trained on their predictions are not very good. Therefore, we need to evaluate our models according to how well they perform on data they weren't trained on.

To do this, we will split the data into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate how well the model performs on unseen data. This helps us ensure that our model can generalize well to new data and avoid overfitting.


## Section 1.1 : Load and Prepare the Data

As we've learned in the previous tutorial, here we load our dataset and prepare it by removing unnecessary columns and extracting the target variable 'tas_FINAL', representing temperature anomaly.

In [None]:
# Load and Prepare the Data
url_Climatebench_train_val = "https://osf.io/y2pq7/download" # Dataset URL
training_data = pd.read_csv(url_Climatebench_train_val)  # Load the training data from the provided URL
training_data.pop('scenario')  #we will drop the `scenario` column from the data as it is just a label, but will not be passed into the model.
target = training_data.pop('tas_FINAL')  # Extract the target variable 'tas_FINAL' which we aim to predict

## Section 1.2: Data Splitting for Training and Testing

Now, our primary objective is to prepare our dataset for model training and evaluation. To achieve this, we'll utilize the `train_test_split` function from Scikit-learn, which conveniently splits our dataset into training and testing subsets.

To facilitate this process, we've imported the essential `train_test_split` function from Scikit-learn earlier in the code:

```python
from sklearn.model_selection import train_test_split      
```

Our strategy involves randomly allocating 20% of the data for testing purposes, while reserving the remaining 80% for model training. This ensures that our model is evaluated on unseen data, which is crucial for assessing its real-world performance.

With this function ready to use, let's seamlessly proceed to split our dataset and go ahead on the journey of model training and evaluation.

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    training_data, target, test_size=0.2, random_state=1
)

We now have separated the input features (now called `X`) and the target variable (now called `y`) into a training set (`X_train`, `y_train`) and a test set (`X_test`, `y_test`).

## Section 1.3 : Train a decision tree model on the training data and evaluate it



In [None]:
# Training the model on the training data
dt_regressor = DecisionTreeRegressor(random_state=random_state,max_depth=20)
dt_regressor.fit(X_train, y_train)

Now we will evaluate the model on both the training and test data

In [None]:
print('Performance on training data :', dt_regressor.score(X_train, y_train))
print('Performance on test data     :', dt_regressor.score(X_test, y_test))

We can see here that our model is overfitting: it is performing much better on the data it was trained on than on held-out test data.

## Section 1.4 : Train a random forest model on the testing data and evaluate it

Use what you know to train a random forest model on the training data and evaluate it on both the training and test data.
We have already imported `RandomForestRegressor` in Setup section via
```
from sklearn.ensemble import RandomForestRegressor  
```



In [None]:
def train_random_forest_model(X_train, y_train, X_test, y_test, random_state):
    """Train a Random Forest model and evaluate its performance.

    Args:
        X_train (ndarray): Training features.
        y_train (ndarray): Training labels.
        X_test (ndarray): Test features.
        y_test (ndarray): Test labels.
        random_state (int): Random seed for reproducibility.

    Returns:
        RandomForestRegressor: Trained Random Forest regressor model.
    """
    #################################################
    ## TODO for students: details of what they should do ##
    # Implement training a RandomForestRegressor model using X_train and y_train
    # Then, evaluate its performance on both training and test data using .score() method
    # Print out the performance on training and test data
    # Fill remove the following line of code one you have completed the exercise:
    raise NotImplementedError("Student exercise: Implement the training and evaluation process.")
    #################################################

    # Train the model on the training data
    rf_regressor = RandomForestRegressor(random_state=random_state)

    #fill in the code
    #rf_regressor.fit(...., ...)

    print('Performance on training data:', rf_regressor.score(X_train, y_train))
    print('Performance on test data:', rf_regressor.score(X_test, y_test))

    #The difference between performance on training and test data is less for the random forest model

    return rf_regressor

# uncomment this to call the function:
# rf_model = train_random_forest_model(X_train, y_train, X_test, y_test, random_state=42)


In [None]:
# to_remove solution

def train_random_forest_model(X_train, y_train, X_test, y_test, random_state):
    """Train a Random Forest model and evaluate its performance.

    Args:
        X_train (ndarray): Training features.
        y_train (ndarray): Training labels.
        X_test (ndarray): Test features.
        y_test (ndarray): Test labels.
        random_state (int): Random seed for reproducibility.

    Returns:
        RandomForestRegressor: Trained Random Forest regressor model.
    """

    # Train the model on the training data
    rf_regressor = RandomForestRegressor(random_state=random_state)
    rf_regressor.fit(X_train, y_train)
    print('Performance on training data :', rf_regressor.score(X_train, y_train))
    print('Performance on test data     :', rf_regressor.score(X_test, y_test))

    #The difference between performance on training and test data is less for the random forest model

    return rf_regressor

# uncomment this to call the function:
rf_model = train_random_forest_model(X_train, y_train, X_test, y_test, random_state=42)


### Think 3.3 : Overfitting - Decision tree vs Random Forest
1. Does the random forest model overfit less than a single decision tree?


In [None]:
#to_remove_explanation

"""
The difference between performance on training and test data is less for the random forest model therefore it overfits less.
This is consistent with what we learned about the benefit of using ensemble models. ]
"""

## Section 1.5 : Explore parameters of the random forest model

In the previous tutorial we saw how we can control the depth of a single decision tree.   
We can also control the depth of the decision trees used in our random forest model by passing in a `max_depth` argument. We can also control the number of trees in the random forest model by setting `n_estimator`.

Intuitively, these variables control the *capacity* of the model. Capacity loosely refers to the number of trainable parameters in the model. The more trees and the deeper they are, the more free parameters the model has to capture the training data. If the model has too low of capacity, it won't be powerful enough to capture complex relationships between the input features and the target variable. If it has too many parameters that it can move around, however, it may end up memorizing every single training point and therefore overfit.

Use the sliders below to experiment with different values of `n_estimator` and `max_depth` and see how they impact performance on training and test data.

### Interactive Demo 3.1:  Peformance of Random Forest
In this activity, you can adjust the sliders for `n_estimators` and `max_depth` to observe their effect on model performance:

* `n_estimators`: Controls the number of trees in the Random Forest.   
* `max_depth`: Sets the maximum depth of each tree.  
After adjusting the sliders, the code fits a new Random Forest model and prints the training and testing scores, showing how changes in these parameters impact model performance.

In [None]:
# @title Use the slider to change the values of 'n_estimators' and 'max_depth' and observe the effect on performance. { run: "auto", form-width: "50%", display-mode: "form" }
# @markdown Make sure you execute this cell to enable the widget!


n_estimators = 30 # @param {type:"slider", min:10, max:100, step:10}
max_depth = 20 # @param {type:"slider", min:5, max:50, step:5}
rf_regressor = RandomForestRegressor(n_estimators=50, max_depth=10)  # Change this and check the score


rf_regressor.fit(X_train, y_train) # Train the model on the training data
# Print scores
print(f"\tn_estimators = {n_estimators}, max_depth = {max_depth}:")
print(f"\n\tTraining Score : {rf_regressor.score(X_train, y_train)}")
print(f"\tTesting Score  : {rf_regressor.score(X_test, y_test)}")


### Interactive Demo 3.1 Discussion

1. Did you observe any trends in how the performance changes?  
2. Try to explain in you own words the concepts of capacity and overfitting and how they relate.
3. In addition to model capacity, what else could be changed to prevent overfitting?

In [None]:
#to_remove_explanation

# airtable
# relevant_variable_name: text

"""
1. Observations: Adjusting `n_estimators` and `max_depth` may cause fluctuations in model performance. Increasing `n_estimators` initially improves performance, but too many trees may lead to overfitting. Similarly, increasing `max_depth` initially enhances performance by capturing complex patterns, but excessively deep trees may result in overfitting.

2. Capacity and Overfitting: Capacity refers to a model's ability to capture complex patterns, while overfitting occurs when a model learns noise instead of true patterns. Increasing capacity, like using more trees or deeper trees, can lead to overfitting.

3. Preventing Overfitting: Apart from adjusting model capacity, we could also consider training on a larger dataset. Machine learning techniques like regularization techniques, cross-validation, feature selection, and ensemble methods help prevent overfitting. These approaches ensure the model generalizes well to unseen data by balancing complexity and performance.

"""



---


# **Summary**

In this tutorial, we delved into the importance of training and testing sets in constructing robust machine learning models. Understanding the concept of overfitting and the necessity of using separate test sets for model assessment were pivotal. Through practical exercises, we acquired hands-on proficiency in data partitioning, model training, and performance evaluation.

---






---


                                Congratulations! You have reached the end of the  tutorial.   



---
