# Problem Session 10
## Concrete Compressive Strength I:  Ensemble Learning

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

##### 1. 

We will work with the following dataset:

Yeh,I-Cheng. (2007). Concrete Compressive Strength. UCI Machine Learning Repository. https://doi.org/10.24432/C5PK67.

1. Print the ReadMe file and read the variable descriptions.  The file is `Concrete_readme.txt` in this directory.
2. Load the data as a pandas DataFrame. The data is located in `../../data/concrete.csv`.
    * Note:  the last column `Concrete compressive strength(MPa, megapascals)` is our target variable and the rest are features.
3. Make a train/test split.
4. Use `sns.pairplot` to visualize the relationship between each feature and the target.
    * Discussion question:  which of the following should you use for this visualization and why?
        * The full dataset
        * The training set
        * The testing set

#### 2a.

In this problem we are removing the usual scaffolding that we supply in the problem sessions!

Write your own the code below to find the values of `max_depth` and `n_estimators` for a either a Random Forest Regressor or an Extra Trees Regressor with the lowest average cross-validation RMSE.

Your code should accomplish the following:
1. Make a stratified 5-fold split of the training data.
2. Select `max_depth` from `range(1,11)`  and `n_estimators` from the two choices `[100,500]` to minimize cross validation RMSE.

Try not to copy/paste from earlier problem sessions:  talk through the logic and write your own cross-validation loop.

Some further questions to consider when training a random forest regressors:

* Is scaling necessary as a preprocessing step?  Why or why not?
* Does colinearity of features matter?
* Should you do feature selection first?
* What bias/variance impact do you think increasing `n_estimators` or `max_depth` will have on the model?
* How might you decide whether to use Random Forest or Extra Trees?


In [None]:
# your code here

In [79]:
# Hint:  at some point you will want to know both the minimum cross validation RMSE and the parameter values used to obtain that
# If you want the index of the maximal element of a numpy array `A`, the following code will give you the index

# Maxing a toy array first
np.random.seed(215)
A = np.random.randint(100, size = (2,4,3))

# Here is how to find the index of the maximal value
min_index = np.unravel_index(np.argmin(A), A.shape)

print(f"A = \n{A} \n \n The minimum value of A is A[{min_index}] = {A[min_index]}")

A = 
[[[70 26 68]
  [14 79 94]
  [31 26 96]
  [45 45 23]]

 [[75 41 85]
  [54 90 39]
  [90 70 17]
  [39 23 65]]] 
 
 The minimum value of A is A[(np.int64(0), np.int64(1), np.int64(0))] = 14


##### b.

In this problem you will learn about `GridSearchCV`, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html</a>, a handy class from `sklearn` that makes hyperparameter tuning through a grid search and cross-validation quicker to code up than writing out a series of `for` loops.


Read through the code chunks below and fill in the missing code to run the same grid search cross-validation you did above with `GridSearchCV`.

In [None]:
## first import GridSearchCV
from sklearn.model_selection import GridSearchCV

In [None]:
## This will also take 1-2 minutes to run
grid_cv = GridSearchCV(, # first put the model object here
                          param_grid = {'max_depth':, # place the grid values for max_depth and
                                        'n_estimators':}, # and n_estimators here
                          scoring = , # put the score we are trying to optimize here as a string, "‘neg_root_mean_squared_error’"
                                      # Note that "score" is the opposite of "loss":  bigger score is better.
                          cv = ) # put the number of cv splits here

## you fit it just like a model


Once a `GridSearchCV` is fit you are easily able to find what hyperparameter combinations were best, what the optimal score was as well as get access to the best model.

In [None]:
## You can find the hyperparameter grid point that
## gave the best performance like so
## .best_params_
grid_cv.best_params_

In [None]:
## You can find the best score like so
## .best_score_
grid_cv.best_score_

In [None]:
## Calling best_estimator_ returns the model with the 
## best avg cv performance after it has been refit on the
## entire data set
grid_cv.best_estimator_

The `best_estimator_` is a model with the optimal hyperparameters that has been fit on the entire training set. Try and predict the compressive strength on the training set with the `best_estimator_` below.

In [None]:
## code here



If you want to look at all of the results, you can do that as well with `.cv_results`. Try that below.

In [None]:
## You can get all of the results with cv_results_
grid_cv.cv_results_

##### c.

Using either the `best_estimator_` fitted model or a refitted model according to your results from the `for` loop cross-validation find the feature importance scores.

In [None]:
## code here



#### 5. Bagging one of the previous models

Consider the following regression algorithms:

* kNN with $k=2$
* kNN with $k=100$
* Linear Regression
* Support Vector Regressor using RBF kernel and $\gamma = 0.1$
* Support Vector Regressor using RBF kernel and $\gamma = 10$

Of these which are likely to benefit from bagging, if any?

Choose one algorithm which you think could be improved by bagging and compare cross-validation RMSE of the base model and the bagged model.  Did bagging improve performance?  


#### 3.  Write your own Bagging Regressor class  

Write your own BaggingRegressor class:

In [None]:
import numpy as np

class CustomBaggingRegressor:
    '''
    Trains a sequence of regressors on bootstrap resamples of the training data.
    Prediction is performed by taking the mean of the predictions of all regressors.
    This is only designed to work with MSE loss.
    '''
    def __init__(self, base_estimator, num_estimators=10, kwargs={}):
        '''
        Parameters:
            base_estimator: A regression class from sklearn
            num_estimators: The number of estimators in the ensemble.
            kwargs: A dictionary of keyword arguments to pass to the estimators.
        
        Attributes:
            self.estimators: A list of instantiated base estimators.
        '''
        self.kwargs = 
        self.num_estimators = 
        self.estimators = 
    
    def fit(self, X, y):
        '''
        Parameters:
            X: numpy array of shape (n, p), where n is the number of samples.
            y: numpy array of shape (n,), target values.
        '''
        rng = np.random.default_rng()
        n_samples = X.shape[0]
        for estimator in self.estimators:
            # Generate bootstrap indices (sampling rows, axis=0)
            indices =  # use rng.choice
            X_boot = 
            y_boot = 
            estimator.fit(X_boot, y_boot)
    
    def predict(self, X):
        '''
        Predict using the ensemble by averaging predictions from all estimators.

        Parameters:
            X: numpy array of shape (n, p)
        
        Returns:
            preds: numpy array of shape (n,), the aggregated predictions.
        '''
        # Collect predictions from each estimator and take their mean
        preds = 
        return preds


In [68]:
# Make sure that your class is able to run the following code.
# Does increasing the number of estimators decrease the MSE?

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

X = np.random.normal(size = (100,2))
y = X[:,0]**2 + X[:,0]*X[:,1]

model =  CustomBaggingRegressor(
            base_estimator = DecisionTreeRegressor, 
            num_estimators = 1, # try 1 and 10 a bunch of times.
            kwargs = {'max_depth' : 10}
            )

model.fit(X,y)

mean_squared_error(y, model.predict(X))

0.14264603928267197

--------------------------

This notebook was written for the Erd&#337;s Institute Data Science Boot Camp by Steven Gubkin, 2025.