# Intro to blending

**Blending** is an ensemble machine learning algorithm.

It is a colloquial name for **stacked generalization** or stacking ensemble where instead of fitting the meta-model on out-of-fold predictions made by the base model, it is fit on predictions made on a holdout dataset.

Blending was used to describe stacking models that combined many hundreds of predictive models by competitors in the 1M Netflix machine learning competition, and as such, remains a popular technique and name for stacking in competitive machine learning circles, such as the Kaggle community.

*Reference:* https://machinelearningmastery.com/blending-ensemble-machine-learning-with-python/

In [1]:
import pandas as pd
import numpy as np

## Data

In [2]:
df_train = pd.read_csv("/kaggle/input/widsdatathon2022/train.csv")
df_test = pd.read_csv("/kaggle/input/widsdatathon2022/test.csv")

### Data Preprocessing

For details of this section, refer to [my other notebook](https://www.kaggle.com/yunlinlew/wids-22-eda-lightgbm-comparecatencoding-shap).

#### Preprocess Train set

In [3]:
## Drop irrelevant columns

df_train = df_train.drop('Year_Factor', axis=1)
df_train = df_train.drop('id', axis=1)

## Handling null values

for col in ['year_built', 'energy_star_rating', 'direction_max_wind_speed', 'direction_peak_wind_speed', 'max_wind_speed', 'days_with_fog']:
    df_train[col] = df_train[col].fillna(df_train[col].median())
df_train.isna().sum().sum() # check if there is any null values left

## Encode categorical features (label encoding)
from sklearn.preprocessing import OrdinalEncoder
df_train_LE = df_train.copy()

ordinalencoder = OrdinalEncoder()
df_train_LE['State_Factor_Cat'] = ordinalencoder.fit_transform(df_train_LE['State_Factor'].to_numpy().reshape(-1, 1))
df_train_LE['building_class_Cat'] = ordinalencoder.fit_transform(df_train_LE['building_class'].to_numpy().reshape(-1, 1))
df_train_LE['facility_type_Cat'] = ordinalencoder.fit_transform(df_train_LE['facility_type'].to_numpy().reshape(-1, 1))

df_train_LE = df_train_LE.drop('State_Factor', axis=1)
df_train_LE = df_train_LE.drop('building_class', axis=1)
df_train_LE = df_train_LE.drop('facility_type', axis=1)

df_train_LE.head()

Unnamed: 0,floor_area,year_built,energy_star_rating,ELEVATION,january_min_temp,january_avg_temp,january_max_temp,february_min_temp,february_avg_temp,february_max_temp,...,days_above_100F,days_above_110F,direction_max_wind_speed,direction_peak_wind_speed,max_wind_speed,days_with_fog,site_eui,State_Factor_Cat,building_class_Cat,facility_type_Cat
0,61242.0,1942.0,11.0,2.4,36,50.5,68,35,50.589286,73,...,0,0,1.0,1.0,1.0,104.0,248.682615,0.0,0.0,13.0
1,274000.0,1955.0,45.0,1.8,36,50.5,68,35,50.589286,73,...,0,0,1.0,1.0,1.0,12.0,26.50015,0.0,0.0,55.0
2,280025.0,1951.0,97.0,1.8,36,50.5,68,35,50.589286,73,...,0,0,1.0,1.0,1.0,12.0,24.693619,0.0,0.0,48.0
3,55325.0,1980.0,46.0,1.8,36,50.5,68,35,50.589286,73,...,0,0,1.0,1.0,1.0,12.0,48.406926,0.0,0.0,6.0
4,66000.0,1985.0,100.0,2.4,36,50.5,68,35,50.589286,73,...,0,0,1.0,1.0,1.0,104.0,3.899395,0.0,0.0,56.0


#### Preprocess Test set

In [4]:
# Drop irrelevant columns
X_test = df_test.drop('Year_Factor', axis=1)
X_test = X_test.drop('id', axis=1)

# Handle null values
for col in ['year_built', 'energy_star_rating', 'direction_max_wind_speed', 'direction_peak_wind_speed', 'max_wind_speed', 'days_with_fog']:
    X_test[col] = X_test[col].fillna(df_train[col].median())

# Encode categorial features
X_test['State_Factor_Cat'] = ordinalencoder.fit_transform(X_test['State_Factor'].to_numpy().reshape(-1, 1))
X_test['building_class_Cat'] = ordinalencoder.fit_transform(X_test['building_class'].to_numpy().reshape(-1, 1))
X_test['facility_type_Cat'] = ordinalencoder.fit_transform(X_test['facility_type'].to_numpy().reshape(-1, 1))

X_test = X_test.drop('State_Factor', axis=1)
X_test = X_test.drop('building_class', axis=1)
X_test = X_test.drop('facility_type', axis=1)

### Hold-out set for blending
We need to split a portion the training set to serve as training dataset for the blending step.

In [5]:
from sklearn.model_selection import train_test_split

y = df_train_LE['site_eui']
X = df_train_LE.drop('site_eui', axis=1)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Model

## Train base models
We need to create a number of base models. These can be any models we like for a regression or classification problem. In this datathon, we are tackling a regression problem. Here we experiment with LightGBM and CatBoost models.

### Base model: LightGBM

In [6]:
from lightgbm import LGBMRegressor

params = {'learning_rate': 0.1, 'max_depth': 50, 'n_estimators': 20000, 'num_leaves': 65}
model_1 = LGBMRegressor(**params)

model_1.fit(X_train, y_train)

LGBMRegressor(max_depth=50, n_estimators=20000, num_leaves=65)

### Base model: CatBoost

Reference: https://www.kaggle.com/rhythmcam/wids-2022-catboost-rmse

In [7]:
from catboost import CatBoostRegressor

SEED = 42
MODEL_MAX_DEPTH = 12
MODEL_TASK_TYPE = 'GPU'
MODEL_RL = 0.025
MODEL_EVAL_METRIC ='RMSE'
MODEL_LOSS_FUNCTION = 'RMSE'
MODEL_ESR = 10
MODEL_VERBOSE = 1000
MODEL_ITERATIONS = 28000

model_2 = CatBoostRegressor(
    verbose=MODEL_VERBOSE,
    early_stopping_rounds=MODEL_ESR,
    random_seed=SEED,
    max_depth=MODEL_MAX_DEPTH,
    task_type=MODEL_TASK_TYPE,
    learning_rate=MODEL_RL,
    iterations=MODEL_ITERATIONS,
    loss_function=MODEL_LOSS_FUNCTION,
    eval_metric= MODEL_EVAL_METRIC
)
model_2.fit(X_train, y_train)

0:	learn: 57.6890721	total: 27ms	remaining: 12m 36s
1000:	learn: 33.2893989	total: 22.1s	remaining: 9m 57s
2000:	learn: 28.5632240	total: 43.8s	remaining: 9m 29s
3000:	learn: 25.4739806	total: 1m 6s	remaining: 9m 11s
4000:	learn: 23.2592787	total: 1m 28s	remaining: 8m 51s
5000:	learn: 21.5365875	total: 1m 50s	remaining: 8m 27s
6000:	learn: 20.1473116	total: 2m 12s	remaining: 8m 7s
7000:	learn: 19.0155426	total: 2m 35s	remaining: 7m 45s
8000:	learn: 18.0362486	total: 2m 56s	remaining: 7m 22s
9000:	learn: 17.2149065	total: 3m 19s	remaining: 7m 1s
10000:	learn: 16.4948920	total: 3m 42s	remaining: 6m 40s
11000:	learn: 15.8704474	total: 4m 4s	remaining: 6m 17s
12000:	learn: 15.3172980	total: 4m 27s	remaining: 5m 56s
13000:	learn: 14.8280272	total: 4m 49s	remaining: 5m 34s
14000:	learn: 14.3911027	total: 5m 11s	remaining: 5m 11s
15000:	learn: 13.9907609	total: 5m 33s	remaining: 4m 49s
16000:	learn: 13.6342933	total: 5m 55s	remaining: 4m 26s
17000:	learn: 13.3106784	total: 6m 18s	remaining: 4

<catboost.core.CatBoostRegressor at 0x7f6041b080d0>

In [8]:
models = [model_1, model_2]

## Fit blending ensemble
Next, we need to fit the blending model. The base models are previously fit on the training dataset. The meta-model is fit on the predictions made by each base model on the holdout dataset.

In [9]:
from sklearn.linear_model import LinearRegression

X_meta = []
for model in models:
    y_pred = model.predict(X_val) # predict on hold-out set
    y_pred = y_pred.reshape(len(y_pred), 1) # reshape predictions into a matrix with one column
    X_meta.append(y_pred)
# create 2d array from predictions, each set is an input feature
X_meta = np.hstack(X_meta)
# define blending model
blender = LinearRegression()
# fit on predictions from base models
blender.fit(X_meta, y_val)

LinearRegression()

# Submission
### Make predictions on the test set

In [10]:
X_meta = []
for model in models:
    y_pred = model.predict(X_test) # predict on hold-out set
    y_pred = y_pred.reshape(len(y_pred), 1) # reshape predictions into a matrix with one column
    X_meta.append(y_pred)
# create 2d array from predictions, each set is an input feature
X_meta = np.hstack(X_meta)

y_pred_b = blender.predict(X_meta)
results = pd.DataFrame(df_test['id'])
results['site_eui'] = y_pred_b
results.head()

Unnamed: 0,id,site_eui
0,75757,212.297059
1,75758,223.601129
2,75759,184.240192
3,75760,233.176849
4,75761,259.859086


In [11]:
# write predictions to CSV
results.to_csv("submission.csv", header=True, index=False)