<a href="https://colab.research.google.com/github/luferIPCA/MIA-MLA-24-25/blob/main/9_Ensemble_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#!Begin

# Masters' in Applied Artificial Intelligence
## Machine Learning Algorithms Course

Notebooks for the MLA course

by [*lufer*](mailto:lufer@ipca.pt)

(ver 2.0)

---



# ML Modelling - Part IX - Ensemble Machine Learning Models
\
**Contents**:

1.  **Ensemble Models**



This notebook explores the requirements adn processes to improve a ML model.

# Environment preparation


**Importing necessary Libraries**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

#import libraries for trainning
from sklearn.model_selection import train_test_split


In [None]:
import datetime
print(f"Last updated: {datetime.datetime.now()}")

**Mounting Drive**

In [None]:

from google.colab import drive

# it will ask for your google drive credentiaals
drive.mount('/content/gDrive/', force_remount=True)

# Classification Ensemble

## Get data

In [None]:
#read in the dataset
filePath="/content/gDrive/MyDrive/Colab Notebooks/MIA - ML - 2024-2025/Datasets/"
df = pd.read_csv(filePath+'diabetes_data.csv')

#take a look at the data
df.head()

In [None]:
#check dataset size
df.shape

### Check Data Quality

### NaN Values

In [None]:


df.notna().sum()
#there is no null values
#df.notna().shape

In [None]:
df.describe()

## Prepare Data

In [None]:
#split data into inputs and targets
X = df.drop(columns = ['diabetes'])
y = df['diabetes']

In [None]:
X.head()

In [None]:
sns.scatterplot(x=df['age'],y=df['insulin'], hue=df['diabetes'])

### Normalizing and Split the Data

In [None]:
#split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

# Scale the features using StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Create Different Models

### Create and Fit a KNN Model


The principle behind Nearest Neighbor (NN) Methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning).

Neighbors-based methods are known as non-generalizing machine learning methods, since they simply “remember” all of its training data.

Supervised neighbors-based learning comes in two flavors: classification for data with discrete labels, and regression for data with continuous labels.

The kNN algorithm can be considered a voting system, where the majority class label determines the class label of a new data point among its nearest ‘k’ (where k is an integer) neighbors in the feature space.

In this classification problem we'll use the `KNeighborsClassifier`. It implements learning based on the  nearest neighbors of each query point, where
 is an integer value specified by the user.

[See more in...](https://scikit-learn.org/stable/modules/neighbors.html#classification)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

#create new a knn model
knn = KNeighborsClassifier()

#create a dictionary of all values we want to test for n_neighbors
params_knn = {'n_neighbors': np.arange(1, 25)}

#use gridsearch to test all values for n_neighbors
knn_gs = GridSearchCV(knn, params_knn, cv=5)

#fit model to training data
knn_gs.fit(X_train, y_train)

In [None]:
#current best model
knn_best = knn_gs.best_estimator_

#check best n_neigbors value
print(knn_gs.best_params_)

## Create and Fit a RandomForest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

#create a new rf classifier
rf = RandomForestClassifier()

#create a dictionary of all values we want to test for n_estimators
params_rf = {'n_estimators': [50, 100, 200]}

#use gridsearch to test all values for n_estimators
rf_gs = GridSearchCV(rf, params_rf, cv=5)

#fit model to training data
rf_gs.fit(X_train, y_train)

In [None]:
#current best model
rf_best = rf_gs.best_estimator_
#rf_best
#check best n_estimators value
print(rf_gs.best_params_)

## Create and Fit a Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression

#create a new logistic regression model
log_reg = LogisticRegression()

#fit the model to the training data
log_reg.fit(X_train, y_train)

In [None]:
#test the three models with the test data and print their accuracy scores

print('knn: {}'.format(knn_best.score(X_test, y_test)))
print('rf: {}'.format(rf_best.score(X_test, y_test)))
print('log_reg: {}'.format(log_reg.score(X_test, y_test)))

## Ensemble all explored models

Ensemble models requires a kind of "voting" process to analyse existing results of the different models.

In [None]:
from sklearn.ensemble import VotingClassifier

#create a dictionary of our models
estimators=[('knn', knn_best), ('rf', rf_best), ('log_reg', log_reg)]

#in Classification: Voting mechanism
#in Regression: Aggragation nechanism
#create the voting classifier, finding the most frequently predicted class among all models (hard)
ensemble = VotingClassifier(estimators, voting='hard')

#fit model to training data
ensemble.fit(X_train, y_train)

#test our model on the test data
res=ensemble.score(X_test, y_test)


In [None]:

result=pd.DataFrame({"Ensemble":ensemble.score(X_test, y_test), "K-NN":knn_best.score(X_test, y_test),"RF":rf_best.score(X_test, y_test),"LR":log_reg.score(X_test, y_test)},index=[0])
result

The ensemble model performed better than the individual k-NN, random forest and logistic regression models!

# Regression Ensemble

Explore models KNN, RandonFOrest, Linear Regression

In [None]:
from sklearn.ensemble import VotingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error

# Generate synthetic regression dataset
X, y = make_regression(n_samples=500, n_features=5, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Explore KNN, RandonFOrest, Linear Regression
knn_reg = KNeighborsRegressor(n_neighbors=5)
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
lr_reg = LinearRegression()

# Create Voting Regressor, using "average"
voting_reg = VotingRegressor(estimators=[('knn', knn_reg), ('rf', rf_reg), ('lr', lr_reg)])

# Train the ensemble
voting_reg.fit(X_train, y_train)

# Predict and evaluate
y_pred = voting_reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Ensemble Mean Squared Error: {mse:.4f}")


In [None]:
y.std()


In [None]:
y.var()

In [None]:
from sklearn.metrics import r2_score
r2=r2_score(y_test, y_pred)
print(f"R² Score: {r2:.4f}")

Remenber:

- If MSE is much smaller than y.var(), the model is performing well.
- If MSE is much larger, the model might be underfitting.