# Complex ML Models

## In this lesson you will learn

 - Ensamble models: Bagging, Boosting and Stacking
     - RandomForest, ADA Boost, XGBoost
     
 - Support Vector Machine



## Ensemble models


Ensemble models in machine learning combine the predictions from multiple individual models to produce a more accurate and robust prediction. The fundamental idea is that by aggregating the predictions of several models, the ensemble often performs better than any individual model.

The three main classes of ensemble learning methods are bagging, stacking, and boosting

 - **Bagging** involves fitting many decision trees on different samples of the same dataset and averaging the predictions. For example RandomForest (regressors or classifiers).


 - **Stacking** involves fitting many different models types on the same data and using another model to learn how to best combine the predictions.


 - **Boosting** involves adding ensemble members sequentially that correct the predictions made by prior models and outputs a weighted average of the predictions.

https://machinelearningmastery.com/tour-of-ensemble-learning-algorithms/

### Bagging models

Bootstrap aggregation, or bagging for short, is an ensemble learning method that seeks a diverse group of ensemble members by varying the training data.

This typically involves using a single machine learning algorithm, almost always an unpruned decision tree, and training each model on a different sample of the same training dataset.

The predictions made by the ensemble members are then combined using simple statistics, such as voting or averaging.




We can **summarize** the key elements of bagging as follows:

 - Different training dataset for each ensembled model.
 - Unpruned decision trees fit on each sample.
 - Simple voting or averaging of predictions.

![image.png](attachment:image.png)

Differences between criterion gini vs entropy

https://www.geeksforgeeks.org/gini-impurity-and-entropy-in-decision-tree-ml/

### Stacking models

It involves combining the predictions from multiple machine learning models on the same dataset

Stacking has its own nomenclature, where ensemble members are referred to as level-0 models, and the model that uses predictions to weight the level-0 models is referred to as a level-1 model.

 - Level-0 Models (Base-Models): Models fit on the training data and whose predictions are compiled.


 - Level-1 Model (Meta-Model): Model that learns how to best combine the predictions of the base models. The meta-model is trained on the predictions made by base models on out-of-sample data.

Unlike bagging, in stacking, the models are typically different (e.g. not all decision trees) and fit on the same dataset (e.g. instead of samples of the training dataset).

Unlike boosting, in stacking, a single model is used to learn how to best combine the predictions from the contributing models (e.g. instead of a sequence of models that correct the predictions of prior models).

We can **summarize** the key elements of stacking as follows:

 - Unchanged training dataset.
 - Different machine learning algorithms for each ensemble member.
 - Machine learning model to learn how to best combine predictions.

![image.png](attachment:image.png)

### Boosting models



Boosting is an ensemble method that seeks to change the training data to focus attention on examples that previous fit models on the training dataset have gotten wrong.

The key property of boosting ensembles is the idea of correcting prediction errors. The models are fit and added to the ensemble sequentially such that the second model attempts to correct the predictions of the first model, the third corrects the second model, and so on.

This typically involves the use of very simple decision trees that only make a single or a few decisions, referred to in boosting as weak learners. The predictions of the weak learners are combined using simple voting or averaging.




#### AdaBoost

AdaBoost is one of the most important Boosting ML algorithm.

Three ideas behind AdaBoost:

 - It combines several Decision Trees called weak learners (with depth of one or two)


 - In the final decision, some Trees have more importance than others


 - The errors a Tree makes are considered in the next Tree

The algorithm works as follow:

 - The first Tree is trained and the quality of the model is calculated (Accuracy for classification and MAPE for regression)


 - Then it is calculated the importance of that Tree


 - In the first Tree, each observation had the same weight. In the second Tree the wrong predictions will have more importance, so the training of the second Tree is influenced by the quality of the first one. The best the previous model predicts, the higher the increasing influence of the error, but the lower the number of instances.


  - The input data for the second Tree is selected depending on those weights, so the missclassifications will have higher probability to be selected to train the second Tree.


 - Then the process is repeated until every instance is well classified or predicted, or the maximum number of Trees parametrized in the AdaBoost is achieved


 - Finally, we have several Trees and each one make its own prediction for a new observation. The final decision is made by averaging the individual Tree's decision, taking into consideration the importance for each Tree


https://www.youtube.com/watch?v=LsK-xG1cLYA

#### Boosting Gradient

Gradient Boosting (GB) also combines several Tree models **sequentially** improving prediction or the dependent variable model after model.

 - It combines several Decision Trees


 - Initially uses the mean (or the log of the odds in the case of classification) to make the first prediction, and calculates the residuals


 - Then a Tree is trained to predict those residuals


 - The predicted residuals are added to the prediction of the dependent variable, improving it, and reducing the initial residuals


 - A Learning Rate factor is added to the residual prediction to avoid overfitting (usually equals to 0.1)


 - The last two steps repeats. Several Trees are trained on the residuals left by the previous Tree and addind their predictions to the dependent variable prediction, untill we reach the maximum number of Trees



 - The final algorithm has multiple sequential Trees, each one adding some value to the previous prediction. When a new individual is passed through the model, depending on the value of its features, different residual prediction will be added.

https://www.youtube.com/watch?v=3CC4N4z3GJc

#### XGBoost

eXtreme Gradient Boost (XGBoost) also sequence several Trees adding residual predictions to the initial dependent variable prediction

 - Initially uses the mean (or the log of the odds in the case of classification) to make the first prediction of the dependent variable, and calculates the residuals

 - A similarity score is calculated to estimate the quality of that initial prediction

 - Then a Tree is trained to predict the residuals based on the most significant feature and threshold. This significance is calculated through the Information Gain, which depends on the similarity score

 - Then new nodes are added to the Tree with new features or different threshold of the same feature than above. Usually the parametrized depth is from 3 to 5.

 - Lambda parameter allows regularize the similarity score and reduce overfitting. LEARNING RATE

https://www.youtube.com/watch?v=OtD8wVaFm6E

## Complex models with Python

Firstly, lets prepare the data

 - Dividing in train and test datasets
 - Standardize continuous variables
 - When working with Trees, making Dummies for categoricals variables is not always recommended. So we won't do it. However, we have to encode categorical variables to numbers. If not Python will return an error


#### Import libraries

In [1]:
!pip install xgboost



In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix, roc_curve, roc_auc_score
from imblearn.over_sampling import RandomOverSampler
# RepeatedStratifiedKFold for classification
# RepeatedKFold for regressio

from sklearn.model_selection import RepeatedKFold, RepeatedStratifiedKFold, GridSearchCV, cross_val_score
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor, GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.ensemble import StackingClassifier, StackingRegressor, RandomForestClassifier, RandomForestRegressor
from xgboost import XGBClassifier, XGBRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.svm import SVC, SVR


from tabulate import tabulate

import warnings
warnings.filterwarnings("ignore")

from sklearn import metrics

### Diabetes dataset

contains more than 700 pacients with several independent variables and one dependent, Outcome, which is binary, for classification. Outcome = 1 means diabetes = Yes

In [None]:
diabetes = pd.read_csv('/content/drive/MyDrive/IronhackDA2023/data/pima-indians-diabetes.csv', sep = ';')
diabetes.head(3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1


#### Diabetes Train-Test split

In [None]:
diab_X_train, diab_X_test, diab_y_train, diab_y_test = train_test_split(diabetes[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI',
                                              'DiabetesPedigreeFunction','Age']],
                                                    diabetes['Outcome'], train_size = 0.8, random_state = 0)
diab_X_train.head(3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
603,7,150,78,29,126,35.2,0.692,54
118,4,97,60,23,0,28.2,0.443,22
247,0,165,90,33,680,52.3,0.427,23


### Over Sampling

As wee saw in with the value_counts method there is some unbalance, so let's fix it

In [None]:
diabetes.Outcome.value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [None]:
ros = RandomOverSampler(random_state=42)
diab_X_train, diab_y_train= ros.fit_resample(diab_X_train, diab_y_train)

In [None]:
scaler = StandardScaler()
sc = scaler.fit(diab_X_train)

train_sc = sc.transform(diab_X_train)
diab_X_train_sc = pd.DataFrame(train_sc)
diab_X_train_sc.columns = ['Pregnancies_st','Glucose_st','BloodPressure_st','SkinThickness_st','Insulin_st','BMI_st',
       'DiabetesPedigreeFunction_st','Age_st']

test_sc = sc.transform(diab_X_test)
diab_X_test_sc = pd.DataFrame(test_sc)
diab_X_test_sc.columns = ['Pregnancies_st','Glucose_st','BloodPressure_st','SkinThickness_st','Insulin_st','BMI_st',
       'DiabetesPedigreeFunction_st','Age_st']

diab_X_test_sc.head(3)

Unnamed: 0,Pregnancies_st,Glucose_st,BloodPressure_st,SkinThickness_st,Insulin_st,BMI_st,DiabetesPedigreeFunction_st,Age_st
0,-0.892614,2.215251,0.318781,1.32409,-0.703392,1.353269,2.715018,-1.061804
1,-0.606949,-0.508176,0.217092,0.536939,0.133295,0.141652,-0.226186,-0.978028
2,-0.035617,-1.425852,-0.393042,-1.279564,-0.703392,0.193765,-0.264808,-0.810476


### Car Miles per Galon dataset

Contains information about 1300 cars. The goal is to predict mpg



In [None]:
cars = pd.read_csv('/content/drive/MyDrive/IronhackDA2023/data/car_miles_per_galon.csv', sep = ';')
cars.head(3)

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model year,origin,mpg
0,6,225.0,95,3264,16.0,75,1,19.0
1,6,250.0,88,3139,14.5,71,1,18.0
2,4,98.0,80,2164,15.0,72,1,28.0


#### Car Train test split



In [None]:
cars_X_train, cars_X_test, cars_y_train, cars_y_test = train_test_split(cars[['cylinders','displacement','horsepower',
                                                                              'weight','acceleration','model year','origin']],
                                                    cars['mpg'], train_size = 0.8, random_state = 0)
cars_X_train.head(3)

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model year,origin
1161,4,86.0,64,1875,16.4,81,1
567,4,98.0,68,2135,16.6,78,3
1270,4,141.0,71,3190,24.8,79,2


We dont scale origin

In [None]:
cars['origin'].value_counts()

1    830
3    280
2    219
Name: origin, dtype: int64

In [None]:
column_to_keep_train = cars_X_train['origin']
cars_X_train = cars_X_train.drop(columns=['origin'])
original_index_train = cars_X_train.index

column_to_keep_test = cars_X_test['origin']
cars_X_test = cars_X_test.drop(columns=['origin'])
original_index_test = cars_X_test.index

scaler = StandardScaler()
car_scaler = scaler.fit(cars_X_train)
train_car_scaler = car_scaler.transform(cars_X_train)
test_car_scaler = car_scaler.transform(cars_X_test)

# Crear un nuevo DataFrame con las características estandarizadas
cars_X_train_car_scaler = pd.DataFrame(train_car_scaler, columns=cars_X_train.columns, index = original_index_train)
cars_X_test_car_scaler = pd.DataFrame(test_car_scaler, columns=cars_X_test.columns, index = original_index_test)

# Concatenar la columna no estandarizada de nuevo al DataFrame
cars_X_train_car_scaler = pd.concat([cars_X_train_car_scaler, column_to_keep_train], axis=1)
cars_X_test_car_scaler = pd.concat([cars_X_test_car_scaler, column_to_keep_test], axis=1)
cars_X_train_car_scaler.head(3)

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model year,origin
1161,-0.861621,-1.020848,-1.050466,-1.288559,0.275634,1.345204,1
567,-0.861621,-0.904822,-0.943163,-0.978212,0.347677,0.529952,3
1270,-0.861621,-0.489063,-0.862686,0.281082,3.30144,0.801703,2


### Random Forest Regressor

In [None]:
RF_Reg = RandomForestRegressor()

grid = dict()
grid['n_estimators'] = [10, 50] # number of trees
grid['criterion'] = ['squared_error','absolute_error']



# define the evaluation procedure
cv = RepeatedKFold(n_splits = 5, n_repeats = 3, random_state = 1)

# define the grid search procedure
grid_search = GridSearchCV(estimator = RF_Reg, param_grid = grid, n_jobs = -1, cv = cv, scoring = 'neg_mean_absolute_error')

# execute the grid search
grid_result = grid_search.fit(cars_X_train_car_scaler, cars_y_train)

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# summarize all scores that were evaluated
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: -0.509470 using {'criterion': 'squared_error', 'n_estimators': 50}
-0.542892 (0.095516) with: {'criterion': 'squared_error', 'n_estimators': 10}
-0.509470 (0.095240) with: {'criterion': 'squared_error', 'n_estimators': 50}
-0.598809 (0.083874) with: {'criterion': 'absolute_error', 'n_estimators': 10}
-0.542938 (0.089356) with: {'criterion': 'absolute_error', 'n_estimators': 50}


In [None]:
RF_Reg = RandomForestRegressor(criterion = 'squared_error', n_estimators = 500)
RF_Reg.fit(cars_X_train_car_scaler, cars_y_train)

In [None]:
cars_X_test_car_scaler['RF_Reg'] = RF_Reg.predict(cars_X_test_car_scaler)
cars_X_train_car_scaler['RF_Reg'] = RF_Reg.predict(cars_X_train_car_scaler)

In [None]:
print("MAE: ", metrics.mean_absolute_error(cars_y_train, cars_X_train_car_scaler['RF_Reg']).round(4))
print("MSE: ", metrics.mean_squared_error(cars_y_train, cars_X_train_car_scaler['RF_Reg']).round(4))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(cars_y_train, cars_X_train_car_scaler['RF_Reg'])).round(4))
print("MAPE: ", metrics.mean_absolute_percentage_error(cars_y_train, cars_X_train_car_scaler['RF_Reg']).round(4))
print("R2: ", metrics.r2_score(cars_y_train, cars_X_train_car_scaler['RF_Reg']).round(4))

MAE:  0.1283
MSE:  0.1177
RMSE:  0.3431
MAPE:  0.0054
R2:  0.998


In [None]:
print("MAE: ", metrics.mean_absolute_error(cars_y_test, cars_X_test_car_scaler['RF_Reg']).round(4))
print("MSE: ", metrics.mean_squared_error(cars_y_test, cars_X_test_car_scaler['RF_Reg']).round(4))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(cars_y_test, cars_X_test_car_scaler['RF_Reg'])).round(4))
print("MAPE: ", metrics.mean_absolute_percentage_error(cars_y_test, cars_X_test_car_scaler['RF_Reg']).round(4))
print("R2: ", metrics.r2_score(cars_y_test, cars_X_test_car_scaler['RF_Reg']).round(4))

MAE:  0.3302
MSE:  0.6734
RMSE:  0.8206
MAPE:  0.0168
R2:  0.9875


### Random Forest Classifier

In [None]:
RF_Cl = RandomForestClassifier()

# define the grid of values to search
grid = dict()
grid['n_estimators'] = [100, 500]
grid['criterion'] = ['gini','entropy']

# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)

# define the grid search procedure
grid_search = GridSearchCV(estimator=RF_Cl, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy')

# execute the grid search
grid_result = grid_search.fit(diab_X_train_sc, diab_y_train)

# summarize the best score and configuration
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# summarize all scores that were evaluated
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))



Best: 1.000000 using {'criterion': 'gini', 'n_estimators': 100}
1.000000 (0.000000) with: {'criterion': 'gini', 'n_estimators': 100}
1.000000 (0.000000) with: {'criterion': 'gini', 'n_estimators': 500}
1.000000 (0.000000) with: {'criterion': 'entropy', 'n_estimators': 100}
1.000000 (0.000000) with: {'criterion': 'entropy', 'n_estimators': 500}


In [None]:
RF_Cl = RandomForestClassifier(criterion = 'entropy', n_estimators = 100)
RF_Cl.fit(diab_X_train_sc, diab_y_train)

In [None]:
print("Train set score (Accuracy) =", RF_Cl.score(diab_X_train_sc, diab_y_train).round(4))
print("Test set score (Accuracy) =", RF_Cl.score(diab_X_test_sc, diab_y_test).round(4))

conf_mat = confusion_matrix(diab_y_test, RF_Cl.predict(diab_X_test_sc))
print(tabulate(conf_mat,headers = ['pred Diab No','pred Diab Yes'], showindex = ['real Diab No','real Diab Yes'],
               tablefmt = 'fancy_grid'))

print(classification_report(diab_y_test, RF_Cl.predict(diab_X_test_sc)))

Train set score (Accuracy) = 1.0
Test set score (Accuracy) = 0.8117
╒═══════════════╤════════════════╤═════════════════╕
│               │   pred Diab No │   pred Diab Yes │
╞═══════════════╪════════════════╪═════════════════╡
│ real Diab No  │             91 │              16 │
├───────────────┼────────────────┼─────────────────┤
│ real Diab Yes │             13 │              34 │
╘═══════════════╧════════════════╧═════════════════╛
              precision    recall  f1-score   support

           0       0.88      0.85      0.86       107
           1       0.68      0.72      0.70        47

    accuracy                           0.81       154
   macro avg       0.78      0.79      0.78       154
weighted avg       0.82      0.81      0.81       154



In [None]:
diab_X_test_sc['RF_Cl'] = RF_Cl.predict(diab_X_test_sc)
diab_X_train_sc['RF_Cl'] = RF_Cl.predict(diab_X_train_sc)

### Stacking Classifier with Python

In [None]:
level0 = list()
level0.append(('lr', LogisticRegression()))
level0.append(('RF', RandomForestClassifier()))
level0.append(('svc', SVC()))

level1 = LogisticRegression()

# define the stacking ensemble
St_Cl = StackingClassifier(estimators = level0, final_estimator = level1)

# fit the model on all available data
St_Cl.fit(diab_X_train_sc, diab_y_train)




In [None]:
print("Train set score (Accuracy) =", St_Cl.score(diab_X_train_sc, diab_y_train).round(4))
print("Test set score (Accuracy) =", St_Cl.score(diab_X_test_sc, diab_y_test).round(4))

conf_mat = confusion_matrix(diab_y_test, St_Cl.predict(diab_X_test_sc))
print(tabulate(conf_mat,headers = ['pred Diab No','pred Diab Yes'], showindex = ['real Diab No','real Diab Yes'],
               tablefmt = 'fancy_grid'))

print(classification_report(diab_y_test, St_Cl.predict(diab_X_test_sc)))


In [None]:
diab_X_test_sc['St_Cl'] = St_Cl.predict(diab_X_test_sc)
diab_X_train_sc['St_Cl'] = St_Cl.predict(diab_X_train_sc)

### Stacking Regression with Python


In [None]:
level0 = list()
level0.append(('lr', LinearRegression()))
level0.append(('RF', RandomForestRegressor()))
level0.append(('svr', SVR()))

level1 = LinearRegression()

# define the stacking ensemble
St_reg = StackingRegressor(estimators = level0, final_estimator = level1)

# fit the model on all available data
St_reg.fit(cars_X_train_car_scaler, cars_y_train)



In [None]:
cars_X_test_car_scaler['St_reg'] = St_reg.predict(cars_X_test_car_scaler)
cars_X_train_car_scaler['St_reg'] = St_reg.predict(cars_X_train_car_scaler)

In [None]:
print("MAE: ", metrics.mean_absolute_error(cars_y_train, cars_X_train_car_scaler['St_reg']).round(4))
print("MSE: ", metrics.mean_squared_error(cars_y_train, cars_X_train_car_scaler['St_reg']).round(4))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(cars_y_train, cars_X_train_car_scaler['St_reg'])).round(4))
print("MAPE: ", metrics.mean_absolute_percentage_error(cars_y_train, cars_X_train_car_scaler['St_reg']).round(4))
print("R2: ", metrics.r2_score(cars_y_train, cars_X_train_car_scaler['St_reg']).round(4))

NameError: ignored

In [None]:
print("MAE: ", metrics.mean_absolute_error(cars_y_test, cars_X_test_car_scaler['St_reg']).round(4))
print("MSE: ", metrics.mean_squared_error(cars_y_test, cars_X_test_car_scaler['St_reg']).round(4))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(cars_y_test, cars_X_test_car_scaler['St_reg'])).round(4))
print("MAPE: ", metrics.mean_absolute_percentage_error(cars_y_test, cars_X_test_car_scaler['St_reg']).round(4))
print("R2: ", metrics.r2_score(cars_y_test, cars_X_test_car_scaler['St_reg']).round(4))


### AdaBoost Regression with Python

In [None]:
AB_Reg = AdaBoostRegressor()

grid = dict()
grid['n_estimators'] = [10, 100, 500] # number of trees
grid['learning_rate'] = [ 0.001, 0.01, 0.1, 1.0]



# define the evaluation procedure
cv = RepeatedKFold(n_splits = 5, n_repeats = 3, random_state = 1)

# define the grid search procedure
grid_search = GridSearchCV(estimator = AB_Reg, param_grid = grid, n_jobs = -1, cv = cv, scoring = 'neg_mean_absolute_error')

# execute the grid search
grid_result = grid_search.fit(cars_X_train_car_scaler, cars_y_train)

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# summarize all scores that were evaluated
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

NameError: ignored

The best model has 100 Trees and a learning rate of 1. Using Mean Absolute Error has loss function, test using Mean Absolute Percentage Error. It could change the best combination of parameters!!

Once we know which is the best combination let's train the definitive model and make the predictions

In [None]:
AB_Reg = AdaBoostRegressor(learning_rate = 1, n_estimators = 100)
AB_Reg.fit(cars_X_train_car_scaler, cars_y_train)

NameError: ignored

In [None]:
cars_X_test_car_scaler['AB_Reg'] = AB_Reg.predict(cars_X_test_car_scaler)
cars_X_train_car_scaler['AB_Reg'] = AB_Reg.predict(cars_X_train_car_scaler)

NameError: ignored

In [None]:
print("MAE: ", metrics.mean_absolute_error(cars_y_train, cars_X_train_car_scaler['AB_Reg']).round(4))
print("MSE: ", metrics.mean_squared_error(cars_y_train, cars_X_train_car_scaler['AB_Reg']).round(4))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(cars_y_train, cars_X_train_car_scaler['AB_Reg'])).round(4))
print("MAPE: ", metrics.mean_absolute_percentage_error(cars_y_train, cars_X_train_car_scaler['AB_Reg']).round(4))
print("R2: ", metrics.r2_score(cars_y_train, cars_X_train_car_scaler['AB_Reg']).round(4))

In [None]:
print("MAE: ", metrics.mean_absolute_error(cars_y_test, cars_X_test_car_scaler['AB_Reg']).round(4))
print("MSE: ", metrics.mean_squared_error(cars_y_test, cars_X_test_car_scaler['AB_Reg']).round(4))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(cars_y_test, cars_X_test_car_scaler['AB_Reg'])).round(4))
print("MAPE: ", metrics.mean_absolute_percentage_error(cars_y_test, cars_X_test_car_scaler['AB_Reg']).round(4))
print("R2: ", metrics.r2_score(cars_y_test, cars_X_test_car_scaler['AB_Reg']).round(4))


MAE:  1.6143
MSE:  4.0474
RMSE:  2.0118
MAPE:  0.0745
R2:  0.925


### AdaBoost Classifier with Python

In [None]:
AB_cl = AdaBoostClassifier()

# define the grid of values to search
grid = dict()
grid['n_estimators'] = [10, 100, 500]
grid['learning_rate'] = [0.001, 0.01, 0.1, 1.0]

# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)

# define the grid search procedure
grid_search = GridSearchCV(estimator=AB_cl, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy')

# execute the grid search
grid_result = grid_search.fit(diab_X_train_sc, diab_y_train)

# summarize the best score and configuration
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# summarize all scores that were evaluated
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))



Best: 0.747554 using {'learning_rate': 0.01, 'n_estimators': 500}
0.713324 (0.033152) with: {'learning_rate': 0.001, 'n_estimators': 10}
0.713324 (0.033152) with: {'learning_rate': 0.001, 'n_estimators': 100}
0.723648 (0.035688) with: {'learning_rate': 0.001, 'n_estimators': 500}
0.713324 (0.033152) with: {'learning_rate': 0.01, 'n_estimators': 10}
0.731281 (0.031052) with: {'learning_rate': 0.01, 'n_estimators': 100}
0.747554 (0.034162) with: {'learning_rate': 0.01, 'n_estimators': 500}
0.730201 (0.032379) with: {'learning_rate': 0.1, 'n_estimators': 10}
0.745942 (0.037197) with: {'learning_rate': 0.1, 'n_estimators': 100}
0.737825 (0.034146) with: {'learning_rate': 0.1, 'n_estimators': 500}
0.730170 (0.042411) with: {'learning_rate': 1.0, 'n_estimators': 10}
0.718819 (0.038249) with: {'learning_rate': 1.0, 'n_estimators': 100}
0.697094 (0.028260) with: {'learning_rate': 1.0, 'n_estimators': 500}


In [None]:
AB_cl = AdaBoostClassifier(learning_rate = 0.01, n_estimators = 500)
AB_cl.fit(diab_X_train_sc, diab_y_train)

# diab_X_train, diab_X_test, diab_y_train, diab_y_test

In [None]:
print("Train set score (Accuracy) =", AB_cl.score(diab_X_train_sc, diab_y_train).round(4))
print("Test set score (Accuracy) =", AB_cl.score(diab_X_test_sc, diab_y_test).round(4))

conf_mat = confusion_matrix(diab_y_test, AB_cl.predict(diab_X_test_sc))
print(tabulate(conf_mat,headers = ['pred Diab No','pred Diab Yes'], showindex = ['real Diab No','real Diab Yes'],
               tablefmt = 'fancy_grid'))

print(classification_report(diab_y_test, AB_cl.predict(diab_X_test_sc)))

Train set score (Accuracy) = 0.7863
Test set score (Accuracy) = 0.7987
╒═══════════════╤════════════════╤═════════════════╕
│               │   pred Diab No │   pred Diab Yes │
╞═══════════════╪════════════════╪═════════════════╡
│ real Diab No  │             86 │              21 │
├───────────────┼────────────────┼─────────────────┤
│ real Diab Yes │             10 │              37 │
╘═══════════════╧════════════════╧═════════════════╛
              precision    recall  f1-score   support

           0       0.90      0.80      0.85       107
           1       0.64      0.79      0.70        47

    accuracy                           0.80       154
   macro avg       0.77      0.80      0.78       154
weighted avg       0.82      0.80      0.80       154



In [None]:
diab_X_test_sc['AB_Cl'] = AB_cl.predict(diab_X_test_sc)
diab_X_train_sc['AB_Cl'] = AB_cl.predict(diab_X_train_sc)

### Boosting Gradient Regressor with Python

In [None]:
BG_Reg = GradientBoostingRegressor()

# define the grid of values to search
grid = dict()
grid['n_estimators'] = [100, 500]
grid['learning_rate'] = [0.01, 0.1, 1.0]
grid['max_depth'] = [3, 5, 8]

# define the evaluation procedure
cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)

# define the grid search procedure
grid_search = GridSearchCV(estimator=BG_Reg, param_grid=grid, n_jobs=-1, cv=cv, scoring='neg_mean_absolute_error')
# execute the grid search
grid_result = grid_search.fit(cars_X_train_car_scaler, cars_y_train)

# summarize the best score and configuration
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# summarize all scores that were evaluated
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))


Best: -0.170705 using {'learning_rate': 0.1, 'max_depth': 8, 'n_estimators': 500, 'subsample': 0.5}
-2.756249 (0.171753) with: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.5}
-2.757431 (0.169782) with: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.7}
-2.759255 (0.171752) with: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100, 'subsample': 1.0}
-1.215482 (0.082850) with: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 500, 'subsample': 0.5}
-1.218314 (0.083073) with: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 500, 'subsample': 0.7}
-1.232304 (0.083389) with: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 500, 'subsample': 1.0}
-2.645642 (0.143866) with: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 100, 'subsample': 0.5}
-2.636124 (0.142124) with: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 100, 'subsample': 0.7}
-2.628138 (0.140084) with: {'learning_rate': 0.01, '

In [None]:
BG_Reg = GradientBoostingRegressor(learning_rate = 1, n_estimators = 100, max_depth = 2)
BG_Reg.fit(cars_X_train_car_scaler, cars_y_train)

In [None]:
cars_X_test_car_scaler['BG_Reg'] = BG_Reg.predict(cars_X_test_car_scaler)
cars_X_train_car_scaler['BG_Reg'] = BG_Reg.predict(cars_X_train_car_scaler)

In [None]:
print("MAE: ", metrics.mean_absolute_error(cars_y_train, cars_X_train_car_scaler['BG_Reg']).round(4))
print("MSE: ", metrics.mean_squared_error(cars_y_train, cars_X_train_car_scaler['BG_Reg']).round(4))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(cars_y_train, cars_X_train_car_scaler['BG_Reg'])).round(4))
print("MAPE: ", metrics.mean_absolute_percentage_error(cars_y_train, cars_X_train_car_scaler['BG_Reg']).round(4))
print("R2: ", metrics.r2_score(cars_y_train, cars_X_train_car_scaler['BG_Reg']).round(4))

MAE:  0.0799
MSE:  0.012
RMSE:  0.1094
MAPE:  0.0037
R2:  0.9998


In [None]:
print("MAE: ", metrics.mean_absolute_error(cars_y_test, cars_X_test_car_scaler['BG_Reg']).round(4))
print("MSE: ", metrics.mean_squared_error(cars_y_test, cars_X_test_car_scaler['BG_Reg']).round(4))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(cars_y_test, cars_X_test_car_scaler['BG_Reg'])).round(4))
print("MAPE: ", metrics.mean_absolute_percentage_error(cars_y_test, cars_X_test_car_scaler['BG_Reg']).round(4))
print("R2: ", metrics.r2_score(cars_y_test, cars_X_test_car_scaler['BG_Reg']).round(4))

MAE:  0.3473
MSE:  1.4328
RMSE:  1.197
MAPE:  0.0167
R2:  0.9734


### Boosting Gradient Classifier with Python


In [None]:
BG_cl = GradientBoostingClassifier()

# define the grid of values to search
grid = dict()
grid['n_estimators'] = [100, 500]
grid['learning_rate'] = [0.01, 0.1, 1.0]
grid['max_depth'] = [3, 5, 8]

# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)

# define the grid search procedure
grid_search = GridSearchCV(estimator=BG_cl, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy')
# execute the grid search

grid_result = grid_search.fit(diab_X_train_sc, diab_y_train)

# summarize the best score and configuration
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# summarize all scores that were evaluated
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.750842 using {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 500, 'subsample': 0.7}
0.640064 (0.003169) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 10, 'subsample': 0.5}
0.640064 (0.003169) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 10, 'subsample': 0.7}
0.640064 (0.003169) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 10, 'subsample': 1.0}
0.640064 (0.003169) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.5}
0.640064 (0.003169) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.7}
0.640064 (0.003169) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 100, 'subsample': 1.0}
0.718806 (0.033561) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 500, 'subsample': 0.5}
0.714461 (0.031336) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 500, 'subsample': 0.7}
0.703616 (0.025514) with: {'learning_rate': 0.001, 'max

In [None]:
BG_cl = GradientBoostingClassifier(learning_rate = 0.01, max_depth = 5, n_estimators = 500)
BG_cl.fit(diab_X_train_sc, diab_y_train)

In [None]:
print("Train set score (Accuracy) =", BG_cl.score(diab_X_train_sc, diab_y_train).round(4))
print("Test set score (Accuracy) =", BG_cl.score(diab_X_test_sc, diab_y_test).round(4))

conf_mat = confusion_matrix(diab_y_test, BG_cl.predict(diab_X_test_sc))
print(tabulate(conf_mat,headers = ['pred Diab No','pred Diab Yes'], showindex = ['real Diab No','real Diab Yes'],
               tablefmt = 'fancy_grid'))

print(classification_report(diab_y_test, BG_cl.predict(diab_X_test_sc)))

diab_X_test_sc['AB_Cl'] = AB_cl.predict(diab_X_test_sc)
diab_X_train_sc['AB_Cl'] = AB_cl.predict(diab_X_train_sc)

Train set score (Accuracy) = 0.9847
Test set score (Accuracy) = 0.7987
╒═══════════════╤════════════════╤═════════════════╕
│               │   pred Diab No │   pred Diab Yes │
╞═══════════════╪════════════════╪═════════════════╡
│ real Diab No  │             88 │              19 │
├───────────────┼────────────────┼─────────────────┤
│ real Diab Yes │             12 │              35 │
╘═══════════════╧════════════════╧═════════════════╛
              precision    recall  f1-score   support

           0       0.88      0.82      0.85       107
           1       0.65      0.74      0.69        47

    accuracy                           0.80       154
   macro avg       0.76      0.78      0.77       154
weighted avg       0.81      0.80      0.80       154



### XGBoost Regressor with Python


In [None]:
XG_Reg = GradientBoostingRegressor()

# define the grid of values to search
grid = dict()
grid['n_estimators'] = [50, 100]
grid['learning_rate'] = [0.01, 0.1, 1.0]
grid['max_depth'] = [3, 5, 8]

# define the evaluation procedure
cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)

# define the grid search procedure
grid_search = GridSearchCV(estimator=XG_Reg, param_grid=grid, n_jobs=-1, cv=cv, scoring='neg_mean_absolute_error')
# execute the grid search
grid_result = grid_search.fit(cars_X_train_car_scaler, cars_y_train)

# summarize the best score and configuration
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# summarize all scores that were evaluated
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))


Best: -0.013461 using {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 500}
-2.409940 (0.134989) with: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100}
-0.065114 (0.006873) with: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 500}
-2.378757 (0.125581) with: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 100}
-0.058029 (0.007673) with: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 500}
-2.378105 (0.123946) with: {'learning_rate': 0.01, 'max_depth': 8, 'n_estimators': 100}
-0.055142 (0.008199) with: {'learning_rate': 0.01, 'max_depth': 8, 'n_estimators': 500}
-0.031272 (0.004521) with: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}
-0.017524 (0.004248) with: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500}
-0.018695 (0.005417) with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
-0.013461 (0.005263) with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 500}
-0.014311 (0.007493) with: {'learning_rate

In [None]:
XG_Reg = GradientBoostingRegressor(learning_rate = 0.1, n_estimators = 100, max_depth = 5)
XG_Reg.fit(cars_X_train_car_scaler, cars_y_train)

In [None]:
cars_X_test_car_scaler['XG_Reg'] = XG_Reg.predict(cars_X_test_car_scaler)
cars_X_train_car_scaler['XG_Reg'] = XG_Reg.predict(cars_X_train_car_scaler)

In [None]:
print("MAE: ", metrics.mean_absolute_error(cars_y_train, cars_X_train_car_scaler['XG_Reg']).round(4))
print("MSE: ", metrics.mean_squared_error(cars_y_train, cars_X_train_car_scaler['XG_Reg']).round(4))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(cars_y_train, cars_X_train_car_scaler['XG_Reg'])).round(4))
print("MAPE: ", metrics.mean_absolute_percentage_error(cars_y_train, cars_X_train_car_scaler['XG_Reg']).round(4))
print("R2: ", metrics.r2_score(cars_y_train, cars_X_train_car_scaler['XG_Reg']).round(4))

MAE:  0.0054
MSE:  0.0001
RMSE:  0.0097
MAPE:  0.0002
R2:  1.0


In [None]:
print("MAE: ", metrics.mean_absolute_error(cars_y_test, cars_X_test_car_scaler['XG_Reg']).round(4))
print("MSE: ", metrics.mean_squared_error(cars_y_test, cars_X_test_car_scaler['XG_Reg']).round(4))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(cars_y_test, cars_X_test_car_scaler['XG_Reg'])).round(4))
print("MAPE: ", metrics.mean_absolute_percentage_error(cars_y_test, cars_X_test_car_scaler['XG_Reg']).round(4))
print("R2: ", metrics.r2_score(cars_y_test, cars_X_test_car_scaler['XG_Reg']).round(4))

MAE:  0.271
MSE:  1.3648
RMSE:  1.1683
MAPE:  0.0131
R2:  0.9747


### XGBoost Classifier with Python

In [None]:
XG_cl = GradientBoostingClassifier()

# define the grid of values to search
grid = dict()
grid['n_estimators'] = [50, 100]
grid['learning_rate'] = [0.01, 0.1, 1.0]
grid['max_depth'] = [3, 5, 8]

# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)

# define the grid search procedure
grid_search = GridSearchCV(estimator=XG_cl, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy')
# execute the grid search

grid_result = grid_search.fit(diab_X_train_sc, diab_y_train)

# summarize the best score and configuration
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# summarize all scores that were evaluated
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.838848 using {'learning_rate': 1.0, 'max_depth': 8, 'n_estimators': 50}
0.758263 (0.030810) with: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50}
0.766317 (0.029425) with: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100}
0.775240 (0.041400) with: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 50}
0.785842 (0.030856) with: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 100}
0.802795 (0.039673) with: {'learning_rate': 0.01, 'max_depth': 8, 'n_estimators': 50}
0.808743 (0.040271) with: {'learning_rate': 0.01, 'max_depth': 8, 'n_estimators': 100}
0.787957 (0.027168) with: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50}
0.801521 (0.025535) with: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}
0.814679 (0.023642) with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 50}
0.823153 (0.026228) with: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
0.825703 (0.036398) with: {'learning_rate': 0.1, 'max_depth

In [None]:
XG_cl = GradientBoostingClassifier(learning_rate = 0.01, max_depth = 5, n_estimators = 500, subsample = 0.7)
XG_cl.fit(diab_X_train_sc, diab_y_train)

In [None]:
print("Train set score (Accuracy) =", XG_cl.score(diab_X_train_sc, diab_y_train).round(4))
print("Test set score (Accuracy) =", XG_cl.score(diab_X_test_sc, diab_y_test).round(4))

conf_mat = confusion_matrix(diab_y_test, XG_cl.predict(diab_X_test_sc))
print(tabulate(conf_mat,headers = ['pred Diab No','pred Diab Yes'], showindex = ['real Diab No','real Diab Yes'],
               tablefmt = 'fancy_grid'))

print(classification_report(diab_y_test, XG_cl.predict(diab_X_test_sc)))

diab_X_test_sc['XG_cl'] = XG_cl.predict(diab_X_test_sc)
diab_X_train_sc['XG_cl'] = XG_cl.predict(diab_X_train_sc)

Train set score (Accuracy) = 0.9847
Test set score (Accuracy) = 0.7987
╒═══════════════╤════════════════╤═════════════════╕
│               │   pred Diab No │   pred Diab Yes │
╞═══════════════╪════════════════╪═════════════════╡
│ real Diab No  │             87 │              20 │
├───────────────┼────────────────┼─────────────────┤
│ real Diab Yes │             11 │              36 │
╘═══════════════╧════════════════╧═════════════════╛
              precision    recall  f1-score   support

           0       0.89      0.81      0.85       107
           1       0.64      0.77      0.70        47

    accuracy                           0.80       154
   macro avg       0.77      0.79      0.77       154
weighted avg       0.81      0.80      0.80       154





#Ejercicio Cleveland


In [4]:
cleveland = pd.read_csv('drive/MyDrive/Ironhack/Data/cleveland.csv', sep = ';')
cleveland.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,unhealthy
0,63,0,4,124,197,0,0,136,1,0.0,2,0,3,1
1,54,0,3,110,214,0,0,158,0,1.6,2,0,3,0
2,65,1,4,135,254,0,2,127,0,2.8,2,1,7,1
3,60,0,4,150,258,0,2,157,0,2.6,2,2,7,1
4,57,0,4,120,354,0,0,163,1,0.6,1,0,3,0


In [5]:
c_X_train, c_X_test, c_y_train, c_y_test = train_test_split(cleveland[['age','sex','cp','trestbps','chol','fbs',
                                              'restecg','thalach','exang','oldpeak','slope','ca','thal']],
                                                    cleveland['unhealthy'], train_size = 0.8, random_state = 0)
c_X_train.head(3)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
62,41,1,3,130,214,0,2,168,0,2.0,2,0,3
79,58,0,2,136,319,1,2,152,0,0.0,1,2,3
647,64,1,3,125,309,0,0,131,1,1.8,2,0,7


In [6]:
ros = RandomOverSampler(random_state=42)
c_X_train, c_y_train= ros.fit_resample(c_X_train, c_y_train)

In [7]:
scaler = StandardScaler()
sc = scaler.fit(c_X_train)

train_sc = sc.transform(c_X_train)
c_X_train_sc = pd.DataFrame(train_sc)
c_X_train_sc.columns = ['age_st','sex_st','cp_st','trestbps_st','chol_st','fbs_st',
                                              'restecg_st','thalach_st','exang_st','oldpeak_st','slope_st','ca_st','thal_st']

test_sc = sc.transform(c_X_test)
c_X_test_sc = pd.DataFrame(test_sc)
c_X_test_sc.columns = ['age_st','sex_st','cp_st','trestbps_st','chol_st','fbs_st',
                                              'restecg_st','thalach_st','exang_st','oldpeak_st','slope_st','ca_st','thal_st']

c_X_test_sc.head(3)

Unnamed: 0,age_st,sex_st,cp_st,trestbps_st,chol_st,fbs_st,restecg_st,thalach_st,exang_st,oldpeak_st,slope_st,ca_st,thal_st
0,0.598131,0.687597,0.846206,-0.938511,-0.325838,2.379399,-1.026404,0.500952,1.34599,0.229434,-1.064808,1.359663,1.099525
1,-1.078908,0.687597,0.846206,-1.052103,0.224319,-0.420274,0.991453,1.561751,-0.742948,-1.007409,-1.064808,-0.735476,-0.952605
2,1.604354,0.687597,-0.200672,0.367797,0.114288,-0.420274,0.991453,-0.093096,-0.742948,0.75951,0.614834,2.407233,1.099525


##AdaBoost Classifier Cleveland

In [8]:
AB_cl = AdaBoostClassifier()

# define the grid of values to search
grid = dict()
grid['n_estimators'] = [10, 100, 500]
grid['learning_rate'] = [0.001, 0.01, 0.1, 1.0]

# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)

# define the grid search procedure
grid_search = GridSearchCV(estimator=AB_cl, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy')

# execute the grid search
grid_result = grid_search.fit(c_X_train_sc, c_y_train)

# summarize the best score and configuration
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# summarize all scores that were evaluated
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.988450 using {'learning_rate': 1.0, 'n_estimators': 500}
0.787159 (0.026178) with: {'learning_rate': 0.001, 'n_estimators': 10}
0.787159 (0.026178) with: {'learning_rate': 0.001, 'n_estimators': 100}
0.799068 (0.031288) with: {'learning_rate': 0.001, 'n_estimators': 500}
0.787159 (0.026178) with: {'learning_rate': 0.01, 'n_estimators': 10}
0.842181 (0.019684) with: {'learning_rate': 0.01, 'n_estimators': 100}
0.864498 (0.019618) with: {'learning_rate': 0.01, 'n_estimators': 500}
0.846420 (0.021083) with: {'learning_rate': 0.1, 'n_estimators': 10}
0.878746 (0.016639) with: {'learning_rate': 0.1, 'n_estimators': 100}
0.911463 (0.018984) with: {'learning_rate': 0.1, 'n_estimators': 500}
0.873377 (0.024655) with: {'learning_rate': 1.0, 'n_estimators': 10}
0.963420 (0.015169) with: {'learning_rate': 1.0, 'n_estimators': 100}
0.988450 (0.009896) with: {'learning_rate': 1.0, 'n_estimators': 500}


In [9]:
AB_cl = AdaBoostClassifier(learning_rate = 0.01, n_estimators = 500)
AB_cl.fit(c_X_train_sc, c_y_train)

#diab_X_train, diab_X_test, diab_y_train, diab_y_test

In [12]:
print("Train set score (Accuracy) =", AB_cl.score(c_X_train_sc, c_y_train).round(4))
print("Test set score (Accuracy) =", AB_cl.score(c_X_test_sc, c_y_test).round(4))

conf_mat = confusion_matrix(c_y_test, AB_cl.predict(c_X_test_sc))
print(tabulate(conf_mat,headers = ['pred Unhealthy No','pred Unhealthy Yes'], showindex = ['real Unhealthy No','real Unhealthy Yes'],
               tablefmt = 'fancy_grid'))

print(classification_report(c_y_test, AB_cl.predict(c_X_test_sc)))

Train set score (Accuracy) = 0.8695
Test set score (Accuracy) = 0.8684
╒════════════════════╤═════════════════════╤══════════════════════╕
│                    │   pred Unhealthy No │   pred Unhealthy Yes │
╞════════════════════╪═════════════════════╪══════════════════════╡
│ real Unhealthy No  │                  81 │                    8 │
├────────────────────┼─────────────────────┼──────────────────────┤
│ real Unhealthy Yes │                  17 │                   84 │
╘════════════════════╧═════════════════════╧══════════════════════╛
              precision    recall  f1-score   support

           0       0.83      0.91      0.87        89
           1       0.91      0.83      0.87       101

    accuracy                           0.87       190
   macro avg       0.87      0.87      0.87       190
weighted avg       0.87      0.87      0.87       190



In [13]:
c_X_test_sc['AB_Cl'] = AB_cl.predict(c_X_test_sc)
c_X_train_sc['AB_Cl'] = AB_cl.predict(c_X_train_sc)

##Random Forest Classifier Cleveland



In [14]:
RF_Cl = RandomForestClassifier()

# define the grid of values to search
grid = dict()
grid['n_estimators'] = [100, 500]
grid['criterion'] = ['gini','entropy']

# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)

# define the grid search procedure
grid_search = GridSearchCV(estimator=RF_Cl, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy')

# execute the grid search
grid_result = grid_search.fit(c_X_train_sc, c_y_train)

# summarize the best score and configuration
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# summarize all scores that were evaluated
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.994220 using {'criterion': 'gini', 'n_estimators': 100}
0.994220 (0.008955) with: {'criterion': 'gini', 'n_estimators': 100}
0.994220 (0.008955) with: {'criterion': 'gini', 'n_estimators': 500}
0.994220 (0.008955) with: {'criterion': 'entropy', 'n_estimators': 100}
0.994220 (0.008955) with: {'criterion': 'entropy', 'n_estimators': 500}


In [15]:
RF_Cl = RandomForestClassifier(criterion = 'gini', n_estimators = 100)
RF_Cl.fit(c_X_train_sc, c_y_train)

In [16]:
print("Train set score (Accuracy) =", RF_Cl.score(c_X_train_sc, c_y_train).round(4))
print("Test set score (Accuracy) =", RF_Cl.score(c_X_test_sc, c_y_test).round(4))

conf_mat = confusion_matrix(c_y_test, RF_Cl.predict(c_X_test_sc))
print(tabulate(conf_mat,headers = ['pred Unhealthy No','pred Unhealthy Yes'], showindex = ['real Unhealthy No','real Unhealthy Yes'],
               tablefmt = 'fancy_grid'))

print(classification_report(c_y_test, RF_Cl.predict(c_X_test_sc)))

Train set score (Accuracy) = 1.0
Test set score (Accuracy) = 1.0
╒════════════════════╤═════════════════════╤══════════════════════╕
│                    │   pred Unhealthy No │   pred Unhealthy Yes │
╞════════════════════╪═════════════════════╪══════════════════════╡
│ real Unhealthy No  │                  89 │                    0 │
├────────────────────┼─────────────────────┼──────────────────────┤
│ real Unhealthy Yes │                   0 │                  101 │
╘════════════════════╧═════════════════════╧══════════════════════╛
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        89
           1       1.00      1.00      1.00       101

    accuracy                           1.00       190
   macro avg       1.00      1.00      1.00       190
weighted avg       1.00      1.00      1.00       190



In [17]:
c_X_test_sc['RF_Cl'] = RF_Cl.predict(c_X_test_sc)
c_X_train_sc['RF_Cl'] = RF_Cl.predict(c_X_train_sc)

##Stacking Classifier Cleveland

In [18]:
level0 = list()
level0.append(('lr', LogisticRegression()))
level0.append(('RF', RandomForestClassifier()))
level0.append(('svc', SVC()))

level1 = LogisticRegression()

# define the stacking ensemble
St_Cl = StackingClassifier(estimators = level0, final_estimator = level1)

# fit the model on all available data
St_Cl.fit(c_X_train_sc, c_y_train)

In [19]:
print("Train set score (Accuracy) =", St_Cl.score(c_X_train_sc, c_y_train).round(4))
print("Test set score (Accuracy) =", St_Cl.score(c_X_test_sc, c_y_test).round(4))

conf_mat = confusion_matrix(c_y_test, St_Cl.predict(c_X_test_sc))
print(tabulate(conf_mat,headers = ['pred Unhealthy No','pred Unhealthy Yes'], showindex = ['real Unhealthy No','real Unhealthy Yes'],
               tablefmt = 'fancy_grid'))

print(classification_report(c_y_test, St_Cl.predict(c_X_test_sc)))

Train set score (Accuracy) = 1.0
Test set score (Accuracy) = 1.0
╒════════════════════╤═════════════════════╤══════════════════════╕
│                    │   pred Unhealthy No │   pred Unhealthy Yes │
╞════════════════════╪═════════════════════╪══════════════════════╡
│ real Unhealthy No  │                  89 │                    0 │
├────────────────────┼─────────────────────┼──────────────────────┤
│ real Unhealthy Yes │                   0 │                  101 │
╘════════════════════╧═════════════════════╧══════════════════════╛
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        89
           1       1.00      1.00      1.00       101

    accuracy                           1.00       190
   macro avg       1.00      1.00      1.00       190
weighted avg       1.00      1.00      1.00       190



In [20]:
c_X_test_sc['St_Cl'] = St_Cl.predict(c_X_test_sc)
c_X_train_sc['St_Cl'] = St_Cl.predict(c_X_train_sc)

#Ejercicio Autoprice

In [22]:
autoprice = pd.read_csv('drive/MyDrive/Ironhack/Data/autoprice.csv', sep = ';')
autoprice.head()

Unnamed: 0,normalized-losses,wheel-base,length,width,height,curb-weight,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,95,109.1,188.8,68.9,55.5,3062,3.78,3.15,9.5,114,5400,19,25,22.625
1,115,98.8,177.8,66.5,55.5,2425,3.39,3.39,8.6,84,4800,26,32,11.245
2,104,99.1,186.6,66.5,56.1,2758,3.54,3.07,9.3,110,5250,21,28,15.51
3,161,93.7,157.3,64.4,50.8,1918,2.97,3.23,9.4,68,5500,37,41,5.389
4,78,96.5,157.1,63.9,58.3,2024,2.92,3.41,9.2,76,6000,30,34,7.295


In [23]:
auto_X_train, auto_X_test, auto_y_train, auto_y_test = train_test_split(autoprice[['normalized-losses','wheel-base','length','height','curb-weight','bore',
                                              'stroke','compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg']],
                                                    autoprice['price'], train_size = 0.8, random_state = 0)
auto_X_train.head()

Unnamed: 0,normalized-losses,wheel-base,length,height,curb-weight,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg
211,103,94.5,170.2,53.5,2037,3.15,3.29,9.4,69,5200,31,37
500,81,95.7,169.7,59.1,2290,3.05,3.03,9.0,62,4800,27,32
497,93,106.7,187.5,54.9,3495,3.58,3.64,21.5,123,4350,22,25
10,188,101.2,176.8,54.3,2765,3.31,3.19,9.0,121,4250,21,28
475,150,99.1,186.6,56.1,2658,3.54,3.07,9.31,110,5250,21,28


In [25]:
autoprice['city-mpg'].value_counts()

31    90
24    74
27    52
19    44
23    36
26    35
21    28
28    26
30    24
37    19
25    16
38    16
22    16
17    10
18    10
20     9
29     8
32     5
49     4
34     4
45     4
47     4
35     4
15     3
16     1
Name: city-mpg, dtype: int64

In [26]:
column_to_keep_train = auto_X_train['city-mpg']
auto_X_train = auto_X_train.drop(columns=['city-mpg'])
original_index_train = auto_X_train.index

column_to_keep_test = auto_X_test['city-mpg']
auto_X_test = auto_X_test.drop(columns=['city-mpg'])
original_index_test = auto_X_test.index

scaler = StandardScaler()
auto_scaler = scaler.fit(auto_X_train)
train_auto_scaler = auto_scaler.transform(auto_X_train)
test_auto_scaler = auto_scaler.transform(auto_X_test)

# Crear un nuevo DataFrame con las características estandarizadas
auto_X_train_auto_scaler = pd.DataFrame(train_auto_scaler, columns=auto_X_train.columns, index = original_index_train)
auto_X_test_auto_scaler = pd.DataFrame(test_auto_scaler, columns=auto_X_test.columns, index = original_index_test)

# Concatenar la columna no estandarizada de nuevo al DataFrame
auto_X_train_auto_scaler = pd.concat([auto_X_train_auto_scaler, column_to_keep_train], axis=1)
auto_X_test_auto_scaler = pd.concat([auto_X_test_auto_scaler, column_to_keep_test], axis=1)
auto_X_train_auto_scaler.head(3)

Unnamed: 0,normalized-losses,wheel-base,length,height,curb-weight,bore,stroke,compression-ratio,horsepower,peak-rpm,highway-mpg,city-mpg
211,-0.492672,-0.739672,-0.160749,-0.21176,-0.87834,-0.591682,0.225063,-0.220402,-0.898631,0.198002,0.757482,31
500,-1.120355,-0.50137,-0.205145,2.328946,-0.321082,-0.958992,-0.646542,-0.319283,-1.153401,-0.637417,-0.049182,27
497,-0.777982,1.683066,1.375356,0.423416,2.333055,0.987749,1.398377,2.770724,1.06674,-1.577262,-1.178512,22


##Random Forest Regressor Autoprice

In [27]:
RF_Reg = RandomForestRegressor()

grid = dict()
grid['n_estimators'] = [10, 50] # number of trees
grid['criterion'] = ['squared_error','absolute_error']



# define the evaluation procedure
cv = RepeatedKFold(n_splits = 5, n_repeats = 3, random_state = 1)

# define the grid search procedure
grid_search = GridSearchCV(estimator = RF_Reg, param_grid = grid, n_jobs = -1, cv = cv, scoring = 'neg_mean_absolute_error')

# execute the grid search
grid_result = grid_search.fit(auto_X_train_auto_scaler, auto_y_train)

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# summarize all scores that were evaluated
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: -0.414273 using {'criterion': 'squared_error', 'n_estimators': 50}
-0.425646 (0.102743) with: {'criterion': 'squared_error', 'n_estimators': 10}
-0.414273 (0.089137) with: {'criterion': 'squared_error', 'n_estimators': 50}
-0.471083 (0.109226) with: {'criterion': 'absolute_error', 'n_estimators': 10}
-0.420396 (0.103353) with: {'criterion': 'absolute_error', 'n_estimators': 50}


In [28]:
RF_Reg = RandomForestRegressor(criterion = 'squared_error', n_estimators = 500)
RF_Reg.fit(auto_X_train_auto_scaler, auto_y_train)

In [29]:
auto_X_test_auto_scaler['RF_Reg'] = RF_Reg.predict(auto_X_test_auto_scaler)
auto_X_train_auto_scaler['RF_Reg'] = RF_Reg.predict(auto_X_train_auto_scaler)

In [30]:
print("MAE: ", metrics.mean_absolute_error(auto_y_train, auto_X_train_auto_scaler['RF_Reg']).round(4))
print("MSE: ", metrics.mean_squared_error(auto_y_train, auto_X_train_auto_scaler['RF_Reg']).round(4))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(auto_y_train, auto_X_train_auto_scaler['RF_Reg'])).round(4))
print("MAPE: ", metrics.mean_absolute_percentage_error(auto_y_train, auto_X_train_auto_scaler['RF_Reg']).round(4))
print("R2: ", metrics.r2_score(auto_y_train, auto_X_train_auto_scaler['RF_Reg']).round(4))

MAE:  0.1108
MSE:  0.0743
RMSE:  0.2725
MAPE:  0.009
R2:  0.9977


In [31]:
print("MAE: ", metrics.mean_absolute_error(auto_y_test, auto_X_test_auto_scaler['RF_Reg']).round(4))
print("MSE: ", metrics.mean_squared_error(auto_y_test, auto_X_test_auto_scaler['RF_Reg']).round(4))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(auto_y_test, auto_X_test_auto_scaler['RF_Reg'])).round(4))
print("MAPE: ", metrics.mean_absolute_percentage_error(auto_y_test, auto_X_test_auto_scaler['RF_Reg']).round(4))
print("R2: ", metrics.r2_score(auto_y_test, auto_X_test_auto_scaler['RF_Reg']).round(4))

MAE:  0.2459
MSE:  0.325
RMSE:  0.5701
MAPE:  0.0204
R2:  0.9906


##Stacking Regressor Autoprice

In [32]:
level0 = list()
level0.append(('lr', LinearRegression()))
level0.append(('RF', RandomForestRegressor()))
level0.append(('svr', SVR()))

level1 = LinearRegression()

# define the stacking ensemble
St_reg = StackingRegressor(estimators = level0, final_estimator = level1)

# fit the model on all available data
St_reg.fit(auto_X_train_auto_scaler, auto_y_train)

In [33]:
auto_X_test_auto_scaler['St_reg'] = St_reg.predict(auto_X_test_auto_scaler)
auto_X_train_auto_scaler['St_reg'] = St_reg.predict(auto_X_train_auto_scaler)

In [34]:
print("MAE: ", metrics.mean_absolute_error(auto_y_train, auto_X_train_auto_scaler['St_reg']).round(4))
print("MSE: ", metrics.mean_squared_error(auto_y_train, auto_X_train_auto_scaler['St_reg']).round(4))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(auto_y_train, auto_X_train_auto_scaler['St_reg'])).round(4))
print("MAPE: ", metrics.mean_absolute_percentage_error(auto_y_train, auto_X_train_auto_scaler['St_reg']).round(4))
print("R2: ", metrics.r2_score(auto_y_train, auto_X_train_auto_scaler['St_reg']).round(4))

MAE:  0.1514
MSE:  0.0778
RMSE:  0.279
MAPE:  0.0132
R2:  0.9975


In [35]:
print("MAE: ", metrics.mean_absolute_error(auto_y_test, auto_X_test_auto_scaler['St_reg']).round(4))
print("MSE: ", metrics.mean_squared_error(auto_y_test, auto_X_test_auto_scaler['St_reg']).round(4))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(auto_y_test, auto_X_test_auto_scaler['St_reg'])).round(4))
print("MAPE: ", metrics.mean_absolute_percentage_error(auto_y_test, auto_X_test_auto_scaler['St_reg']).round(4))
print("R2: ", metrics.r2_score(auto_y_test, auto_X_test_auto_scaler['St_reg']).round(4))


MAE:  0.2623
MSE:  0.347
RMSE:  0.5891
MAPE:  0.0227
R2:  0.99
