Se procede a realizar la carga de librerías a utilizar

In [229]:
import pandas as pd
import numpy as np
import webbrowser as wb
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns 

from dateutil.parser import parse
from ydata_profiling import ProfileReport
from sklearn.preprocessing import StandardScaler
from pycaret.regression import *

# 1. Análisis preliminar

Se procede a realizar la carga del dataset de test para House pricing.

In [235]:
dataset = pd.read_csv("train.csv", sep = ",")

Se utilizará la función **ProfileReport** para obtener un reporte con análisis preliminar exhaustivo del dataset.

In [113]:
profile = ProfileReport(dataset, title = "Reporte preliminar")
#profile.to_notebook_iframe() # Se procede a excluir la visualización en el archivo debido a errores en visualización. En su lugar se procede con la exportación a un archivo HTML.
profile.to_file("reporte.html")
wb.open_new_tab('reporte.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

True

Se realiza la separación del dataset entre train y test

In [231]:
data_train = dataset.sample(frac=0.8, random_state=2023) 
data_test = dataset.drop(data_train.index) 
data_train.shape, data_test.shape

((1168, 81), (292, 81))

Se realiza la configuración del modelo

In [244]:
df = setup(data=data_train,
                target = 'SalePrice', # Variable objetivo
                session_id = 2023, # Asegurando la reproducibilidad
                numeric_imputation= 'knn', # Imputación numérica utilizando el acercamiento del KNN (K-Nearest Neighbor)
                max_encoding_ohe = 3, # Imputación categórica por One Hot Encoding
                categorical_imputation = 'mode', # Imputación categórica por Frecuency Encoding       
                transformation = True, # Se realiza la transformación de datos         
                outliers_threshold = 0.05, # Intervalo de parámetros que se descartarán para los outliers
                normalize = True, # Se acepta la normalización
                normalize_method = 'minmax', # Selección del métrodo de normalización,
                n_jobs = None, # Para poder realizar trabajos en paralelo
                log_experiment = True, # Guardar el log del experimento via mlflow.
                experiment_name = 'SalePrice_house') # Nombre del experimento

Unnamed: 0,Description,Value
0,Session id,2023
1,Target,SalePrice
2,Target type,Regression
3,Original data shape,"(1168, 81)"
4,Transformed data shape,"(1168, 87)"
5,Transformed train set shape,"(817, 87)"
6,Transformed test set shape,"(351, 87)"
7,Ordinal features,6
8,Numeric features,37
9,Categorical features,43


Se seleccionan los tres mejores modelos para el análisis basados en los resultados del RMSE. Se excluyen del cálculo la regresión lineal (*lr*) y Least Angle Regression (*lar*) debido a que sus resultados ___________________ .

In [242]:
best = compare_models(sort='RMSE', exclude=['lar', 'lr'], n_select=3) 

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
gbr,Gradient Boosting Regressor,17280.3918,756015101.2977,26263.8898,0.8681,0.1397,0.1017,1.244
lightgbm,Light Gradient Boosting Machine,17506.711,836261965.7587,27928.658,0.8544,0.143,0.1014,1.032
et,Extra Trees Regressor,18325.2437,849421054.8134,28118.3078,0.8567,0.1468,0.1067,1.579
rf,Random Forest Regressor,17984.9218,861646526.8375,28143.1361,0.8494,0.1486,0.107,2.259
ada,AdaBoost Regressor,23028.6276,1190813511.3989,33690.9076,0.7939,0.1892,0.1474,0.949
huber,Huber Regressor,21382.3086,1284243932.0694,35001.7817,0.7822,0.1601,0.1178,0.863
br,Bayesian Ridge,23035.3078,1330416422.0882,35470.2268,0.7689,0.1977,0.1365,0.794
ridge,Ridge Regression,23129.017,1347606067.8172,35649.4398,0.7652,0.2085,0.1379,0.716
par,Passive Aggressive Regressor,23740.4706,1536816705.4081,38708.4949,0.7424,0.1702,0.1263,0.993
lasso,Lasso Regression,24029.842,1633920612.5883,38786.6141,0.7093,0.2318,0.1453,0.74


Se visualizan cuales son los modelos elegidos

In [249]:
print(best)

[GradientBoostingRegressor(random_state=2023), ExtraTreesRegressor(random_state=2023), LGBMRegressor(random_state=2023)]


In [250]:
gbr = create_model('gbr')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,23605.5966,2345305516.2718,48428.3545,0.5385,0.2256,0.1524
1,15979.6412,577945060.6841,24040.4879,0.9254,0.1155,0.0874
2,18973.7058,1014087264.4503,31844.7368,0.884,0.1217,0.0924
3,17240.291,563449951.0299,23737.1007,0.8968,0.1455,0.1001
4,18466.1943,673688123.8643,25955.5028,0.8686,0.1324,0.0991
5,14729.7852,351311609.0062,18743.3084,0.9241,0.1471,0.1061
6,16338.8134,491932763.4293,22179.5573,0.9135,0.127,0.0981
7,15676.5257,521924660.3447,22845.6705,0.9085,0.131,0.0984
8,14832.4559,391093306.6132,19776.0792,0.9309,0.1145,0.0859
9,16960.9089,629412757.283,25088.0999,0.8904,0.1364,0.0973


In [251]:
lightgbm = create_model('lightgbm')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,22995.4623,2323967115.5401,48207.5421,0.5427,0.225,0.1479
1,16192.7498,752432340.1148,27430.5002,0.9029,0.1115,0.0845
2,19200.3789,1124837004.6183,33538.5898,0.8713,0.1282,0.098
3,18408.889,754508141.1261,27468.3116,0.8618,0.1495,0.0987
4,18465.5194,661344476.5216,25716.6187,0.871,0.1291,0.0965
5,16774.3432,581794456.8971,24120.4158,0.8743,0.1587,0.1133
6,16591.5988,629453611.4507,25088.9141,0.8893,0.1306,0.0939
7,14275.4273,438218584.8455,20933.6711,0.9232,0.1361,0.0952
8,16451.241,503831110.4771,22446.1825,0.911,0.1293,0.0942
9,15711.5003,592232815.996,24335.834,0.8968,0.132,0.0917


In [252]:
et = create_model('et')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,23900.2867,1886149120.7359,43429.8183,0.6288,0.2225,0.1507
1,15888.1376,573543449.3345,23948.7672,0.926,0.1146,0.087
2,24045.5902,1778508916.3421,42172.3715,0.7965,0.1503,0.1115
3,18211.3713,788775566.769,28085.1485,0.8555,0.1466,0.1032
4,18669.6151,723272739.7915,26893.7305,0.8589,0.13,0.0968
5,15836.4602,441666916.3909,21015.873,0.9045,0.1531,0.1132
6,16860.6917,576599470.4212,24012.4857,0.8986,0.1408,0.1042
7,15985.3515,547628577.5554,23401.4653,0.904,0.1468,0.1051
8,16264.6853,455580404.5813,21344.3296,0.9195,0.1262,0.097
9,17590.2469,722485386.212,26879.0883,0.8741,0.1376,0.0978
