# Problema

Predecir el coste del seguro

## Instrucciones

 Utilizar el dataset (insurance.csv) para entrenar un modelo de regresión capaz de predecir el valor del seguro en función de las características del cliente. Realizar limpieza, preprocesado modelado y testeo del modelo aportando conclusiones de todos estos pasos.

# El set de datos

* age: age of primary beneficiary

* sex: insurance contractor gender, female, male

* bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

* children: Number of children covered by health insurance / Number of dependents

* smoker: Smoking

* region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

* charges: Individual medical costs billed by health insurance



In [2]:
# imports
import pandas as pd

In [3]:
ruta = "insurance.csv"
df = pd.read_csv(ruta)

In [4]:
print(df.shape)
df.head()

(1338, 7)


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


# Objetivo

Generar un model de regresión capaz de predecir el valor del seguro en base a las características del cliente.

* Aplicar las técnicas oportunas de procesamiento de datos

* Valorar diferentes modelos de regresión

* Comparación entre modelos

* Ensemble

* Métricas

* Conclusiones finales

## Procesamiento

In [5]:
numeric_features = ['age', 'bmi']
categoric_features = ['sex', 'children', 'smoker', 'region']

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

scaler = StandardScaler()

df_scaled = scaler.fit_transform(df[numeric_features])

scaled_df = pd.DataFrame(df_scaled, columns=numeric_features)
df[numeric_features] = scaled_df

encoder = OneHotEncoder()

df_encoded = encoder.fit_transform(df[categoric_features])

encoded_df = pd.DataFrame.sparse.from_spmatrix(df_encoded, columns=encoder.get_feature_names_out(categoric_features))
df = pd.concat([df.drop(columns=categoric_features), encoded_df], axis=1)

df

Unnamed: 0,age,bmi,charges,sex_female,sex_male,children_0,children_1,children_2,children_3,children_4,children_5,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,-1.438764,-0.453320,16884.92400,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,-1.509965,0.509621,1725.55230,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,-0.797954,0.383307,4449.46200,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,-0.441948,-1.305531,21984.47061,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,-0.513149,-0.292556,3866.85520,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1333,0.768473,0.050297,10600.54830,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1334,-1.509965,0.206139,2205.98080,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1335,-1.509965,1.014878,1629.83350,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1336,-1.296362,-0.797813,2007.94500,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


## Implementación

### Random Forest Regression

In [7]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Separar las características y la variable objetivo
X = df.drop('charges', axis=1)  # Asegúrate de reemplazar 'target_variable_name' con el nombre de tu variable objetivo
y = df['charges']

# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Crear el modelo de regresión de bosque aleatorio
random_forest_reg = RandomForestRegressor(random_state=42)

# Entrenar el modelo
random_forest_reg.fit(X_train, y_train)

# Realizar predicciones en el conjunto de prueba
y_pred = random_forest_reg.predict(X_test)

# Calcular el error cuadrático medio (MSE) en el conjunto de prueba
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
print('Mean Squared Error (MSE):', mse)
print('Root Mean Squared Error (RMSE):', rmse)



Mean Squared Error (MSE): 21491836.404149912
Root Mean Squared Error (RMSE): 4635.928860988908




### XGBoost

In [8]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Separar las características y la variable objetivo
X = df.drop('charges', axis=1)  # Asegúrate de reemplazar 'target_variable_name' con el nombre de tu variable objetivo
y = df['charges']

# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Crear el modelo de regresión de XGBoost
xgboost_reg = xgb.XGBRegressor(random_state=42)

# Entrenar el modelo
xgboost_reg.fit(X_train, y_train)

# Realizar predicciones en el conjunto de prueba
y_pred = xgboost_reg.predict(X_test)

# Calcular el error cuadrático medio (MSE) en el conjunto de prueba
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
print('Mean Squared Error (MSE):', mse)
print('Root Mean Squared Error (RMSE):', rmse)


Mean Squared Error (MSE): 23566319.33805541
Root Mean Squared Error (RMSE): 4854.515355630818


### SVR

In [9]:
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Separar las características y la variable objetivo
X = df.drop('charges', axis=1)  # Asegúrate de reemplazar 'target_variable_name' con el nombre de tu variable objetivo
y = df['charges']

# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Crear el modelo de regresión de Vectores de Soporte (SVR)
svr_reg = SVR()

# Entrenar el modelo
svr_reg.fit(X_train, y_train)

# Realizar predicciones en el conjunto de prueba
y_pred = svr_reg.predict(X_test)

# Calcular el error cuadrático medio (MSE) en el conjunto de prueba
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
print('Mean Squared Error (MSE):', mse)
print('Root Mean Squared Error (RMSE):', rmse)


Mean Squared Error (MSE): 166181398.8741574
Root Mean Squared Error (RMSE): 12891.13644618493




## Ensemble

In [12]:
from sklearn.ensemble import VotingRegressor

# Crear un ensemble con los mejores modelos
ensemble = VotingRegressor(estimators=[
    ('rf', random_forest_reg),
    ('xgb', xgboost_reg)
])

# Entrenar el modelo ensemble
ensemble.fit(X_train, y_train)

# Evaluar el ensemble
ensemble_y_pred = ensemble.predict(X_test)
ensemble_mse = mean_squared_error(y_test, ensemble_y_pred)
ensemble_rmse = ensemble_mse ** 0.5

print(f"Ensemble Model\nMSE: {ensemble_mse}\nRMSE: {ensemble_rmse}")



Ensemble Model
MSE: 21327769.16230562
RMSE: 4618.199775053654




# Conclusiones

Al combinar los dos mejores modelos mediante un ensemble hemos visto una pequeña mejora de la precisión en la predicción del valor del seguro. Esta técnica resalta la importancia de aprovechar la diversidad de enfoques modelando diferentes aspectos de los datos, lo que resulta en una mejor generalización y rendimiento predictivo.