# Problema

Predecir el coste del seguro

## Instrucciones

 Utilizar el dataset (insurance.csv) para entrenar un modelo de regresión capaz de predecir el valor del seguro en función de las características del cliente. Realizar limpieza, preprocesado modelado y testeo del modelo aportando conclusiones de todos estos pasos.

# El set de datos

* age: age of primary beneficiary

* sex: insurance contractor gender, female, male

* bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

* children: Number of children covered by health insurance / Number of dependents

* smoker: Smoking

* region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

* charges: Individual medical costs billed by health insurance



In [1]:
# imports
import pandas as pd


In [2]:
ruta = "insurance.csv"
data = pd.read_csv(ruta)

In [3]:
print(data.shape)
data.head()

(1338, 7)


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


# Objetivo

Generar un model de regresión capaz de predecir el valor del seguro en base a las características del cliente.

* Aplicar las técnicas oportunas de procesamiento de datos

* Valorar diferentes modelos de regresión

* Comparación entre modelos

* Ensemble

* Métricas

* Conclusiones finales

## Implementación

In [4]:
# Verificar valores nulos en el conjunto de datos
data.isnull().sum()



age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [5]:
# Codificación de variables categóricas
data_encoded = pd.get_dummies(data, columns=['region'], drop_first=True)
data_encoded['sex'] = data_encoded['sex'].map({'male': 1, 'female': 0})
data_encoded['smoker'] = data_encoded['smoker'].map({'yes': 1, 'no': 0})

# Mostrar las primeras filas del conjunto de datos codificado
data_encoded.head()


Unnamed: 0,age,sex,bmi,children,smoker,charges,region_northwest,region_southeast,region_southwest
0,19,0,27.9,0,1,16884.924,False,False,True
1,18,1,33.77,1,0,1725.5523,False,True,False
2,28,1,33.0,3,0,4449.462,False,True,False
3,33,1,22.705,0,0,21984.47061,True,False,False
4,32,1,28.88,0,0,3866.8552,True,False,False


In [6]:
from sklearn.model_selection import train_test_split

# Separar características y variable objetivo
X = data_encoded.drop('charges', axis=1)
y = data_encoded['charges']

# Dividir el conjunto de datos en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


((1070, 8), (268, 8), (1070,), (268,))

# Regresion lineal

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Entrenar el modelo de Regresión Lineal
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Predicciones
y_pred_train = linear_model.predict(X_train)
y_pred_test = linear_model.predict(X_test)

# Evaluación del modelo
mse_train = mean_squared_error(y_train, y_pred_train)
mae_train = mean_absolute_error(y_train, y_pred_train)
r2_train = r2_score(y_train, y_pred_train)

mse_test = mean_squared_error(y_test, y_pred_test)
mae_test = mean_absolute_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)

mse_train, mae_train, r2_train, mse_test, mae_test, r2_test


(37277681.70201866,
 4208.234572492226,
 0.7417255854683333,
 33596915.85136146,
 4181.1944737536505,
 0.7835929767120723)

# Modelos

In [10]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Función para entrenar y evaluar modelos
def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    mse_train = mean_squared_error(y_train, y_pred_train)
    mae_train = mean_absolute_error(y_train, y_pred_train)
    r2_train = r2_score(y_train, y_pred_train)
    mse_test = mean_squared_error(y_test, y_pred_test)
    mae_test = mean_absolute_error(y_test, y_pred_test)
    r2_test = r2_score(y_test, y_pred_test)
    return mse_train, mae_train, r2_train, mse_test, mae_test, r2_test


In [11]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Modelos a evaluar
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=1.0),
    "Lasso Regression": Lasso(alpha=1.0),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42)
}

# Evaluar modelos
results = {}
for model_name, model in models.items():
    results[model_name] = evaluate_model(model, X_train, y_train, X_test, y_test)

# Mostrar resultados
results_df = pd.DataFrame(results, index=['MSE Train', 'MAE Train', 'R2 Train', 'MSE Test', 'MAE Test', 'R2 Test'])
print(results_df)


           Linear Regression  Ridge Regression  Lasso Regression  \
MSE Train       3.727768e+07      3.728069e+07      3.727774e+07   
MAE Train       4.208235e+03      4.217887e+03      4.208584e+03   
R2 Train        7.417256e-01      7.417048e-01      7.417252e-01   
MSE Test        3.359692e+07      3.364504e+07      3.360551e+07   
MAE Test        4.181194e+03      4.193585e+03      4.182426e+03   
R2 Test         7.835930e-01      7.832830e-01      7.835376e-01   

           Decision Tree  Random Forest  
MSE Train   2.442396e+05   3.757331e+06  
MAE Train   2.957252e+01   1.067874e+03  
R2 Train    9.983078e-01   9.739677e-01  
MSE Test    4.244691e+07   2.095569e+07  
MAE Test    3.195110e+03   2.550670e+03  
R2 Test     7.265877e-01   8.650186e-01  
