# Problema

Predecir el coste del seguro

## Instrucciones

 Utilizar el dataset (insurance.csv) para entrenar un modelo de regresión capaz de predecir el valor del seguro en función de las características del cliente. Realizar limpieza, preprocesado modelado y testeo del modelo aportando conclusiones de todos estos pasos.

# El set de datos

* age: age of primary beneficiary

* sex: insurance contractor gender, female, male

* bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

* children: Number of children covered by health insurance / Number of dependents

* smoker: Smoking

* region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

* charges: Individual medical costs billed by health insurance



In [None]:
# Importar librerías
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
########################################
# Lectura
from google.colab import drive
drive.mount('/gdrive')


data = pd.read_csv('/gdrive/MyDrive/EDEM/ML/Regresión/ejercicios/insurance.csv')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


In [None]:
print(data.shape)
data.head()

(1338, 7)


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [None]:
# Suponiendo que 'data' es tu DataFrame
# Verificar si hay valores nulos en el DataFrame
null_values = data.isnull().sum()

# Imprimir el recuento de valores nulos por columna
print(null_values)


age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64


# Estadística

In [None]:
data['age'].value_counts()

age
18    69
19    68
50    29
51    29
47    29
46    29
45    29
20    29
48    29
52    29
22    28
49    28
54    28
53    28
21    28
26    28
24    28
25    28
28    28
27    28
23    28
43    27
29    27
30    27
41    27
42    27
44    27
31    27
40    27
32    26
33    26
56    26
34    26
55    26
57    26
37    25
59    25
58    25
36    25
38    25
35    25
39    25
61    23
60    23
63    23
62    23
64    22
Name: count, dtype: int64

In [None]:
mapping = {'male': 0, 'female': 1}
# Aplica el mapeo a la columna 'smoker'
data['sex'] = data['sex'].map(mapping)
data['sex'] = pd.to_numeric(data['sex'], errors='coerce')
data['sex'] = pd.to_numeric(data['sex'], errors='coerce')
data['sex'].value_counts()

sex
0    676
1    662
Name: count, dtype: int64

In [None]:
#age	sex	bmi	children	smoker	region	charges
data['bmi'].value_counts()

bmi
32.300    13
28.310     9
30.495     8
30.875     8
31.350     8
          ..
46.200     1
23.800     1
44.770     1
32.120     1
30.970     1
Name: count, Length: 548, dtype: int64

In [None]:
data['children'].value_counts()

children
0    574
1    324
2    240
3    157
4     25
5     18
Name: count, dtype: int64

In [None]:
data['smoker'].value_counts()

smoker
no     1064
yes     274
Name: count, dtype: int64

In [None]:
mapping = {'no': 0, 'yes': 1}
# Aplica el mapeo a la columna 'smoker'
data['smoker'] = data['smoker'].map(mapping)
data['smoker'] = pd.to_numeric(data['smoker'], errors='coerce')


In [None]:
mapping = {'southeast': 1, 'southwest': 2,'northwest': 3, 'northeast': 4}
# Aplica el mapeo a la columna 'smoker'
data['region'] = data['region'].map(mapping)
data['region'] = pd.to_numeric(data['region'], errors='coerce')
data['region'] = pd.to_numeric(data['region'], errors='coerce')
data['region'].value_counts()

region
1    364
2    325
3    325
4    324
Name: count, dtype: int64

In [None]:
import pandas as pd

# Suponiendo que 'data' es tu DataFrame y 'region' es la columna que quieres convertir
data['region'] = pd.factorize(data['region'])[0]

# Imprimir la columna convertida y su conteo
print(data['region'].value_counts())


region
1    364
0    325
2    325
3    324
Name: count, dtype: int64


In [None]:
data['charges'].value_counts()

charges
1639.56310     2
16884.92400    1
29330.98315    1
2221.56445     1
19798.05455    1
              ..
7345.08400     1
26109.32905    1
28287.89766    1
1149.39590     1
29141.36030    1
Name: count, Length: 1337, dtype: int64

# Normalizar

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data[['age', 'bmi', 'children']] = scaler.fit_transform(data[['age', 'bmi', 'children']])

# Imprimir el DataFrame normalizado
data.head(2)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,0.021739,1,0.321227,0.0,1,0,16884.924
1,0.0,0,0.47915,0.2,0,1,1725.5523


# Objetivo

Generar un model de regresión capaz de predecir el valor del seguro en base a las características del cliente.

* Aplicar las técnicas oportunas de procesamiento de datos

* Valorar diferentes modelos de regresión

* Comparación entre modelos

* Ensemble

* Métricas

* Conclusiones finales

## Implementación

In [None]:
data.head(5)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,0.021739,1,0.321227,0.0,1,0,16884.924
1,0.0,0,0.47915,0.2,0,1,1725.5523
2,0.217391,0,0.458434,0.6,0,1,4449.462
3,0.326087,0,0.181464,0.0,0,2,21984.47061
4,0.304348,0,0.347592,0.0,0,2,3866.8552


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Dividir los datos en variables independientes (X) y variable dependiente (y)
X = data.drop(columns=['charges'])
y = data['charges']

# Dividir datos en conjunto de entrenamiento y conjunto de prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Regresión Lineal
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
linear_predictions_train = linear_reg.predict(X_train)
linear_predictions_test = linear_reg.predict(X_test)
linear_r2 = r2_score(y_test, linear_predictions_test)
linear_mae = mean_absolute_error(y_test, linear_predictions_test)
linear_rmse = np.sqrt(mean_squared_error(y_test, linear_predictions_test))
print("Regresión Lineal:")
print("R cuadrado (R2):", linear_r2)
print("Error absoluto medio (MAE):", linear_mae)
print("Error cuadrático medio (RMSE):", linear_rmse)

# Regresión Polinomial
degree = 2
poly_features = PolynomialFeatures(degree=degree)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)
poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train)
poly_predictions_train = poly_reg.predict(X_train_poly)
poly_predictions_test = poly_reg.predict(X_test_poly)
poly_r2 = r2_score(y_test, poly_predictions_test)
poly_mae = mean_absolute_error(y_test, poly_predictions_test)
poly_rmse = np.sqrt(mean_squared_error(y_test, poly_predictions_test))
print("\nRegresión Polinomial (Grado", degree, "):")
print("R cuadrado (R2):", poly_r2)
print("Error absoluto medio (MAE):", poly_mae)
print("Error cuadrático medio (RMSE):", poly_rmse)


from sklearn.tree import DecisionTreeRegressor

# Árbol de decisión
decision_tree_reg = DecisionTreeRegressor(random_state=42)
decision_tree_reg.fit(X_train, y_train)
dt_predictions_train = decision_tree_reg.predict(X_train)
dt_predictions_test = decision_tree_reg.predict(X_test)
dt_r2 = r2_score(y_test, dt_predictions_test)
dt_mae = mean_absolute_error(y_test, dt_predictions_test)
dt_rmse = np.sqrt(mean_squared_error(y_test, dt_predictions_test))
print("\nÁrbol de decisión:")
print("R cuadrado (R2):", dt_r2)
print("Error absoluto medio (MAE):", dt_mae)
print("Error cuadrático medio (RMSE):", dt_rmse)


Regresión Lineal:
R cuadrado (R2): 0.7833463107364537
Error absoluto medio (MAE): 4186.508898366434
Error cuadrático medio (RMSE): 5799.587091438359

Regresión Polinomial (Grado 2 ):
R cuadrado (R2): 0.8677566718537741
Error absoluto medio (MAE): 2730.315581680434
Error cuadrático medio (RMSE): 4531.071500534053

Árbol de decisión:
R cuadrado (R2): 0.7323462208330208
Error absoluto medio (MAE): 2956.4767822985077
Error cuadrático medio (RMSE): 6446.1546440224965


# Conclusiones

- La **regresión lineal** parece tener un desempeño muy pobre en este conjunto de datos, ya que el R cuadrado es muy bajo y los errores MAE y RMSE son bastante altos. Esto sugiere que el modelo lineal no está capturando bien la relación entre las variables independientes y la variable dependiente.