# Predicción de Cargos de Seguros

## M.Sc. Favio Vázquez (XXX Congreso nacional de actuarios)

![](https://media.istockphoto.com/photos/insurance-protecting-family-health-live-house-and-car-concept-picture-id1199060494?k=20&m=1199060494&s=612x612&w=0&h=Jw_XYEFO42jcs4aBFdbiEnPPNODyYjvQCpcrmXnCazM=)

Exploraremos un conjunto de datos dedicado al costo de la prima de seguro dependiendo de algunas características de pacientes. El costo de la prima depende de muchos factores: diagnóstico, tipo de clínica, ciudad de residencia, edad, etc. No tenemos datos sobre el diagnóstico de los pacientes. Pero tenemos otra información que nos puede ayudar a sacar una conclusión sobre la salud de los pacientes y practicar el análisis de regresión.

Significado de las columnas:

- **age**: edad del beneficiario principal
- **sex**: contratista de seguros género, femenino, masculino
- **bmi**: índice de masa corporal, que proporciona una comprensión del cuerpo, pesos que son relativamente altos o bajos en relación con la altura,
índice objetivo de peso corporal (kg/m^2) utilizando la relación altura-peso, idealmente 18,5 a 24,9
- **children**: Número de niños cubiertos por el seguro de salud / Número de dependientes
- **smoker**: fumador o no
- **region**: el área residencial del beneficiario en los EE. UU., noreste, sureste, suroeste, noroeste.
- **charges**: costos médicos individuales facturados por el seguro de salud

In [None]:
!pip install plotly

In [None]:
pip install --upgrade bamboolib --user

In [None]:
!python -m bamboolib install_nbextensions

In [None]:
!pip install shap

## Importar datos

In [None]:
import pandas as pd
import bamboolib as bam
import plotly.express as px
import plotly.graph_objects as go

In [None]:
df = pd.read_csv("../data/insurance.csv")

## Exploración y preparación de datos

In [None]:
df

In [None]:
# Cantidad de filas y columnas
rows = df.shape[0]
columns = df.shape[1]
print(f'Número de filas : {rows}')
print(f'Número de columnas : {columns}')

In [None]:
# Datos faltantes
df.isnull().sum()

In [None]:
# Tipos de datos
df.dtypes

In [None]:
# Descripción estadística de las variables numéricas
df.describe()

In [None]:
df

In [None]:
import plotly.graph_objs as go
go.Figure(
    data=[go.Histogram(x=df["age"], xbins={"start": 18, "end": 66.0, "size": 3.0})],
    layout=go.Layout(title="Histogram of age", yaxis={"title": "Count"}, bargap=0.05),
    )

In [None]:
go.Figure(
    data=[go.Histogram(x=df["charges"], xbins={"start": 1000.0, "end": 65000.0, "size": 4000.0})],
    layout=go.Layout(title="Histogram of charges", yaxis={"title": "Count"}, bargap=0.05),
    )

In [None]:
import plotly.express as px
fig = px.violin(df, x='charges', color='smoker', box=True)
fig

In [None]:
fig = px.violin(df[df.sex=="female"], x='charges', color='smoker', box=True, 
               title="Box plot para cargos of mujeres fumadoras")
fig

In [None]:
fig = px.violin(df[df.sex=="male"], x='charges', color='smoker', box=True, 
               title="Box plot para cargos of homres fumadores")
fig

In [None]:
fig = px.box(df[df.age==18], x='charges', color='smoker', 
               title="Box plot para cargos de jóvenes fumadores")
fig

In [None]:
fig = px.density_contour(df[df.smoker=="no"], x="age", y="charges",  marginal_x="histogram", marginal_y="histogram")
fig.show()

In [None]:
fig = px.density_contour(df[df.smoker=="yes"], x="age", y="charges",  marginal_x="histogram", marginal_y="histogram")
fig.show()

In [None]:
import plotly.express as px
fig = px.histogram(df, x='bmi')
fig

In [None]:
import plotly.express as px
fig = px.box(df[df.bmi>30], x='charges')
fig

In [None]:
import plotly.express as px
fig = px.box(df[df.bmi<30], x='charges')
fig

In [None]:
df = pd.get_dummies(df, columns=['sex', 'region', 'smoker'], drop_first=True, dummy_na=False)
df = df.rename(columns={'sex_male': 'sex', 'smoker_yes': 'smoker'})
df

## Modelado

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import RandomizedSearchCV

In [None]:
X = df.drop(['charges'], axis=1)
y = df.charges

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_test,y_test)

In [None]:
ridge = Ridge(alpha=1)
ridge.fit(X_train, y_train)
ridge.score(X_test,y_test)

In [None]:
model_params = {
    'random-forest' : {
        'model' : RandomForestRegressor(),
        'params' : {
             "n_estimators": [5, 10, 15, 20, 25], 
             "max_depth": [3, 5, 7, 9, 11, 13],
         }
    },
    'ridge': {
        'model' : Ridge(),
        'params' : {
            'alpha' : [0.001, 0.01,0.1,1,10,100,1000],
           
        }
    },
    'decision_tree' :{
        'model' :  DecisionTreeRegressor(),
        'params' : {
             'max_depth': [3, 5, 7, 9, 11, 13],
        }
    }
    
}

In [None]:
model_scores = []

for model_name ,mp in model_params.items():
    clf = RandomizedSearchCV(mp['model'],mp['params'], cv = 5,return_train_score = False,n_iter = 2)#for the computational purpose set n_ter = 2
    clf.fit(X_train,y_train)
    
    model_scores.append({
        'model' : model_name,
        'score_train' : clf.score(X_train,y_train),
        'score_test': clf.score(X_test,y_test),
        'best_params' : clf.best_params_
        
    })

In [None]:
results = pd.DataFrame(model_scores)
results

## Explicaciones

In [None]:
from sklearn.ensemble import RandomForestRegressor


model_rf = RandomForestRegressor(n_estimators= 20, max_depth= 3)
model_rf.fit(X_train, y_train)

In [None]:
model_rf.score(X_test,y_test)

In [None]:
sample = pd.DataFrame(X_test.iloc[0]).T

In [None]:
sample

In [None]:
y_test.iloc[0]

In [None]:
import shap
shap.initjs()

In [None]:
explainer = shap.TreeExplainer(model_rf)

In [None]:
shap_values = explainer.shap_values(X_test)

In [None]:
shap.summary_plot(shap_values, X_test)

In [None]:
sample = pd.DataFrame(X_test.iloc[0]).T

In [None]:
sample

In [None]:
y_test.iloc[0]

In [None]:
sample = pd.DataFrame(X_test.iloc[0]).T
shap_values = explainer.shap_values(sample)
shap.force_plot(explainer.expected_value[0], shap_values[0], sample)

In [None]:
df_features = df.drop(columns='charges')

In [None]:
shap.plots.partial_dependence(
    'age', model_rf.predict, df_features, 
    ice=False, model_expected_value=True, feature_expected_value=True
)

In [None]:
shap.plots.partial_dependence(
    'smoker', model_rf.predict, df_features, 
    ice=False, model_expected_value=True, feature_expected_value=True
)

In [None]:
shap.plots.partial_dependence(
    'bmi', model_rf.predict, df_features, 
    ice=False, model_expected_value=True, feature_expected_value=True
)

In [None]:
shap.plots.partial_dependence(
    'children', model_rf.predict, df_features, 
    ice=False, model_expected_value=True, feature_expected_value=True
)