# Predição de Despesas Médicas <br/>do Seguro de Saúde

Paulo Cysne Rios Jr. | Novembro 2017

## Solução do Aluno: João Holanda Freires

## Exercício

Este conjunto de dados *insurance.csv* representa despesas médicas de indivíduos nos EUA.<br/>
O valor objetivo é as despesas (expenses).  <br/>
Predizer despesas médicas é de fundamental importância para uma empresa de seguros de saúde. É tambem de interesse de cada pessoa.


- Encontre as despesas por região
- Usando regressão linear múltipla, vom a classe LinearRegression, prediza as despesas <br/>
e veja como se saiu usando MSE. 
- Melhore sua modelagem: use um indicador para BMI acima ou igual a 30<br/>
e veja como se saiu usando MSE. 
- Melhore sua modelagem: use regressão polinomial e veja como se saiu usando MSE. 
- Usando a classe SGDRegressor (de Descida de Gradiente Estocástico), <br/>
prediza as despesas (expenses) para os casos acima e veja como se saiu usando MSE. 
- Faça uma tabela com todas as MSE e encontre a melhor.
- Lembre-se de calcular o MSE usando a package metrics de SciKit-Learn.
- Os primeiros passos da modelagem estão abaixo.

In [125]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler, PolynomialFeatures


%matplotlib inline

In [2]:
insurance = pd.read_csv("data/insurance.csv")

In [3]:
insurance.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'expenses'], dtype='object')

In [4]:
insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
age         1338 non-null int64
sex         1338 non-null object
bmi         1338 non-null float64
children    1338 non-null int64
smoker      1338 non-null object
region      1338 non-null object
expenses    1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.2+ KB


In [6]:
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86


## Despesas por região

In [9]:
regionAndExpenses = insurance[['region', 'expenses']]
regionAndExpenses.groupby('region').sum()

Unnamed: 0_level_0,expenses
region,Unnamed: 1_level_1
northeast,4343668.64
northwest,4035711.93
southeast,5363689.8
southwest,4012754.82


## Pré-processamento

In [110]:
def doPreprocessing(dataset):
    
    featuresEncoding = ['sex', 'smoker', 'region']
    
    # Aplicando Label Encoding em sex, smoker and region features
    labelEncoder = LabelEncoder()
    preProcessedInsurance = dataset[featuresEncoding].apply(labelEncoder.fit_transform)

    # Agora aplicando One Hot Encoding
    oneHotEncoder = OneHotEncoder(sparse=False)
    preProcessedInsurance = oneHotEncoder.fit_transform(preProcessedInsurance.as_matrix())
    preProcessedInsurance = pd.DataFrame(preProcessedInsurance, columns=['sex_female', 'sex_male', 'smoker_no', 'smoker_yes', 'reg_northeast', 'reg_northwest', 'reg_southeast','reg_southwest'])

    # Aplicando Standardization (Z-score normalization)

    featuresStd = ['age', 'bmi', 'children']

    stdScaler = StandardScaler()
    normalizedFeatures = stdScaler.fit_transform(dataset[featuresStd])
    normalizedFeatures = pd.DataFrame(normalizedFeatures, columns=featuresStd)

    # Novo dataset pré-processado
    newDataset = dataset.drop(featuresEncoding + featuresStd, axis = 1)
    newDataset = pd.concat([newDataset, preProcessedInsurance, normalizedFeatures], axis = 1)
    
    return newDataset


### Depois do pré-processamento

In [111]:
newInsurance = doPreprocessing(insurance)
newInsurance.head()

Unnamed: 0,expenses,sex_female,sex_male,smoker_no,smoker_yes,reg_northeast,reg_northwest,reg_southeast,reg_southwest,age,bmi,children
0,16884.92,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,-1.438764,-0.453646,-0.908614
1,1725.55,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,-1.509965,0.514186,-0.078767
2,4449.46,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,-0.797954,0.382954,1.580926
3,21984.47,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,-0.441948,-1.30665,-0.908614
4,3866.86,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,-0.513149,-0.289606,-0.908614


## Criando métodos de predição

In [121]:
labelColumn = 'expenses'

def splitFeaturesAndLabel(dataset, labelColumn):
    X = dataset.drop(labelColumn, axis = 1)
    y = dataset[labelColumn]
    
    return X,y

def predictLinearRegression(dataset, labelColumn):
    X, y = splitFeaturesAndLabel(dataset, labelColumn)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=6)
    
    linReg = LinearRegression()
    linReg.fit(X_train, y_train)
    predLinReg = linReg.predict(X_test)

    return mean_squared_error(y_test, predLinReg)

def predictSGDRegressor(dataset, labelColumn):
    X, y = splitFeaturesAndLabel(dataset, labelColumn)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=6)
    
    sgdReg = SGDRegressor(max_iter=500)
    sgdReg.fit(X_train, y_train)
    predSGDReg = sgdReg.predict(X_test)

    return mean_squared_error(y_test, predSGDReg)
    

## Predição de despesas com Linear Regression

In [122]:
linRegMSE = predictLinearRegression(newInsurance, labelColumn)
sgdRegMSE = predictSGDRegressor(newInsurance, labelColumn)
print("MSE Linear Regression: {:.3f}".format(linRegMSE))
print("MSE SGD Regressor: {:.3f}".format(sgdRegMSE))


MSE Linear Regression: 31439534.082
MSE SGD Regressor: 31422902.402


### Com Regressão Polinomial

In [133]:
polyFeature = PolynomialFeatures()
onlyFeatures = newInsurance.drop(labelColumn, axis=1)
newFeaturesInsurance = polyFeature.fit_transform(onlyFeatures)
newFeaturesInsuranceWithoutBias = newFeaturesInsurance[:,1:]

polyInsurance = pd.DataFrame(newFeaturesInsuranceWithoutBias)
polyInsurance[labelColumn] = insurance[labelColumn]
polyInsurance.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,68,69,70,71,72,73,74,75,76,expenses
0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,-1.438764,-0.453646,...,-1.438764,-0.453646,-0.908614,2.070043,0.652689,1.307281,0.205794,0.412189,0.825579,16884.92
1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,-1.509965,0.514186,...,-0.0,0.0,-0.0,2.279996,-0.776403,0.118936,0.264387,-0.040501,0.006204,1725.55
2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,-0.797954,0.382954,...,-0.0,0.0,0.0,0.63673,-0.30558,-1.261505,0.146654,0.605422,2.499326,4449.46
3,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,-0.441948,-1.30665,...,-0.0,-0.0,-0.0,0.195318,0.577471,0.40156,1.707333,1.18724,0.825579,21984.47
4,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,-0.513149,-0.289606,...,-0.0,-0.0,-0.0,0.263322,0.148611,0.466254,0.083872,0.26314,0.825579,3866.86


In [134]:
polyLinRegMSE = predictLinearRegression(polyInsurance, labelColumn)
polySgdRegMSE = predictSGDRegressor(polyInsurance, labelColumn)
print("Poly - MSE Linear Regression: {:.3f}".format(polyLinRegMSE))
print("Poly - MSE SGD Regressor: {:.3f}".format(polySgdRegMSE))

Poly - MSE Linear Regression: 18692731.748
Poly - MSE SGD Regressor: 18936161.592


## Predição de despesas com BMI >= 30

In [123]:
insuranceBMI = insurance[ insurance['bmi'] >= 30 ]
insuranceBMI = insuranceBMI.reset_index()
insuranceBMI = insuranceBMI.drop('index', axis=1)
procInsuranceBMI = doPreprocessing(insuranceBMI)
procInsuranceBMI.head()

Unnamed: 0,expenses,sex_female,sex_male,smoker_no,smoker_yes,reg_northeast,reg_northwest,reg_southeast,reg_southwest,age,bmi,children
0,1725.55,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,-1.552992,-0.36077,-0.089151
1,4449.46,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,-0.857945,-0.557219,1.591638
2,8240.59,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.393139,-0.458994,-0.089151
3,1826.84,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,-1.205469,-0.213433,-0.929545
4,11090.72,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.088186,1.112596,-0.929545


In [124]:
BMIlinRegMSE = predictLinearRegression(procInsuranceBMI, labelColumn)
BMIsgdRegMSE = predictSGDRegressor(procInsuranceBMI, labelColumn)

print("MSE Linear Regression com BMI >= 30: {:.3f}".format(BMIlinRegMSE))
print("MSE SGD Regressor com BMI >= 30: {:.3f}".format(BMIsgdRegMSE))

MSE Linear Regression com BMI >= 30: 19172086.033
MSE SGD Regressor com BMI >= 30: 19167045.380


### Com Regressão polinomial para BMI >= 30

In [135]:
polyFeature = PolynomialFeatures()
onlyFeaturesBMI = procInsuranceBMI.drop(labelColumn, axis=1)
newFeaturesInsuranceBMI = polyFeature.fit_transform(onlyFeaturesBMI)
newFeaturesInsuranceWithoutBiasBMI = newFeaturesInsuranceBMI[:,1:]

polyInsuranceBMI = pd.DataFrame(newFeaturesInsuranceWithoutBiasBMI)
polyInsuranceBMI[labelColumn] = insurance[labelColumn]
polyInsuranceBMI.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,68,69,70,71,72,73,74,75,76,expenses
0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,-1.552992,-0.36077,...,-0.0,-0.0,-0.0,2.411785,0.560273,0.13845,0.130155,0.032163,0.007948,16884.92
1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,-0.857945,-0.557219,...,-0.0,-0.0,0.0,0.73607,0.478063,-1.365539,0.310493,-0.88689,2.533312,1725.55
2,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.393139,-0.458994,...,0.0,-0.0,-0.0,0.154558,-0.180449,-0.035049,0.210676,0.04092,0.007948,4449.46
3,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,-1.205469,-0.213433,...,-1.205469,-0.213433,-0.929545,1.453155,0.257287,1.120538,0.045554,0.198396,0.864054,21984.47
4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.088186,1.112596,...,0.0,0.0,-0.0,1.184149,1.210712,-1.011518,1.237871,-1.034209,0.864054,3866.86


In [136]:
BMIpolyLinRegMSE = predictLinearRegression(polyInsuranceBMI, labelColumn)
BMIpolySgdRegMSE = predictSGDRegressor(polyInsuranceBMI, labelColumn)

print("Poly - MSE Linear Regression com BMI >= 30: {:.3f}".format(BMIpolyLinRegMSE))
print("Poly - MSE SGD Regressor com BMI >= 30: {:.3f}".format(BMIpolySgdRegMSE))

Poly - MSE Linear Regression com BMI >= 30: 142115256.348
Poly - MSE SGD Regressor com BMI >= 30: 140380691.889


## Tabela de resultados para o MSE

In [139]:
MSEresults = np.array([['none', linRegMSE, sgdRegMSE, polyLinRegMSE, polySgdRegMSE], ['BMI >= 30', BMIlinRegMSE, BMIsgdRegMSE, BMIpolyLinRegMSE, BMIpolySgdRegMSE]])
MSEresults = pd.DataFrame(MSEresults, columns = ['Filtro', 'MSE Linear Reg', 'MSE SGD Reg', 'Poly - MSE Linear Reg', 'Poly - MSE SGD Reg'])
MSEresults.head()

Unnamed: 0,Filtro,MSE Linear Reg,MSE SGD Reg,Poly - MSE Linear Reg,Poly - MSE SGD Reg
0,none,31439534.0823,31422902.402,18692731.7477,18936161.5919
1,BMI >= 30,19172086.0326,19167045.3797,142115256.348,140380691.889
