# Desafio de ML

O objetivo é criar um modelo preditivo de regressão para prever o valor dos custos médicos individuais cobrados pelo seguro de saúde.

## Preparação dos dados

A base de dados contém 1338 linhas com informações sobre as pessoas.

In [36]:
import pandas as pd

path = "D:/Repos/FIAP/ML/files/insurance.csv"
df = pd.read_csv(path, sep=",")
df.head(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
7,37,female,27.74,3,no,northwest,7281.5056
8,37,male,29.83,2,no,northeast,6406.4107
9,60,female,25.84,0,no,northwest,28923.13692


In [37]:
df.shape

(1338, 7)

Verificando como estão os dados, tipos e nulos.

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


**Verificando os dados não numericos**

Vou verificar as colunas de texto para verificar quantos rotulos diferentes existem para cada caso

In [39]:
df["sex"].value_counts()

sex
male      676
female    662
Name: count, dtype: int64

A coluna "smoker" é desproporcional, precisaremos trabalhar mais à frente a amostragem desses dados para equilibrar os sets de treino e teste.

In [40]:
df["smoker"].value_counts()

smoker
no     1064
yes     274
Name: count, dtype: int64

In [41]:
df["region"].value_counts()

region
southeast    364
southwest    325
northwest    325
northeast    324
Name: count, dtype: int64

**Utilizando One-Hot Encoding**

Vou usar o *OneHotEncoder* para converter esses tipos textuais. Primeiro um teste com a coluna 'region'.

In [42]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

region_1hot = encoder.fit_transform(df[["region"]])
region_1hot.toarray()[:10]

array([[0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.]])

In [43]:
encoder.categories_

[array(['northeast', 'northwest', 'southeast', 'southwest'], dtype=object)]

**Criando um Pipeline**

Vou criar um pipeline com *StandardScaler* para normalizar os dados.

In [44]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('std_scaler', StandardScaler()), # padronizando as escalas dos dados
    ])

Vou usar o **ColumnTransformer** para efetuar os ajustes.

In [45]:
from sklearn.compose import ColumnTransformer

colunas_numericas = df.select_dtypes(include='number').columns.tolist()
print(colunas_numericas)
colunas_texto = df.select_dtypes(exclude='number').columns.tolist()
print(colunas_texto)

pipeline = ColumnTransformer([
        ("num", num_pipeline, colunas_numericas), #tratando as variáveis numéricas (chamando a pipeline de cima)
        ("cat", OneHotEncoder(), colunas_texto), # tratando as variáveis categóricas
    ])

dados_preparados = pipeline.fit_transform(df)

['age', 'bmi', 'children', 'charges']
['sex', 'smoker', 'region']


In [46]:
dados_preparados

array([[-1.43876426, -0.45332   , -0.90861367, ...,  0.        ,
         0.        ,  1.        ],
       [-1.50996545,  0.5096211 , -0.07876719, ...,  0.        ,
         1.        ,  0.        ],
       [-0.79795355,  0.38330685,  1.58092576, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [-1.50996545,  1.0148781 , -0.90861367, ...,  0.        ,
         1.        ,  0.        ],
       [-1.29636188, -0.79781341, -0.90861367, ...,  0.        ,
         0.        ,  1.        ],
       [ 1.55168573, -0.26138796, -0.90861367, ...,  1.        ,
         0.        ,  0.        ]], shape=(1338, 12))

In [47]:
type(dados_preparados)

numpy.ndarray

Vou reconstruir o dataframe incluindo os nomes das novas colunas

In [52]:
sex_cat, smoker_cat, region_cat = pipeline.named_transformers_["cat"].categories_
print(sex_cat)
print(region_cat)
print(smoker_cat)

['female' 'male']
['northeast' 'northwest' 'southeast' 'southwest']
['no' 'yes']


In [53]:
colunas_originais = df.columns
colunas_originais

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

Reorganizando as colunas para alinhar com os dados após o encoding.

In [None]:
novas_colunas = []
for col in colunas_originais:
    if col == 'sex':
        novas_colunas.extend([f'sex_{cat}' for cat in sex_cat])
    # elif col == 'smoker':
    #     novas_colunas.extend([f'smoker_{cat}' for cat in smoker_cat])
    elif col == 'region':
        novas_colunas.extend([f'region_{cat}' for cat in region_cat])
    elif df[col].dtype in ['int64', 'float64']:
        novas_colunas.append(col)
    # Se quiser manter 'charges' como última coluna, pode adicionar aqui ou depois

print(novas_colunas)

['age', 'sex_female', 'sex_male', 'bmi', 'children', 'smoker_no', 'smoker_yes', 'region_northeast', 'region_northwest', 'region_southeast', 'region_southwest', 'charges']


Checando o novo dataframe.

In [55]:
df_preparado = pd.DataFrame(data=dados_preparados, columns=novas_colunas)
df_preparado

Unnamed: 0,age,sex_female,sex_male,bmi,children,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest,charges
0,-1.438764,-0.453320,-0.908614,0.298584,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,-1.509965,0.509621,-0.078767,-0.953689,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
2,-0.797954,0.383307,1.580926,-0.728675,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
3,-0.441948,-1.305531,-0.908614,0.719843,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
4,-0.513149,-0.292556,-0.908614,-0.776802,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
1333,0.768473,0.050297,1.580926,-0.220551,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
1334,-1.509965,0.206139,-0.908614,-0.914002,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1335,-1.509965,1.014878,-0.908614,-0.961596,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1336,-1.296362,-0.797813,-0.908614,-0.930362,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


Verificando se há algum nulo

In [56]:
df_preparado.isnull().sum()

age                 0
sex_female          0
sex_male            0
bmi                 0
children            0
smoker_no           0
smoker_yes          0
region_northeast    0
region_northwest    0
region_southeast    0
region_southwest    0
charges             0
dtype: int64

## Efetuando o divisão de dados para treino e teste

Como a quantidade de dados para fumantes e não fumantes é muito diferente, precisamos equilibrar a amostragem para que dados de treino e teste reflitam a mesma proporção.
Para isso utilizarei o *StratifiedShuffleSplit*.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(df_preparado, df_preparado["smoker"]):
    strat_train_set = df.loc[train_index]
    strat_test_set = df.loc[test_index]