# Scikit-learn

Uma das bibliotecas mais usadas para aprendizado de máquina, processamento de imagens e etc. Como citado anteriormente, precisamos dos dados bem condicionados pra construir os modelos.

Dados bem condicionados != modelo bem condicionado, porém:
Dados mal condicionados === modelo mal condicionado

## Preparando os dados

Primeiro, além de limpar os dados, precisamos separar os dados de treinamento dos dados de teste. O modelo nunca pode ver os dados de teste enquanto está sendo treinado, apenas quando for testado.

Na etapa de testes é possível medir a precisão do modelo

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
%matplotlib inline

heart_disease = pd.read_csv('./data/heart-disease.csv')
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Nesse dataframe "heart disease" queremos prever se um paciente terá ou não doença cardíaca de acordo com os dados de outros pacientes, Assim temos:

- Features (X): são os dados que nos ajudam a prever o target. Nesse caso, todos os dados podem ser considerados feature, exceto o target.
- Target(y): Variável que queremos prever

In [3]:
X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

display(X.head(), y.head())

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

display(X_train, X_test, y_train, y_test)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
63,41,1,1,135,203,0,1,132,0,0.0,1,0,1
232,55,1,0,160,289,0,0,145,1,0.8,1,1,3
196,46,1,2,150,231,0,1,147,0,3.6,1,0,2
230,47,1,2,108,243,0,1,152,0,0.0,2,0,2
220,63,0,0,150,407,0,0,154,0,4.0,1,3,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
202,58,1,0,150,270,0,0,111,1,0.8,2,0,3
19,69,0,3,140,239,0,1,151,0,1.8,2,2,2
144,76,0,2,140,197,0,2,116,0,1.1,1,0,2
116,41,1,2,130,214,0,0,168,0,2.0,1,0,2


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
36,54,0,2,135,304,1,1,170,0,0.0,2,0,2
159,56,1,1,130,221,0,0,163,0,0.0,2,0,3
175,40,1,0,110,167,0,0,114,1,2.0,1,0,3
254,59,1,3,160,273,0,0,125,0,0.0,2,0,2
189,41,1,0,110,172,0,0,158,0,0.0,2,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
243,57,1,0,152,274,0,1,88,1,1.2,1,1,3
138,57,1,0,110,201,0,1,126,1,1.5,1,0,1
115,37,0,2,120,215,0,1,170,0,0.0,2,0,2
110,64,0,0,180,325,0,1,154,1,0.0,2,0,2


63     1
232    0
196    0
230    0
220    0
      ..
202    0
19     1
144    1
116    1
239    0
Name: target, Length: 242, dtype: int64

36     1
159    1
175    0
254    0
189    0
      ..
243    0
138    1
115    1
110    1
94     1
Name: target, Length: 61, dtype: int64

# Exemplo: Previsão de preços de carros

Previsão de preços remete à previsão de números. Quando o modelo prevê números, é denominado uma **regressão**. Existem vários tipos de regressão, com características diferentes. Vamos começar com o Workflow básico

## Preparando os dados

In [5]:
car_sales = pd.read_csv('./data/car-sales-extended.csv')
car_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Make           1000 non-null   object
 1   Colour         1000 non-null   object
 2   Odometer (KM)  1000 non-null   int64 
 3   Doors          1000 non-null   int64 
 4   Price          1000 non-null   int64 
dtypes: int64(3), object(2)
memory usage: 39.2+ KB


## Tratando os dados

A fase inicial de preparar o dataset (limpar os dados) não precisa ser feita nesse exemplo pois não há a existência de valores nulos nem de faltantes, porém se houverem valores faltantes há duas formas principais de contornar o problema:

1. Imputação: Substituir valores de determinado grupo
2. Remover os dados faltantes

Imputação se feita de forma errada pode comprometer o desempenho do modelo e há muita pesquisa acerca disso, e remover dados faltantes pode representar uma quantidade muito grande de dados do dataset, então é melhor fazer com cuidado.

## Encoding

Ainda assim, há um fator à se considerar: Queremos fazer uma regressão numérica e há variáveis categóricas no dataset: 

- Make, Colour, Doors -> são categóricas
- Odometer (KM), Price -> são numéricas

Para transformar variáveis categóricas em numéricas podemos utilizar do processo de **Encoding**. Esse processo basicamente troca os valores de categóricos para numéricos transformando cada classe em uma coluna separada.

De forma simplista, uma coluna X, com classes A,B,C, se torna 3 colunas diferentes: X_A, X_B, X_C; e são atribuídos zeros ou uns para cada tupla com a classe correspondente. Vai ficar mais visível posteriormente.

Vamos começar separando features e target:

In [8]:
X = car_sales.drop('Price', axis=1)
y = car_sales['Price']

display(X,y)

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3
...,...,...,...,...
995,Toyota,Black,35820,4
996,Nissan,White,155144,3
997,Nissan,Blue,66604,4
998,Honda,White,215883,4


0      15323
1      19943
2      28343
3      13434
4      14043
       ...  
995    32042
996     5716
997    31570
998     4001
999    12732
Name: Price, Length: 1000, dtype: int64

Sabendo que há variáveis categóricas **não transformadas em numéricas**, vamos tentar usar algum modelo mesmo assim para confirmar a necessidade de conversão:

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

In [11]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)

ValueError: could not convert string to float: 'Honda'

Como o esperado, o modelo não consegue dar um fit, porque há variáveis categóricas nele:

> `ValueError: could not convert string to float: 'Honda'`

Transformando então os valores categóricos em numéricos, usando o HotEncoder:

In [16]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Make', 'Colour', 'Doors']
encoder = OneHotEncoder()
transformer = ColumnTransformer([('one_hot', 
                                    encoder,
                                    categorical_features)],
                                    remainder='passthrough')

transf_X = transformer.fit_transform(X)
display(pd.DataFrame(transf_X))


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


Usando o `OneHotEncoder` e  o `ColumnTransformer` conseguimos transformar todos os dados categóricos em numéricos. Há como fazer a mesma coisa com `Pandas` diretamente - e de forma mais legível - com o uso de `get_dummies()`:

In [18]:
# é necessário converter 'Doors' para um tipo não-numérico para que a função funcione.
car_sales['Doors'] = car_sales['Doors'].astype(object) 
dummies = pd.get_dummies(car_sales[categorical_features])
display(dummies)

Unnamed: 0,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White,Doors_3,Doors_4,Doors_5
0,0,1,0,0,0,0,0,0,1,0,1,0
1,1,0,0,0,0,1,0,0,0,0,0,1
2,0,1,0,0,0,0,0,0,1,0,1,0
3,0,0,0,1,0,0,0,0,1,0,1,0
4,0,0,1,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,1,1,0,0,0,0,0,1,0
996,0,0,1,0,0,0,0,0,1,1,0,0
997,0,0,1,0,0,1,0,0,0,0,1,0
998,0,1,0,0,0,0,0,0,1,0,1,0


Agora com os valores transformados podemos usar o modelo e avaliar a performance:

In [26]:
X_train, X_test, y_train, y_test = train_test_split(transf_X,y, test_size=0.2)

np.random.seed(42) # Apenas para usar a mesma seed da aula
model = RandomForestRegressor(n_estimators=10)
model.fit(X_train, y_train)

model.score(X_test,y_test)

0.2148319788500227

É possível verificar que a precisão do modelo está muito baixa, cerca de 15% de precisão apenas, mas por hora não é o foco ajustar esse modelo. Há vários fatores para a precisão do modelo ser baixa.