<a href="https://colab.research.google.com/github/guilhermeaugusto9/sigmoidal/blob/master/05_3_Lidando_com_vari%C3%A1veis_categ%C3%B3ricas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img alt="Colaboratory logo" width="15%" src="https://raw.githubusercontent.com/carlosfab/escola-data-science/master/img/novo_logo_bg_claro.png">

#### **Data Science na Prática 2.0**
*by [sigmoidal.ai](https://sigmoidal.ai)*

---

# Lidando com variáveis categóricas

Em machine learning, muitos modelos não conseguirão lidar diretamente com variáveis categóricas. Dessa maneira, é importante conhecer os principais métodos e saber como aplicá-los.

Nesta aula veremos como usar o `LabelEncoder` e `OneHotEncoder`. Mais que isso, vou te mostrar algumas situações onde colunas numéricas são, na verdade, variáveis categóricas.

Para exemplificar o uso dessas técnicas, vou usar o dataset de câncer de mama, disponibilizado pela [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/breast+cancer).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

df = pd.read_csv("https://raw.githubusercontent.com/carlosfab/dsnp2/master/datasets/breast-cancer.data", header=None,
                 names=["class", "age", "menopause", "tumor_size",
                        "inv_nodes", "nodes-caps", "deg_malig", "breast",
                        "breast_quad", "irradiat"])

df.head()

Unnamed: 0,class,age,menopause,tumor_size,inv_nodes,nodes-caps,deg_malig,breast,breast_quad,irradiat
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no


In [None]:
X = df.drop('class', axis=1)
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(X, y)

## Label encoding

In [None]:
# y_train antes do encoding
y_train

118    no-recurrence-events
8      no-recurrence-events
231       recurrence-events
112    no-recurrence-events
165    no-recurrence-events
               ...         
166    no-recurrence-events
213       recurrence-events
156    no-recurrence-events
248       recurrence-events
262       recurrence-events
Name: class, Length: 214, dtype: object

In [None]:
# y_test antes do encoding
y_test

140    no-recurrence-events
206       recurrence-events
80     no-recurrence-events
264       recurrence-events
247       recurrence-events
               ...         
25     no-recurrence-events
70     no-recurrence-events
134    no-recurrence-events
131    no-recurrence-events
10     no-recurrence-events
Name: class, Length: 72, dtype: object

In [None]:
# codificando a variável alvo
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(y_train)
y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [None]:
# y_train depois do encoding
y_train

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1])

In [None]:
# y_test depois do encoding
y_test

array([0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0])

In [None]:
# visualizando as classes (fase do fit)
le.classes_

array(['no-recurrence-events', 'recurrence-events'], dtype=object)

In [None]:
# recuperando e convertendo os labels
le.inverse_transform(y_train)[:5]

array(['no-recurrence-events', 'no-recurrence-events',
       'recurrence-events', 'no-recurrence-events',
       'no-recurrence-events'], dtype=object)

## One-hot encoding

E quando a ordem não representa, necessariamente, uma escala real de importância?

<center><img alt="Colaboratory logo" width="45%" src="https://raw.githubusercontent.com/carlosfab/dsnp2/master/img/encoding.png"></center>


In [None]:
# X_train antes do OneHotEncoder
X_train

Unnamed: 0,age,menopause,tumor_size,inv_nodes,nodes-caps,deg_malig,breast,breast_quad,irradiat
118,30-39,premeno,10-14,0-2,no,1,right,left_low,no
8,40-49,premeno,50-54,0-2,no,2,left,left_low,no
231,40-49,premeno,30-34,3-5,no,2,right,left_up,no
112,40-49,premeno,20-24,0-2,no,2,right,left_up,no
165,40-49,premeno,20-24,3-5,no,2,right,left_up,no
...,...,...,...,...,...,...,...,...,...
166,40-49,premeno,20-24,3-5,no,2,right,left_low,no
213,50-59,premeno,25-29,0-2,no,1,right,left_up,no
156,50-59,ge40,25-29,3-5,yes,3,right,left_up,no
248,60-69,ge40,35-39,6-8,yes,3,left,left_low,no


In [None]:
from sklearn.preprocessing import OneHotEncoder

le = OneHotEncoder()
le.fit(X_train)
X_train_enc = le.transform(X_train)

In [None]:
X_train_enc

<214x41 sparse matrix of type '<class 'numpy.float64'>'
	with 1926 stored elements in Compressed Sparse Row format>

In [None]:
X_train_enc.toarray()

array([[0., 1., 0., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 1., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.]])

## Dummies values

In [None]:
pd.get_dummies(df, columns=['menopause', 'breast'])

Unnamed: 0,class,age,tumor_size,inv_nodes,nodes-caps,deg_malig,breast_quad,irradiat,menopause_ge40,menopause_lt40,menopause_premeno,breast_left,breast_right
0,no-recurrence-events,30-39,30-34,0-2,no,3,left_low,no,0,0,1,1,0
1,no-recurrence-events,40-49,20-24,0-2,no,2,right_up,no,0,0,1,0,1
2,no-recurrence-events,40-49,20-24,0-2,no,2,left_low,no,0,0,1,1,0
3,no-recurrence-events,60-69,15-19,0-2,no,2,left_up,no,1,0,0,0,1
4,no-recurrence-events,40-49,0-4,0-2,no,2,right_low,no,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
281,recurrence-events,30-39,30-34,0-2,no,2,left_up,no,0,0,1,1,0
282,recurrence-events,30-39,20-24,0-2,no,3,left_up,yes,0,0,1,1,0
283,recurrence-events,60-69,20-24,0-2,no,1,left_up,no,1,0,0,0,1
284,recurrence-events,40-49,30-34,3-5,no,3,left_low,no,1,0,0,1,0


In [None]:
pd.get_dummies(df)

Unnamed: 0,deg_malig,class_no-recurrence-events,class_recurrence-events,age_20-29,age_30-39,age_40-49,age_50-59,age_60-69,age_70-79,menopause_ge40,menopause_lt40,menopause_premeno,tumor_size_0-4,tumor_size_10-14,tumor_size_15-19,tumor_size_20-24,tumor_size_25-29,tumor_size_30-34,tumor_size_35-39,tumor_size_40-44,tumor_size_45-49,tumor_size_5-9,tumor_size_50-54,inv_nodes_0-2,inv_nodes_12-14,inv_nodes_15-17,inv_nodes_24-26,inv_nodes_3-5,inv_nodes_6-8,inv_nodes_9-11,nodes-caps_?,nodes-caps_no,nodes-caps_yes,breast_left,breast_right,breast_quad_?,breast_quad_central,breast_quad_left_low,breast_quad_left_up,breast_quad_right_low,breast_quad_right_up,irradiat_no,irradiat_yes
0,3,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,1,0
1,2,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,1,0
2,2,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,1,0
3,2,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0
4,2,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
281,2,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0
282,3,0,1,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1
283,1,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0
284,3,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,1,0


In [None]:
df_enc = pd.get_dummies(df)