**São variáveis que não possuem valores numéricos e sim um valor atribuido (ex, cidade, genero, pais, emprego, cor, etc.)**

**Formas de tratar tais variáveis nos dados utilizado**


In [2]:
import pandas as pd
X = pd.read_csv("winemag-data_first150k.csv")
X.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,,Provence,Bandol,,Provence red blend,Domaine de la Bégude


- Como pegar determiar as variáveis de categoria de um conjunto de dados/

In [13]:
# Get list of categorical variables
s = (X.dtypes == 'object')
object_cols = list(s[s].index)

#usaremos apenas estas 
object_cols = ['country',  'province', 'region_1', 'region_2', 'variety']

X = X[object_cols]
X.dropna()

print("Categorical variables:")
print(object_cols)


Categorical variables:
['country', 'province', 'region_1', 'region_2', 'variety']


# Drop Variables

Descarta as dados que são desse tipo.
Perde informação para realizar a tarefa.

In [None]:
drop_X = X.select_dtypes(exclude=['object'])

# Label Enconding

Atribuir um valor numérico a cada uma das opções.
 ex.: João Pessoa = 1, CG =2; Mulher = 1, Homem =2;
 
O problema desta abordagem é atribuir valores com pesos diferentes às categorias (maior valor, maior importância)

In [17]:
from sklearn.preprocessing import LabelEncoder

# Make copy to avoid changing original data 
label_X = X.copy()

label_enconder = LabelEncoder()

for col in object_cols:
    label_X[col] = label_enconder.fit_transform(X[col].astype(str))
    
label_X.head()

Unnamed: 0,country,province,region_1,region_2,variety
0,44,51,738,7,70
1,40,274,1070,18,553
2,44,51,528,13,468
3,44,282,1222,17,402
4,15,313,66,18,422


# One-hot Enconder

Estratégia em que cada valor possível da variável (coluna) se torna uma coluna binária. 

## Verificar cardinalidade 

Cardinalidade é a quantidade de valores únicos que uma variável apresenta
Váriáveis que possuem alta cardinalidade não valem a pena aplicar o one-hot encoder pois o número de variáveis (features) irá aumentar muito, prejudicando a capacidade de aprendizagem da máquina.

In [18]:
object_nunique = list(map(lambda col: X[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])

[('region_2', 18),
 ('country', 48),
 ('province', 455),
 ('variety', 632),
 ('region_1', 1236)]

- Hot encode

In [20]:
# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X[col].nunique() < 20]
print(low_cardinality_cols)

['region_2']


In [28]:
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)

print('X', label_X.head())
OH_cols = pd.DataFrame(OH_encoder.fit_transform(label_X[low_cardinality_cols]))

print('OH_cols', OH_cols.head())

# One-hot encoding removed index; put it back
OH_cols.index = label_X.index

# Remove as colunas que foram aplicados a onehot encoder
num_X_train = label_X.drop(low_cardinality_cols, axis=1)

print('num_X_train', num_X_train.head())

# Add one-hot encoded columns to numerical features
OH_X = pd.concat([num_X_train, OH_cols], axis=1)


print('OH_X', OH_X.head())

X    country  province  region_1  region_2  variety
0       44        51       738         7       70
1       40       274      1070        18      553
2       44        51       528        13      468
3       44       282      1222        17      402
4       15       313        66        18      422
OH_cols     0    1    2    3    4    5    6    7    8    9    10   11   12   13   14  \
0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
2  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0   
3  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
4  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   

    15   16   17   18  
0  0.0  0.0  0.0  0.0  
1  0.0  0.0  0.0  1.0  
2  0.0  0.0  0.0  0.0  
3  0.0  0.0  1.0  0.0  
4  0.0  0.0  0.0  1.0  
num_X_train    country  province  region_1  variety
0       44   