# Transformando colunas categoricas em numericas


### OneHotEncoder: 
- Permite que voce limite o numero de colunas: Cria uma coluna 'valores menos frequentes' e coloca o resto la. Conseguindo tambem estipular os valores para criar ou nao uma coluna com eles</br> 
- Se o modelo esta rodando e um novo tipo da dado for adicionado, conseguimos ignora-lo ou trata-lo.
</br> </br>
### GET Dummies:
- Mais simples e direto, mas nao tem as vantagens do OneHotEncoder: se um dado diferente for adicionado, da erro. 
- Para variavies binarias, melhor utilizar o parametro 'drop_first', cria so uma coluna. Se não usar, cria duas colunas com 0 ou 1 cada (ou True e false). 
- Ja substitue a coluna altomaticamente
</br> </br>
### Label Encoding:

- É usado quando as categorias têm uma ordem entre elas, como 'ruim, regular, bom'
- Assim, pode introduzir uma ordem artificial nos dados, o que pode não ser desejável.

In [1]:
import pandas as pd

In [2]:
# Importando e visualizando base
titanic = pd.read_csv('train2.csv')
titanic = titanic.drop(["Titulos"], axis = 1) # Simplificando
titanic.head(3)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S


# Utilizando getDummies

### Categoria Sex

In [3]:
titanic = pd.get_dummies(titanic, columns=['Sex'], drop_first=True)
titanic.head(3)

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Embarked,Sex_male
0,0,3,22.0,1,0,7.25,S,True
1,1,1,38.0,1,0,71.2833,C,False
2,1,3,26.0,0,0,7.925,S,False


### Categoria Embarked

In [4]:
titanic = pd.get_dummies(titanic, columns=['Embarked'])
titanic.head(3)

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,3,22.0,1,0,7.25,True,False,False,True
1,1,1,38.0,1,0,71.2833,False,True,False,False
2,1,3,26.0,0,0,7.925,False,False,False,True


# Utilizando OneHotEncoder

In [5]:
from sklearn.preprocessing import OneHotEncoder

### Importando novamente

In [6]:
# Importando e visualizando base
titanic2 = pd.read_csv('train2.csv')
titanic2 = titanic2.drop(["Titulos"], axis = 1) # Simplificando
titanic2.head(3)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S


## Categoria Sex

In [7]:
# Instanciando e fazendo fit
ohe = OneHotEncoder()
ohe = ohe.fit(titanic2[['Sex']])

### Fazendo a transformação 

In [8]:
new = ohe.transform(titanic2[['Sex']]).toarray()

### Transformando em um DataFrame

In [9]:
ohe_df = pd.DataFrame(new)
ohe_df.head(2)

Unnamed: 0,0,1
0,0.0,1.0
1,1.0,0.0


### Nomeando as categorias

In [10]:
ohe_df.columns = ohe.get_feature_names_out()
ohe_df.head(2)

Unnamed: 0,Sex_female,Sex_male
0,0.0,1.0
1,1.0,0.0


### Concatenando com o DataFrame original e excluindo a categoria anterior "Sex"

In [11]:
titanic2 = pd.concat([titanic2, ohe_df], axis=1)
titanic2 = titanic2.drop("Sex", axis = 1)
titanic2.head(3)

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Embarked,Sex_female,Sex_male
0,0,3,22.0,1,0,7.25,S,0.0,1.0
1,1,1,38.0,1,0,71.2833,C,1.0,0.0
2,1,3,26.0,0,0,7.925,S,1.0,0.0


## Categoria Embarked

### Temos 3 tipos, vas vamos limitar a 2 colunas

In [12]:
titanic2.Embarked.value_counts()

Embarked
S    646
C    168
Q     77
Name: count, dtype: int64

In [13]:
# Instanciando e fazendo fit
ohe2 = OneHotEncoder()
ohe2 = ohe2.fit(titanic2[['Embarked']])

### Fazendo transformação

In [14]:
new2 = ohe2.transform(titanic2[['Embarked']]).toarray()

### Transformando em um DataFrame

In [15]:
ohe2_df = pd.DataFrame(new2)
ohe2_df.head(2)

Unnamed: 0,0,1,2
0,0.0,0.0,1.0
1,1.0,0.0,0.0


### Nomeando as categorias

In [16]:
ohe2_df.columns = ohe2.get_feature_names_out()
ohe2_df.head(2)

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0.0,0.0,1.0
1,1.0,0.0,0.0


### Concatenando com o DataFrame original e excluindo a categoria anterior "Embarked"

In [17]:
titanic2 = pd.concat([titanic2, ohe2_df], axis=1)
titanic2 = titanic2.drop("Embarked", axis = 1)
titanic2.head(3)

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,3,22.0,1,0,7.25,0.0,1.0,0.0,0.0,1.0
1,1,1,38.0,1,0,71.2833,1.0,0.0,1.0,0.0,0.0
2,1,3,26.0,0,0,7.925,1.0,0.0,0.0,0.0,1.0


# Utilizando Label Encoding

- Nesse caso não ha ordem, é so um exemplo de execução

In [18]:
from sklearn.preprocessing import LabelEncoder

In [19]:
# Importando e visualizando base
titanic = pd.read_csv('train2.csv')
titanic = titanic.drop(["Titulos", "Embarked"], axis = 1) # Simplificando
titanic.head(3)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,0,3,male,22.0,1,0,7.25
1,1,1,female,38.0,1,0,71.2833
2,1,3,female,26.0,0,0,7.925


### Instanciando e fazendo fit transform

In [20]:
label = LabelEncoder()
titanic['male?'] = label.fit_transform(titanic['Sex'])

In [21]:
titanic = titanic.drop("Sex", axis = 1)

In [22]:
titanic.head(3)

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,male?
0,0,3,22.0,1,0,7.25,1
1,1,1,38.0,1,0,71.2833,0
2,1,3,26.0,0,0,7.925,0
