# Transformando colunas categoricas em numericas



### GET Dummies:
- Mais simples e direto, mas nao tem as vantagens do OneHotEncoder: se um dado diferente for adicionado, da erro. 
- Para variavies binarias, ha o parametro 'drop_first', cria so uma coluna.  
- Ja substitue a coluna altomaticamente
</br> </br>
### OneHotEncoder: 
- Permite que voce limite o numero de colunas: Cria uma coluna 'valores menos frequentes' e coloca o resto la. Conseguindo tambem estipular os valores para criar ou nao uma coluna com eles</br> 
- Para variavies binarias, ha o parametro 'drop_first', cria so uma coluna.  
- Se o modelo esta rodando e um novo tipo da dado for adicionado, conseguimos ignora-lo ou trata-lo.
</br> </br>

### OrdinalEncoder:

- Para dados ordenados. "Ensino fundamental, medio e graduação". A ordem é importante para o modelo
- Temos que definir a ordem atraves do parametro categories (Do menor para o maior)
- Permite tratar os valores nulos com os parametros handle_unknown e unknown_value
- Permite escolher o tipo da variavel de saida (int, float) atraves do parametro "dtype"

In [1]:
import pandas as pd

In [2]:
# Importando e visualizando base
titanic = pd.read_csv('train2.csv')
titanic = titanic.drop(["Titulos"], axis = 1) # Simplificando
titanic.head(3)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S


# Utilizando getDummies

### Categoria Sex

In [3]:
titanic = pd.get_dummies(titanic, columns=['Sex'], drop_first=True)
titanic.head(3)

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Embarked,Sex_male
0,0,3,22.0,1,0,7.25,S,True
1,1,1,38.0,1,0,71.2833,C,False
2,1,3,26.0,0,0,7.925,S,False


### Categoria Embarked

In [4]:
titanic = pd.get_dummies(titanic, columns=['Embarked'])
titanic.head(3)

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,3,22.0,1,0,7.25,True,False,False,True
1,1,1,38.0,1,0,71.2833,False,True,False,False
2,1,3,26.0,0,0,7.925,False,False,False,True


# Utilizando OneHotEncoder

In [5]:
from sklearn.preprocessing import OneHotEncoder

### Importando novamente

In [6]:
# Importando e visualizando base
titanic2 = pd.read_csv('train2.csv')
titanic2 = titanic2.drop(["Titulos"], axis = 1) # Simplificando
titanic2.head(3)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S


## Categoria Sex

In [7]:
# Instanciando e fazendo fit
ohe = OneHotEncoder()
ohe = ohe.fit(titanic2[['Sex']])

### Fazendo a transformação 

In [8]:
new = ohe.transform(titanic2[['Sex']]).toarray()

### Transformando em um DataFrame

In [9]:
ohe_df = pd.DataFrame(new)
ohe_df.head(2)

Unnamed: 0,0,1
0,0.0,1.0
1,1.0,0.0


### Nomeando as categorias

In [10]:
ohe_df.columns = ohe.get_feature_names_out()
ohe_df.head(2)

Unnamed: 0,Sex_female,Sex_male
0,0.0,1.0
1,1.0,0.0


### Concatenando com o DataFrame original e excluindo a categoria anterior "Sex"

In [11]:
titanic2 = pd.concat([titanic2, ohe_df], axis=1)
titanic2 = titanic2.drop("Sex", axis = 1)
titanic2.head(3)

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Embarked,Sex_female,Sex_male
0,0,3,22.0,1,0,7.25,S,0.0,1.0
1,1,1,38.0,1,0,71.2833,C,1.0,0.0
2,1,3,26.0,0,0,7.925,S,1.0,0.0


## Categoria Embarked

### Temos 3 tipos, vas vamos limitar a 2 colunas

In [12]:
titanic2.Embarked.value_counts()

Embarked
S    646
C    168
Q     77
Name: count, dtype: int64

In [13]:
# Instanciando e fazendo fit
ohe2 = OneHotEncoder(max_categories=2)
ohe2 = ohe2.fit(titanic2[['Embarked']])

### Fazendo transformação

In [14]:
new2 = ohe2.transform(titanic2[['Embarked']]).toarray()

### Transformando em um DataFrame

In [15]:
ohe2_df = pd.DataFrame(new2)
ohe2_df.head(2)

Unnamed: 0,0,1
0,1.0,0.0
1,0.0,1.0


### Nomeando as categorias
- Como limitamos em 2 categorias, ele vai colocar a maior e a outra vai ser com os menos frequentes

In [16]:
ohe2_df.columns = ohe2.get_feature_names_out()
ohe2_df.head(2)

Unnamed: 0,Embarked_S,Embarked_infrequent_sklearn
0,1.0,0.0
1,0.0,1.0


### Concatenando com o DataFrame original e excluindo a categoria anterior "Embarked"

In [17]:
titanic2 = pd.concat([titanic2, ohe2_df], axis=1)
titanic2 = titanic2.drop("Embarked", axis = 1)
titanic2.head(3)

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_S,Embarked_infrequent_sklearn
0,0,3,22.0,1,0,7.25,0.0,1.0,1.0,0.0
1,1,1,38.0,1,0,71.2833,1.0,0.0,0.0,1.0
2,1,3,26.0,0,0,7.925,1.0,0.0,1.0,0.0


# Utilizando Ordinal Encoding


In [18]:
from sklearn.preprocessing import OrdinalEncoder

In [19]:
# Importando e visualizando a base
base = pd.read_csv('aug_test.csv')
base.head(3)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours
0,32403,city_41,0.827,Male,Has relevent experience,Full time course,Graduate,STEM,9,<10,,1,21
1,9858,city_103,0.92,Female,Has relevent experience,no_enrollment,Graduate,STEM,5,,Pvt Ltd,1,98
2,31806,city_21,0.624,Male,No relevent experience,no_enrollment,High School,,<1,,Pvt Ltd,never,15


### Instanciando

- Categories: Determina a ordem 
- handle_unknown e unknown_value: Trata os valores nulos
- dtype: Determina os valores de saida

In [20]:
oe = OrdinalEncoder(categories=[['Primary School','High School','Graduate','Masters','Phd']]
                    ,handle_unknown='use_encoded_value',unknown_value=-1
                    ,dtype='int32'
                   )

### Fazendo o fit e transformando os dados

In [21]:
oe = oe.fit(base[['education_level']])

In [22]:
oe.transform(base[['education_level']])

array([[2],
       [2],
       [1],
       ...,
       [0],
       [1],
       [3]])

### Criando dataFrame

In [23]:
oe_df = pd.DataFrame(oe.transform(base[['education_level']]),columns=['education_level_oe'])
oe_df.head(3)

Unnamed: 0,education_level_oe
0,2
1,2
2,1


### Eliminando a coluna education_level_oe

In [24]:
base = base.drop('education_level', axis=1)

### Unindo com o dataset original

In [25]:
base = pd.concat([base,oe_df],axis=1)
base.head(3)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,major_discipline,experience,company_size,company_type,last_new_job,training_hours,education_level_oe
0,32403,city_41,0.827,Male,Has relevent experience,Full time course,STEM,9,<10,,1,21,2
1,9858,city_103,0.92,Female,Has relevent experience,no_enrollment,STEM,5,,Pvt Ltd,1,98,2
2,31806,city_21,0.624,Male,No relevent experience,no_enrollment,,<1,,Pvt Ltd,never,15,1
