<a href="https://colab.research.google.com/github/lima-breno/sampling_models/blob/main/SM_09_Transforma%C3%A7%C3%A3o_de_vari%C3%A1veis_com_Pyhton.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformação de variáveis com Python

Importando bibliotecas necessárias

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler

## 1. Carregamento dos Dados

In [None]:
dados = pd.read_csv('adult.data.csv')

In [None]:
# Visualizando o conjunto de dados
dados.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## 2. Separação entre variáveis preditoras e alvo

In [None]:
variaveis_preditoras = dados.iloc[:, 0:14]
alvo = dados.iloc[:, 14]

As variáveis preditoras contêm todas as colunas exceto a última, que é a variável alvo.

In [None]:
variaveis_preditoras

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States


In [None]:
alvo

0         <=50K
1         <=50K
2         <=50K
3         <=50K
4         <=50K
          ...  
32556     <=50K
32557      >50K
32558     <=50K
32559     <=50K
32560      >50K
Name: income, Length: 32561, dtype: object

## 3. Transformação da variável alvo com LabelEncoder

In [None]:
encoder = LabelEncoder()
alvo = encoder.fit_transform(alvo)

In [None]:
alvo

array([0, 0, 0, ..., 0, 0, 1])

Utilizamos o `LabelEncoder` para transformar a variável alvo, convertendo rótulos categóricos em valores numéricos.

## 4. Codificação de variáveis categóricas com One-Hot Encoding

In [None]:
# Contagem de valores da coluna 'workclass'
variaveis_preditoras['workclass'].value_counts()

workclass
 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: count, dtype: int64

In [None]:
# Aplicando LabelEncoder à coluna 'workclass' como exemplo
workclass = encoder.fit_transform(variaveis_preditoras['workclass'])

In [None]:
workclass

array([7, 6, 4, ..., 4, 4, 5])

In [None]:
valores_unicos, contagens = np.unique(workclass, return_counts=True)
resultado = np.column_stack((valores_unicos, contagens))
print("Valor | Contagem")
print(resultado)

Valor | Contagem
[[    0  1836]
 [    1   960]
 [    2  2093]
 [    3     7]
 [    4 22696]
 [    5  1116]
 [    6  2541]
 [    7  1298]
 [    8    14]]


O exemplo acima é errado! ele numera de 0 a 6 os empregos, sendo que é necessario para duas ou mais categorias fazer de outra forma! (exemplo dos estados nos slides)

### Utilizando as variáveis DUMMY

In [None]:
one_hot = pd.get_dummies(data=variaveis_preditoras, columns=['workclass'])

In [None]:
one_hot

Unnamed: 0,age,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,...,native-country,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay
0,39,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,...,United-States,False,False,False,False,False,False,False,True,False
1,50,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,...,United-States,False,False,False,False,False,False,True,False,False
2,38,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,...,United-States,False,False,False,False,True,False,False,False,False
3,53,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,...,United-States,False,False,False,False,True,False,False,False,False
4,28,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,...,Cuba,False,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,...,United-States,False,False,False,False,True,False,False,False,False
32557,40,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,...,United-States,False,False,False,False,True,False,False,False,False
32558,58,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,...,United-States,False,False,False,False,True,False,False,False,False
32559,22,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,...,United-States,False,False,False,False,True,False,False,False,False


A codificação One-Hot cria colunas binárias para representar a presença ou ausência de categorias.

In [None]:
# Criando dummies para múltiplas colunas
one_hot_full = pd.get_dummies(data=variaveis_preditoras, columns=['workclass', 'education', 'marital-status',
                                                                  'occupation', 'relationship', 'race',
                                                                  'gender', 'native-country'])

In [None]:
one_hot_full

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,39,77516,13,2174,0,40,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1,50,83311,13,0,0,13,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
2,38,215646,9,0,0,40,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,53,234721,7,0,0,40,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
4,28,338409,13,0,0,40,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,257302,12,0,0,38,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
32557,40,154374,9,0,0,40,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
32558,58,151910,9,0,0,40,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
32559,22,201490,9,0,0,20,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


Aplicando a codificação One-Hot para várias colunas categóricas ao mesmo tempo

## Escalonamento

## 1. Padronização dos Dados

Padronização é o processo de centralizar as variáveis em torno da média e escalá-las para ter desvio padrão 1.

In [None]:
scaler = StandardScaler()
variaveis_preditoras_2 = scaler.fit_transform(one_hot_full)

In [None]:
variaveis_preditoras_2_df = pd.DataFrame(variaveis_preditoras_2, columns=one_hot_full.columns)
variaveis_preditoras_2_df = variaveis_preditoras_2_df.round(2)
variaveis_preditoras_2_df

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,0.03,-1.06,1.13,0.15,-0.22,-0.04,-0.24,-0.17,-0.26,-0.01,...,-0.03,-0.06,-0.02,-0.05,-0.04,-0.02,-0.02,0.34,-0.05,-0.02
1,0.84,-1.01,1.13,-0.15,-0.22,-2.22,-0.24,-0.17,-0.26,-0.01,...,-0.03,-0.06,-0.02,-0.05,-0.04,-0.02,-0.02,0.34,-0.05,-0.02
2,-0.04,0.25,-0.42,-0.15,-0.22,-0.04,-0.24,-0.17,-0.26,-0.01,...,-0.03,-0.06,-0.02,-0.05,-0.04,-0.02,-0.02,0.34,-0.05,-0.02
3,1.06,0.43,-1.20,-0.15,-0.22,-0.04,-0.24,-0.17,-0.26,-0.01,...,-0.03,-0.06,-0.02,-0.05,-0.04,-0.02,-0.02,0.34,-0.05,-0.02
4,-0.78,1.41,1.13,-0.15,-0.22,-0.04,-0.24,-0.17,-0.26,-0.01,...,-0.03,-0.06,-0.02,-0.05,-0.04,-0.02,-0.02,-2.93,-0.05,-0.02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,-0.85,0.64,0.75,-0.15,-0.22,-0.20,-0.24,-0.17,-0.26,-0.01,...,-0.03,-0.06,-0.02,-0.05,-0.04,-0.02,-0.02,0.34,-0.05,-0.02
32557,0.10,-0.34,-0.42,-0.15,-0.22,-0.04,-0.24,-0.17,-0.26,-0.01,...,-0.03,-0.06,-0.02,-0.05,-0.04,-0.02,-0.02,0.34,-0.05,-0.02
32558,1.42,-0.36,-0.42,-0.15,-0.22,-0.04,-0.24,-0.17,-0.26,-0.01,...,-0.03,-0.06,-0.02,-0.05,-0.04,-0.02,-0.02,0.34,-0.05,-0.02
32559,-1.22,0.11,-0.42,-0.15,-0.22,-1.66,-0.24,-0.17,-0.26,-0.01,...,-0.03,-0.06,-0.02,-0.05,-0.04,-0.02,-0.02,0.34,-0.05,-0.02


In [None]:
variaveis_preditoras_2_df.describe().round(2)

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,...,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,...,0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-1.58,-1.68,-3.53,-0.15,-0.22,-3.19,-0.24,-0.17,-0.26,-0.01,...,-0.03,-0.06,-0.02,-0.05,-0.04,-0.02,-0.02,-2.93,-0.05,-0.02
25%,-0.78,-0.68,-0.42,-0.15,-0.22,-0.04,-0.24,-0.17,-0.26,-0.01,...,-0.03,-0.06,-0.02,-0.05,-0.04,-0.02,-0.02,0.34,-0.05,-0.02
50%,-0.12,-0.11,-0.03,-0.15,-0.22,-0.04,-0.24,-0.17,-0.26,-0.01,...,-0.03,-0.06,-0.02,-0.05,-0.04,-0.02,-0.02,0.34,-0.05,-0.02
75%,0.69,0.45,0.75,-0.15,-0.22,0.37,-0.24,-0.17,-0.26,-0.01,...,-0.03,-0.06,-0.02,-0.05,-0.04,-0.02,-0.02,0.34,-0.05,-0.02
max,3.77,12.27,2.3,13.39,10.59,4.74,4.09,5.74,3.82,68.2,...,29.65,16.87,52.08,20.15,25.25,42.52,41.39,0.34,22.02,45.1


## 2. Normalização Min-Max
A normalização Min-Max escala as variáveis para um intervalo específico, geralmente [0, 1], útil para algoritmos sensíveis à escala dos dados

In [None]:
minmax = MinMaxScaler()
variaveis_preditoras_3 = minmax.fit_transform(one_hot_full)

In [None]:
variaveis_preditoras_3_df = pd.DataFrame(variaveis_preditoras_3, columns=one_hot_full.columns)
variaveis_preditoras_3_df.round(2)

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,0.30,0.04,0.80,0.02,0.0,0.40,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.45,0.05,0.80,0.00,0.0,0.12,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.29,0.14,0.53,0.00,0.0,0.40,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.49,0.15,0.40,0.00,0.0,0.40,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.15,0.22,0.80,0.00,0.0,0.40,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0.14,0.17,0.73,0.00,0.0,0.38,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
32557,0.32,0.10,0.53,0.00,0.0,0.40,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
32558,0.56,0.09,0.53,0.00,0.0,0.40,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
32559,0.07,0.13,0.53,0.00,0.0,0.19,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [None]:
variaveis_preditoras_3_df.describe().round(2)

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,...,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,0.3,0.12,0.61,0.01,0.02,0.4,0.06,0.03,0.06,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9,0.0,0.0
std,0.19,0.07,0.17,0.07,0.09,0.13,0.23,0.17,0.25,0.01,...,0.03,0.06,0.02,0.05,0.04,0.02,0.02,0.31,0.05,0.02
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.15,0.07,0.53,0.0,0.0,0.4,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
50%,0.27,0.11,0.6,0.0,0.0,0.4,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
75%,0.42,0.15,0.73,0.0,0.0,0.45,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Com essas etapas, cobrimos desde a transformação básica de variáveis categóricas até a padronização e normalização, preparando os dados para modelos de machine learning.