### Pair Programming - Encoding

- Hacer una códificación de la/las variables categóricas que tengáis en vuestro set de datos.
- Recordad que lo primero que deberéis hacer es decidir su vuestras variables tienen o no orden, para que en función de esto uséis una aproximación u otra.
- Guardad el dataframe, donde deberíais tener las variables estadandarizas, normalizadas y codificadas en un csv para usarlo en el próximo pairprogramming

In [299]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

import warnings
warnings.filterwarnings('ignore')

In [300]:
df = pd.read_csv("data/adult.data_limpio.csv", index_col = 0)

In [301]:
df.head(2)

Unnamed: 0_level_0,work_class,final_weight,education,education_yrs,marital_status,occupation,relationship,ethnicity,gender,capital_gain,capital_lost,hours_week,country,salary,census
39,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,Bajo
38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,Bajo


In [302]:
df_categoricas = df.select_dtypes(include = "object")

In [303]:
df_categoricas.head(2)

Unnamed: 0_level_0,work_class,education,marital_status,occupation,relationship,ethnicity,gender,country,salary,census
39,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
50,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K,Bajo
38,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,<=50K,Bajo


In [304]:
df_categoricas.columns

Index(['work_class', 'education', 'marital_status', 'occupation',
       'relationship', 'ethnicity', 'gender', 'country', 'salary', 'census'],
      dtype='object')

In [305]:
df["ethnicity"].unique()

array([' White', ' Black', ' Asian-Pac-Islander', ' Amer-Indian-Eskimo',
       ' Other'], dtype=object)

In [306]:
df["gender"].unique()

array([' Male', ' Female'], dtype=object)

In [307]:
df["work_class"].unique()

array([' Self-emp-not-inc', ' Private', ' State-gov', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)

In [308]:
df_categoricas["census"].unique()

array(['Bajo', 'Alto'], dtype=object)

In [309]:
df_categoricas["salary"].unique()

array([' <=50K', ' >50K'], dtype=object)

***
### Map

In [310]:
mapa_gender = {' Female': 1, ' Male': 0}
mapa_census = {'Bajo' : 0, 'Alto' : 1}
mapa_salary = {' <=50K' : 0, ' >50K' : 1}

In [311]:
df_categoricas["gender_map"] = df_categoricas["gender"].map(mapa_gender)

In [312]:
df_categoricas["census_map"] = df_categoricas["census"].map(mapa_census)

In [313]:
df_categoricas["salary_map"] = df_categoricas["salary"].map(mapa_salary)

In [314]:
df_categoricas.head(2)

Unnamed: 0_level_0,work_class,education,marital_status,occupation,relationship,ethnicity,gender,country,salary,census,gender_map,census_map,salary_map
39,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
50,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K,Bajo,0,0,0
38,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,<=50K,Bajo,0,0,0


***
### Ordinal Encoder

In [315]:
df_categoricas["education"].unique()

array([' Bachelors', ' HS-grad', ' 11th', ' Masters', ' 9th',
       ' Some-college', ' Assoc-acdm', ' Assoc-voc', ' 7th-8th',
       ' Doctorate', ' Prof-school', ' 5th-6th', ' 10th', ' 1st-4th',
       ' Preschool', ' 12th'], dtype=object)

In [316]:
orden_education = [" Preschool", ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th', ' 11th', ' 12th', ' HS-grad', ' Assoc-acdm', ' Assoc-voc', ' Some-college', ' Prof-school', ' Bachelors', ' Masters', ' Doctorate', ]

In [317]:
df_sin_orden = pd.DataFrame(df_categoricas["education"])

In [318]:
df_sin_orden.head()

Unnamed: 0_level_0,education
39,Unnamed: 1_level_1
50,Bachelors
38,HS-grad
53,11th
28,Bachelors
37,Masters


In [319]:
def ordinal_encoder(df, columna, orden_valores):
    
    ordinal = OrdinalEncoder(categories = [orden_education], dtype = int)

    transformados_oe = ordinal.fit_transform(df[[columna]])
    
    oe_df = pd.DataFrame(transformados_oe)

    df[columna] = oe_df
     
    return df

In [320]:
df_education_oe = ordinal_encoder(df_sin_orden, "education", orden_education)

In [321]:
df_education_oe.head()

Unnamed: 0_level_0,education
39,Unnamed: 1_level_1
50,8
38,9
53,8
28,8
37,11


***
- Hacer el encoding de **work class**, **marital_status**, **ethnicity** y **country** (sin orden).