### Pair Programming - Encoding

- Hacer una códificación de la/las variables categóricas que tengáis en vuestro set de datos.
- Recordad que lo primero que deberéis hacer es decidir su vuestras variables tienen o no orden, para que en función de esto uséis una aproximación u otra.
- Guardad el dataframe, donde deberíais tener las variables estadandarizas, normalizadas y codificadas en un csv para usarlo en el próximo pairprogramming

In [70]:
import pandas as pd
import sklearn 
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

import warnings
warnings.filterwarnings('ignore')

In [71]:
df = pd.read_csv("data/adult.numericas_robust.csv", index_col = 0)

***
### Map

In [72]:
mapa_gender = {' Female': 1, ' Male': 0}
mapa_census = {'Bajo' : 0, 'Alto' : 1}

In [73]:
df["gender_map"] = df["gender"].map(mapa_gender)

In [74]:
df["census_map"] = df["census"].map(mapa_census)

In [75]:
df.head(2)

Unnamed: 0_level_0,work_class,final_weight,education,education_yrs,marital_status,occupation,relationship,ethnicity,gender,capital_gain,capital_lost,hours_week,country,salary,census,salary_log,gender_map,census_map
39,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
50,Self-emp-not-inc,0.407581,Bachelors,-0.333333,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,-2.0,United-States,32755,Bajo,10.396811,0,0
38,Private,0.730681,HS-grad,0.666667,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,0.0,United-States,45156,Bajo,10.717878,0,0


***
### Ordinal Encoder

- Education

In [76]:
orden_education = [" Preschool", ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th', ' 11th', ' 12th', ' HS-grad', ' Assoc-acdm', ' Assoc-voc', ' Some-college', ' Prof-school', ' Bachelors', ' Masters', ' Doctorate', ]

In [77]:
df_sin_orden = pd.DataFrame(df["education"])

In [78]:
df_sin_orden.head()

Unnamed: 0_level_0,education
39,Unnamed: 1_level_1
50,Bachelors
38,HS-grad
53,11th
28,Bachelors
37,Masters


In [79]:
def ordinal_encoder(df, columna, orden_valores):
    
    ordinal = OrdinalEncoder(categories = [orden_education], dtype = int)

    transformados_oe = ordinal.fit_transform(df[[columna]])
    
    oe_df = pd.DataFrame(transformados_oe)

    df[columna] = oe_df
     
    return df

In [80]:
df_education_oe = ordinal_encoder(df_sin_orden, "education", orden_education)

In [81]:
df_education_oe.head()

Unnamed: 0_level_0,education
39,Unnamed: 1_level_1
50,8
38,9
53,8
28,8
37,11


In [82]:

final = pd.concat([df, df_education_oe],axis=1)


***
- Hacer el encoding de **work class**, **marital_status**, **ethnicity** y **country** (sin orden).

In [83]:

dummies_marital = pd.get_dummies(df['marital_status'], prefix_sep = "_", prefix = 'marital_status', dtype = int)
dummies_marital.head(2)

Unnamed: 0_level_0,marital_status_ Divorced,marital_status_ Married-AF-spouse,marital_status_ Married-civ-spouse,marital_status_ Married-spouse-absent,marital_status_ Never-married,marital_status_ Separated,marital_status_ Widowed
39,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
50,0,0,1,0,0,0,0
38,1,0,0,0,0,0,0


In [84]:

dummies_country = pd.get_dummies(df['country'], prefix_sep = "_", prefix = 'country', dtype = int)
dummies_country.head(2)

Unnamed: 0_level_0,country_ ?,country_ Cambodia,country_ Canada,country_ China,country_ Columbia,country_ Cuba,country_ Dominican-Republic,country_ Ecuador,country_ El-Salvador,country_ England,...,country_ Portugal,country_ Puerto-Rico,country_ Scotland,country_ South,country_ Taiwan,country_ Thailand,country_ Trinadad&Tobago,country_ United-States,country_ Vietnam,country_ Yugoslavia
39,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
50,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
38,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


- Ethnicity

In [85]:

dummies_ethnicity = pd.get_dummies(df["ethnicity"], prefix_sep = "_", prefix = "ethnicity", dtype = int)
dummies_ethnicity.head(2)

Unnamed: 0_level_0,ethnicity_ Amer-Indian-Eskimo,ethnicity_ Asian-Pac-Islander,ethnicity_ Black,ethnicity_ Other,ethnicity_ White
39,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
50,0,0,0,0,1
38,0,0,0,0,1


In [86]:

dummies_work_class = pd.get_dummies(df['work_class'], prefix_sep = "_", prefix = 'work_class', dtype = int)
dummies_work_class.head(2)

Unnamed: 0_level_0,work_class_ ?,work_class_ Federal-gov,work_class_ Local-gov,work_class_ Never-worked,work_class_ Private,work_class_ Self-emp-inc,work_class_ Self-emp-not-inc,work_class_ State-gov,work_class_ Without-pay
39,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
50,0,0,0,0,0,0,1,0,0
38,0,0,0,0,1,0,0,0,0


In [87]:

dummies_occupation= pd.get_dummies(df['occupation'], prefix_sep = "_", prefix = 'occupation', dtype = int)
dummies_occupation.head(2)

Unnamed: 0_level_0,occupation_ ?,occupation_ Adm-clerical,occupation_ Armed-Forces,occupation_ Craft-repair,occupation_ Exec-managerial,occupation_ Farming-fishing,occupation_ Handlers-cleaners,occupation_ Machine-op-inspct,occupation_ Other-service,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving
39,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
50,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
38,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0


In [88]:

dummies_relationship = pd.get_dummies(df['relationship'], prefix_sep = "_", prefix = 'relationship', dtype = int)
dummies_relationship.head(2)

Unnamed: 0_level_0,relationship_ Husband,relationship_ Not-in-family,relationship_ Other-relative,relationship_ Own-child,relationship_ Unmarried,relationship_ Wife
39,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
50,1,0,0,0,0,0
38,0,1,0,0,0,0


In [89]:
df_encoding = pd.concat([df,dummies_relationship,dummies_occupation, dummies_work_class, dummies_ethnicity, dummies_marital, dummies_country,df_education_oe], axis = 1)
df_encoding.head(2)

Unnamed: 0_level_0,work_class,final_weight,education,education_yrs,marital_status,occupation,relationship,ethnicity,gender,capital_gain,...,country_ Puerto-Rico,country_ Scotland,country_ South,country_ Taiwan,country_ Thailand,country_ Trinadad&Tobago,country_ United-States,country_ Vietnam,country_ Yugoslavia,education
39,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
50,Self-emp-not-inc,0.407581,Bachelors,-0.333333,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,...,0,0,0,0,0,0,1,0,0,8
38,Private,0.730681,HS-grad,0.666667,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,...,0,0,0,0,0,0,1,0,0,9


In [90]:
df_encoding.drop(["occupation", "relationship", "marital_status", "country", "ethnicity", "work_class", "education", "gender", "census"], axis = 1, inplace = True)

In [91]:
df_encoding

Unnamed: 0_level_0,final_weight,education_yrs,capital_gain,capital_lost,hours_week,salary,salary_log,gender_map,census_map,relationship_ Husband,...,country_ Portugal,country_ Puerto-Rico,country_ Scotland,country_ South,country_ Taiwan,country_ Thailand,country_ Trinadad&Tobago,country_ United-States,country_ Vietnam,country_ Yugoslavia
39,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
50,0.407581,-0.333333,0.0,0.0,-2.0,32755,10.396811,0,0,1,...,0,0,0,0,0,0,0,1,0,0
38,0.730681,0.666667,0.0,0.0,0.0,45156,10.717878,0,0,0,...,0,0,0,0,0,0,0,1,0,0
53,-0.574814,-0.333333,0.0,0.0,4.0,39938,10.595084,0,0,1,...,0,0,0,0,0,0,0,1,0,0
28,0.125840,-0.333333,0.0,0.0,0.0,26464,10.183541,1,0,0,...,0,0,0,0,0,0,0,0,0,0
37,-0.790191,0.000000,0.0,0.0,-0.4,36976,10.518024,1,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27,1.584401,-0.333333,0.0,0.0,8.0,42582,10.659187,1,0,0,...,0,0,0,0,0,0,0,1,0,0
40,-0.753688,1.000000,0.0,0.0,0.0,62257,11.039026,0,0,1,...,0,0,0,0,0,0,0,1,0,0
58,0.082056,-0.333333,5013.0,0.0,0.0,35340,10.472771,1,0,0,...,0,0,0,0,0,0,0,1,0,0
22,-0.514381,-1.000000,0.0,2042.0,0.0,34013,10.434498,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [92]:
df_encoding.to_csv("data/adults_encoding.csv")