In [122]:
import sklearn
import pandas as pd
import numpy as np

# Dummy Variables and Encoding

One of the components of data preprocessing is feature selection and transformations. Transformations can include the creation of dummy variables from either categorical or numerical variables. A dummy variable is a predictor that only includes values of 0 or 1, which correspond to an absence or presence of that feature, respectively.


### Dataset

To test out encoding and dummy variable creation in Python, the Census income dataset from UCI Machine Learning Repository will be used. This dataset contains both numerical and categorical variables.

In [123]:
df = pd.read_csv(r"http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.data", 
                 header = None, index_col = False,
                 names = ['age','workclass','fnlwgt','edu','edunum','marital-status','occupation','relationship',
                    'race','sex','cap-gain','cap-loss','hours-per-week','native-country'])
df.head()

Unnamed: 0,age,workclass,fnlwgt,edu,edunum,marital-status,occupation,relationship,race,sex,cap-gain,cap-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


### Categorical Variables

Dummy variables are frequently used to encode categorical variables. Before applying a regression or classification technique on a dataset, some ML libraries require that all predictors must be numerical. 

To encode categorical variables, one hot encoding can be used to represent those categories as binary vectors - each category will become a binary predictor column.

The use of One Hot Encoder will be demonstrated on the marital-status column, though it can be applied to all the categorical columns in the dataset. The following custom function accepts a column to be encoded and a dataframe, and returns a revised dataframe with encoded binary predictors instead.

In [124]:
from sklearn.preprocessing import OneHotEncoder

def encode_into_df(column_to_encode, df):
    hot_encoder = OneHotEncoder(sparse=False)
    onehot = hot_encoder.fit_transform(np.array(df[column_to_encode]).reshape(-1, 1))
    i = 0
    
    #encode all categories except the last one
    for col in hot_encoder.get_feature_names()[0: -1]:
        col = col.replace('x0_', '')
        df[col] = onehot[:, i]
        i += 1
    
    df = df.drop(columns = column_to_encode)
    return df

df = encode_into_df('marital-status', df)
df.head()

Unnamed: 0,age,workclass,fnlwgt,edu,edunum,occupation,relationship,race,sex,cap-gain,cap-loss,hours-per-week,native-country,Divorced,Married-AF-spouse,Married-civ-spouse,Married-spouse-absent,Never-married,Separated
0,39,State-gov,77516,Bachelors,13,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0.0,0.0,0.0,0.0,1.0,0.0
1,50,Self-emp-not-inc,83311,Bachelors,13,Exec-managerial,Husband,White,Male,0,0,13,United-States,0.0,0.0,1.0,0.0,0.0,0.0
2,38,Private,215646,HS-grad,9,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,1.0,0.0,0.0,0.0,0.0,0.0
3,53,Private,234721,11th,7,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0.0,0.0,1.0,0.0,0.0,0.0
4,28,Private,338409,Bachelors,13,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0.0,0.0,1.0,0.0,0.0,0.0


### Numerical Variables

Another use of dummy variables is for bucketing or binning numerical variables. For example, the hours-per-week column can be bucketed into three categories:
    
- part-time (less than 40 hours)
- regular (40 hours)
- over-time (greater than 40 hours)

When using dummy variables for bins, the amount of predictors required is one less than the amout of buckets. This is because when all of the predictors are false, it will be assumed that the excluded predictor is true. In this example, part-time and over-time columns will be created. If part-time and over-time are false, it will be assumed that regular is true.

In the dataframe below, `hours-per-week` has been replaced by categories.

In [125]:
bins = [0, 39, 40, 100]
labels = ['part-time', 'regular', 'full-time']
df['hours-per-week'] = pd.cut(df['hours-per-week'], bins, labels = labels)
df.head()

Unnamed: 0,age,workclass,fnlwgt,edu,edunum,occupation,relationship,race,sex,cap-gain,cap-loss,hours-per-week,native-country,Divorced,Married-AF-spouse,Married-civ-spouse,Married-spouse-absent,Never-married,Separated
0,39,State-gov,77516,Bachelors,13,Adm-clerical,Not-in-family,White,Male,2174,0,regular,United-States,0.0,0.0,0.0,0.0,1.0,0.0
1,50,Self-emp-not-inc,83311,Bachelors,13,Exec-managerial,Husband,White,Male,0,0,part-time,United-States,0.0,0.0,1.0,0.0,0.0,0.0
2,38,Private,215646,HS-grad,9,Handlers-cleaners,Not-in-family,White,Male,0,0,regular,United-States,1.0,0.0,0.0,0.0,0.0,0.0
3,53,Private,234721,11th,7,Handlers-cleaners,Husband,Black,Male,0,0,regular,United-States,0.0,0.0,1.0,0.0,0.0,0.0
4,28,Private,338409,Bachelors,13,Prof-specialty,Wife,Black,Female,0,0,regular,Cuba,0.0,0.0,1.0,0.0,0.0,0.0


Now that `hours-per-week` is a categorical feature, the custom `encode_into_df` function can be applied to that column to create dummy variables.

In [126]:
df = encode_into_df('hours-per-week', df)
df.head()

Unnamed: 0,age,workclass,fnlwgt,edu,edunum,occupation,relationship,race,sex,cap-gain,cap-loss,native-country,Divorced,Married-AF-spouse,Married-civ-spouse,Married-spouse-absent,Never-married,Separated,full-time,part-time
0,39,State-gov,77516,Bachelors,13,Adm-clerical,Not-in-family,White,Male,2174,0,United-States,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,50,Self-emp-not-inc,83311,Bachelors,13,Exec-managerial,Husband,White,Male,0,0,United-States,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2,38,Private,215646,HS-grad,9,Handlers-cleaners,Not-in-family,White,Male,0,0,United-States,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,53,Private,234721,11th,7,Handlers-cleaners,Husband,Black,Male,0,0,United-States,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,28,Private,338409,Bachelors,13,Prof-specialty,Wife,Black,Female,0,0,Cuba,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
