# PREPARE THE DATA

These are the steps required to prepare the data before is passed to the *"Wide and Deep"* model at `wide_deep/torch_model.py`

Let's first load the data and create a target:

In [None]:
from __future__ import print_function
import pandas as pd
import numpy as np

DF = pd.read_csv('data/adult_data.csv')

# Let's create a feature that will be our target for logistic regression
DF['income_label'] = (DF["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)

DF.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket,income_label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,0


### 1-Set the experiment

We need to define the columns in the dataset that will be passed to the *"wide-"* and the *"deep-side"* of the model. For more details of what I mean by "wide" and "deep" I recommend either to read [this tutorial](https://www.tensorflow.org/tutorials/wide_and_deep), the [original paper](https://arxiv.org/pdf/1606.07792.pdf) or the demo2 in this repo. 

In the example below, the wide and crossed column will be passed to the wide side of the model while the embedding columns and continuous columns will go through the deep side. 

We also need to state our target and the method that will be used to fit/predict that target (regression, logistic or multiclass).

In [2]:
wide_cols = ['age','hours_per_week','education', 'relationship','workclass',
             'occupation','native_country','gender']
crossed_cols = (['education', 'occupation'], ['native_country', 'occupation'])
embeddings_cols = [('education',10), ('relationship',8), ('workclass',10),
                    ('occupation',10),('native_country',12)]
continuous_cols = ["age","hours_per_week"]
target = 'income_label'
method = 'logistic'

You will see that `embeddings_cols` is a list of tuples with two elements. These are the column name and the "dimension of the corresponding embeddings" (i.e. the number of embeddings per feature), so that when passed through the Deep-side education will be represented by 10 embeddings, relatioship by 8, etc.

If you want to use the same number of embeddings for *all* the embedding columns you can simply include the column names and define the number of embeddings when calling to the `prepare_data` function I mention before. This function has a parameter called `def_dim` (default dimension) that will be applied to all embedding columns if no embedding dimension. The first few lines on `prepare_data` look like this

In [3]:
# If embeddings_cols does not include the embeddings dimensions it will be set as
# def_dim
if type(embeddings_cols[0]) is tuple:
    emb_dim = dict(embeddings_cols)
    embeddings_cols = [emb[0] for emb in embeddings_cols]
else:
    emb_dim = {e:def_dim for e in embeddings_cols}
deep_cols = embeddings_cols+continuous_cols

### 2-Cross-product for binary features

At explained in the original paper: *"For binary features, a cross-product transformation (e.g.,
`AND(gender=female, language=en))` is 1 if and only if the constituent features (`gender=female and language=en`)
are all 1, and 0 otherwise"*. Here, this is implemented by combining the features into a new feature and one-hot encoded it afterwards.

In [4]:
Y = np.array(DF[target])
# We copy the original dataset so we do not mutate it
df_tmp = DF.copy()[list(set(wide_cols + deep_cols))]

# Build the crossed columns
crossed_columns = []
for cols in crossed_cols:
    colname = '_'.join(cols)
    df_tmp[colname] = df_tmp[cols].apply(lambda x: '-'.join(x), axis=1)
    crossed_columns.append(colname)

# Extract the categorical column names that can be one hot encoded later
categorical_columns = list(df_tmp.select_dtypes(include=['object']).columns)

Let's have a look to one of the "crossed features"

In [5]:
df_tmp['education_occupation'].head()

0       Bachelors-Adm-clerical
1    Bachelors-Exec-managerial
2    HS-grad-Handlers-cleaners
3       11th-Handlers-cleaners
4     Bachelors-Prof-specialty
Name: education_occupation, dtype: object

When we one-hot encode this feature later, it will be only 1 *if and only* if the two constituent features are 1. In other words, the level `Bachelors-Adm-clerical` of the `education_occupation` feature will be 1 *if and only if* for that particular observation `education=Bachelors` AND `occupation=Adm-clerical`.

### 3-Label-encoding and splitting the dataframe into wide and deep.

We first encode the dataframe and keep a dictionary of the encodings for those columns that will be represented as embeddings (for the remaining ones is unneccesary).

In [6]:
def label_encode(df, cols=None):
    """
    Helper function to label-encode some features of a given dataset.

    Parameters:
    --------
    df  (pd.Dataframe)
    cols (list): optional - columns to be label-encoded

    Returns:
    ________
    val_to_idx (dict) : Dictionary of dictionaries with useful information about
    the encoding mapping
    df (pd.Dataframe): mutated df with Label-encoded features.
    """

    if cols == None:
        cols = list(df.select_dtypes(include=['object']).columns)

    val_types = dict()
    for c in cols:
        val_types[c] = df[c].unique()

    val_to_idx = dict()
    for k, v in val_types.iteritems():
        val_to_idx[k] = {o: i for i, o in enumerate(val_types[k])}

    for k, v in val_to_idx.iteritems():
        df[k] = df[k].apply(lambda x: v[x])

    return val_to_idx, df

# Encode the dataframe and get the encoding Dictionary only for the
# deep_cols (for the wide_cols is uneccessary)
encoding_dict,df_tmp = label_encode(df_tmp)
encoding_dict = {k:encoding_dict[k] for k in encoding_dict if k in deep_cols}
embeddings_input = []
for k,v in encoding_dict.iteritems():
    embeddings_input.append((k, len(v), emb_dim[k]))

Then we split the data frame into the wide and deep data frames and keep the index of the deep column. This information will be used later since we will slice the tensors based on index.

In [7]:
# select the deep_cols and get the column index that will be use later
# to slice the tensors
df_deep = df_tmp[deep_cols]
deep_column_idx = {k:v for v,k in enumerate(df_deep.columns)}

# The continous columns will be concatenated with the embeddings, so you
# might want to normalize them first
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
for cc in continuous_cols:
    df_deep[cc]  = scaler.fit_transform(df_deep[cc].values.reshape(-1,1))

df_wide = df_tmp[wide_cols+crossed_columns]
del(df_tmp)

dummy_cols = [c for c in wide_cols+crossed_columns if c in categorical_columns]
df_wide = pd.get_dummies(df_wide, columns=dummy_cols)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


### 4-Train/Test split and build the output dictionary

I think the code here is self explanatory...

In [8]:
from sklearn.model_selection import train_test_split
from collections import namedtuple

seed = 1981
X_train_deep, X_test_deep = train_test_split(df_deep.values, test_size=0.3, random_state=seed)
X_train_wide, X_test_wide = train_test_split(df_wide.values, test_size=0.3, random_state=seed)
y_train, y_test = train_test_split(Y, test_size=0.3, random_state=1981)

# Building the output dictionary
wd_dataset = dict()
train_dataset = namedtuple('train_dataset', 'wide, deep, labels')
test_dataset  = namedtuple('test_dataset' , 'wide, deep, labels')
wd_dataset['train_dataset'] = train_dataset(X_train_wide, X_train_deep, y_train)
wd_dataset['test_dataset']  = test_dataset(X_test_wide, X_test_deep, y_test)
wd_dataset['embeddings_input']  = embeddings_input
wd_dataset['deep_column_idx'] = deep_column_idx
wd_dataset['encoding_dict'] = encoding_dict

In [9]:
print(wd_dataset['train_dataset'])

train_dataset(wide=array([[46, 50,  0, ...,  0,  0,  0],
       [32, 45,  1, ...,  0,  0,  0],
       [30, 30,  0, ...,  0,  0,  0],
       ..., 
       [40, 40,  0, ...,  0,  0,  0],
       [45, 37,  1, ...,  0,  0,  0],
       [40, 45,  1, ...,  0,  0,  0]]), deep=array([[ 3.        ,  1.        ,  6.        , ...,  0.        ,
         0.53655844,  0.77292975],
       [ 0.        ,  0.        ,  2.        , ...,  0.        ,
        -0.48456647,  0.36942139],
       [ 1.        ,  4.        ,  2.        , ...,  0.        ,
        -0.63044146, -0.84110367],
       ..., 
       [ 1.        ,  0.        ,  2.        , ...,  0.        ,
         0.09893348, -0.03408696],
       [ 0.        ,  1.        ,  2.        , ...,  0.        ,
         0.46362095, -0.27619198],
       [ 0.        ,  1.        ,  2.        , ...,  0.        ,
         0.09893348,  0.36942139]]), labels=array([1, 0, 0, ..., 0, 0, 0]))


In [10]:
print(wd_dataset['embeddings_input'])

[('workclass', 9, 10), ('education', 16, 10), ('native_country', 42, 12), ('relationship', 6, 8), ('occupation', 15, 10)]


In [11]:
print(wd_dataset['deep_column_idx'])

{'hours_per_week': 6, 'native_country': 4, 'relationship': 1, 'age': 5, 'workclass': 2, 'education': 0, 'occupation': 3}


In [14]:
wd_dataset['encoding_dict']

{'education': {'10th': 12,
  '11th': 2,
  '12th': 15,
  '1st-4th': 13,
  '5th-6th': 11,
  '7th-8th': 8,
  '9th': 4,
  'Assoc-acdm': 6,
  'Assoc-voc': 7,
  'Bachelors': 0,
  'Doctorate': 9,
  'HS-grad': 1,
  'Masters': 3,
  'Preschool': 14,
  'Prof-school': 10,
  'Some-college': 5},
 'native_country': {'?': 4,
  'Cambodia': 17,
  'Canada': 10,
  'China': 28,
  'Columbia': 16,
  'Cuba': 1,
  'Dominican-Republic': 24,
  'Ecuador': 19,
  'El-Salvador': 25,
  'England': 9,
  'France': 26,
  'Germany': 11,
  'Greece': 35,
  'Guatemala': 27,
  'Haiti': 22,
  'Holand-Netherlands': 41,
  'Honduras': 8,
  'Hong': 38,
  'Hungary': 40,
  'India': 3,
  'Iran': 12,
  'Ireland': 39,
  'Italy': 14,
  'Jamaica': 2,
  'Japan': 29,
  'Laos': 20,
  'Mexico': 5,
  'Nicaragua': 36,
  'Outlying-US(Guam-USVI-etc)': 32,
  'Peru': 31,
  'Philippines': 13,
  'Poland': 15,
  'Portugal': 23,
  'Puerto-Rico': 7,
  'Scotland': 33,
  'South': 6,
  'Taiwan': 21,
  'Thailand': 18,
  'Trinadad&Tobago': 34,
  'United-Sta

Emphasize again that all this is wrapped-up in a function saved in the module `wide_deep.data_utils`. Therefore, as long as your data is in a state similar to the original `DF` at the beginning of this notebook, you will be able to:

In [15]:
from wide_deep.data_utils import prepare_data

and simply call the function.