# How to use the model

To understand the model it would be convenient if you have gone through demo1 and 2, however you can learn how to use the model simply reading this notebook. 

I will use 3 examples to illustrate the different set-ups that can be used with this pytorch implementation of wide and deep.

### 0. Load the data

Note that, as long as your dataset is in a state similar to that of `adult_data.csv` below (remove NaN, impute missing values, etc..), you are "good to go".

In [8]:
from __future__ import print_function
import pandas as pd
import numpy as np

DF = pd.read_csv('data/adult_data.csv')

DF.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket,income_label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,0


## 1. Logistic regression with varying embedding dimensions, no dropout and Adam optimizer.

#### 1_1. Set the experiment

In [9]:
# Let's define a target for logistic regression:
DF['income_label'] = (DF["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)

# Experiment set up
wide_cols = ['age','hours_per_week','education', 'relationship','workclass',
             'occupation','native_country','gender']
crossed_cols = (['education', 'occupation'], ['native_country', 'occupation'])
embeddings_cols = [('education',10), ('relationship',8), ('workclass',10),
                    ('occupation',10),('native_country',10)]
continuous_cols = ["age","hours_per_week"]
target = 'income_label'
method = 'logistic'

#### 1_2. prepare the data

In [10]:
from wide_deep.data_utils import prepare_data

# just call prepare_data
wd_dataset = prepare_data(DF, wide_cols,crossed_cols,embeddings_cols,continuous_cols,target,scale=True)

#### 1_3. Build the model

In [11]:
# Network set up
wide_dim = wd_dataset['train_dataset'].wide.shape[1]
n_class=1 # for logistic and regression
deep_column_idx = wd_dataset['deep_column_idx']
embeddings_input= wd_dataset['embeddings_input']
encoding_dict   = wd_dataset['encoding_dict']
hidden_layers = [100,50]
dropout = None

# Build the model. Again you just need to call WideDeep
from wide_deep.torch_model import WideDeep
model = WideDeep(wide_dim,embeddings_input,continuous_cols,deep_column_idx,hidden_layers, dropout, encoding_dict,n_class)

# I have included a compile method if you want to change the fitting method or the optimizer
model.compile(method=method, optimizer="Adam")

let's have a look:

In [12]:
print(model)

WideDeep (
  (emb_layer_workclass): Embedding(9, 10)
  (emb_layer_education): Embedding(16, 10)
  (emb_layer_native_country): Embedding(42, 10)
  (emb_layer_relationship): Embedding(6, 8)
  (emb_layer_occupation): Embedding(15, 10)
  (linear_1): Linear (50 -> 100)
  (linear_2): Linear (100 -> 50)
  (output): Linear (848 -> 1)
)


#### 1_4. Fit and Predict

In [13]:
train_dataset = wd_dataset['train_dataset']
test_dataset  = wd_dataset['test_dataset']

# As your usual Sklearn model, simply call fit/predict
model.fit(dataset=train_dataset, n_epochs=10, batch_size=64)
pred = model.predict(dataset=test_dataset)

from sklearn.metrics import accuracy_score
print(accuracy_score(pred, test_dataset.labels))

  "Please ensure they have the same size.".format(target.size(), input.size()))
  "Please ensure they have the same size.".format(target.size(), input.size()))


Epoch 1 of 10, Loss: 0.215, accuracy: 0.8175
Epoch 2 of 10, Loss: 0.356, accuracy: 0.8396
Epoch 3 of 10, Loss: 0.229, accuracy: 0.842
Epoch 4 of 10, Loss: 0.531, accuracy: 0.8425
Epoch 5 of 10, Loss: 0.197, accuracy: 0.8438
Epoch 6 of 10, Loss: 0.134, accuracy: 0.844
Epoch 7 of 10, Loss: 0.454, accuracy: 0.8463
Epoch 8 of 10, Loss: 0.156, accuracy: 0.8464
Epoch 9 of 10, Loss: 0.217, accuracy: 0.8452
Epoch 10 of 10, Loss: 0.445, accuracy: 0.8472
0.838258377124


I have included a method to easily get the learned embeddings. This will return a dictionary where the keys are the column values and the values are the embeddings.

In [14]:
model.get_embeddings('education')

{'10th': array([-0.18979575,  1.4436841 , -0.50139612, -0.85227281,  1.36461151,
         0.3559041 , -0.58077377,  0.57836998,  0.09822965,  0.45356399], dtype=float32),
 '11th': array([ 0.45051831, -1.17895794, -0.70969492, -0.41443011, -0.54592711,
         2.06732845,  0.97312623, -1.66578746,  0.15288909, -0.13219695], dtype=float32),
 '12th': array([-0.55539042,  1.34430635, -0.14818592, -1.01501787, -1.85061646,
        -1.42545903,  0.30155715,  1.02573991, -0.42215505,  1.02378154], dtype=float32),
 '1st-4th': array([ 1.84887922,  1.20987594,  0.2984882 , -1.79686284,  0.59199595,
        -0.09441201, -0.26749009,  0.20149775, -0.73544145, -0.51700133], dtype=float32),
 '5th-6th': array([-0.05392418,  0.36236417,  0.47461176,  0.41363204, -0.2278301 ,
        -0.5376063 ,  2.63320708,  2.04696202, -0.49895033, -0.29155737], dtype=float32),
 '7th-8th': array([ 0.12547047,  0.05075515, -1.44649279, -1.56195939, -1.32460868,
        -0.34222227,  0.88958579,  0.47252822, -0.09495

## 2. Multiclass classification with fixed embedding dimensions (10), varying dropout and RMSProp. 

Let's first define a feature for multiclass classification. Note that **this is only for illustration purposes**. 

In [15]:
# Let's define age groups
age_groups = [0, 25, 50, 90]
age_labels = range(len(age_groups) - 1)
DF['age_group'] = pd.cut(DF['age'], age_groups, labels=age_labels)

# Set the experiment
wide_cols = ['hours_per_week','education', 'relationship','workclass',
             'occupation','native_country','gender']
crossed_cols = (['education', 'occupation'], ['native_country', 'occupation'])
embeddings_cols = ['education', 'relationship','workclass','occupation','native_country']
continuous_cols = ["hours_per_week"]
target = 'age_group'
method = 'multiclass'

wd_dataset = prepare_data(DF,wide_cols,crossed_cols,embeddings_cols,continuous_cols,target,scale=True,def_dim=10)

wide_dim = wd_dataset['train_dataset'].wide.shape[1]
n_class=3
deep_column_idx = wd_dataset['deep_column_idx']
embeddings_input= wd_dataset['embeddings_input']
encoding_dict   = wd_dataset['encoding_dict']
hidden_layers = [100,50]
dropout = [0.5, 0.2]

model = WideDeep(wide_dim,embeddings_input,continuous_cols,deep_column_idx,hidden_layers,dropout,encoding_dict,n_class)
model.compile(method=method, optimizer="RMSprop")

# Let's have a look to the model
print(model)

WideDeep (
  (emb_layer_workclass): Embedding(9, 10)
  (emb_layer_education): Embedding(16, 10)
  (emb_layer_native_country): Embedding(42, 10)
  (emb_layer_relationship): Embedding(6, 10)
  (emb_layer_occupation): Embedding(15, 10)
  (linear_1): Linear (51 -> 100)
  (linear_1_drop): Dropout (p = 0.5)
  (linear_2): Linear (100 -> 50)
  (linear_2_drop): Dropout (p = 0.2)
  (output): Linear (847 -> 3)
)


In [16]:
train_dataset = wd_dataset['train_dataset']
model.fit(dataset=train_dataset, n_epochs=10, batch_size=64)
test_dataset  = wd_dataset['test_dataset']

# The model object also has a predict_proba method in case you want probabilities instead of class
pred = model.predict_proba(test_dataset)
print('\n {}'.format(pred))

Epoch 1 of 10, Loss: 0.699, accuracy: 0.6737
Epoch 2 of 10, Loss: 0.822, accuracy: 0.6855
Epoch 3 of 10, Loss: 0.717, accuracy: 0.6879
Epoch 4 of 10, Loss: 1.016, accuracy: 0.6931
Epoch 5 of 10, Loss: 0.842, accuracy: 0.6944
Epoch 6 of 10, Loss: 0.805, accuracy: 0.6942
Epoch 7 of 10, Loss: 0.783, accuracy: 0.6966
Epoch 8 of 10, Loss: 0.859, accuracy: 0.6975
Epoch 9 of 10, Loss: 0.929, accuracy: 0.6992
Epoch 10 of 10, Loss: 0.826, accuracy: 0.7006

 [[  9.99074221e-01   9.25758795e-04   3.93159311e-10]
 [  2.88534306e-13   1.00000000e+00   1.56172240e-15]
 [  1.73595769e-08   1.00000000e+00   4.79524920e-10]
 ..., 
 [  8.90251540e-04   9.71086264e-01   2.80234683e-02]
 [  2.58150152e-07   9.99999106e-01   6.09748270e-07]
 [  8.45011652e-01   1.54977426e-01   1.09334724e-05]]


In [17]:
from sklearn.metrics import f1_score, accuracy_score

print("\n {}".format(f1_score(model.predict(test_dataset), test_dataset.labels, average="weighted")))

print("\n {}".format(accuracy_score(model.predict(test_dataset), test_dataset.labels)))


 0.733689593553

 0.700402647922


## 3. Linear regression with varying embedding dimensions and varying dropout.

Again, bear in mind that here we use `age` as target just **for illustration purposes**

In [18]:
# Set the experiment
wide_cols = ['hours_per_week','education', 'relationship','workclass',
             'occupation','native_country','gender']
crossed_cols  = (['education', 'occupation'], ['native_country', 'occupation'])
embeddings_cols  = [('education',10), ('relationship',8), ('workclass',10),
                    ('occupation',10),('native_country',10)]
continuous_cols = ["hours_per_week"]
target = 'age'
method = 'regression'

# Prepare the dataset
wd_dataset = prepare_data(DF, wide_cols,crossed_cols,embeddings_cols,continuous_cols,target)

wide_dim = wd_dataset['train_dataset'].wide.shape[1]
n_class=1
deep_column_idx = wd_dataset['deep_column_idx']
embeddings_input= wd_dataset['embeddings_input']
encoding_dict   = wd_dataset['encoding_dict']
hidden_layers = [100,50]
dropout = [0.5, 0.2]
model = WideDeep(wide_dim,embeddings_input,continuous_cols,deep_column_idx,hidden_layers,dropout,encoding_dict,n_class)
model.compile(method=method)
print(model)

WideDeep (
  (emb_layer_workclass): Embedding(9, 10)
  (emb_layer_education): Embedding(16, 10)
  (emb_layer_native_country): Embedding(42, 10)
  (emb_layer_relationship): Embedding(6, 8)
  (emb_layer_occupation): Embedding(15, 10)
  (linear_1): Linear (49 -> 100)
  (linear_1_drop): Dropout (p = 0.5)
  (linear_2): Linear (100 -> 50)
  (linear_2_drop): Dropout (p = 0.2)
  (output): Linear (847 -> 1)
)


In [19]:
train_dataset = wd_dataset['train_dataset']
model.fit(dataset=train_dataset, n_epochs=10, batch_size=64)

test_dataset  = wd_dataset['test_dataset']
pred = model.predict(test_dataset)

from sklearn.metrics import mean_squared_error
print("\n RMSE: {}".format(np.sqrt(mean_squared_error(pred, test_dataset.labels))))

Epoch 1 of 10, Loss: 151.295
Epoch 2 of 10, Loss: 108.425
Epoch 3 of 10, Loss: 82.35
Epoch 4 of 10, Loss: 36.353
Epoch 5 of 10, Loss: 50.06
Epoch 6 of 10, Loss: 147.494
Epoch 7 of 10, Loss: 176.602
Epoch 8 of 10, Loss: 167.916
Epoch 9 of 10, Loss: 40.365
Epoch 10 of 10, Loss: 107.579

 RMSE: 11.2378476775
