# Lab Assignment Five: Wide and Deep Network Architectures
 

#### Everett Cienkus, Blake Miller, Colin Weil

### 1. Preparation

#### 1.1 Define and Prepare Class Variables

Data from https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction

Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).

In [26]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Load the data into memory and save it to a pandas data frame.
df = pd.read_csv('promotion_dataset/train.csv')
df = df.dropna()

df_train, df_test = train_test_split(df,train_size=0.8)

In [27]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
# ========================================================
# define objects that can encode each variable as integer
encoders = dict() # save each encoder in dictionary
categorical_headers = ['department','region','education','gender','recruitment_channel']
# train all encoders
for col in categorical_headers:
    df_train[col] = df_train[col].str.strip()
    df_test[col] = df_test[col].str.strip()
    encoders[col] = LabelEncoder() # save the encoder
    df_train[col+'_int'] = encoders[col].fit_transform(df_train[col])
    df_test[col+'_int'] = encoders[col].transform(df_test[col])
# ========================================================
# scale the numeric, continuous variables
numeric_headers = ['no_of_trainings', 'previous_year_rating', 'length_of_service', 'awards_won?', 'avg_training_score']
ss = StandardScaler()
df_train[numeric_headers] = ss.fit_transform(df_train[numeric_headers].values)
df_test[numeric_headers] = ss.transform(df_test[numeric_headers].values)


categorical_headers_ints = [x+'_int' for x in categorical_headers]

feature_columns = categorical_headers_ints+numeric_headers

import pprint
pp = pprint.PrettyPrinter(indent=4)
print(f"We will use the following {len(feature_columns)} features:")
pp.pprint(feature_columns)


We will use the following 10 features:
[   'department_int',
    'region_int',
    'education_int',
    'gender_int',
    'recruitment_channel_int',
    'no_of_trainings',
    'previous_year_rating',
    'length_of_service',
    'awards_won?',
    'avg_training_score']


#### 1.2 Combine into Cross-Product Features

Identify groups of features in your data that should be combined into cross-product features. Provide justification for why these features should be crossed (or why some features should not be crossed):

One of the crosses we decided to use is crossing the ExerciseAngina column and the ChestPainType column. We decided to cross these because they are both attributes that relate to chest pain, as the ExerciseAngina column describes a specific kind of chest pain.

In [28]:
for col in categorical_headers:
    vals = df_train[col].unique()
    print(col,'has', len(vals), 'unique values:')
    print(vals)

department has 9 unique values:
['Procurement' 'Sales & Marketing' 'Legal' 'R&D' 'Technology' 'Operations'
 'Analytics' 'HR' 'Finance']
region has 34 unique values:
['region_7' 'region_13' 'region_2' 'region_22' 'region_6' 'region_23'
 'region_29' 'region_5' 'region_16' 'region_4' 'region_15' 'region_21'
 'region_11' 'region_27' 'region_24' 'region_31' 'region_9' 'region_33'
 'region_1' 'region_14' 'region_26' 'region_28' 'region_10' 'region_20'
 'region_32' 'region_17' 'region_30' 'region_12' 'region_8' 'region_25'
 'region_19' 'region_3' 'region_34' 'region_18']
education has 3 unique values:
["Bachelor's" "Master's & above" 'Below Secondary']
gender has 2 unique values:
['m' 'f']
recruitment_channel has 3 unique values:
['other' 'sourcing' 'referred']


In [29]:
# a quick example of crossing some columns

cross_columns = [#['race','sex','education','occupation'],
    ['department','education'],
    ['recruitment_channel','education']
]

# cross each set of columns in the list above
cross_col_df_names = []
for cols_list in cross_columns:
    # encode as ints for the embedding
    enc = LabelEncoder()

    # 1. create crossed labels by join operation
    X_crossed_train = df_train[cols_list].apply(lambda x: '_'.join(x), axis=1)
    X_crossed_test = df_test[cols_list].apply(lambda x: '_'.join(x), axis=1)

    # get a nice name for this new crossed column
    cross_col_name = '_'.join(cols_list)

    # 2. encode as integers, stacking all possibilities
    enc.fit(np.hstack((X_crossed_train.to_numpy(),  X_crossed_test.to_numpy())))

    # 3. Save into dataframe with new name
    df_train[cross_col_name] = enc.transform(X_crossed_train)
    df_test[cross_col_name] = enc.transform(X_crossed_test)

    # keep track of the new names of the crossed columns
    cross_col_df_names.append(cross_col_name)

cross_col_df_names

['department_education', 'recruitment_channel_education']

#### 1.3 Choose Metrics to Evaluate Performance

Choose and explain what metric(s) you will use to evaluate your algorithm’s performance. You should give a detailed argument for why this (these) metric(s) are appropriate on your data. That is, why is the metric appropriate for the task (e.g., in terms of the business case for the task). Please note: rarely is accuracy the best evaluation metric to use. Think deeply about an appropriate measure of performance.

#### 1.4 Choose Method for Dividing Data

Choose the method you will use for dividing your data into training and testing (i.e., are you using Stratified 10-fold cross validation? Shuffle splits? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. Argue why your cross validation method is a realistic mirroring of how an algorithm would be used in practice.

In [30]:
from sklearn.model_selection import train_test_split
X_train = df_train[feature_columns].to_numpy()
X_test = df_test[feature_columns].to_numpy()

y_train = df_train['is_promoted'].to_numpy()
y_test = df_test['is_promoted'].to_numpy()
# X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)


Since our dataset is over 50,000, it is okay to use 80/20 split. EXPAND ON THIS

### 2. Modeling

#### 2.1 Create Three Combined Wide and Deep Netowkrs using Keras

Create at least three combined wide and deep networks to classify your data using Keras. Visualize the performance of the network on the training data and validation data in the same plot versus the training iterations. Note: use the "history" return parameter that is part of Keras "fit" function to easily access this data.

In [31]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense, Activation, Input
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import concatenate
print(tf.__version__)
print(keras.__version__)

2.9.1
2.9.0


In [39]:
# get crossed columns
X_train_crossed = df_train[cross_col_df_names].to_numpy()
X_test_crossed = df_test[cross_col_df_names].to_numpy()
# save categorical features
X_train_cat = df_train[categorical_headers_ints].to_numpy()
X_test_cat = df_test[categorical_headers_ints].to_numpy()
# and save off the numeric features
X_train_num =  df_train[numeric_headers].to_numpy()
X_test_num =  df_test[numeric_headers].to_numpy()

# we need to create separate lists for each branch
crossed_outputs = []

# CROSSED DATA INPUT
input_crossed = Input(shape=(X_train_crossed.shape[1],), dtype='int64', name='wide_inputs')
for idx,col in enumerate(cross_col_df_names):

    # track what the maximum integer value will be for this variable
    # which is the same as the number of categories
    N = max(df_train[col].max(),df_test[col].max())+1


    # this line of code does this: input_branch[:,idx]
    x = tf.gather(input_crossed, idx, axis=1)

    # now use an embedding to deal with integers as if they were one hot encoded
    x = Embedding(input_dim=N,
                  output_dim=int(np.sqrt(N)),
                  input_length=1, name=col+'_embed')(x)

    # save these outputs to concatenate later
    crossed_outputs.append(x)


# now concatenate the outputs and add a fully connected layer
wide_branch = concatenate(crossed_outputs, name='wide_concat')

# reset this input branch
all_deep_branch_outputs = []

# CATEGORICAL DATA INPUT
input_cat = Input(shape=(X_train_cat.shape[1],), dtype='int64', name='categorical_input')
for idx,col in enumerate(categorical_headers_ints):

    # track what the maximum integer value will be for this variable
    # which is the same as the number of categories
    N = df_train[col].max()+1

    # this line of code does this: input_branch[:,idx]
    x = tf.gather(input_cat, idx, axis=1)

    # now use an embedding to deal with integers as if they were one hot encoded
    x = Embedding(input_dim=N,
                  output_dim=int(np.sqrt(N)),
                  input_length=1, name=col+'_embed')(x)

    # save these outputs to concatenate later
    all_deep_branch_outputs.append(x)

# NUMERIC DATA INPUT
# create dense input branch for numeric
input_num = Input(shape=(X_train_num.shape[1],), name='numeric')
x_dense = Dense(units=22, activation='relu',name='num_1')(input_num)

all_deep_branch_outputs.append(x_dense)


# merge the deep branches together
deep_branch = concatenate(all_deep_branch_outputs,name='concat_embeds')
deep_branch = Dense(units=50,activation='relu', name='deep1')(deep_branch)
deep_branch = Dense(units=25,activation='relu', name='deep2')(deep_branch)
deep_branch = Dense(units=10,activation='relu', name='deep3')(deep_branch)

# merge the deep and wide branch
final_branch = concatenate([wide_branch, deep_branch],
                           name='concat_deep_wide')
final_branch = Dense(units=1,activation='sigmoid',
                     name='combined')(final_branch)

model = Model(inputs=[input_crossed,input_cat,input_num],
              outputs=final_branch)

model.compile(optimizer='sgd',
              loss='mean_squared_error',
              metrics=[tf.keras.metrics.Recall()])



In [35]:
history = model.fit([X_train_crossed,X_train_cat,X_train_num],
                    y_train,
                    epochs=15,
                    batch_size=32,
                    verbose=1,
                    validation_data = ([X_test_crossed,X_test_cat,X_test_num],y_test))

#### 2.2 Investigate Performance by Altering the Number of Layers in the Deep Branch of the Network

Investigate generalization performance by altering the number of layers in the deep branch of the network. Try at least two different number of layers. Use the method of cross validation and evaluation metric that you argued for at the beginning of the lab to select the number of layers that performs superiorly. 

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [42]:
from sklearn import metrics as mt
yhat = np.round(model.predict([X_test_crossed,X_test_cat,X_test_num]))
print(mt.confusion_matrix(y_test,yhat))
print(mt.classification_report(y_test,yhat))
unique_yhat, counts_yhat = np.unique(y_train, return_counts=True)
print(np.asarray((unique_yhat, counts_yhat)).T)

[[8855   13]
 [ 767   97]]
              precision    recall  f1-score   support

           0       0.92      1.00      0.96      8868
           1       0.88      0.11      0.20       864

    accuracy                           0.92      9732
   macro avg       0.90      0.56      0.58      9732
weighted avg       0.92      0.92      0.89      9732

[[    0 35560]
 [    1  3368]]


#### 2.3 Investigate Performance of the Best Wide and Deep Network to Multi-Layer Perceptron

Compare the performance of your best wide and deep network to a standard multi-layer perceptron (MLP). Alternatively, you can compare to a network without the wide branch (i.e., just the deep network). For classification tasks, compare using the receiver operating characteristic and area under the curve. For regression tasks, use Bland-Altman plots and residual variance calculations.  Use proper statistical methods to compare the performance of different models.  

### 3. Capturing the Embedding Weights from the Deep Network

Capture the embedding weights from the deep network and (if needed) perform dimensionality reduction on the output of these embedding layers (only if needed). That is, pass the observations into the network, save the embedded weights (called embeddings), and then perform  dimensionality reduction in order to visualize results. Visualize and explain any clusters in the data.