# Lab Assignment Five: Wide and Deep Network Architectures

By : Katie Rink

## Preparation

Data Set : https://www.kaggle.com/datasets/whenamancodes/infoseccyber-security-salaries?select=Cyber_salaries.csv

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import metrics as mt
import tensorflow as tf
from tensorflow.keras.layers import Dense, Activation, Input
from tensorflow.keras.models import Model
from tensorflow.keras.utils import plot_model

#Loading the dataset
df = pd.read_csv('../Data/Cyber_salaries.csv', low_memory=False)

#Showing data
#df.info()
#df.head()
print(df.shape)

2022-11-15 08:47:28.657131: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


(1349, 11)


In [2]:
df['salary_in_usd'].max()
df['salary_in_usd'].min()

2000

### Pre-Processing

Remove all variables that we will not be using, so that we have a set of defined class variables. .

In [3]:
#Select which variables to use
df.drop(['work_year', 'salary', 'salary_currency', 'employee_residence'], axis=1, inplace=True)

Clean the data by removing all null values so that our data is easy to work with.

In [4]:
# Get rid of rows with any missing data
df.replace(to_replace=' ?',value=np.nan, inplace=True)
df.dropna(inplace=True)
df.reset_index()

df.head()

Unnamed: 0,experience_level,employment_type,job_title,salary_in_usd,remote_ratio,company_location,company_size
0,EN,FT,Information Security Officer,72762,100,DE,S
1,SE,FT,Security Officer,123400,0,US,M
2,SE,FT,Security Officer,88100,0,US,M
3,SE,FT,Security Engineer,163575,100,US,M
4,SE,FT,Security Engineer,115800,100,US,M


Replace salary with categories so that it is more broad

In [5]:
def classify(row):
    k_amount = int(row['salary_in_usd'] / 10000)
    k_amount = str(k_amount)
    val = k_amount[0]
    for i in range(1, len(str(k_amount))) :
        val += '0'
    val += 'k'
    return val

Next we begin to preprocess the data by encoding categorical data as integers first <br/>

In [6]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

#Convert the salary to categories 
df['salary_class'] = df.apply(classify, axis=1)

#Encode categorical data as integers  
encoders = dict() # save each encoder in dictionary
categorical_headers = ['experience_level','employment_type','job_title', 'company_location', 'company_size', 'salary_class']

for col in categorical_headers:
    df[col] = df[col].str.strip()
    df[col] = df[col].str.strip()
    if col == "salary_class":
        # special case the target, just replace the column
        tmp = LabelEncoder()
        df[col] = tmp.fit_transform(df[col])
    else : 
        # integer encode strings that are features
        encoders[col] = LabelEncoder() # save the encoder
        df[col+'_int'] = encoders[col].fit_transform(df[col])
    
df.head()

Unnamed: 0,experience_level,employment_type,job_title,salary_in_usd,remote_ratio,company_location,company_size,salary_class,experience_level_int,employment_type_int,job_title_int,company_location_int,company_size_int
0,EN,FT,Information Security Officer,72762,100,DE,S,13,0,2,47,17,2
1,SE,FT,Security Officer,123400,0,US,M,1,3,2,71,54,1
2,SE,FT,Security Officer,88100,0,US,M,15,3,2,71,54,1
3,SE,FT,Security Engineer,163575,100,US,M,1,3,2,68,54,1
4,SE,FT,Security Engineer,115800,100,US,M,1,3,2,68,54,1


Pre-process the data by utilizing dimensionality reduction, scaling, etc <br/>
**TO-DO : handle scaling with categorical values ?**

In [7]:
# scale the numeric, continuous variables
numeric_headers = ["remote_ratio"]

ss = StandardScaler()
df[numeric_headers] = ss.fit_transform(df[numeric_headers].values)

Print final data as it is represented <br/>
**TO-DO : describe more in depth what each variable represents and how it will affect final regression.**

In [8]:
categorical_headers_ints = [x+'_int' for x in categorical_headers[:-1]]
feature_columns = categorical_headers_ints+numeric_headers

import pprint
pp = pprint.PrettyPrinter(indent=4)

print(f"We will use the following {len(feature_columns)} features:")
pp.pprint(feature_columns)

print("\nNumeric Headers:")
pp.pprint(numeric_headers) # normalized numeric data
print("\nCategorical String Headers:")
pp.pprint(categorical_headers) # string data
print("\nCategorical Headers, Encoded as Integer:")
pp.pprint(categorical_headers_ints) # string data encoded as an integer

We will use the following 6 features:
[   'experience_level_int',
    'employment_type_int',
    'job_title_int',
    'company_location_int',
    'company_size_int',
    'remote_ratio']

Numeric Headers:
['remote_ratio']

Categorical String Headers:
[   'experience_level',
    'employment_type',
    'job_title',
    'company_location',
    'company_size',
    'salary_class']

Categorical Headers, Encoded as Integer:
[   'experience_level_int',
    'employment_type_int',
    'job_title_int',
    'company_location_int',
    'company_size_int']


#### Features 

First we need to represent all of the features we have available to us. 

In [9]:
# sandbox for looking at different categorical variables
for col in categorical_headers:
    vals = df[col].unique()
    print(col,'has', len(vals), 'unique values:')
    print(vals)

experience_level has 4 unique values:
['EN' 'SE' 'MI' 'EX']
employment_type has 4 unique values:
['FT' 'PT' 'CT' 'FL']
job_title has 87 unique values:
['Information Security Officer' 'Security Officer' 'Security Engineer'
 'Penetration Testing Engineer' 'Security Analyst' 'Security Consultant'
 'Network Security Engineer' 'Penetration Tester' 'DevSecOps Engineer'
 'Security Specialist' 'Cloud Security Engineer'
 'Security Operations Engineer' 'Head of Information Security'
 'Chief Information Security Officer' 'Cyber Security Analyst'
 'Information Security Manager' 'Network and Security Engineer'
 'Threat Hunter' 'Information Security Compliance Lead'
 'Digital Forensics Analyst' 'Information Security Compliance Analyst'
 'Cyber Threat Analyst' 'Cyber Security Consultant' 'IT Security Engineer'
 'Cyber Program Manager' 'IT Security Analyst'
 'Application Security Architect' 'Security Researcher'
 'Information Security Compliance Manager'
 'Application Security Specialist' 'Security In

Now that we have our values, we created cross columns and encoded our cross columns as integers. <br/>
**TO-DO : Determine what values to cross and justify**

In [10]:
# choose these as a class, what makes sense??
cross_columns = [
                    ['experience_level','employment_type'],
                ]

# cross each set of columns in the list above
cross_col_df_names = []
for cols_list in cross_columns:
    # encode as ints for the embedding
    enc = LabelEncoder()
    
    # 1. create crossed labels by join operation
    X_crossed = df[cols_list].apply(lambda x: '_'.join(x), axis=1)
    
    # get a nice name for this new crossed column
    cross_col_name = '_'.join(cols_list)
    
    # 2. encode as integers, stacking all possibilities
    enc.fit(np.hstack((X_crossed.to_numpy())))
    
    # 3. Save into dataframe with new name
    df[cross_col_name] = enc.transform(X_crossed)
    
    # Save the encoder used here for later:
    encoders[cross_col_name] = enc
    
    # keep track of the new names of the crossed columns
    cross_col_df_names.append(cross_col_name) 
    feature_columns.append(cross_col_name)
    
cross_col_df_names

['experience_level_employment_type']

### Metrics

**TO-DO : Choose metrics and describe why they are appropriate**

### Training-Testing

Now that we have all of our processing is complete, we can begin modeling. The final step to do is to split our data into testing and training. <br/>
**TO-DO : Determine method to divide into training and testing a justify**

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

# combine the features into a single large matrix
X = df[feature_columns]
y = df['salary_class'].values.astype(np.int32)

df_train, df_test = train_test_split(df, test_size=0.4, random_state=42)

skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(X, y)

10

Now that we have our split, we can run a cross-validation on it to estimate how accurate our model will be based around our testing-training split <br/>
**TO-DO : Select a cross validation and argue why it is a realistic mirroring**

## Modeling

### Wide and Deep Networks

#### Model 1

Then, we create our first wide and deep network. In this network we utilized . <br/>
Once we have completed the modeling, I graphed the performance of the network on the training and validation data utilizing the predetermined metrics vs each training iteration. <br/>
**TO-DO : Create First Wide and Deep Network** <br/>
**TO-DO : Graph Performance** <br/>

In [12]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import concatenate
from tensorflow import keras
from sklearn import metrics as mt

accuracies = []
accuracies_val = []
losses = []
losses_val = []
for i in range(splits):
    inner = []
    for j in range(epcohs):
        inner.append(0)
    losses_val.append(inner)
    losses.append(inner)
    accuracies_val.append(inner)
    accuracies.append(inner)


for train_index, test_index in skf.split(X, y) : 
    x_train_fold, x_test_fold = X.iloc[train_index], X.iloc[test_index]
    y_train_fold, y_test_fold = y[train_index], y[test_index]
    
    # get crossed columns
    X_train_crossed = x_train_fold[cross_col_df_names].to_numpy()
    X_test_crossed = x_test_fold[cross_col_df_names].to_numpy()

    # save categorical features
    X_train_cat = x_train_fold[categorical_headers_ints].to_numpy() 
    X_test_cat = x_test_fold[categorical_headers_ints].to_numpy() 

    # and save off the numeric features
    X_train_num =  x_train_fold[numeric_headers].to_numpy()
    X_test_num = x_test_fold[numeric_headers].to_numpy()


    # we need to create separate lists for each branch
    crossed_outputs = []

    # CROSSED DATA INPUT
    input_crossed = Input(shape=(X_train_crossed.shape[1],), dtype='int64', name='wide_inputs')
    for idx,col in enumerate(cross_col_df_names):
    
        # track what the maximum integer value will be for this variable
        # which is the same as the number of categories
        N = max(df_train[col].max(),df_test[col].max())+1
        N = len(encoders[col].classes_)
        N_reduced = int(np.sqrt(N))
    
    
        # this line of code does this: input_branch[:,idx]
        x = tf.gather(input_crossed, idx, axis=1)
    
        # now use an embedding to deal with integers as if they were one hot encoded
        x = Embedding(input_dim=N, 
                  output_dim=N_reduced, 
                  input_length=1, name=col+'_embed')(x)
    
        # save these outputs to concatenate later
        crossed_outputs.append(x)
    

    # now concatenate the outputs and add a fully connected layer
    wide_branch = concatenate(crossed_outputs, name='wide_concat')

    # reset this input branch
    all_deep_branch_outputs = []
    # CATEGORICAL DATA INPUT
    input_cat = Input(shape=(X_train_cat.shape[1],), dtype='int64', name='categorical_input')
    for idx,col in enumerate(categorical_headers_ints):
    
        # track what the maximum integer value will be for this variable
        # which is the same as the number of categories
        N = max(df_train[col].max(),df_test[col].max())+1
        N_reduced = int(np.sqrt(N))
    
        # this line of code does this: input_branch[:,idx]
        x = tf.gather(input_cat, idx, axis=1)
    
        # now use an embedding to deal with integers as if they were one hot encoded
        x = Embedding(input_dim=N, 
                  output_dim=N_reduced, 
                  input_length=1, name=col+'_embed')(x)
    
        # save these outputs to concatenate later
        all_deep_branch_outputs.append(x)
    
    # NUMERIC DATA INPUT
    # create dense input branch for numeric
    input_num = Input(shape=(X_train_num.shape[1],), name='numeric')
    x_dense = Dense(units=20, activation='relu',name='num_1')(input_num)
    
    all_deep_branch_outputs.append(x_dense)


    # merge the deep branches together
    deep_branch = concatenate(all_deep_branch_outputs,name='concat_embeds')
    deep_branch = Dense(units=50,activation='relu', name='deep1')(deep_branch)
    deep_branch = Dense(units=25,activation='relu', name='deep2')(deep_branch)
    deep_branch = Dense(units=10,activation='relu', name='deep3')(deep_branch)
    
    # merge the deep and wide branch
    final_branch = concatenate([wide_branch, deep_branch],
                           name='concat_deep_wide')
    final_branch = Dense(units=1,activation='sigmoid',
                     name='combined')(final_branch)

    model = Model(inputs=[input_crossed,input_cat,input_num], 
              outputs=final_branch)

    model.compile(optimizer='adagrad',
              loss='mean_squared_error',
              metrics=['accuracy'])

    # lets also add the history variable to see how we are doing
    # and lets add a validation set to keep track of our progress
    history = model.fit([X_train_crossed,X_train_cat,X_train_num],
                    y_train_fold, 
                    epochs=50, 
                    batch_size=32, 
                    verbose=0, 
                    validation_data = ([X_test_crossed,X_test_cat,X_test_num],y_test_fold))
    
    for index in range(len(history.history['accuracy'])) : 
        accuracies[index].append(history.history['accuracy'])
        accuracies_val[index].append(history.history['accuracy_val'])
        losses[index].append(history.history['loss'])
        losses_val[index].append(history.history['loss_val'])

2022-11-15 08:47:51.233400: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-15 08:47:51.235440: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


TypeError: 'int' object is not subscriptable

In [None]:
plt.figure(figsize=(10,4))
plt.subplot(2,2,1)
plt.plot(accuracies)

plt.ylabel('Accuracy %')
plt.title('Training')
plt.subplot(2,2,2)
plt.plot(accuracies_val)
plt.title('Validation')

plt.subplot(2,2,3)
plt.plot(losses)
plt.ylabel('Training Loss')
plt.xlabel('epochs')

plt.subplot(2,2,4)
plt.plot(losses_val)
plt.xlabel('epochs')

#### Model 2

Then, we create our second wide and deep network. In this network we utilized . <br/>
Once we have completed the modeling, I graphed the performance of the network on the training and validation data vs each training iteration. <br/>
**TO-DO : Create Second Wide and Deep Network** <br/>
**TO-DO : Graph Performance** <br/>

#### Model 3

Finally, we create our third wide and deep network. In this network we utilized . <br/>
Once we have completed the modeling, I graphed the performance of the network on the training and validation data vs each training iteration. <br/>
**TO-DO : Create Third Wide and Deep Network** <br/>
**TO-DO : Graph Performance** <br/>

#### Performance Analysis

Now that all our our different models, we can compare the performance. <br/>
**TO-DO : Compare the performance of each of the networks**

### Layer Performance Analysis 

To further analyze our models, we recreated our model with different numbers of layers. Once we have these models, we ran a cross-validation to see the affects of the layers. Then I pulled out the metrics I already determined were significant. <br/>
**TO-DO : Create a model for each layer.** <br/>
**TO-DO : Run a cross-validation on each model** <br/>

Once we have run all of our measures, we can now evaluate what the affect of each layer is. <br/>
**TO-DO : Graph our validation metric vs each layer**
**TO-DO : Write an analysis on what we see**

#### Network Performance Analysis
[1 points] Compare the performance of your best wide and deep network to a standard multi-layer perceptron (MLP). Alternatively, you can compare to a network without the wide branch (i.e., just the deep network). 

Now that we have a functioning deep and wide network, we can compare it's performance to the Multi-Layer Perceptron we were working with. In order to compare we must run both our best performing model and a standard MLP. <br/>
**TO-DO : Create a multi-layer perceptron and run it** <br/>
**TO-DO : Create our best wide and deep model and run it** <br/>
**TO-DO : Graph performance metric at each iteration for each model** <br/>
**TO-DO : Analyze the difference between each model** <br/>