# Lab Assignment Five: Wide and Deep Network Architectures

By : Katie Rink

### Preparation

Data Set : https://www.kaggle.com/datasets/whenamancodes/infoseccyber-security-salaries?select=Cyber_salaries.csv

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

#Loading the dataset
df = pd.read_csv('../Data/Cyber_salaries.csv', low_memory=False)

#Showing data
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1349 entries, 0 to 1348
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           1349 non-null   int64 
 1   experience_level    1349 non-null   object
 2   employment_type     1349 non-null   object
 3   job_title           1349 non-null   object
 4   salary              1349 non-null   int64 
 5   salary_currency     1349 non-null   object
 6   salary_in_usd       1349 non-null   int64 
 7   employee_residence  1349 non-null   object
 8   remote_ratio        1349 non-null   int64 
 9   company_location    1349 non-null   object
 10  company_size        1349 non-null   object
dtypes: int64(4), object(7)
memory usage: 116.1+ KB


Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2022,EN,FT,Information Security Officer,68000,EUR,72762,DE,100,DE,S
1,2022,SE,FT,Security Officer,123400,USD,123400,US,0,US,M
2,2022,SE,FT,Security Officer,88100,USD,88100,US,0,US,M
3,2022,SE,FT,Security Engineer,163575,USD,163575,US,100,US,M
4,2022,SE,FT,Security Engineer,115800,USD,115800,US,100,US,M


#### Class Variables
[1 points] Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created). 

In [2]:
#Select which variables to use
df.drop(['work_year', 'salary', 'salary_currency', 'employee_residence'], axis=1, inplace=True)

In [3]:
# let's just get rid of rows with any missing data
# and then reset the indices of the dataframe so it corresponds to row number
df.replace(to_replace=' ?',value=np.nan, inplace=True)
df.dropna(inplace=True)
df.reset_index()

df.head()

Unnamed: 0,experience_level,employment_type,job_title,salary_in_usd,remote_ratio,company_location,company_size
0,EN,FT,Information Security Officer,72762,100,DE,S
1,SE,FT,Security Officer,123400,0,US,M
2,SE,FT,Security Officer,88100,0,US,M
3,SE,FT,Security Engineer,163575,100,US,M
4,SE,FT,Security Engineer,115800,100,US,M


In [4]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

#Encode categorical data as integers  
encoders = dict() # save each encoder in dictionary
categorical_headers = ['experience_level','employment_type','job_title', 'company_location', 'company_size']

for col in categorical_headers:
    df[col] = df[col].str.strip()
    df[col] = df[col].str.strip()
    
    # integer encode strings that are features
    encoders[col] = LabelEncoder() # save the encoder
    df[col+'_int'] = encoders[col].fit_transform(df[col])
    
# scale the numeric, continuous variables
numeric_headers = ["remote_ratio"]

ss = StandardScaler()
df[numeric_headers] = ss.fit_transform(df[numeric_headers].values)

df.head()

Unnamed: 0,experience_level,employment_type,job_title,salary_in_usd,remote_ratio,company_location,company_size,experience_level_int,employment_type_int,job_title_int,company_location_int,company_size_int
0,EN,FT,Information Security Officer,72762,0.705663,DE,S,0,2,47,17,2
1,SE,FT,Security Officer,123400,-1.836231,US,M,3,2,71,54,1
2,SE,FT,Security Officer,88100,-1.836231,US,M,3,2,71,54,1
3,SE,FT,Security Engineer,163575,0.705663,US,M,3,2,68,54,1
4,SE,FT,Security Engineer,115800,0.705663,US,M,3,2,68,54,1


#### Features
[1 points] Identify groups of features in your data that should be combined into cross-product features. Provide justification for why these features should be crossed (or why some features should not be crossed). 

In [5]:
#scale continuous data
# let's start as simply as possible, without any feature preprocessing
categorical_headers_ints = [x+'_int' for x in categorical_headers]
feature_columns = categorical_headers_ints+numeric_headers

import pprint
pp = pprint.PrettyPrinter(indent=4)

print(f"We will use the following {len(feature_columns)} features:")
pp.pprint(feature_columns)

print("\nNumeric Headers:")
pp.pprint(numeric_headers) # normalized numeric data
print("\nCategorical String Headers:")
pp.pprint(categorical_headers) # string data
print("\nCategorical Headers, Encoded as Integer:")
pp.pprint(categorical_headers_ints) # string data encoded as an integer

We will use the following 6 features:
[   'experience_level_int',
    'employment_type_int',
    'job_title_int',
    'company_location_int',
    'company_size_int',
    'remote_ratio']

Numeric Headers:
['remote_ratio']

Categorical String Headers:
[   'experience_level',
    'employment_type',
    'job_title',
    'company_location',
    'company_size']

Categorical Headers, Encoded as Integer:
[   'experience_level_int',
    'employment_type_int',
    'job_title_int',
    'company_location_int',
    'company_size_int']


#### Metrics
[1 points] Choose and explain what metric(s) you will use to evaluate your algorithm’s performance. You should give a detailed argument for why this (these) metric(s) are appropriate on your data. That is, why is the metric appropriate for the task (e.g., in terms of the business case for the task). Please note: rarely is accuracy the best evaluation metric to use. Think deeply about an appropriate measure of performance.

#### Training-Testing
[1 points] Choose the method you will use for dividing your data into training and testing (i.e., are you using Stratified 10-fold cross validation? Shuffle splits? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. Argue why your cross validation method is a realistic mirroring of how an algorithm would be used in practice. 

In [6]:
from sklearn.model_selection import train_test_split

# combine the features into a single large matrix
X = df[feature_columns].to_numpy()
y = df['salary_in_usd'].values.astype(np.int32)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Modeling

#### Wide and Deep Networks
[2 points] Create at least three combined wide and deep networks to classify your data using Keras. Visualize the performance of the network on the training data and validation data in the same plot versus the training iterations. Note: use the "history" return parameter that is part of Keras "fit" function to easily access this data.

In [7]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Activation, Input
from tensorflow.keras.models import Model

# First, lets setup the input size
num_features = X_train.shape[1]
input_tensor = Input(shape=(num_features,))

# a layer instance is callable on a tensor, and returns a tensor
# Dense means a fully connected layer, with 10 hidden neurons and a bias term
x = Dense(units=10, activation='relu')(input_tensor)
x = Dense(units=5, activation='tanh')(x)
predictions = Dense(1, activation='sigmoid')(x)

# This creates a model that includes
# the Input layer and three Dense layers
model = Model(inputs=input_tensor, outputs=predictions)

ModuleNotFoundError: No module named 'tensorflow'

#### Layer Performance Analysis
[2 points] Investigate generalization performance by altering the number of layers in the deep branch of the network. Try at least two different number of layers. Use the method of cross validation and evaluation metric that you argued for at the beginning of the lab to select the number of layers that performs superiorly. 

#### Network Performance Analysis
[1 points] Compare the performance of your best wide and deep network to a standard multi-layer perceptron (MLP). Alternatively, you can compare to a network without the wide branch (i.e., just the deep network). 