# Classification of Mental Health Priority in Tech Workplace

### CS699 Term Project Code

Oliva Lee

Fall 2022

## Preparation

### Libraries & Set Up

In [None]:
pip install mlxtend --upgrade --no-deps

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
#Libraries

import math
import numpy as np
import pandas as pd

from sklearn import svm
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import chi2
from sklearn.linear_model import Lasso
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score, make_scorer, matthews_corrcoef, precision_score, recall_score, roc_auc_score, roc_curve
from sklearn.model_selection import cross_validate, GridSearchCV, train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn_pandas import CategoricalImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeRegressor
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

import plotly.express as px
import plotly.graph_objects as go
colors = px.colors.sequential.Viridis

In [None]:
#Set Working Directory
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


### The Dataset

In [None]:
data = pd.read_csv('/content/drive/MyDrive/CS699 Project/data/data_0.csv')
data

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1254,2015-09-12 11:17:21,26,male,United Kingdom,,No,No,Yes,,26-100,...,Somewhat easy,No,No,Some of them,Some of them,No,No,Don't know,No,
1255,2015-09-26 01:07:35,32,Male,United States,IL,No,Yes,Yes,Often,26-100,...,Somewhat difficult,No,No,Some of them,Yes,No,No,Yes,No,
1256,2015-11-07 12:36:58,34,male,United States,CA,No,Yes,Yes,Sometimes,More than 1000,...,Somewhat difficult,Yes,Yes,No,No,No,No,No,No,
1257,2015-11-30 21:25:06,46,f,United States,NC,No,No,No,,100-500,...,Don't know,Yes,No,No,No,No,No,No,No,


#### Variables

The list below describes the 27 attributes includede in the dataset:

*  *Timestamp*: date/time the survey was conducted
*  *Age*: age of the survey participant in years
*  *Gender*: gender of the survey participant
*  *Country*: country of origin of the survey participant
*  *State*: state/territory of the survey participant
*  *self_employed*: Are you self-employed?
*  *family_history*: Do you have a family history of mental illness?
*  *treatment*: Have you sought treatment for a mental health condition?
*  *work_interfere*: If you have a mental health condition, do you feel that it interferes with your work?
*  *no_employees*: How many employees does your company or organization have?
*  *remote_work*: Do you work remotely (outside of an office) at least 50 of the time?
*  *tech_company*: Is your employer primarily a tech company/organization?
*  *benefits*: Does your employer provide mental health benefits?
*  *care_options*: Do you know the options for mental health care your employer provides?
*  *wellness_program*: Has your employer ever discussed mental health as part of an employee wellness program?
*  *seek_help*: Does your employer provide resources to learn more about mental health issues and how to seek help?
*  *anonymity*: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
*  *leave*: How easy is it for you to take medical leave for a mental health condition?
*  *mental_health_consequence*: Do you think that discussing a mental health issue with your employer would have negative consequences?
*  *phys_health_consequence*: Do you think that discussing a physical health issue with your employer would have negative consequences?
*  *coworkers*: Would you be willing to discuss a mental health issue with your coworkers?
*  *supervisor*: Would you be willing to discuss a mental health issue with your direct supervisor(s)?
*  *mental_health_interview*: Would you bring up a mental health issue with a potential employer in an interview?
*  *phys_health_interview*: Would you bring up a physical health issue with a potential employer in an interview?
*  *mental_vs_physical*: Do you feel that your employer takes mental health as seriously as physical health?
*  *obs_consequence*: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
*  *comments*: Any additional notes or comments

### Data Cleaning

#### Step 1: Remove Duplicate or Irrelevant Data

In [None]:
#Remove irrelevant attributes
data.drop(['Timestamp', 'Country', 'state', 'comments'], axis=1, inplace=True)

#Remove "Don't Know" responses from survery
data = data[data.mental_vs_physical != "Don't know"]

#Rearrange class attribute to front of dataset
class_attribute = data.pop('mental_vs_physical')
data.insert(0, 'mental_vs_physical', class_attribute)

#### Step 2: Fix Structural Errors

In [None]:
#mental_vs_physical
data[['mental_vs_physical']].value_counts()

mental_vs_physical
Yes                   343
No                    340
dtype: int64

In [None]:
#Age
data.loc[(data['Age'] < 18) | (data['Age'] > 100), 'Age'] = None
data[['Age']].describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


Unnamed: 0,Age
count,677.0
mean,32.082718
std,7.293242
min,18.0
25%,27.0
50%,31.0
75%,36.0
max,65.0


In [None]:
#Gender

data.loc[data['Gender'] == 'Male ', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'male', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'M', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'm', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'Make', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'Man', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'Cis Male', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'male leaning androgynous', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'maile', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'msle', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'ostensibly male, unsure what that really means', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'cis male', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'Malr', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'Cis Man', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'Guy (-ish) ^_^', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'Mail', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'Mal', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'Male (CIS)', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'Male-ish', 'Gender'] = 'Male'
data.loc[data['Gender'] == 'something kinda male?', 'Gender'] = 'Male'

data.loc[data['Gender'] == 'female', 'Gender'] = 'Female'
data.loc[data['Gender'] == 'F', 'Gender'] = 'Female'
data.loc[data['Gender'] == 'f', 'Gender'] = 'Female'
data.loc[data['Gender'] == 'Woman', 'Gender'] = 'Female'
data.loc[data['Gender'] == 'Female ', 'Gender'] = 'Female'
data.loc[data['Gender'] == 'Female (trans)', 'Gender'] = 'Female'
data.loc[data['Gender'] == 'femail', 'Gender'] = 'Female'
data.loc[data['Gender'] == 'cis-female/femme', 'Gender'] = 'Female'
data.loc[data['Gender'] == 'Trans-female', 'Gender'] = 'Female'
data.loc[data['Gender'] == 'Trans woman', 'Gender'] = 'Female'
data.loc[data['Gender'] == 'Female (cis)', 'Gender'] = 'Female'
data.loc[data['Gender'] == 'Femake', 'Gender'] = 'Female'
data.loc[data['Gender'] == 'Cis Female', 'Gender'] = 'Female'
data.loc[data['Gender'] == 'woman', 'Gender'] = 'Female'

data.loc[data['Gender'] == 'All', 'Gender'] = 'Non-binary'
data.loc[data['Gender'] == 'Agender', 'Gender'] = 'Non-binary'
data.loc[data['Gender'] == 'Androgyne', 'Gender'] = 'Non-binary'
data.loc[data['Gender'] == 'Enby', 'Gender'] = 'Non-binary'
data.loc[data['Gender'] == 'Genderqueer', 'Gender'] = 'Non-binary'
data.loc[data['Gender'] == 'fluid', 'Gender'] = 'Non-binary'
data.loc[data['Gender'] == 'non-binary', 'Gender'] = 'Non-binary'
data.loc[data['Gender'] == 'queer', 'Gender'] = 'Non-binary'
data.loc[data['Gender'] == 'queer/she/they', 'Gender'] = 'Non-binary'

data.loc[data['Gender'] == 'A little about you', 'Gender'] = None
data.loc[data['Gender'] == 'Nah', 'Gender'] = None
data.loc[data['Gender'] == 'Neuter', 'Gender'] = None
data.loc[data['Gender'] == 'p', 'Gender'] = None

data[['Gender']].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


Gender    
Male          529
Female        142
Non-binary      8
dtype: int64

In [None]:
#self_employed
data[['self_employed']].value_counts()

self_employed
No               575
Yes               98
dtype: int64

In [None]:
#family_history
data[['family_history']].value_counts()

family_history
No                393
Yes               290
dtype: int64

In [None]:
#treatment
data[['treatment']].value_counts()

treatment
Yes          377
No           306
dtype: int64

In [None]:
#work_interfere
data[['work_interfere']].value_counts()

work_interfere
Sometimes         277
Never             104
Rarely             90
Often              88
dtype: int64

In [None]:
#no_employees
data[['no_employees']].value_counts()

no_employees  
More than 1000    159
6-25              151
26-100            145
1-5               105
100-500            91
500-1000           32
dtype: int64

In [None]:
#remote_work
data[['remote_work']].value_counts()

remote_work
No             479
Yes            204
dtype: int64

In [None]:
#tech_company
data[['tech_company']].value_counts()

tech_company
Yes             553
No              130
dtype: int64

In [None]:
#benefits
data[['benefits']].value_counts()

benefits  
Yes           276
No            237
Don't know    170
dtype: int64

In [None]:
#care_options
data[['care_options']].value_counts()

care_options
Yes             300
No              247
Not sure        136
dtype: int64

In [None]:
#wellness_program
data[['wellness_program']].value_counts()

wellness_program
No                  440
Yes                 155
Don't know           88
dtype: int64

In [None]:
#seek_help
data[['seek_help']].value_counts()

seek_help 
No            362
Yes           166
Don't know    155
dtype: int64

In [None]:
#anonymity
data[['anonymity']].value_counts()

anonymity 
Don't know    370
Yes           262
No             51
dtype: int64

In [None]:
#leave
data[['leave']].value_counts()

leave             
Don't know            223
Very easy             148
Somewhat easy         143
Somewhat difficult     91
Very difficult         78
dtype: int64

In [None]:
#mental_health_consequence
data[['mental_health_consequence']].value_counts()

mental_health_consequence
No                           293
Maybe                        212
Yes                          178
dtype: int64

In [None]:
#phys_health_consequence
data[['phys_health_consequence']].value_counts()

phys_health_consequence
No                         504
Maybe                      135
Yes                         44
dtype: int64

In [None]:
#coworkers
data[['coworkers']].value_counts()

coworkers   
Some of them    416
Yes             143
No              124
dtype: int64

In [None]:
#supervisor
data[['supervisor']].value_counts()

supervisor  
Yes             315
No              196
Some of them    172
dtype: int64

In [None]:
#mental_health_interview
data[['mental_health_interview']].value_counts()

mental_health_interview
No                         528
Maybe                      127
Yes                         28
dtype: int64

In [None]:
#phys_health_interview
data[['phys_health_interview']].value_counts()

phys_health_interview
Maybe                    317
No                       241
Yes                      125
dtype: int64

In [None]:
#obs_consequence
data[['obs_consequence']].value_counts()

obs_consequence
No                 550
Yes                133
dtype: int64

#### Step 3: Handle Missing Data

In [None]:
#Replace NAs with attribute mean / mode
data['Age'] = data['Age'].fillna(data['Age'].mean())

imputer = CategoricalImputer()

gender = np.array(data['Gender'], dtype=object)
data['Gender'] = imputer.fit_transform(gender)

self_employed = np.array(data['self_employed'], dtype=object)
data['self_employed'] = imputer.fit_transform(self_employed)

work_interfere = np.array(data['work_interfere'], dtype=object)
data['work_interfere'] = imputer.fit_transform(work_interfere)

na_rows = len(data[data.isna().any(axis=1)])
na_rows

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats i

0

#### Step 4: Filter Unwanted Outliers

In [None]:
age = list(data['Age'])
q1 = np.percentile(age, 25); q3 = np.percentile(age, 75)
iqr = q3 - q1; lower = q1 - (1.5 * iqr); upper = q3 + (1.5 * iqr)
len(data.loc[(data['Age'] < lower) | (data['Age'] > upper), 'Age'])

15

### Data Visualization

In [None]:
#Mental vs Physical Distribution
yes_count = len(data[data['mental_vs_physical'] == 'Yes'])
no_count = len(data[data['mental_vs_physical'] == 'No'])

fig = go.Figure(data = [go.Pie(labels = ['Yes', 'No'], values = [yes_count, no_count])])
fig.update_traces(textinfo = 'label+percent', marker = dict(colors = colors[0:2], line = dict(color='white', width=2)), opacity = .75, showlegend = False)
fig.update_layout(title = 'Distribution of Mental Health Priority Responses')
fig.layout.height = 450; fig.layout.width = 450; fig.show()

In [None]:
#Age Distribution by Mental Health Priority Response
fig = px.histogram(data, x = 'Age', color = 'mental_vs_physical', marginal = 'box', color_discrete_sequence = colors[0:2], opacity = .75)
fig.update_layout(title = 'Distribution of Age by Mental Health Priority Responses')
fig.show()

In [None]:
#Categorical Attributes by Mental Health Priority Response
def mossaic_plot(attribute):
  matrix = pd.crosstab(data['mental_vs_physical'], data[attribute])
  fig = px.imshow(matrix, text_auto = True, color_continuous_scale = 'Viridis')
  fig.update_layout(title = 'Mosaic Plot of Mental Health Priority Response vs ' + attribute)
  fig.layout.height = 500; fig.layout.width = 750; fig.show() 

attributes = list(data.columns[2:])
for attribute in attributes: 
  mossaic_plot(attribute)


### Data Preprocessing

In [None]:
#Label encoding
self_employed = np.where(data['self_employed'].str.contains('No'), 0, 1)
family_history = np.where(data['family_history'].str.contains('No'), 0, 1)
treatment = np.where(data['treatment'].str.contains('No'), 0, 1)
work_interfere = np.where(data['work_interfere'].str.contains('Never'), 0, np.where(data['work_interfere'].str.contains('Rarely'), 1, np.where(data['work_interfere'].str.contains('Sometimes'), 2, 3)))
no_employees = np.where(data['no_employees'].str.contains('1-5'), 0, np.where(data['no_employees'].str.contains('6-25'), 1, np.where(data['no_employees'].str.contains('26-100'), 2, np.where(data['no_employees'].str.contains('100-500'), 3, np.where(data['no_employees'].str.contains('500-1000'), 4, 5)))))
remote_work = np.where(data['remote_work'].str.contains('No'), 0, 1)
tech_company = np.where(data['tech_company'].str.contains('No'), 0, 1)
benefits = np.where(data['remote_work'].str.contains('No'), 0, np.where(data['remote_work'].str.contains("Don't know"), 1, 2))
care_options = np.where(data['care_options'].str.contains('No'), 0, np.where(data['care_options'].str.contains('Not sure'), 1, 2))
wellness_program = np.where(data['wellness_program'].str.contains('No'), 0, np.where(data['wellness_program'].str.contains("Don't know"), 1, 2))
seek_help = np.where(data['seek_help'].str.contains('No'), 0, np.where(data['seek_help'].str.contains("Don't know"), 1, 2))
anonymity = np.where(data['anonymity'].str.contains('No'), 0, np.where(data['anonymity'].str.contains("Don't know"), 1, 2))
leave = np.where(data['leave'].str.contains('Very difficult'), 0, np.where(data['leave'].str.contains('Somewhat difficult'), 1, np.where(data['leave'].str.contains("Don't know"), 2, np.where(data['leave'].str.contains('Somewhat easy'), 3, 4))))
mental_health_consequence = np.where(data['mental_health_consequence'].str.contains('No'), 0, np.where(data['mental_health_consequence'].str.contains('Maybe'), 1, 2))
phys_health_consequence = np.where(data['phys_health_consequence'].str.contains('No'), 0, np.where(data['phys_health_consequence'].str.contains('Maybe'), 1, 2))
coworkers = np.where(data['coworkers'].str.contains('No'), 0, np.where(data['coworkers'].str.contains('Some of them'), 1, 2))
supervisor = np.where(data['supervisor'].str.contains('No'), 0, np.where(data['supervisor'].str.contains('Some of them'), 1, 2))
mental_health_interview = np.where(data['mental_health_interview'].str.contains('No'), 0, np.where(data['mental_health_interview'].str.contains('Maybe'), 1, 2))
phys_health_interview = np.where(data['phys_health_interview'].str.contains('No'), 0, np.where(data['phys_health_interview'].str.contains('Maybe'), 1, 2))
obs_consequence = np.where(data['obs_consequence'].str.contains('No'), 0, 1)

In [None]:
#One-hot encoding
data_enc = data[['mental_vs_physical', 'Age', 'Gender']]
data_enc = pd.get_dummies(data_enc, columns = ['Gender'], drop_first=True)

In [None]:
data_enc['self_employed'] = self_employed
data_enc['family_history'] = family_history
data_enc['treatment'] = treatment
data_enc['no_employees'] = no_employees
data_enc['remote_work'] = remote_work
data_enc['tech_company'] = tech_company
data_enc['benefits'] = benefits
data_enc['care_options'] = care_options
data_enc['wellness_program'] = wellness_program
data_enc['seek_help'] = seek_help
data_enc['anonymity'] = anonymity
data_enc['leave'] = leave
data_enc['mental_health_consequence'] = mental_health_consequence
data_enc['phys_health_consequence'] = phys_health_consequence
data_enc['coworkers'] = coworkers
data_enc['supervisor'] = supervisor
data_enc['mental_health_interview'] = mental_health_interview
data_enc['phys_health_interview'] = phys_health_interview
data_enc['obs_consequence'] = obs_consequence

In [None]:
#Dataset after preprocessing
data_enc.to_csv('/content/drive/MyDrive/CS699 Project/data/data_1.csv')

In [None]:
def variables(data):
  #X and y variables
  attributes = list(data.columns[1:])
  X = data[attributes].values
  y = data['mental_vs_physical'].values
  
  #Standardize attribute variables
  scaler = StandardScaler()
  X = scaler.fit_transform(X)

  return X, y

X, y = variables(data_enc)

## Analysis

### Classification Algorithms

#### Classifier 1: K-Nearest Neighbors

In [None]:
#Determining Number of Neighbors w/ Elbow Method
def elbow_method(X, y, k_values): 
    error_rates = []
    for k in k_values:
      results = cross_validate(estimator = KNeighborsClassifier(n_neighbors = k), X = X, y = y, cv = 10, scoring = {'Accuracy' : make_scorer(accuracy_score)})
      avg_error =  (1 - results['test_Accuracy']).mean()
      error_rates.append(avg_error)
    return error_rates

k_values = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25]
error_rates = elbow_method(X, y, k_values)
k_error = pd.DataFrame({'k Value' : k_values, 'Error Rate' : error_rates})

fig = px.scatter(k_error, x = 'k Value', y = 'Error Rate', color = 'Error Rate', color_continuous_scale = 'Viridis', title = 'Elbow Method: Error Rate for kNN')
fig.update_traces(marker_size = 18); fig.show()

k = 9 #Optimal Neighbors

In [None]:
def kNN_classifier(X_train, y_train, X_test, y_test):
  classifier = KNeighborsClassifier(n_neighbors = k)
  classifier.fit(X_train, y_train)
  y_pred = classifier.predict(X_test)
  y_score = classifier.predict_proba(X_test)[:, 1]
  return y_pred, y_score

#### Classifier 2: Naive Bayesian

In [None]:
def nb_classifier(X_train, y_train, X_test, y_test):
  classifier = GaussianNB()
  classifier.fit(X_train, y_train)
  y_pred = classifier.predict(X_test)
  y_score = classifier.predict_proba(X_test)[:, 1]
  return y_pred, y_score

#### Classifier 3: Random Forest

In [None]:
def rf_classifier(X_train, y_train, X_test, y_test):
  classifier = RandomForestClassifier()
  classifier.fit(X_train, y_train) 
  y_pred = classifier.predict(X_test)
  y_score = classifier.predict_proba(X_test)[:, 1]
  return y_pred, y_score

#### Classifier 4: Support Vector Machines

In [None]:
def svm_classifier(X_train, y_train, X_test, y_test):
  classifier = svm.SVC(kernel = 'linear', probability = True)
  classifier.fit(X_train, y_train) 
  y_pred = classifier.predict(X_test)
  y_score = classifier.predict_proba(X_test)[:, 1]
  return y_pred, y_score

#### Classifier 5: Artificial Neural Network

In [None]:
def ann_classifier(X_train, y_train, X_test, y_test, k):
  #Encode class variable
  ohe = OneHotEncoder()
  y_train_enc = ohe.fit_transform(y_train.reshape(-1,1))
  y_train_enc = y_train_enc.toarray()
  y_test_enc = ohe.fit_transform(y_test.reshape(-1,1))
  y_test_enc = y_test_enc.toarray()

  #ANN
  classifier = Sequential()
  classifier.add(Dense(100, input_dim = k, activation = 'relu'))
  classifier.add(Dense(50, activation = 'relu'))
  classifier.add(Dense(50, activation = 'relu'))
  classifier.add(Dense(2, activation = 'softmax'))

  classifier.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
  history = classifier.fit(X_train, y_train_enc, epochs = 20, verbose = 0)
  y_score = classifier.predict(X_test)
  y_pred_enc = (y_score > 0.5).astype(int)
  y_pred = ohe.inverse_transform(y_pred_enc).ravel()
  
  return y_pred, y_score[:, 1]

### 10-Fold Cross Validation

In [None]:
#10-Fold Cross Validation
def cross_validation(model, X, y, cv = 10):
  
  y_enc = y.copy(); y_enc[y_enc == 'No'] = 0; y_enc[y_enc == 'Yes'] = 1; y_enc = y_enc.astype('int')
  
  scoring = {'Accuracy': make_scorer(accuracy_score),
           'TPR_No' : make_scorer(precision_score, pos_label = 0), 'Precision_No': make_scorer(precision_score, pos_label = 0), 'Recall_No': make_scorer(recall_score, pos_label = 0), 'F1_Measure_No': make_scorer(f1_score, pos_label = 0),
           'TPR_Yes' : make_scorer(precision_score, pos_label = 1), 'Precision_Yes': make_scorer(precision_score, pos_label = 1), 'Recall_Yes': make_scorer(recall_score, pos_label = 1), 'F1_Measure_Yes': make_scorer(f1_score, pos_label = 1),
           'MCC' : make_scorer(matthews_corrcoef), 'ROC_AUC' : make_scorer(roc_auc_score, average = 'weighted')}
  
  results = cross_validate(estimator = model, X = X, y = y_enc, cv = cv, scoring = scoring)
  
  print("Average Accuracy:", round(results['test_Accuracy'].mean(),4))
  
  results_df = pd.DataFrame({'TPR' : [results['test_TPR_No'].mean(), results['test_TPR_Yes'].mean()],
                             'FPR' : [1- results['test_TPR_Yes'].mean(), 1 - results['test_TPR_No'].mean()],
                             'Precision' : [results['test_Precision_No'].mean(), results['test_Precision_Yes'].mean()],
                             'Recall' : [results['test_Recall_No'].mean(), results['test_Recall_Yes'].mean()],
                             'F1 Measure' : [results['test_F1_Measure_No'].mean(), results['test_F1_Measure_Yes'].mean()],
                             'MCC' :  [results['test_MCC'].mean(), results['test_MCC'].mean()],
                             'ROC AUC' :  [results['test_ROC_AUC'].mean(), results['test_ROC_AUC'].mean()]},
                            index = ['No', 'Yes'])
  results_df.loc['Weighted Avg'] = list(results_df.mean(axis=0))
  
  return results_df

**1. K-Nearest Neighbors**

In [None]:
cv_results_1 = cross_validation(KNeighborsClassifier(n_neighbors = k), X, y)
cv_results_1

Average Accuracy: 0.8038


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.82122,0.206363,0.82122,0.782353,0.799331,0.610924,0.803529
Yes,0.793637,0.17878,0.793637,0.824706,0.806679,0.610924,0.803529
Weighted Avg,0.807428,0.192572,0.807428,0.803529,0.803005,0.610924,0.803529


**2. Naive Bayesian**

In [None]:
cv_results_2 = cross_validation(GaussianNB(), X, y)
cv_results_2

Average Accuracy: 0.8126


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.822489,0.191215,0.822489,0.8,0.808983,0.627955,0.812353
Yes,0.808785,0.177511,0.808785,0.824706,0.814826,0.627955,0.812353
Weighted Avg,0.815637,0.184363,0.815637,0.812353,0.811904,0.627955,0.812353


**3. Random Forest**

In [None]:
cv_results_3 = cross_validation(RandomForestClassifier(), X, y)
cv_results_3

Average Accuracy: 0.8126


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.810395,0.175625,0.810395,0.823529,0.814071,0.62985,0.812521
Yes,0.824375,0.189605,0.824375,0.801513,0.809243,0.62985,0.812521
Weighted Avg,0.817385,0.182615,0.817385,0.812521,0.811657,0.62985,0.812521


**4. Support Vector Machines**

In [None]:
cv_results_4 = cross_validation(svm.SVC(kernel = 'linear'), X, y)
cv_results_4

Average Accuracy: 0.8243


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.82955,0.174607,0.82955,0.820588,0.823044,0.651602,0.82416
Yes,0.825393,0.17045,0.825393,0.827731,0.824309,0.651602,0.82416
Weighted Avg,0.827472,0.172528,0.827472,0.82416,0.823677,0.651602,0.82416


**5. Artificial Neural Network**

In [None]:
def create_ann():
  classifier = Sequential()
  classifier.add(Dense(100, input_dim = 22, activation = 'relu'))
  classifier.add(Dense(50, activation = 'relu'))
  classifier.add(Dense(50, activation = 'relu'))
  classifier.add(Dense(2, activation = 'softmax'))

  classifier.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
  
  return classifier

ann_model = KerasClassifier(build_fn = create_ann, epochs = 10, verbose = 0)


KerasClassifier is deprecated, use Sci-Keras (https://github.com/adriangb/scikeras) instead. See https://www.adriangb.com/scikeras/stable/migration.html for help migrating.



In [None]:
cv_results_5 = cross_validation(ann_model, X, y)
cv_results_5









Average Accuracy: 0.8039


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.819684,0.206141,0.819684,0.786317,0.799572,0.609714,0.802971
Yes,0.793859,0.180316,0.793859,0.819624,0.803777,0.609714,0.802971
Weighted Avg,0.806771,0.193229,0.806771,0.802971,0.801674,0.609714,0.802971


### Performance Meaures

In [None]:
def performance(y_test, y_pred, y_score):
  
  #classification_report(y_test, y_pred)

  #Confusion Matrix
  cm = pd.DataFrame(confusion_matrix(y_test, y_pred))
  cm = cm.rename(columns={0:'No', 1:'Yes'}, index={0:'No', 1:'Yes'})
  fig_cm = px.imshow(cm, labels = dict(x = 'Predicted Label', y = 'True Label'), title = 'Confusion Matrix', text_auto = True, color_continuous_scale = colors, width = 400, height = 400)
  fig_cm.show()

  #ROC
  y_test_enc = [0 if i == 'No' else 1 for i in y_test]
  fpr, tpr, thresholds = roc_curve(y_test_enc, y_score)
  fig_roc = px.area(x = fpr, y = tpr, title = 'ROC Curve', labels = dict(x = 'False Positive Rate', y = 'True Positive Rate'), width = 500, height = 400)
  fig_roc.add_shape(type = 'line', line = dict(dash = 'dash'), x0 = 0, x1 = 1, y0 = 0, y1 = 1)
  fig_roc.update_yaxes(scaleanchor = "x", scaleratio = 1); fig_roc.update_xaxes(constrain='domain')
  fig_roc.show()

  #Performance Measures
  classes = ['No', 'Yes']
  TPR = []; FPR = []; precision = []; recall = []; f_measure = []; MCC = []; ROC = []

  for c in classes:
    TP = cm.loc[c, c]
    FP = sum(cm.loc[:, c]) - TP
    FN = sum(cm.loc[c]) - TP
    TN = cm.to_numpy().sum() - (TP + FP + FN)

    TPR.append(TP / (TP + FN))
    FPR.append(FP / (FP + TN))
    precision.append(TP / (TP + FP))
    recall.append(TP / (TP + FN))
    f_measure.append((2*TP) / (2*TP + FP + FN))
    MCC.append(((TP * TN) - (FP * FN)) / math.sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)))
    ROC.append(roc_auc_score(y_test_enc, y_score, average = 'weighted'))

  results = pd.DataFrame(list(zip(TPR, FPR, precision, recall, f_measure, MCC, ROC)), 
                        columns = ['TPR', 'FPR', 'Precision', 'Recall', 'F1 Measure', 'MCC', 'ROC AUC'], index = classes)

  weights = np.array(np.unique(y_test, return_counts=True))[1].tolist()
  weighted_avg = [np.average(results[c], weights = weights) for c in results.columns]
  results.loc['Weighted Avg'] = weighted_avg

  print("Accuracy:", round(1 - np.mean(y_pred != y_test),4))

  return results

### Attribute Selection Methods

In [None]:
#Stratified train/test split
#X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, train_size = 0.66, test_size = 0.34, random_state = 10)
data_enc_train, data_enc_test = train_test_split(data_enc, stratify = y, train_size = 0.66, test_size = 0.34, random_state = 10)

#Train and Test Datasets
data_enc_train.to_csv('/content/drive/MyDrive/CS699 Project/data/data_train.csv')
data_enc_test.to_csv('/content/drive/MyDrive/CS699 Project/data/data_test.csv')

#### Method 1: Chi-Square Test

In [None]:
#Chi-Square Test for Attribute Selection (Filter Method) 
def chi_sqr(data_train, data_test):
  #Variables
  attributes = list(data_train.columns[1:])
  X_train = data_train[attributes].values
  y_train = data_train['mental_vs_physical'].values
  
  #Chi-Square Test
  chi_scores = chi2(X_train, y_train)
  results = pd.DataFrame({'Attribute' : data_enc.columns[1:], 'P-value' : chi_scores[1]})
  results = results.sort_values(by = ['P-value'], ascending = False)
  fig = px.bar(results, x = 'Attribute', y = 'P-value', title = 'Results of Chi-Square Test: Remove Attributes w/ P-value > 0.05'); fig.add_hline(y = 0.05, line_color = 'red'); fig.show()

  #Reduced Dataset
  remove = list(results[results['P-value'] > 0.05]['Attribute'])
  data_enc_train_1 = data_train.drop(columns = remove); data_enc_test_1 = data_test.drop(columns = remove)
  attributes_1 = data_enc_train_1.columns[1:]
  X_train_1, y_train_1 = variables(data_enc_train_1); X_test_1, y_test_1 = variables(data_enc_test_1)

  return X_train_1, y_train_1, X_test_1, y_test_1, attributes_1

X_train_1, y_train_1, X_test_1, y_test_1, attributes_1 = chi_sqr(data_enc_train, data_enc_test)

In [None]:
print('Attributes Selected:')
print(list(attributes_1))

Attributes Selected:
['self_employed', 'no_employees', 'benefits', 'wellness_program', 'seek_help', 'anonymity', 'leave', 'mental_health_consequence', 'phys_health_consequence', 'coworkers', 'supervisor', 'mental_health_interview', 'obs_consequence']


**1.1 K-Nearest Neighbors**

In [None]:
y_pred_1_1, y_score_1_1 = kNN_classifier(X_train_1, y_train_1, X_test_1, y_test_1)
results_1_1 = performance(y_test_1, y_pred_1_1, y_score_1_1)
results_1_1

Accuracy: 0.8412


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.827586,0.145299,0.849558,0.827586,0.838428,0.682589,0.914309
Yes,0.854701,0.172414,0.833333,0.854701,0.843882,0.682589,0.914309
Weighted Avg,0.841202,0.158915,0.841411,0.841202,0.841167,0.682589,0.914309


**1.2 Naive Bayesian**

In [None]:
y_pred_1_2, y_score_1_2 = nb_classifier(X_train_1, y_train_1, X_test_1, y_test_1)
results_1_2 = performance(y_test_1, y_pred_1_2, y_score_1_2)
results_1_2

Accuracy: 0.8627


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.853448,0.128205,0.868421,0.853448,0.86087,0.725404,0.917772
Yes,0.871795,0.146552,0.857143,0.871795,0.864407,0.725404,0.917772
Weighted Avg,0.862661,0.137418,0.862758,0.862661,0.862646,0.725404,0.917772


**1.3 Random Forest**

In [None]:
y_pred_1_3, y_score_1_3 = rf_classifier(X_train_1, y_train_1, X_test_1, y_test_1)
results_1_3 = performance(y_test_1, y_pred_1_3, y_score_1_3)
results_1_3

Accuracy: 0.8326


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.818966,0.153846,0.840708,0.818966,0.829694,0.665414,0.914346
Yes,0.846154,0.181034,0.825,0.846154,0.835443,0.665414,0.914346
Weighted Avg,0.832618,0.167499,0.83282,0.832618,0.832581,0.665414,0.914346


**1.4 Support Vector Machines**

In [None]:
y_pred_1_4, y_score_1_4 = svm_classifier(X_train_1, y_train_1, X_test_1, y_test_1)
results_1_4 = performance(y_test_1, y_pred_1_4, y_score_1_4)
results_1_4

Accuracy: 0.8498


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.853448,0.153846,0.846154,0.853448,0.849785,0.699602,0.923077
Yes,0.846154,0.146552,0.853448,0.846154,0.849785,0.699602,0.923077
Weighted Avg,0.849785,0.150183,0.849817,0.849785,0.849785,0.699602,0.923077


**1.5 Artificial Neural Network**

In [None]:
y_pred_1_5, y_score_1_5 = ann_classifier(X_train_1, y_train_1, X_test_1, y_test_1, len(attributes_1))
results_1_5 = performance(y_test_1, y_pred_1_5, y_score_1_5)
results_1_5



Accuracy: 0.824


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.775862,0.128205,0.857143,0.775862,0.81448,0.65083,0.900973
Yes,0.871795,0.224138,0.796875,0.871795,0.832653,0.65083,0.900973
Weighted Avg,0.824034,0.176377,0.82688,0.824034,0.823605,0.65083,0.900973


#### Method 2: Lasso Regression

In [None]:
#Lasso Regression for Attribute Selection (Embedded Method)
def lasso_reg(data_train, data_test):
  #Variables
  data_train_enc = data_train.copy()
  mental_vs_physical = np.where(data_train_enc['mental_vs_physical'].str.contains('No'), 0, 1)
  data_train_enc['mental_vs_physical'] = mental_vs_physical
  attributes = list(data_train_enc.columns[1:])
  X_train = data_train_enc[attributes].values
  y_train = data_train_enc['mental_vs_physical'].values

  #Lasso Regression
  pipeline = Pipeline([('scaler',StandardScaler()), ('model',Lasso())])
  grid_search = GridSearchCV(pipeline, {'model__alpha':np.arange(0.1,10)}, cv = 3, scoring = "neg_mean_squared_error",verbose = 0)
  grid_search.fit(X_train,y_train)
  coefficients = grid_search.best_estimator_.named_steps['model'].coef_
  results = pd.DataFrame({'Attribute' : data_train_enc.columns[1:], 'Abs. Coefficient' : np.abs(coefficients)})
  results = results.sort_values(by = ['Abs. Coefficient'], ascending = False)
  fig = px.bar(results, x = 'Attribute', y = 'Abs. Coefficient', title = 'Results of Lasso Regression: Remove Attributes w/ Absolute Coefficient = 0'); fig.add_hline(y = 0, line_color = 'red'); fig.show()
  
  #Reduced Dataset
  remove = list(results[results['Abs. Coefficient'] == 0]['Attribute'])
  data_enc_train_2 = data_train.drop(columns = remove); data_enc_test_2 = data_test.drop(columns = remove)
  attributes_2 = data_enc_train_2.columns[1:]
  X_train_2, y_train_2 = variables(data_enc_train_2); X_test_2, y_test_2 = variables(data_enc_test_2)

  return X_train_2, y_train_2, X_test_2, y_test_2, attributes_2

X_train_2, y_train_2, X_test_2, y_test_2, attributes_2 = lasso_reg(data_enc_train, data_enc_test)

In [None]:
print('Attributes Selected:')
print(list(attributes_2))

Attributes Selected:
['wellness_program', 'leave', 'mental_health_consequence']


**2.1 K-Nearest Neighbors**

In [None]:
y_pred_2_1, y_score_2_1 = kNN_classifier(X_train_2, y_train_2, X_test_2, y_test_2)
results_2_1 = performance(y_test_2, y_pred_2_1, y_score_2_1)
results_2_1

Accuracy: 0.7897


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.706897,0.128205,0.845361,0.706897,0.769953,0.586967,0.886826
Yes,0.871795,0.293103,0.75,0.871795,0.806324,0.586967,0.886826
Weighted Avg,0.7897,0.211008,0.797476,0.7897,0.788217,0.586967,0.886826


**2.2 Naive Bayesian**

In [None]:
y_pred_2_2, y_score_2_2 = nb_classifier(X_train_2, y_train_2, X_test_2, y_test_2)
results_2_2 = performance(y_test_2, y_pred_2_2, y_score_2_2)
results_2_2

Accuracy: 0.7897


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.810345,0.230769,0.77686,0.810345,0.793249,0.580003,0.883363
Yes,0.769231,0.189655,0.803571,0.769231,0.786026,0.580003,0.883363
Weighted Avg,0.7897,0.210124,0.790273,0.7897,0.789622,0.580003,0.883363


**2.3 Random Forest**

In [None]:
y_pred_2_3, y_score_2_3 = rf_classifier(X_train_2, y_train_2, X_test_2, y_test_2)
results_2_3 = performance(y_test_2, y_pred_2_3, y_score_2_3)
results_2_3

Accuracy: 0.8197


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.784483,0.145299,0.842593,0.784483,0.8125,0.640886,0.890694
Yes,0.854701,0.215517,0.8,0.854701,0.826446,0.640886,0.890694
Weighted Avg,0.819742,0.180559,0.821205,0.819742,0.819503,0.640886,0.890694


**2.4 Support Vector Machines**

In [None]:
y_pred_2_4, y_score_2_4 = svm_classifier(X_train_2, y_train_2, X_test_2, y_test_2)
results_2_4 = performance(y_test_2, y_pred_2_4, y_score_2_4)
results_2_4

Accuracy: 0.7682


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.801724,0.264957,0.75,0.801724,0.775,0.537878,0.894268
Yes,0.735043,0.198276,0.788991,0.735043,0.761062,0.537878,0.894268
Weighted Avg,0.76824,0.231473,0.769579,0.76824,0.768001,0.537878,0.894268


**2.5 Artificial Neural Network**

In [None]:
y_pred_2_5, y_score_2_5 = ann_classifier(X_train_2, y_train_2, X_test_2, y_test_2, len(attributes_2))
results_2_5 = performance(y_test_2, y_pred_2_5, y_score_2_5)
results_2_5



Accuracy: 0.7983


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.706897,0.111111,0.863158,0.706897,0.777251,0.606192,0.896552
Yes,0.888889,0.293103,0.753623,0.888889,0.815686,0.606192,0.896552
Weighted Avg,0.798283,0.202498,0.808155,0.798283,0.796551,0.606192,0.896552


#### Method 3: Decision Tree Induction

In [None]:
#Decision Tree Induction for Attribute Selection
def dec_tree(data_train, data_test):
  #Variables
  data_train_enc = data_train.copy()
  mental_vs_physical = np.where(data_train_enc['mental_vs_physical'].str.contains('No'), 0, 1)
  data_train_enc['mental_vs_physical'] = mental_vs_physical
  attributes = list(data_train_enc.columns[1:])
  X_train = data_train_enc[attributes].values
  y_train = data_train_enc['mental_vs_physical'].values
  
  #CART
  model = DecisionTreeRegressor()
  model.fit(X_train, y_train)
  importance = model.feature_importances_
  results = pd.DataFrame({'Attribute' : data_train_enc.columns[1:], 'Importance' : importance})
  results = results.sort_values(by = ['Importance'], ascending = False)
  fig = px.bar(results, x = 'Attribute', y = 'Importance', title = 'Results of Decision Tree Induction: Remove Attributes w/ Importance < 0.05'); fig.add_hline(y = 0.05, line_color = 'red'); fig.show()

  #Reduced Dataset
  remove = list(results[results['Importance'] < 0.05]['Attribute'])
  data_enc_train_3 = data_train.drop(columns = remove); data_enc_test_3 = data_test.drop(columns = remove)
  attributes_3 = data_enc_train_3.columns[1:]
  X_train_3, y_train_3 = variables(data_enc_train_3); X_test_3, y_test_3 = variables(data_enc_test_3)

  return X_train_3, y_train_3, X_test_3, y_test_3, attributes_3

X_train_3, y_train_3, X_test_3, y_test_3, attributes_3 = dec_tree(data_enc_train, data_enc_test)

In [None]:
print('Attributes Selected:')
print(list(attributes_3))

Attributes Selected:
['Age', 'family_history', 'no_employees', 'wellness_program', 'leave', 'mental_health_consequence', 'phys_health_interview']


**3.1 K-Nearest Neighbors**

In [None]:
y_pred_3_1, y_score_3_1 = kNN_classifier(X_train_3, y_train_3, X_test_3, y_test_3)
results_3_1 = performance(y_test_3, y_pred_3_1, y_score_3_1)
results_3_1

Accuracy: 0.7983


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.775862,0.179487,0.810811,0.775862,0.792952,0.597035,0.873084
Yes,0.820513,0.224138,0.786885,0.820513,0.803347,0.597035,0.873084
Weighted Avg,0.798283,0.201908,0.798797,0.798283,0.798172,0.597035,0.873084


**3.2 Naive Bayesian**

In [None]:
y_pred_3_2, y_score_3_2 = nb_classifier(X_train_3, y_train_3, X_test_3, y_test_3)
results_3_2 = performance(y_test_3, y_pred_3_2, y_score_3_2)
results_3_2

Accuracy: 0.7811


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.775862,0.213675,0.782609,0.775862,0.779221,0.562228,0.889589
Yes,0.786325,0.224138,0.779661,0.786325,0.782979,0.562228,0.889589
Weighted Avg,0.781116,0.218929,0.781129,0.781116,0.781108,0.562228,0.889589


**3.3 Random Forest**

In [None]:
y_pred_3_3, y_score_3_3 = rf_classifier(X_train_3, y_train_3, X_test_3, y_test_3)
results_3_3 = performance(y_test_3, y_pred_3_3, y_score_3_3)
results_3_3

Accuracy: 0.7768


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.784483,0.230769,0.771186,0.784483,0.777778,0.553754,0.877726
Yes,0.769231,0.215517,0.782609,0.769231,0.775862,0.553754,0.877726
Weighted Avg,0.776824,0.223111,0.776922,0.776824,0.776816,0.553754,0.877726


**3.4 Support Vector Machines**

In [None]:
y_pred_3_4, y_score_3_4 = svm_classifier(X_train_3, y_train_3, X_test_3, y_test_3)
results_3_4 = performance(y_test_3, y_pred_3_4, y_score_3_4)
results_3_4

Accuracy: 0.794


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.801724,0.213675,0.788136,0.801724,0.794872,0.588092,0.901267
Yes,0.786325,0.198276,0.8,0.786325,0.793103,0.588092,0.901267
Weighted Avg,0.793991,0.205942,0.794093,0.793991,0.793984,0.588092,0.901267


**3.5 Artificial Neural Network**

In [None]:
y_pred_3_5, y_score_3_5 = ann_classifier(X_train_3, y_train_3, X_test_3, y_test_3, len(attributes_3))
results_3_5 = performance(y_test_3, y_pred_3_5, y_score_3_5)
results_3_5



Accuracy: 0.8112


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.767241,0.145299,0.839623,0.767241,0.801802,0.624478,0.886789
Yes,0.854701,0.232759,0.787402,0.854701,0.819672,0.624478,0.886789
Weighted Avg,0.811159,0.189217,0.8134,0.811159,0.810775,0.624478,0.886789


#### Method 4: Forward Selection

In [None]:
#Forward Selection for Attribute Selection (Wrapper Method)
def forward(data_train, data_test):
  #Variables
  X_train, y_train = variables(data_train)
  X_test, y_test = variables(data_test)

  #Forward Selection
  knn = KNeighborsClassifier(n_neighbors = 5)
  sfs = SFS(knn, k_features = 11, forward = True, floating = False, verbose = 0, scoring = 'accuracy', cv = 0)
  sfs = sfs.fit(X_train, y_train)
  feature_idx = list(sfs.k_feature_idx_)

  #Reduced Dataset
  data_enc_train_4 = data_train.iloc[:,feature_idx]; data_enc_test_4 = data_test.iloc[:,feature_idx]
  attributes_4 = data_enc_train_4.columns[1:]
  X_train_4, y_train_4 = variables(data_enc_train_4); X_test_4, y_test_4 = variables(data_enc_test_4)

  return X_train_4, y_train_4, X_test_4, y_test_4, attributes_4

X_train_4, y_train_4, X_test_4, y_test_4, attributes_4 = forward(data_enc_train, data_enc_test)

In [None]:
print('Attributes Selected:')
print(list(attributes_4))

Attributes Selected:
['Age', 'Gender_Male', 'Gender_Non-binary', 'self_employed', 'no_employees', 'care_options', 'wellness_program', 'anonymity', 'leave', 'supervisor']


**4.1 K-Nearest Neighbors**

In [None]:
y_pred_4_1, y_score_4_1 = kNN_classifier(X_train_4, y_train_4, X_test_4, y_test_4)
results_4_1 = performance(y_test_4, y_pred_4_1, y_score_4_1)
results_4_1

Accuracy: 0.7682


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.75,0.213675,0.776786,0.75,0.763158,0.53672,0.83057
Yes,0.786325,0.25,0.760331,0.786325,0.773109,0.53672,0.83057
Weighted Avg,0.76824,0.231916,0.768523,0.76824,0.768155,0.53672,0.83057


**4.2 Naive Bayesian**

In [None]:
y_pred_4_2, y_score_4_2 = nb_classifier(X_train_4, y_train_4, X_test_4, y_test_4)
results_4_2 = performance(y_test_4, y_pred_4_2, y_score_4_2)
results_4_2

Accuracy: 0.8026


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.801724,0.196581,0.801724,0.801724,0.801724,0.605143,0.876363
Yes,0.803419,0.198276,0.803419,0.803419,0.803419,0.605143,0.876363
Weighted Avg,0.802575,0.197432,0.802575,0.802575,0.802575,0.605143,0.876363


**4.3 Random Forest**

In [None]:
y_pred_4_3, y_score_4_3 = rf_classifier(X_train_4, y_train_4, X_test_4, y_test_4)
results_4_3 = performance(y_test_4, y_pred_4_3, y_score_4_3)
results_4_3

Accuracy: 0.7811


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.775862,0.213675,0.782609,0.775862,0.779221,0.562228,0.862843
Yes,0.786325,0.224138,0.779661,0.786325,0.782979,0.562228,0.862843
Weighted Avg,0.781116,0.218929,0.781129,0.781116,0.781108,0.562228,0.862843


**4.4 Support Vector Machines**

In [None]:
y_pred_4_4, y_score_4_4 = svm_classifier(X_train_4, y_train_4, X_test_4, y_test_4)
results_4_4 = performance(y_test_4, y_pred_4_4, y_score_4_4)
results_4_4

Accuracy: 0.8197


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.784483,0.145299,0.842593,0.784483,0.8125,0.640886,0.890068
Yes,0.854701,0.215517,0.8,0.854701,0.826446,0.640886,0.890068
Weighted Avg,0.819742,0.180559,0.821205,0.819742,0.819503,0.640886,0.890068


**4.5 Artificial Neural Network**

In [None]:
y_pred_4_5, y_score_4_5 = ann_classifier(X_train_4, y_train_4, X_test_4, y_test_4, len(attributes_4))
results_4_5 = performance(y_test_4, y_pred_4_5, y_score_4_5)
results_4_5



Accuracy: 0.8112


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.784483,0.162393,0.827273,0.784483,0.80531,0.623054,0.877542
Yes,0.837607,0.215517,0.796748,0.837607,0.816667,0.623054,0.877542
Weighted Avg,0.811159,0.189069,0.811945,0.811159,0.811013,0.623054,0.877542


#### Method 5: Backwards Selection

In [None]:
#Backwards Selection for Attribute Selection (Wrapper Method)
def backward(data_train, data_test):
  #Variables
  X_train, y_train = variables(data_train)
  X_test, y_test = variables(data_test)

  #Backwards Selection
  knn = KNeighborsClassifier(n_neighbors = 5)
  sfs = SFS(knn, k_features = 11, forward = False, floating = False, verbose = 0, scoring = 'accuracy', cv = 0)
  sfs = sfs.fit(X_train, y_train)
  feature_idx = list(sfs.k_feature_idx_)

  #Reduced Dataset
  data_enc_train_5 = data_train.iloc[:,feature_idx]; data_enc_test_5 = data_test.iloc[:,feature_idx]
  attributes_5 = data_enc_train_5.columns[1:]
  X_train_5, y_train_5 = variables(data_enc_train_5); X_test_5, y_test_5 = variables(data_enc_test_5)

  return X_train_5, y_train_5, X_test_5, y_test_5, attributes_5

X_train_5, y_train_5, X_test_5, y_test_5, attributes_5 = backward(data_enc_train, data_enc_test)

In [None]:
print('Attributes Selected:')
print(list(attributes_5))

Attributes Selected:
['Gender_Non-binary', 'self_employed', 'family_history', 'treatment', 'no_employees', 'tech_company', 'care_options', 'anonymity', 'leave', 'supervisor']


**5.1 K-Nearest Neighbors**

In [None]:
y_pred_5_1, y_score_5_1 = kNN_classifier(X_train_5, y_train_5, X_test_5, y_test_5)
results_5_1 = performance(y_test_5, y_pred_5_1, y_score_5_1)
results_5_1

Accuracy: 0.7167


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.732759,0.299145,0.708333,0.732759,0.720339,0.433805,0.794724
Yes,0.700855,0.267241,0.725664,0.700855,0.713043,0.433805,0.794724
Weighted Avg,0.716738,0.283125,0.717036,0.716738,0.716676,0.433805,0.794724


**5.2 Naive Bayesian**

In [None]:
y_pred_5_2, y_score_5_2 = nb_classifier(X_train_5, y_train_5, X_test_5, y_test_5)
results_5_2 = performance(y_test_5, y_pred_5_2, y_score_5_2)
results_5_2

Accuracy: 0.7639


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.801724,0.273504,0.744,0.801724,0.771784,0.529627,0.856322
Yes,0.726496,0.198276,0.787037,0.726496,0.755556,0.529627,0.856322
Weighted Avg,0.763948,0.235729,0.765611,0.763948,0.763635,0.529627,0.856322


**5.3 Random Forest**

In [None]:
y_pred_5_3, y_score_5_3 = rf_classifier(X_train_5, y_train_5, X_test_5, y_test_5)
results_5_3 = performance(y_test_5, y_pred_5_3, y_score_5_3)
results_5_3

Accuracy: 0.6738


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.681034,0.333333,0.669492,0.681034,0.675214,0.347727,0.757552
Yes,0.666667,0.318966,0.678261,0.666667,0.672414,0.347727,0.757552
Weighted Avg,0.67382,0.326119,0.673895,0.67382,0.673808,0.347727,0.757552


**5.4 Support Vector Machines**

In [None]:
y_pred_5_4, y_score_5_4 = svm_classifier(X_train_5, y_train_5, X_test_5, y_test_5)
results_5_4 = performance(y_test_5, y_pred_5_4, y_score_5_4)
results_5_4

Accuracy: 0.8026


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.775862,0.17094,0.818182,0.775862,0.79646,0.60586,0.863469
Yes,0.82906,0.224138,0.788618,0.82906,0.808333,0.60586,0.863469
Weighted Avg,0.802575,0.197653,0.803336,0.802575,0.802422,0.60586,0.863469


**5.5 Artificial Neural Network**

In [None]:
y_pred_5_5, y_score_5_5 = ann_classifier(X_train_5, y_train_5, X_test_5, y_test_5, len(attributes_5))
results_5_5 = performance(y_test_5, y_pred_5_5, y_score_5_5)
results_5_5



Accuracy: 0.7382


Unnamed: 0,TPR,FPR,Precision,Recall,F1 Measure,MCC,ROC AUC
No,0.741379,0.264957,0.735043,0.741379,0.738197,0.476422,0.833702
Yes,0.735043,0.258621,0.741379,0.735043,0.738197,0.476422,0.833702
Weighted Avg,0.738197,0.261775,0.738225,0.738197,0.738197,0.476422,0.833702


## Evaluation

### Attribute Selection Results

In [None]:
attributes = list(data_enc.columns)[1:]; selection_methods = ['Chi-Square', 'Lasso Reg', 'Decision Tree', 'Forward Selection', 'Backwards Selection']
def attribute_check(attribute_x):
  attributes_x_lst = []
  for att in attributes:
    if att in attribute_x:
      attributes_x_lst.append(1)
    else:
      attributes_x_lst.append(0)
  return attributes_x_lst

attributes_selected = pd.DataFrame(list(zip(attribute_check(attributes_1), attribute_check(attributes_2), attribute_check(attributes_3), attribute_check(attributes_4), attribute_check(attributes_5))), 
                        columns = selection_methods, index = attributes) 
attributes_selected.loc['# Attributes'] = list(attributes_selected.sum(axis=0))
attributes_selected['Times Selected'] = list(attributes_selected.sum(axis=1))
attributes_selected

Unnamed: 0,Chi-Square,Lasso Reg,Decision Tree,Forward Selection,Backwards Selection,Times Selected
Age,0,0,1,1,0,2
Gender_Male,0,0,0,1,0,1
Gender_Non-binary,0,0,0,1,1,2
self_employed,1,0,0,1,1,3
family_history,0,0,1,0,1,2
treatment,0,0,0,0,1,1
no_employees,1,0,1,1,1,4
remote_work,0,0,0,0,0,0
tech_company,0,0,0,0,1,1
benefits,1,0,0,0,0,1


In [None]:
selected = pd.DataFrame({'Selection Methods': selection_methods, 'Number of Attributes': list(attributes_selected.loc['# Attributes'])[:5]})
selected.sort_values(by=['Number of Attributes'], inplace=True, ascending=False)
fig = px.bar(selected, x = 'Selection Methods', y = 'Number of Attributes', title = 'Number of Attributes for each Selection Method'); fig.show()

In [None]:
attributes = pd.DataFrame({'Attributes': attributes, 'Times Selected': list(attributes_selected['Times Selected'])[:22]})
attributes.sort_values(by=['Times Selected'], inplace=True, ascending=False)
fig = px.bar(attributes, x = 'Attributes', y = 'Times Selected', title = 'Number of Times an Attribute was Selected'); fig.show()

### ROC AUC Performance Results

In [None]:
#Summary table of the ROC AUC results from the 25 models
classifiers = ['kNN', 'Naive Bayes', 'Random Forest', 'SVM', 'ANN']
methods = ['Chi-Square', 'Lasso Reg', 'Decision Tree', 'Forward Selection', 'Backwards Selection']

auc_results = pd.DataFrame({'Chi-Square' : [results_1_1.loc['Weighted Avg', 'ROC AUC'], results_1_2.loc['Weighted Avg', 'ROC AUC'], results_1_3.loc['Weighted Avg', 'ROC AUC'], results_1_4.loc['Weighted Avg', 'ROC AUC'], results_1_5.loc['Weighted Avg', 'ROC AUC']],
                           'Lasso Reg' : [results_2_1.loc['Weighted Avg', 'ROC AUC'], results_2_2.loc['Weighted Avg', 'ROC AUC'], results_2_3.loc['Weighted Avg', 'ROC AUC'], results_2_4.loc['Weighted Avg', 'ROC AUC'], results_2_5.loc['Weighted Avg', 'ROC AUC']],
                           'Decision Tree' : [results_3_1.loc['Weighted Avg', 'ROC AUC'], results_3_2.loc['Weighted Avg', 'ROC AUC'], results_3_3.loc['Weighted Avg', 'ROC AUC'], results_3_4.loc['Weighted Avg', 'ROC AUC'], results_3_5.loc['Weighted Avg', 'ROC AUC']],
                           'Forward Selection' : [results_4_1.loc['Weighted Avg', 'ROC AUC'], results_4_2.loc['Weighted Avg', 'ROC AUC'], results_4_3.loc['Weighted Avg', 'ROC AUC'], results_4_4.loc['Weighted Avg', 'ROC AUC'], results_4_5.loc['Weighted Avg', 'ROC AUC']],
                           'Backwards Selection' : [results_5_1.loc['Weighted Avg', 'ROC AUC'], results_5_2.loc['Weighted Avg', 'ROC AUC'], results_5_3.loc['Weighted Avg', 'ROC AUC'], results_5_4.loc['Weighted Avg', 'ROC AUC'], results_5_5.loc['Weighted Avg', 'ROC AUC']]},
                           index = classifiers)

auc_results['Classifier Average'] = auc_results.mean(axis = 1)
auc_results.loc['Method Average'] = auc_results.mean(axis = 0)

print('ROC AUC Results of the Classification Algorithms:\n')
auc_results

ROC AUC Results of the Classification Algorithms:



Unnamed: 0,Chi-Square,Lasso Reg,Decision Tree,Forward Selection,Backwards Selection,Classifier Average
kNN,0.914309,0.886826,0.873084,0.83057,0.794724,0.859903
Naive Bayes,0.917772,0.883363,0.889589,0.876363,0.856322,0.884682
Random Forest,0.914346,0.890694,0.877726,0.862843,0.757552,0.860632
SVM,0.923077,0.894268,0.901267,0.890068,0.863469,0.89443
ANN,0.900973,0.896552,0.886789,0.877542,0.833702,0.879111
Method Average,0.914095,0.89034,0.885691,0.867477,0.821154,0.875752


In [None]:
#Bar chart of the ROC AUC resutls from the 25 models
selection_methods = [] 
for m in methods:
  selection_methods.extend([m] * 5)
classifiers_25 = classifiers * 5
auc_results_t = pd.DataFrame({'Selection Method': selection_methods, 'Classifier': classifiers_25, 'ROC AUC' : 
                             [results_1_1.loc['Weighted Avg', 'ROC AUC'], results_1_2.loc['Weighted Avg', 'ROC AUC'], results_1_3.loc['Weighted Avg', 'ROC AUC'], results_1_4.loc['Weighted Avg', 'ROC AUC'], results_1_5.loc['Weighted Avg', 'ROC AUC'],
                              results_2_1.loc['Weighted Avg', 'ROC AUC'], results_2_2.loc['Weighted Avg', 'ROC AUC'], results_2_3.loc['Weighted Avg', 'ROC AUC'], results_2_4.loc['Weighted Avg', 'ROC AUC'], results_2_5.loc['Weighted Avg', 'ROC AUC'],
                              results_3_1.loc['Weighted Avg', 'ROC AUC'], results_3_2.loc['Weighted Avg', 'ROC AUC'], results_3_3.loc['Weighted Avg', 'ROC AUC'], results_3_4.loc['Weighted Avg', 'ROC AUC'], results_3_5.loc['Weighted Avg', 'ROC AUC'],
                              results_4_1.loc['Weighted Avg', 'ROC AUC'], results_4_2.loc['Weighted Avg', 'ROC AUC'], results_4_3.loc['Weighted Avg', 'ROC AUC'], results_4_4.loc['Weighted Avg', 'ROC AUC'], results_4_5.loc['Weighted Avg', 'ROC AUC'],
                              results_5_1.loc['Weighted Avg', 'ROC AUC'], results_5_2.loc['Weighted Avg', 'ROC AUC'], results_5_3.loc['Weighted Avg', 'ROC AUC'], results_5_4.loc['Weighted Avg', 'ROC AUC'], results_5_5.loc['Weighted Avg', 'ROC AUC']]})

fig = px.bar(auc_results_t, x = 'Selection Method', y = 'ROC AUC', color = 'Classifier', barmode = 'group', 
             text_auto = '.2f', color_discrete_map = {'kNN':colors[0], 'Naive Bayes':colors[2], 'Random Forest':colors[4], 'SVM':colors[6], 'ANN':colors[8]}, opacity = .75, title = 'Performance Results from 25 Models: ROC AUC')
fig.show()

#### Best Method & Classifier: Voting

In [None]:
#Best Attribute Selection Method from Voting
indcies = [np.argmax(auc_results.loc[c]) for c in classifiers]
counts = list(map(methods.__getitem__, indcies))
counts = [counts.count(m) for m in methods]
print('Best Attribute Selection Method based on Voting:\n')
pd.DataFrame({'Method' : methods, 'No. Times Best Method': counts})

Best Attribute Selection Method based on Voting:



Unnamed: 0,Method,No. Times Best Method
0,Chi-Square,5
1,Lasso Reg,0
2,Decision Tree,0
3,Forward Selection,0
4,Backwards Selection,0


In [None]:
#Best Classification Algorithm from Voting
indcies = [np.argmax(auc_results[m]) for m in methods]
counts = list(map(classifiers.__getitem__, indcies))
counts = [counts.count(c) for c in classifiers]
print('Best Classifier based on Voting:\n')
pd.DataFrame({'Classifier' : classifiers, 'No. Times Best Classifier': counts})

Best Classifier based on Voting:



Unnamed: 0,Classifier,No. Times Best Classifier
0,kNN,0
1,Naive Bayes,0
2,Random Forest,0
3,SVM,4
4,ANN,1


#### Best Method & Classifier: Average

In [None]:
#Best Attribute Selection Method from Averaging
auc_results_classifier = pd.DataFrame({'Attribute Selection Method' : methods, 'ROC AUC' : list(auc_results.loc['Method Average'])[:-1]})
fig = px.bar(auc_results_classifier, x = 'Attribute Selection Method', y = 'ROC AUC', text_auto = '.2f', color = methods, color_discrete_map = {'Chi-Square':colors[1], 'Lasso Reg':colors[3], 'Decision Tree':colors[5], 'Forward Selection':colors[7], 'Backwards Selection':colors[9]}, opacity = .75, title = 'Average ROC AUC by Attribute Selection Method')
fig.update_layout(showlegend = False); fig.show()

In [None]:
#Best Classification Algorithm from Averaging
auc_results_classifier = pd.DataFrame({'Classifier' : classifiers, 'ROC AUC' : list(auc_results['Classifier Average'])[:-1]})
fig = px.bar(auc_results_classifier, x = 'Classifier', y = 'ROC AUC', text_auto = '.2f', color = classifiers, color_discrete_map = {'kNN':colors[0], 'Naive Bayes':colors[2], 'Random Forest':colors[4], 'SVM':colors[6], 'ANN':colors[8]}, opacity = .75, title = 'Average ROC AUC by Classification Algoirithm')
fig.update_layout(showlegend = False); fig.show()

#### Best Model

In [None]:
#Chi-Square Attribute Selection Method + SVM Classifier
auc_results_best_model = pd.DataFrame({'Method' : ['None', 'Chi-Square'], 'ROC AUC' : [cv_results_4.loc['Weighted Avg', 'ROC AUC'], results_1_4.loc['Weighted Avg', 'ROC AUC']]})
fig = px.bar(auc_results_best_model, x = 'Method', y = 'ROC AUC', text_auto = '.2f', color = 'Method', color_discrete_map = {'None':colors[0], 'Chi-Square':colors[2]}, opacity = .75, title = 'Performance Results: ROC AUC of Best Model vs All Attributes', width = 750, height = 500)
fig.update_layout(showlegend = False); fig.show()

In [None]:
#Train and Test Datasets from Best Model
best_attributes = ['mental_vs_physical']; best_attributes.extend(attributes_1)
data_enc_train[best_attributes].to_csv('/content/drive/MyDrive/CS699 Project/data/data_best_train.csv')
data_enc_test[best_attributes].to_csv('/content/drive/MyDrive/CS699 Project/data/data_best_test.csv')