#  Mental Health in the Tech Industry

> This data set measures the attitudes towards mental health and frequency of mental health disorders in the tech workplace. 
I will be using machine learning on this data set to analyze the data and see if we can predict which employees are in need of treatment. 

__Link to dataset__ <br>
https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed

<br>

## Data Exploration/Preprocessing

First, we need to do a little data exploration, see if there is any missing data and how we should tackle the missing values. After that we will preprocess the data, encode categorical data, fix errors within the data, etc. We can also come back to this process to enhance our model after making preliminary models. Let's import our python modules, and check out the data set. 

In [1]:


import time
import numpy as np
import pandas as pd

import sklearn
from IPython.display import display
from sklearn.model_selection import train_test_split
from pandas.api.types import is_numeric_dtype
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV as gscv
from sklearn.model_selection import RandomizedSearchCV as rscv
import warnings
from itertools import permutations
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

warnings.filterwarnings('ignore')

pd.options.display.max_columns = None
pd.options.display.max_rows = 50

filepath = 'C:\\Users\\jpkli\\Desktop\\Machine Learning Projects\\Mental Health\\survey'

df_1 = pd.read_csv(f'{filepath}.csv')


In [2]:
df_1.head(15)

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,No,Yes,Yes,Not sure,No,Yes,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,No,No,Don't know,No,Don't know,Don't know,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,No,Yes,No,No,No,No,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,No,Yes,No,Yes,No,No,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,Yes,Yes,Yes,No,Don't know,Don't know,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,
5,2014-08-27 11:31:22,33,Male,United States,TN,,Yes,No,Sometimes,6-25,No,Yes,Yes,Not sure,No,Don't know,Don't know,Don't know,No,No,Yes,Yes,No,Maybe,Don't know,No,
6,2014-08-27 11:31:50,35,Female,United States,MI,,Yes,Yes,Sometimes,1-5,Yes,Yes,No,No,No,No,No,Somewhat difficult,Maybe,Maybe,Some of them,No,No,No,Don't know,No,
7,2014-08-27 11:32:05,39,M,Canada,,,No,No,Never,1-5,Yes,Yes,No,Yes,No,No,Yes,Don't know,No,No,No,No,No,No,No,No,
8,2014-08-27 11:32:39,42,Female,United States,IL,,Yes,Yes,Sometimes,100-500,No,Yes,Yes,Yes,No,No,No,Very difficult,Maybe,No,Yes,Yes,No,Maybe,No,No,
9,2014-08-27 11:32:43,23,Male,Canada,,,No,No,Never,26-100,No,Yes,Don't know,No,Don't know,Don't know,Don't know,Don't know,No,No,Yes,Yes,Maybe,Maybe,Yes,No,


<br>

As we can see from a glimps of the data, there are missing values, typos, and a timestamp column. For this data in particular the timestamp relates to when each individual person submitted their survey, and is not representing multiple surveys from individuals, so for the time being I will drop that column. (May come back to this later). Also, the comments section may be useful to read, but for our current purposes we will not be using that column. 

Lets take a look at our columns with missing data.

In [3]:
df_1 = df_1.drop(['comments'], axis = 1)
df_1 = df_1.drop(['Timestamp'], axis = 1)
display(df_1.isnull().sum())

Age                            0
Gender                         0
Country                        0
state                        515
self_employed                 18
family_history                 0
treatment                      0
work_interfere               264
no_employees                   0
remote_work                    0
tech_company                   0
benefits                       0
care_options                   0
wellness_program               0
seek_help                      0
anonymity                      0
leave                          0
mental_health_consequence      0
phys_health_consequence        0
coworkers                      0
supervisor                     0
mental_health_interview        0
phys_health_interview          0
mental_vs_physical             0
obs_consequence                0
dtype: int64

<br>
Even though the 'state' column is missing about 41% of its data, it isn't as bad as it seems. Most of the missing state values are due to the fact that they are in a country without states. So I will try to work with that in this dataset. 
<br>

In [4]:
subset = df_1[(df_1['Country'] == 'United States') & (df_1['state'].isnull())]
subset

Unnamed: 0,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
52,31,M,United States,,No,No,No,,100-500,Yes,Yes,Don't know,No,Don't know,Don't know,Don't know,Don't know,Maybe,Maybe,Some of them,Some of them,Maybe,Maybe,Don't know,No
294,56,Male,United States,,No,No,Yes,Never,More than 1000,No,Yes,Yes,Not sure,Don't know,Don't know,Don't know,Don't know,No,Maybe,Yes,Some of them,No,Maybe,Don't know,No
367,36,Male,United States,,No,Yes,Yes,Often,100-500,No,Yes,No,Yes,No,No,Yes,Very easy,No,No,Some of them,Some of them,No,No,Don't know,No
525,41,Female,United States,,No,Yes,Yes,Rarely,500-1000,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Very easy,Maybe,Maybe,Some of them,Some of them,No,No,Yes,No
574,50,Male,United States,,No,No,No,Never,26-100,Yes,Yes,No,Yes,No,No,Don't know,Don't know,No,No,No,No,No,Maybe,No,No
596,24,Female,United States,,No,Yes,Yes,Sometimes,100-500,No,Yes,Yes,Not sure,No,No,Don't know,Somewhat difficult,Yes,Maybe,No,No,No,No,No,Yes
638,35,Male,United States,,Yes,No,No,,1-5,Yes,Yes,Yes,Not sure,No,No,Yes,Very easy,No,No,Some of them,Yes,No,No,Yes,No
817,44,male,United States,,Yes,Yes,Yes,Sometimes,1-5,Yes,Yes,No,Yes,No,No,No,Very easy,Yes,Yes,Some of them,No,No,No,Yes,No
854,31,Male,United States,,No,Yes,No,,6-25,No,Yes,Don't know,Not sure,No,No,Don't know,Don't know,Maybe,No,Some of them,Some of them,No,No,Don't know,No
926,43,M,United States,,No,Yes,No,Sometimes,500-1000,No,No,Yes,Not sure,No,Don't know,Don't know,Don't know,Maybe,No,No,Some of them,No,Maybe,No,No


In [8]:
df_1['state'].value_counts()

740

<br>
Seeing that only 11 of the rows missing values in state, we will replace those missing values with the mode, which happens to be California. For every other missing value we will replace with 'other'. This can be an area that we go back to later to try and be more precise with replacing the missing data. 

<br>

In [6]:
df_1 = df_1[df_1['Country'] == 'United States'].fillna(df_1['state'] == 'CA')

In [7]:
df_1['state'].value_counts().

CA    138
WA     70
NY     56
TN     45
TX     44
OH     30
PA     29
OR     29
IL     28
IN     27
MI     22
MN     21
MA     20
FL     15
NC     14
VA     14
GA     12
WI     12
MO     12
UT     10
CO      9
AL      8
MD      7
AZ      7
OK      6
NJ      6
KY      5
SC      5
DC      4
IA      4
CT      4
NV      3
NH      3
SD      3
VT      3
KS      3
WY      2
NE      2
NM      2
MS      1
ME      1
RI      1
LA      1
ID      1
WV      1
Name: state, dtype: int64