## Capstone Project
-------

### Stage 1 - Cleaning phase
------

#### Importing packages and data
------

In [0]:
# import packages
import pandas as pd
pd.set_option("display.max_columns", None)
import numpy as np

import warnings
warnings.filterwarnings("ignore")

from sklearn import pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

The dataset used for this project is from the [2017](https://www.kaggle.com/osmihelp/osmi-mental-health-in-tech-survey-2017) and [2018](https://www.kaggle.com/osmihelp/osmi-mental-health-in-tech-survey-2018) Mental Health in the Tech Industry conducted by Open Sourcing Mental Illness (OSMI), available on Kaggle.

In [2]:
# import data
data_2017 = pd.read_csv("Datasets/2017_survey.csv")
data_2018 = pd.read_csv("Datasets/2018_survey.csv")

# Combing the datasets to one table
data = pd.concat([data_2017,data_2018],sort=False,ignore_index=True)

Saving 2017_survey.csv to 2017_survey.csv
Saving 2018_survey.csv to 2018_survey.csv


#### Preliminary data cleaning
------

<u> Goals: </u>

1. Cleaning column titles
2. Combining and cleaning responses
3. Handling NaN values

In [0]:
# Defining some functions for use later on
def combine_columns(first_num,num_list,df_name):
    '''
    This function combines duplicate columns.
    
    Inputs:
    ------
    first_num: an integer of the column number you wish the information to be combined to
    num_list: a list of integers of the column numbers you wish the information to be combined
    df_name = the name of the dataframe
    
    '''
    for num in num_list:
        df_name.iloc[:,first_num] = df_name.iloc[:,first_num] + df_name.iloc[:,num]

def combine_info(my_list,column_name = "What is your race?"):
    '''
    This function combines similar responses (but spelled differently or used slightly different wording) 
    into one category of response.
    
    Inputs:
    ------
    my_list = a list of responses you want to put into the category
    column_name = the name of the column
    
    '''
    for num,info in enumerate(my_list):
        if num > 0:
            df[column_name][df[column_name]==info] = my_list[0]
            
def fillna_with_median(question = "What is your age?"):
    '''
    This function fills NaN values with the median of the column.
    
    Input:
    ------
    question: the name of the column
    
    '''
    median = np.median(df[question][df[question].isna()==False])
    df[question].fillna(median,inplace=True)

Some of the questions have HTML code embedded in them, so come cleaning is required to ease the searching process later on.

In [0]:
# cleaning column titles
columns_to_clean = data.columns[data.columns.str.contains("<strong>")]

# grouping column names based on where the HTML code is at
list_1 = (columns_to_clean[:7],columns_to_clean[9:11])
list_2 = (columns_to_clean[-5],columns_to_clean[-3])
list_3 = (columns_to_clean[-4],columns_to_clean[-2])
list_4 = [["If you have a mental health disorder, how often do you feel that it interferes with your work <strong>when being treated effectively?</strong>",
            "If you have a mental health disorder, how often do you feel that it interferes with your work when being treated effectively?"],
           ["If you have a mental health disorder, how often do you feel that it interferes with your work <strong>when <em>NOT</em> being treated effectively (i.e., when you are experiencing symptoms)?</strong>",
            "If you have a mental health disorder, how often do you feel that it interferes with your work when NOT being treated effectively (i.e., when you are experiencing symptoms)?"]]

# renaming columns
for item in list_1:
    for question in range(len(item)):
        data.rename(columns = {f"{item[question]}": f"{item[question][8:-9]}"},inplace=True)

for question in list_2:
    data.rename(columns = {f"{question}": f"{question[:20]+question[28:32]+question[-4:]}"},inplace=True)

for question in list_3:
    data.rename(columns = {f"{question}": f"{question[:34]+question[42:46]+question[-4:]}"},inplace=True)

for pairs in list_4:
    data.rename(columns = {pairs[0]:pairs[1]},inplace=True)

Unique indexes are created to label each row of the survey data which replaces the IDs under the column `#`.

In [0]:
# insert unique id and drop column "#"
data.insert(0,"id",(data.index+1))
data.drop(columns = "#",inplace=True)

##### Cleaning responses for MH disorders
-----

The columns for MH disorders from different years have not been combined. Therefore, the results will be combined and the duplicates will be deleted.

First, the information in the duplicated columns will be combined to the first 13 columns of MH disorders avoid loss of data.

In [0]:
# combining data
start_num = 50

while start_num < 64:
    data.iloc[:,start_num].fillna(data.iloc[:,(start_num+13)],inplace=True)
    data.iloc[:,start_num].fillna(data.iloc[:,(start_num+26)],inplace=True)
    data.iloc[:,start_num].fillna(0,inplace=True)
    data.iloc[:,start_num].where(data.iloc[:,start_num]==0,1,inplace=True)
    start_num += 1

The disorders in the "Other" category will be converted to dummy variables to match the format of other columns of MH disorders.

In [0]:
# checking number of entries in each column
others_dummy = pd.concat([pd.get_dummies(data["Other.1"]),pd.get_dummies(data["Other.2"])],axis=1)

Since some disorders like Asperger's Syndrome are repeated with slightly different names, those columns will be combined to avoid duplication.

In [0]:
# combining columns
ADHD_list = (5,8)
ASD_list = (1,2,3,7,9)
Depression_list = (15,16)

my_list = [(0,ADHD_list),(10,ASD_list),(14,Depression_list)]

for num, name in my_list:
    combine_columns(num,name,others_dummy)

# combining panic disorder
others_dummy.iloc[:,-3] = others_dummy.iloc[:,-3] + others_dummy.iloc[:,-2]

Once the information from the duplicate columns are combined, the duplicates in the DataFrame `others_dummy` can be dropped.

In [0]:
# dropping duplicate columns in others_dummy
drop_list = list(ADHD_list + ASD_list + Depression_list)
drop_list.append(20)
column_names = []

for num in drop_list:
    column_names.append(others_dummy.columns[num])

others_dummy.drop(columns = column_names,inplace=True)

Upon further inspection, some responses in the "Other" category are duplicates of existing categories in the main dataset (eg. Mood Disorder). The responses will therefore be added to the main dataset.

In [0]:
# Mood disorder
data.iloc[:,51] = data.iloc[:,51] + others_dummy.iloc[:,-5] + others_dummy.iloc[:,4]

# ADHD
data.iloc[:,54] += others_dummy.iloc[:,0]

Now, those duplicate columns in `others_dummy` will be dropped as well.

In [0]:
drop_list = (0,4,7)
column_names2 = []

for num in drop_list:
    column_names2.append(others_dummy.columns[num])

others_dummy.drop(columns = column_names2,inplace=True)

Since the information from all the "Other" category have been extracted, all the duplicate columns can now be dropped.

In [0]:
# dropping duplicate columns in data
data.drop(columns=data.columns[62:89],inplace=True)

##### Handling duplicate columns for other questions
-----

There are also some survey questions that appear to have duplicate columns. Those are questions on:

- How would team members react to your MH diagnoses
- How the MH disorder interfering with work when it is not treated properly
- Current employer's MH coverage

Also, the last column of the data is irrelvant to the analysis so the column will be dropped as well.

In [0]:
pairs = [(5,-4),(66,-3),(76,-2)]

for i,j in pairs:
    data.iloc[:,i].fillna(data.iloc[:,j],inplace=True)
    data.drop(columns=data.columns[j],inplace=True)

# drop last column in data
data.drop(columns=data.columns[-1],inplace=True)

Finally, the DataFrames `data` and `others_dummy` are combined and a copy of the DataFrame is made to preserve the original dataset.

In [0]:
# combining df
data = pd.concat([data.iloc[:,:62], others_dummy, data.iloc[:,62:]],axis = 1)

# making a copy of the dataset
df = data.copy()

##### Handling NaN values and combining responses
------

NaN values are handled in the following ways:

- The column will be dropped if there are more than 50% NaN values or there are more than 25% NaN value but they will not be used for the modelling phase
- In most text responses, NaN values will be replaced with "did not answer" or "NA"
- In most categorical responses (i.e 0/1), NaN values will be replaced with -1
- In most continuous responses, NaN values will be replaced with either the median value or filled using machine learning algorithm

In [0]:
# fill in some NaN for some columns that have over 50% NaN value to keep those columns
for num in [49,-6]:
    df.iloc[:,num].fillna("Did not answer",inplace=True)

# drop columns with over 50% NaN values
delete_list = df.isna().sum()[df.isna().sum() > 587]

for num in range(len(delete_list)):
    df.drop(columns = delete_list.index[num],inplace=True)

# drop columns with over 25% NaN values that are deemed not essential
df.drop(columns=df.columns[-12],inplace=True)

# fillna for describing things to improve
df.iloc[:,-12].fillna("Did not answer",inplace=True)

Since there are many duplicate answers with slightly different words or spellings (eg. Latino vs. Latina), the responses in a number of columns need to be cleaned, combined and the duplicate columns will be dropped.

#### Race
------

In [0]:
# cleaning race column
question = "What is your race?"
df[question].fillna(df["Other.3"],inplace=True)

# cleaning up racial responses
hispanics = ["Hispanic","Hispanic or Latino","Latina","Latino","Latinx","mexican american "]
no_answer = ["Did not answer","I prefer not to answer","I am of the race of Adam, the first human."]
mixed = ["Mixed","More than one of the above","Hispanic, White","Mestizo"]
jewish = ["Jewish","Ashkenazi"]
caucasian = ["Caucasian","White","European American","My race is white, but my ethnicity is Latin American"]
caribbean = ["Caribbean","Indo-Caribbean","West Indian"]
asian = ["Asian","South Asian"]
aa = ["Afrcian American","Black or African American"]

race_list = [hispanics,no_answer,mixed,jewish,caucasian,caribbean,asian,aa]

for race in race_list:
    combine_info(race,column_name = "What is your race?")
    
# dropping duplicate column
df.drop(columns="Other.3",inplace=True)

#### Gender
------

In [0]:
# Cleaning gender
question = "What is your gender?"
df[question].fillna("Did not answer",inplace=True)

# Combine gender responses
male = ["Male","Cis Male","Cis male","Cis-male","Cisgender male","M","MALE","cis hetero male","cis male",
        "cis male ","cis-male","dude","m","male","male (hey this is the tech industry you're talking about)",
        "male, born with xy chromosoms","male/androgynous","man","God King of the Valajar","Mail","Male ",
        "Male (cis)","Male, cis","SWM","Malel","Man","Ostensibly Male"]

female = ["Female","*shrug emoji* (F)","Cis female ","Cis woman","Cis-Female","Cisgendered woman","F",
          "F, cisgender","Female ","Female (cis) ","Female (cisgender)","I identify as female","Woman",
          "Woman-identified","cis female","cis-Female","cisgender female","f","femail","female",
          "female (cis)","female (cisgender)","femalw","woman","My sex is female."]

genderqueer = ["Genderqueer","Agender","Agender/genderfluid","Contextual","Female-ish","Demiguy",
               "Female/gender non-binary.","Genderfluid","Genderqueer demigirl","Genderqueer/non-binary",
               "Male (or female, or both)","Male-ish","NB","Non binary","Non-binary","Nonbinary",
               "Nonbinary/femme","She/her/they/them","gender non-conforming woman","genderfluid",
               "non binary","non-binary","nonbinary","uhhhhhhhhh fem genderqueer?","male/androgynous "]

transgender = ["Transgender","Trans female","Trans man","Trans woman","Transfeminine",
               "trans woman","transgender"]

other = ["Other","None","\-","none","sometimes"]

gender_list = [male,female,genderqueer,transgender,other]

for gender in gender_list:
    combine_info(gender,column_name = "What is your gender?")

#### Employment type and status
------

In [0]:
# Clean up # of employees
question = "How many employees does your company or organization have?"
df[question].fillna(0,inplace=True)

From a quick survey of the count of NaN values, there is a pattern of certain questions having 169 NaN values. 

In [19]:
# Visualizing NaN count with 169 NaN values
nan_table = pd.DataFrame(df.isna().sum(),columns = ["NaN"])
nan_table["NaN"].groupby(nan_table["NaN"]).count()

NaN
0      40
2       8
12      1
143    12
145     1
149     1
169    13
170     1
175     1
257     1
260     1
273     1
277     1
356     1
365     1
Name: NaN, dtype: int64

Since there are survey participants who are self-employed, it would be useful to find out if all the NaN values pertaining to employment are from the self-employed population.

In [20]:
# grabbing a list of participants who are self-employed
self_employed = df[df[question]==0].index.values

# check and see if # NaN = 169 are all from self-employed participants
my_list = df.isna().sum().index[df.isna().sum()==169]
index_list = list(my_list.values)

count = 0

for i,j in enumerate(index_list):
    b = df[index_list[i]][df[index_list[i]].isna()==True].index.values
    if (self_employed == b).sum() == 169:
        count += 1

if count == len(index_list):
    print("All NaN values are from the self-employed group")
else:
    print("Not all NaN values are from the self-employed group")

All NaN values are from the self-employed group


From the result above, it seems like all the NaN values regarding employment are from those who are self-employed. Those NaN values will be replaced by "Not applicable".

In [0]:
# changing NaN values in columns with text data to NA
my_list = df.isna().sum().index[5:13]
column_list = list(my_list.values)
column_list.append(df.isna().sum().index[14])

for question in column_list:
    df.loc[(self_employed),column_list]="Not Applicable"

#### Demographics
------

In [0]:
# Fill NaNs in age and overall rating with median
fillna_with_median(question = "What is your age?")
fillna_with_median(question = "Overall, how well do you think the tech industry supports employees with mental health issues?")

To fill in the NaN values for country of residence, the Network ID is used as a clue to determine which country to replace the NaN value.

In [23]:
# Using Network ID as a clue to fill in a NaN value
network_id_nan = df["Network ID"][df["What country do you live in?"].isna()==True].values

for network_id in network_id_nan:
    display(df.iloc[:,-10:][df.iloc[:,-1]==network_id])

Unnamed: 0,What is your age?,What is your gender?,What country do you live in?,What US state or territory do you live in?,What is your race?,What country do you work in?,What US state or territory do you work in?,Start Date (UTC),Submit Date (UTC),Network ID
673,33.0,Male,United States of America,Indiana,Caucasian,United States of America,Indiana,2017-11-14 22:12:42,2017-11-14 22:22:28,bae691937c
753,34.0,Did not answer,,,Did not answer,,,2017-08-31 18:05:07,2017-08-31 18:06:56,bae691937c


Unnamed: 0,What is your age?,What is your gender?,What country do you live in?,What US state or territory do you live in?,What is your race?,What country do you work in?,What US state or territory do you work in?,Start Date (UTC),Submit Date (UTC),Network ID
755,34.0,Did not answer,,,Did not answer,,,2017-08-31 13:40:57,2017-08-31 13:45:48,ebd922c723


It seems like the Network ID for one of the survey participant (index = 673) who did not fill in the country of residence matches another participant (index = 753) who filled in their country and state of residence. The information will be used to replace the NaN value. For the other survey participant (index = 755) that has no match, the NaN values will be filled in with "Did not answer".

In [0]:
# Using Network ID as a clue to fill in a NaN value
for num in [-8,-5]:
    df.iloc[753,num]="United States of America"
    
for num in [-7,-4]:
    df.iloc[753,num]="Indiana"

df.loc[755,"What country do you live in?"]="Did not answer"

A quick survey of country of residences revealed that not all survey participants come from the United States.

In [25]:
# Quick survey of country of residences
all(df["What country do you live in?"]=='United States of America')

False

Since not all participants live in the United States, the NaN values for US state/territory will be filled with Not applicable. Also, the columns for work countries and states will be dropped since the analysis will focus on using country of residence as a demographic feature.

In [0]:
# Fill in NaN values for US states
df["What US state or territory do you live in?"].fillna("NA",inplace=True)

# Dropping columns for work countries/states
df.drop(columns = df.columns[-5:-3],inplace=True)

For the responses with a binary response (0/1), NaN values will be filled in with -1 to differentiate those who did not answer from those who answered "Yes" or "No".

In [0]:
# Filling some NaN with -1 - indicating did not answer
my_list = df.isna().sum()[df.isna().sum() > 1].index
positions = [0,1,3,4,5,8,-3,-4,-7,-8,-11,-12,-14]

for i in positions:
    df.loc[:,my_list[i]].fillna(-1,inplace=True)

##### Filling in the missing values in overall employer ratings using machine learning algorithm
------

The missing values for overall employer ratings will be filled in using machine learning algorithm instead of the median of the ratings to avoid over-representing the median rating.

<u>Independent variables:</u> 
- Gender
- Country of residence
- Race

<u>Dependent variable:</u>
- Respective overall ratings

<u>Models considered:</u>
- Decision Tree Classifier
- Random Forrest Classifier
- XGB Classifier

Overall industry ratings will be dealt with separately.

In [0]:
# creating a list of questions with overall ratings
rating_list = df.isna().sum().index[df.isna().sum().index.str.contains("Overall")].values

# creating independent and dependent variables for the model
original = {}
final = {}
train = {}
test = {}

for num in range(len(rating_list)-1):
    original[num] = df.loc[:,(rating_list[num],"What is your gender?","What country do you live in?",
                              "What is your race?")]
    
    dummies = pd.get_dummies(original[num].iloc[:,1:])
    final[num] = pd.concat([original[num].iloc[:,0],dummies],axis=1)

    train[num] = final[num][final[num].iloc[:,0].isna()==False]
    test[num] = final[num][final[num].iloc[:,0].isna()==True]

GridSearchCV is used to determine the optimum model for each rating.

In [29]:
# Using GridsearchCV to determine the optimum model for each rating

# to filter deprecation warning associated with numpy
warnings.filterwarnings("ignore",category=DeprecationWarning)

for num in range(len(rating_list)-1):
    x = train[num].iloc[:,1:]
    y = train[num].iloc[:,0]
    x_test = test[num].iloc[:,1:]

    estimators = [('model', DecisionTreeClassifier())]

    pipe = pipeline.Pipeline(estimators)

    param_grid = [{'model': [DecisionTreeClassifier()]},
                  {'model': [RandomForestClassifier()]},
                  {'model': [XGBClassifier()]}]

    grid = GridSearchCV(pipe, param_grid, cv=5, n_jobs=3)
    grid_search = grid.fit(x, y)
    print(num,grid_search.best_estimator_)

0 Pipeline(memory=None,
     steps=[('model', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))])
1 Pipeline(memory=None,
     steps=[('model', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1))])
2 Pipeline(memory=None,
     steps=[('model', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=

Based on results from GridSearchCV, Decision Tree Classifier is best for the first question on the list and XGB Classifier is the best for the rest of the questions.

In [0]:
# Filling in the NaN values
for num in range(len(rating_list)-1):
    x = train[num].iloc[:,1:]
    y = train[num].iloc[:,0]
    x_test = test[num].iloc[:,1:]

    if num == 0:
        dt = DecisionTreeClassifier()
        # 5-fold cross-validated to be the best one out of the box

        dt.fit(x,y)
        results = dt.predict(x_test)
    
    else:
        xgb = XGBClassifier()
        # 5-fold cross-validated to be the best one out of the box

        xgb.fit(x,y)
        results = xgb.predict(x_test)
    
    values = df.loc[:,rating_list[num]][df[rating_list[num]].isna()==True].index.values

    for position, value in enumerate(values):
        df.loc[value,rating_list[num]] = results[position]

Lastly, the remaining NaN values will be replaced with "Did not answer".

In [0]:
# Fill in more NaN values
column_list = df.isna().sum()[df.isna().sum() > 1].index

for column in column_list:
    df.loc[:,column].fillna("Did not answer",inplace=True)

##### Simplify responses for MH disorders
------

Responses for MH disorders will be grouped into 6 broader categories to aid in later modelling stage:

- Neurodevelopmental disorder
- Adjustment disorder
- Substance Use disorder
- Anxiety disorder
- Mood disorder
- Other

In [0]:
# Combining responses
neuro = ["Attention Deficit Hyperactivity Disorder","Autism Spectrum Disorder","Tourette's"]
adjust = ["Adjustment disorder","Stress Response Syndromes"]
substance = ["Substance Use Disorder","Addictive Disorder"]
anxiety = ["Anxiety Disorder (Generalized, Social, Phobia, etc)","Panic Disorder"]
mood = ["Mood Disorder (Depression, Bipolar Disorder, etc)","Cyclothymia"]
other = ['Suicidal','Codependence','Gender Dysphoria', 'Multiple Sclerosis & Mental Health']

column_list = [neuro,adjust,substance,anxiety,mood,other]

for var in column_list:
    for num,column in enumerate(var):
        if num > 0:
            df.loc[:,var[0]] += df.loc[:,var[num]]
            df.drop(columns = var[num],inplace=True)

# renaming some columns
my_list = [["Attention Deficit Hyperactivity Disorder","Neurodevelopmental Disorders"],
           ["Substance Use Disorder","Substance-Related and Addictive Disorders"],
           ["Suicidal","Other"],
           ["Anxiety Disorder (Generalized, Social, Phobia, etc)","Anxiety Disorder"],
           ["Mood Disorder (Depression, Bipolar Disorder, etc)","Mood Disorder"],
           ["Psychotic Disorder (Schizophrenia, Schizoaffective, etc)","Psychotic Disorder"],
           ["Eating Disorder (Anorexia, Bulimia, etc)","Eating Disorder"],
           ["Personality Disorder (Borderline, Antisocial, Paranoid, etc)","Personality Disorder"]]

for pairs in my_list:
    df.rename(columns = {pairs[0] : pairs[1]},inplace=True)

# replacing some duplicate responds since the answers are binary (0/1)
for num in range(36,48):
    df.iloc[:,num].replace(2,1,inplace=True)

### Advanced cleaning
------

Model-specific data cleaning and preparation.

In [0]:
# Modifying and grouping some of the answers
question = "Would you have been willing to discuss your mental health with your coworkers at previous employers?"
old_answer = "At some of my previous employers"
new_answer = "Some of my previous employers"

df.loc[:,question][df.loc[:,question]==old_answer]=new_answer

new_name = "Does your employer provide mental health benefits as part of healthcare coverage?"

df.rename(columns = {df.columns[5] : new_name}, inplace=True)

In [0]:
# Modifying and grouping some of the answers
question_1 = "Does your employer provide mental health benefits as part of healthcare coverage?"
question_2 = "Do you know the options for mental health care available under your employer-provided health coverage?"
question_3 = "Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?"
question_4 = "Have you observed or experienced supportive or well handled response to a mental health issue in your current or previous workplace?"

answer_1 = "Not Applicable"
answer_2 = "Not eligible for coverage / NA"
answer_3 = "Did not answer"
answer_4 = "Yes, I experienced"
answer_5 = "Yes, I observed"
answer_6 = "Yes"

df.loc[:,question_1][df.loc[:,question_1]==answer_1]=answer_2

df.loc[:,question_2][df.loc[:,question_2]==answer_3]=answer_1

for question in [question_3,question_4]:
    df.loc[:,question][df.loc[:,question]==answer_4]=answer_5
    df.loc[:,question][df.loc[:,question]==answer_5]=answer_6

The cleaned dataset is exported as a .csv file `df.csv` via the code `pd.to_csv`. The beginning of the cleaned dataset is displayed below.

In [35]:
# Cleaned dataset
df.head(10)

Unnamed: 0,id,Are you self-employed?,How many employees does your company or organization have?,Is your employer primarily a tech company/organization?,Is your primary role within your company related to tech/IT?,Does your employer provide mental health benefits as part of healthcare coverage?,Do you know the options for mental health care available under your employer-provided health coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Does your employer offer resources to learn more about mental health disorders and options for seeking help?,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,"If a mental health issue prompted you to request a medical leave from work, how easy or difficult would it be to ask for that leave?",Would you feel more comfortable talking to your coworkers about your physical health or your mental health?,Would you feel comfortable discussing a mental health issue with your direct supervisor(s)?,Have you ever discussed your mental health with your employer?,Would you feel comfortable discussing a mental health issue with your coworkers?,Have you ever discussed your mental health with coworkers?,Have you ever had a coworker discuss their or another coworker's mental health with you?,"Overall, how much importance does your employer place on physical health?","Overall, how much importance does your employer place on mental health?",Do you have previous employers?,Was your employer primarily a tech company/organization?,Have your previous employers provided mental health benefits?,Were you aware of the options for mental health care provided by your previous employers?,Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?,Did your previous employers provide resources to learn more about mental health disorders and how to seek help?,Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?,Would you have felt more comfortable talking to your previous employer about your physical health or your mental health?,Would you have been willing to discuss your mental health with your direct supervisor(s)?,Did you ever discuss your mental health with your previous employer?,Would you have been willing to discuss your mental health with your coworkers at previous employers?,Did you ever discuss your mental health with a previous coworker(s)?,Did you ever have a previous coworker discuss their or another coworker's mental health with you?,"Overall, how much importance did your previous employer place on physical health?","Overall, how much importance did your previous employer place on mental health?",Do you currently have a mental health disorder?,Have you ever been diagnosed with a mental health disorder?,Anxiety Disorder,Mood Disorder,Psychotic Disorder,Eating Disorder,Neurodevelopmental Disorders,Personality Disorder,Obsessive-Compulsive Disorder,Post-Traumatic Stress Disorder,Dissociative Disorder,Substance-Related and Addictive Disorders,Other,Adjustment disorder,Have you had a mental health disorder in the past?,Have you ever sought treatment for a mental health disorder from a mental health professional?,Do you have a family history of mental illness?,"If you have a mental health disorder, how often do you feel that it interferes with your work when being treated effectively?","If you have a mental health disorder, how often do you feel that it interferes with your work when NOT being treated effectively (i.e., when you are experiencing symptoms)?",Have your observations of how another individual who discussed a mental health issue made you less likely to reveal a mental health issue yourself in your current workplace?,How willing would you be to share with friends and family that you have a mental illness?,Would you be willing to bring up a physical health issue with a potential employer in an interview?,Why or why not?,Would you bring up your mental health with a potential employer in an interview?,Why or why not?.1,Are you openly identified at work as a person with a mental health issue?,"If they knew you suffered from a mental health disorder, how do you think that team members/co-workers would react?",Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?,Have you observed or experienced supportive or well handled response to a mental health issue in your current or previous workplace?,"Overall, how well do you think the tech industry supports employees with mental health issues?",Briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees.,What is your age?,What is your gender?,What country do you live in?,What US state or territory do you live in?,What is your race?,Start Date (UTC),Submit Date (UTC),Network ID
0,1,0,100-500,1.0,1.0,No,Yes,No,I don't know,I don't know,I don't know,Same level of comfort for each,Yes,0.0,Yes,1.0,1.0,6.0,0.0,1,0.0,I don't know,N/A (was not aware),Some did,Some did,"Yes, always",Physical health,"Yes, all of my previous supervisors",0.0,"No, at none of my previous employers",0.0,0.0,3.0,3.0,Possibly,Did not answer,1,0,0,0,0,0,0,0,0,0,0,0,Possibly,1,No,Sometimes,Sometimes,No,5,Yes,Did not answer,No,I'd be worried they wouldn't hire me,0.0,10.0,Yes,Yes,1.0,They don't take it seriously,27.0,Female,United Kingdom,,Did not answer,2018-05-16 12:32:04,2018-05-16 12:42:40,464b7a12f1
1,2,0,100-500,1.0,1.0,Yes,Yes,No,No,I don't know,I don't know,Same level of comfort for each,Maybe,0.0,Yes,1.0,1.0,7.0,2.0,1,1.0,Some did,I was aware of some,None did,None did,I don't know,Physical health,"No, none of my previous supervisors",0.0,Some of my previous employers,1.0,0.0,5.0,2.0,Possibly,Did not answer,0,1,0,0,0,0,0,0,0,0,0,0,Possibly,0,No,Not applicable to me,Sometimes,No,4,Yes,it may require specific measures to accomodate...,No,mental health issues are stigmatised and misun...,0.0,6.0,Yes,Maybe/Not sure,2.0,"raise awareness, talk about it to lessen the s...",31.0,Male,United Kingdom,,Did not answer,2018-05-16 12:31:13,2018-05-16 12:40:40,464b7a12f1
2,3,0,6-25,1.0,1.0,I don't know,No,I don't know,No,Yes,Difficult,Same level of comfort for each,Yes,1.0,Maybe,1.0,0.0,0.0,1.0,1,1.0,Some did,N/A (was not aware),None did,None did,I don't know,Physical health,"No, none of my previous supervisors",0.0,Some of my previous employers,1.0,0.0,8.0,0.0,Yes,Yes,1,1,0,0,0,0,0,0,0,1,0,0,Yes,1,Yes,Sometimes,Sometimes,Yes,5,Maybe,I will sometimes bring up my psoriasis just as...,No,stigma,1.0,5.0,Yes,Yes,1.0,"Education and awareness, statistics, add suppo...",36.0,Male,United States of America,Missouri,Caucasian,2018-05-09 05:34:05,2018-05-09 05:46:04,1eb7e0cb94
3,4,0,More than 1000,1.0,1.0,Yes,Yes,I don't know,I don't know,Yes,Difficult,Same level of comfort for each,Yes,1.0,Yes,1.0,0.0,7.0,5.0,0,-1.0,Did not answer,Did not answer,Did not answer,Did not answer,Did not answer,Did not answer,Did not answer,-1.0,Did not answer,-1.0,-1.0,5.0,5.0,Yes,Yes,0,0,0,0,1,0,0,0,0,0,0,0,No,1,I don't know,Sometimes,Often,No,10,No,Anything that may hurt my chances to be hired ...,No,Might hurt my chances,0.0,5.0,Maybe/Not sure,Maybe/Not sure,2.0,"More support, less burnout and death marches",22.0,Male,United States of America,Washington,Caucasian,2018-05-04 23:19:14,2018-05-04 23:23:23,63852edbc4
4,5,1,0,-1.0,-1.0,Not eligible for coverage / NA,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,-1.0,Not Applicable,-1.0,-1.0,5.0,5.0,1,0.0,"No, none did",N/A (none offered),None did,None did,"Yes, always",Same level of comfort for each,"No, none of my previous supervisors",0.0,"No, at none of my previous employers",1.0,1.0,8.0,8.0,No,Did not answer,0,0,0,0,0,0,0,0,0,0,0,0,Yes,1,Yes,Often,Sometimes,No,10,Maybe,It depends. it's not something you start with ...,No,Don't think it's connected to the job. You do ...,0.0,4.0,No,Yes,1.0,I think tech is more internal and they don't r...,52.0,Female,United States of America,Illinois,Mixed,2018-05-03 00:40:24,2018-05-03 00:53:20,43237889f1
5,6,0,100-500,1.0,0.0,Yes,No,No,I don't know,Yes,Somewhat easy,Physical health,Maybe,0.0,Maybe,0.0,0.0,9.0,5.0,1,1.0,"No, none did",I was aware of some,None did,Some did,"Yes, always",Physical health,Some of my previous supervisors,0.0,Some of my previous employers,0.0,0.0,7.0,3.0,No,Did not answer,0,0,0,0,0,0,0,0,0,0,0,0,No,0,Yes,Rarely,Not applicable to me,Maybe,5,Maybe,It would depend on what it is..,No,It wouldn't feel safe,0.0,4.0,Yes,Yes,2.0,"Awareness, changed work schedules and expectat...",30.0,Male,United States of America,California,Caucasian,2018-05-01 22:53:02,2018-05-01 22:59:21,8ac9b72b8a
6,7,0,6-25,1.0,1.0,Yes,Yes,No,No,Yes,Very easy,Same level of comfort for each,Yes,0.0,No,1.0,1.0,10.0,10.0,1,1.0,Some did,I was aware of some,None did,None did,"Yes, always",Same level of comfort for each,Some of my previous supervisors,1.0,Some of my previous employers,1.0,1.0,10.0,10.0,Yes,Yes,0,1,0,0,1,0,0,0,0,0,0,0,No,1,Yes,Rarely,Often,No,8,No,It seems like it would be a distraction.,No,I would be worried that it would affect my int...,1.0,5.0,No,Yes,2.0,Be more vocal about supporting employees with ...,36.0,Female,United States of America,Washington,Asian,2018-04-28 20:02:22,2018-04-28 20:12:23,2a299f981a
7,8,0,26-100,1.0,1.0,Yes,No,No,No,I don't know,Somewhat easy,Physical health,Yes,0.0,Maybe,0.0,1.0,10.0,8.0,1,1.0,Some did,I was aware of some,None did,None did,I don't know,Physical health,Some of my previous supervisors,0.0,Some of my previous employers,0.0,1.0,5.0,5.0,No,Did not answer,0,0,0,0,0,0,0,0,0,0,0,0,No,1,Yes,Not applicable to me,Not applicable to me,No,3,No,I want to maintain my privacy. Unless I end up...,No,Same as above - none of their business,0.0,7.0,No,Yes,2.0,I think we over work ourselves and each other....,38.0,Female,United States of America,Georgia,Caucasian,2018-04-27 17:42:55,2018-04-27 17:50:41,1533aa77ee
8,9,0,100-500,0.0,1.0,I don't know,No,No,No,Yes,Very easy,Same level of comfort for each,Maybe,0.0,Maybe,0.0,0.0,9.0,7.0,1,1.0,"No, none did",N/A (was not aware),None did,None did,"Yes, always",Physical health,"No, none of my previous supervisors",0.0,"No, at none of my previous employers",0.0,0.0,9.0,6.0,Don't Know,Did not answer,0,0,0,0,0,0,0,0,0,0,0,0,No,0,I don't know,Not applicable to me,Not applicable to me,Maybe,6,No,Fear of not getting the job,No,Fear of not getting the job,0.0,5.0,No,Yes,2.0,Don’t know,35.0,Male,Switzerland,,Did not answer,2018-04-26 22:35:49,2018-04-26 22:46:46,f5e9431851
9,10,1,0,-1.0,-1.0,Not eligible for coverage / NA,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,-1.0,Not Applicable,-1.0,-1.0,8.0,4.0,1,1.0,Some did,I was aware of some,Some did,Some did,Sometimes,Mental health,Some of my previous supervisors,1.0,Some of my previous employers,1.0,0.0,2.0,2.0,Possibly,Did not answer,0,0,0,0,1,0,0,0,0,0,0,0,No,1,No,Often,Sometimes,No,4,Maybe,ok,No,ok,1.0,4.0,Yes,Yes,3.0,ok,36.0,Male,India,,Did not answer,2018-04-25 07:18:35,2018-04-25 07:22:44,f5dd0e2917


------
### Please continue to the notebook *Capstone modelling stage final* for the remainder of the project.
------