In [1]:
import pandas as pd
import numpy as np
df_16 = pd.read_csv('mental-health-in-tech-2016/mental-heath-in-tech-2016_20161114.csv', header=0)

## Data cleaning for df_16

In [2]:
df_16.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1433 entries, 0 to 1432
Data columns (total 63 columns):
Are you self-employed?                                                                                                                                                              1433 non-null int64
How many employees does your company or organization have?                                                                                                                          1146 non-null object
Is your employer primarily a tech company/organization?                                                                                                                             1146 non-null float64
Is your primary role within your company related to tech/IT?                                                                                                                        263 non-null float64
Does your employer provide mental health benefits as part of healthcare coverage?        

In [3]:
nan = df_16[df_16["Are you self-employed?"]==1]
nan


Unnamed: 0,Are you self-employed?,How many employees does your company or organization have?,Is your employer primarily a tech company/organization?,Is your primary role within your company related to tech/IT?,Does your employer provide mental health benefits as part of healthcare coverage?,Do you know the options for mental health care available under your employer-provided coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Does your employer offer resources to learn more about mental health concerns and options for seeking help?,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,"If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:",...,"If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?","If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?",What is your age?,What is your gender?,What country do you live in?,What US state or territory do you live in?,What country do you work in?,What US state or territory do you work in?,Which of the following best describes your work position?,Do you work remotely?
3,1,,,,,,,,,,...,Sometimes,Sometimes,43,male,United Kingdom,,United Kingdom,,Supervisor/Team Lead,Sometimes
9,1,,,,,,,,,,...,Rarely,Often,30,Male,United States of America,Kentucky,United States of America,Kentucky,One-person shop|Front-end Developer|Back-end D...,Always
18,1,,,,,,,,,,...,Often,Often,25,female,United States of America,Washington,United States of America,Washington,One-person shop,Always
24,1,,,,,,,,,,...,Not applicable to me,Not applicable to me,38,Male,United States of America,New York,United States of America,New York,Supervisor/Team Lead,Sometimes
33,1,,,,,,,,,,...,Not applicable to me,Not applicable to me,37,Male,Czech Republic,,Czech Republic,,Back-end Developer|Front-end Developer,Sometimes
40,1,,,,,,,,,,...,Rarely,Sometimes,34,Male,United States of America,Georgia,United States of America,Georgia,One-person shop|Back-end Developer|DevOps/SysA...,Sometimes
43,1,,,,,,,,,,...,Sometimes,Often,28,M,Lithuania,,Lithuania,,DevOps/SysAdmin|Back-end Developer,Sometimes
46,1,,,,,,,,,,...,Rarely,Often,32,Male,United States of America,West Virginia,United States of America,Ohio,Other,Sometimes
48,1,,,,,,,,,,...,Not applicable to me,Not applicable to me,43,male,United Kingdom,,United Kingdom,,Other,Sometimes
52,1,,,,,,,,,,...,Not applicable to me,Not applicable to me,31,Male,Netherlands,,Netherlands,,Supervisor/Team Lead|Back-end Developer|Front-...,Sometimes


### There are 287 self employed records which most imformation is NaN. These data are not fit to the analysis. So I remove them from the dataframe.

In [4]:
drop_list = df_16[df_16["Are you self-employed?"] == 1].index
df_c16 = df_16.drop(drop_list, axis=0)
df_c16.shape


(1146, 63)

## 1. Separate company into 7 levels based on size

In [5]:
df_c16["How many employees does your company or organization have?"].value_counts(dropna=False)

26-100            292
More than 1000    256
100-500           248
6-25              210
500-1000           80
1-5                60
Name: How many employees does your company or organization have?, dtype: int64

In [6]:
def company_level(company_size):
    levels = {
        1: '1-5',
        2: '6-25',
        3: '26-100',
        4: '100-500',
        5: '500-1000',
        6: 'More than 1000'
    }
    result = 0
    for k, v in levels.items():
        if v == company_size:
            result = k
    return result

company_size = df_c16['How many employees does your company or organization have?'].apply(company_level)

In [7]:
df_c16['company_size'] = company_size
df_c16 = df_c16.drop(['How many employees does your company or organization have?'], axis=1)

In [8]:
df_c16['company_size'].value_counts()

3    292
6    256
4    248
2    210
5     80
1     60
Name: company_size, dtype: int64

## 2. Check each column

#### Column: Is your primary role within your company related to tech/IT?

In [10]:
df_c16["Is your primary role within your company related to tech/IT?"].value_counts(dropna=False)

NaN     883
 1.0    248
 0.0     15
Name: Is your primary role within your company related to tech/IT?, dtype: int64

In [11]:
df_c16["Which of the following best describes your work position?"].replace(['Back-end Developer|Front-end Developer', 'Front-end Developer|Back-end Developer'], 'Full track', inplace=True)


In [12]:
# base on the information of column 'Which of the following best describes your work position?' update the data in 
# column 'Is your primary role within your company related to tech/IT?'
tech_list = ['Developer', 'Full Track',' DevOps', 'SysAdmin', 'Designer', 'Dev Evangelist']

df_c16["Is your primary role within your company related to tech/IT?"] = np.where((x in df_c16["Which of the following best describes your work position?"] for x in tech_list) and (df_c16["Is your primary role within your company related to tech/IT?"].isnull()), 1.0, 0.0)


In [13]:
df_c16["Is your primary role within your company related to tech/IT?"].value_counts(dropna=False)

1.0    883
0.0    263
Name: Is your primary role within your company related to tech/IT?, dtype: int64

#### Column: Do you know the options for mental health care available under your employer-provided coverage?

In [14]:
df_c16["Do you know the options for mental health care available under your employer-provided coverage?"].value_counts(dropna=False)

No               354
I am not sure    352
Yes              307
NaN              133
Name: Do you know the options for mental health care available under your employer-provided coverage?, dtype: int64

In [15]:
# fill in NaN with value of 'I am not sure'
df_c16["Do you know the options for mental health care available under your employer-provided coverage?"].fillna('I am not sure', inplace=True)

In [16]:
df_c16["Do you know the options for mental health care available under your employer-provided coverage?"].value_counts(dropna=False)

I am not sure    485
No               354
Yes              307
Name: Do you know the options for mental health care available under your employer-provided coverage?, dtype: int64

#### Column: Does your employer provide mental health benefits as part of healthcare coverage?

In [17]:
df_c16["Does your employer provide mental health benefits as part of healthcare coverage?"].value_counts(dropna=False)

Yes                                531
I don't know                       319
No                                 213
Not eligible for coverage / N/A     83
Name: Does your employer provide mental health benefits as part of healthcare coverage?, dtype: int64

### Due to all have NaN values, these columns are related with self employed. So drop columns:

In [18]:
drop_list = ["Do you have medical coverage (private insurance or state-provided) which includes treatment of  mental health issues?", 
            "Do you know local or online resources to seek help for a mental health disorder?", 
            "If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts?",
            "If you have revealed a mental health issue to a client or business contact, do you believe this has impacted you negatively?",
            "If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?",
            "If you have revealed a mental health issue to a coworker or employee, do you believe this has impacted you negatively?",
            "Do you believe your productivity is ever affected by a mental health issue?",
            "If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?"]
df_c16 = df_c16.drop(drop_list, axis=1)

In [19]:
df_c16.shape

(1146, 55)

### Columns in check_list are a group of survey related with previous employer. Those features have less affection with this research. So remove those columns from the analysis data frame.

In [20]:
check_list = ["Have your previous employers provided mental health benefits?",
              "Were you aware of the options for mental health care provided by your previous employers?",
             "Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?",
             "Did your previous employers provide resources to learn more about mental health issues and how to seek help?",
             "Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?",
             "Do you think that discussing a mental health disorder with previous employers would have negative consequences?",
             "Do you think that discussing a physical health issue with previous employers would have negative consequences?",
             "Would you have been willing to discuss a mental health issue with your previous co-workers?",
             "Would you have been willing to discuss a mental health issue with your direct supervisor(s)?",
             "Did you feel that your previous employers took mental health as seriously as physical health?",
             "Did you hear of or observe negative consequences for co-workers with mental health issues in your previous workplaces?"]


In [21]:
df_c16['Have your previous employers provided mental health benefits?'].value_counts(dropna=False)

Some did             326
No, none did         280
I don't know         258
Yes, they all did    151
NaN                  131
Name: Have your previous employers provided mental health benefits?, dtype: int64

In [22]:
check_df = df_c16[df_c16['Have your previous employers provided mental health benefits?'].isnull()]

In [23]:
check_df[check_list]

Unnamed: 0,Have your previous employers provided mental health benefits?,Were you aware of the options for mental health care provided by your previous employers?,Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?,Did your previous employers provide resources to learn more about mental health issues and how to seek help?,Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?,Do you think that discussing a mental health disorder with previous employers would have negative consequences?,Do you think that discussing a physical health issue with previous employers would have negative consequences?,Would you have been willing to discuss a mental health issue with your previous co-workers?,Would you have been willing to discuss a mental health issue with your direct supervisor(s)?,Did you feel that your previous employers took mental health as seriously as physical health?,Did you hear of or observe negative consequences for co-workers with mental health issues in your previous workplaces?
25,,,,,,,,,,,
30,,,,,,,,,,,
32,,,,,,,,,,,
56,,,,,,,,,,,
71,,,,,,,,,,,
83,,,,,,,,,,,
89,,,,,,,,,,,
101,,,,,,,,,,,
104,,,,,,,,,,,
108,,,,,,,,,,,


In [24]:
df_c16 = df_c16.drop(labels=check_list, axis=1)
df_c16.shape

(1146, 44)

#### Remove 2 explanation columns: "Why or why not?" and "Why or why not?.1"

In [25]:
df_c16 = df_c16.drop(["Why or why not?", "Why or why not?.1"], axis=1)

#### column: "Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace? "

In [26]:
df_c16["Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?"].value_counts(dropna=False)

No                    491
Maybe/Not sure        278
Yes, I observed       193
Yes, I experienced    132
NaN                    52
Name: Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?, dtype: int64

In [27]:
df_c16["Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?"].fillna("Maybe/Not sure", inplace=True)

In [28]:
df_c16["Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?"] = \
    df_c16["Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?"].replace(['Yes, I observed', 'Yes, I experienced'], 'Yes')
    
df_c16["Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?"] = \
    df_c16["Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?"].replace(['Maybe/Not sure'], 'Maybe')

In [29]:
df_c16["Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?"].value_counts(dropna=False)

No       491
Maybe    330
Yes      325
Name: Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?, dtype: int64

#### Column: Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?

In [30]:
df_c16["Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?"].value_counts(dropna=False)

NaN      635
No       198
Yes      181
Maybe    132
Name: Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?, dtype: int64

In [31]:
df_c16["Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?"].fillna("Maybe", inplace=True)

In [32]:
df_c16["Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?"].value_counts(dropna=False)

Maybe    767
No       198
Yes      181
Name: Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?, dtype: int64

#### Remove those diagnostic explanation columns

In [33]:
mental_health_diagnose = ["If yes, what condition(s) have you been diagnosed with?", "If maybe, what condition(s) do you believe you have?", "If so, what condition(s) were you diagnosed with?"]

# Remove explanation columns:
df_c16 = df_c16.drop(mental_health_diagnose, axis=1)

#### Categorize medical request levels

In [34]:
def medical_request_level(request):
    levels = {
        1: 'Very easy',
        2: 'Somewhat easy',
        3: 'Neither easy nor difficult',
        4: 'Somewhat difficult',
        5: 'Very difficult'
    }
    result = 0
    for k, v in levels.items():
        if v == request:
            result = k
    return result

requests = df_c16["If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:"].apply(medical_request_level)



In [35]:
df_c16["If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:"] = requests

In [36]:
df_c16["If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:"].value_counts(dropna=False)

2    281
1    220
4    199
3    178
0    150
5    118
Name: If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:, dtype: int64

#### Column: What is your age?

In [37]:
df_c16["What is your age?"].value_counts(dropna=False)

30     77
28     66
31     64
29     63
32     62
26     58
35     58
33     56
27     55
34     53
37     46
36     44
39     41
24     39
38     39
25     36
40     30
22     28
23     24
45     22
43     21
44     21
42     21
41     20
46     16
21     13
47     11
49      8
52      6
51      5
48      5
20      5
55      5
50      4
57      3
53      2
54      2
56      2
63      2
19      2
17      1
323     1
99      1
58      1
59      1
61      1
62      1
66      1
70      1
74      1
3       1
Name: What is your age?, dtype: int64

In [38]:
# calculate the average age, two records have wrong entries with 3, 323
(np.sum(df_c16["What is your age?"]) - (323 + 3)) / (len(df_c16["What is your age?"]) - 2)

33.42919580419581

In [39]:
df_c16["What is your age?"] = df_c16["What is your age?"].replace([3, 323], 33)

In [41]:
# based on age range in each 10 years, devide age columns into 5 groups
age_class = []
for idx, row in df_c16.iterrows():
    age = row['What is your age?']
    if age < 20:
        age_class.append(1)
    elif (age < 30):
        age_class.append(2)
    elif (age < 40):
        age_class.append(3)
    elif (age < 50):
        age_class.append(4)
    else:
        age_class.append(5)

In [42]:
df_c16['age_class'] = age_class

#### Column: What is your gender?

In [43]:
# clean the genders by grouping the genders into 3 categories: Female, Male, Genderqueer/Other 
# Borrowed from "https://www.kaggle.com/jchen2186/data-visualization-with-python-seaborn"
df_c16["What is your gender?"] = df_c16["What is your gender?"].replace([
    'male', 'Male ', 'M', 'm', 'man', 'Cis male',
    'Male.', 'Male (cis)', 'Man', 'Sex is male',
    'cis male', 'Malr', 'Dude', "I'm a man why didn't you make this a drop down question. You should of asked sex? And I would of answered yes please. Seriously how much text can this take? ",
    'mail', 'M|', 'male ', 'Cis Male', 'Male (trans, FtM)',
    'cisdude', 'cis man', 'MALE'], 'Male')
df_c16["What is your gender?"] = df_c16["What is your gender?"].replace([
    'female', 'I identify as female.', 'female ',
    'Female assigned at birth ', 'F', 'Woman', 'fm', 'f',
    'Cis female', 'Transitioned, M2F', 'Female or Multi-Gender Femme',
    'Female ', 'woman', 'female/woman', 'Cisgender Female', 
    'mtf', 'fem', 'Female (props for making this a freeform field, though)',
    ' Female', 'Cis-woman', 'AFAB', 'Transgender woman',
    'Cis female '], 'Female')
df_c16["What is your gender?"] = df_c16["What is your gender?"].replace([
    'Bigender', 'non-binary,', 'Genderfluid (born female)',
    'Other/Transfeminine', 'Androgynous', 'male 9:1 female, roughly',
    'nb masculine', 'genderqueer', 'Human', 'Genderfluid',
    'Enby', 'genderqueer woman', 'Queer', 'Agender', 'Fluid',
    'Genderflux demi-girl', 'female-bodied; no feelings about gender',
    'non-binary', 'Male/genderqueer', 'Nonbinary', 'Other', 'none of your business',
    'Unicorn', 'human', 'Genderqueer'], 'Genderqueer/Other')

# replace the one null with Male, the mode gender, so we don't have to drop the row
df_c16["What is your gender?"] = df_c16["What is your gender?"].replace(np.NaN, 'Male')
df_c16["What is your gender?"].unique()

array(['Male', 'Female', 'Genderqueer/Other'], dtype=object)

In [44]:
df_c16.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1146 entries, 0 to 1432
Data columns (total 40 columns):
Are you self-employed?                                                                                                                                                              1146 non-null int64
Is your employer primarily a tech company/organization?                                                                                                                             1146 non-null float64
Is your primary role within your company related to tech/IT?                                                                                                                        1146 non-null float64
Does your employer provide mental health benefits as part of healthcare coverage?                                                                                                   1146 non-null object
Do you know the options for mental health care available under your employer-provided co

In [45]:
df_c16.shape

(1146, 40)

## change all data into categorical data type

In [46]:
df_c16 = df_c16.apply(lambda x: x.astype('category'))

In [47]:
df_c16['If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:'].cat.as_ordered()

0       1
1       2
2       3
4       3
5       2
6       2
7       1
8       5
10      1
11      4
12      1
13      2
14      1
15      3
16      5
17      0
19      0
20      1
21      2
22      1
23      4
25      1
26      5
27      4
28      3
29      2
30      2
31      1
32      1
34      1
       ..
1398    1
1399    2
1400    1
1401    0
1402    0
1403    4
1405    0
1406    4
1407    0
1409    3
1410    5
1411    1
1412    3
1413    2
1414    5
1415    5
1416    2
1417    5
1418    3
1419    2
1421    1
1422    0
1423    2
1424    4
1425    2
1426    2
1427    2
1430    4
1431    4
1432    5
dtype: category
Categories (6, int64): [0 < 1 < 2 < 3 < 4 < 5]

In [48]:
df_c16.describe()

Unnamed: 0,Are you self-employed?,Is your employer primarily a tech company/organization?,Is your primary role within your company related to tech/IT?,Does your employer provide mental health benefits as part of healthcare coverage?,Do you know the options for mental health care available under your employer-provided coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Does your employer offer resources to learn more about mental health concerns and options for seeking help?,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,"If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:",Do you think that discussing a mental health disorder with your employer would have negative consequences?,...,What is your age?,What is your gender?,What country do you live in?,What US state or territory do you live in?,What country do you work in?,What US state or territory do you work in?,Which of the following best describes your work position?,Do you work remotely?,company_size,age_class
count,1146,1146.0,1146.0,1146,1146,1146,1146,1146,1146,1146,...,1146,1146,1146,709,1146,716,1146,1146,1146,1146
unique,1,2.0,2.0,4,3,3,3,3,6,3,...,49,3,43,47,44,48,179,3,6,5
top,0,1.0,1.0,Yes,I am not sure,No,No,I don't know,2,Maybe,...,30,Male,United States of America,California,United States of America,California,Back-end Developer,Sometimes,3,3
freq,1146,883.0,883.0,531,485,813,531,742,281,487,...,77,850,709,110,716,120,238,611,292,542


#### ** About 77%(883/1146) people have tech related jobs.  46.3%(531/1146) company support mental health insurance. About 62%(709/1146) of work positions are in USA, and 15.5%(110/709) of them are in California. 53.3%(611/1146) of company support for some kind of remote working. 

In [49]:
df_c16.to_csv('cleaned_data.csv')

In [50]:
df_c16.shape

(1146, 40)