# 02.526 Data Visualisation Project
# DATA PREPARATION

This dataset is from a 2014 survey that measures attitudes towards mental health and frequency of mental health disorders in the tech workplace. 

### Driving Questions:

How does the frequency of mental health illness and attitudes towards mental health vary by geographic location?
What are the strongest predictors of mental health illness or certain attitudes towards mental health in the workplace?

In [1]:
import pandas as pd
import os
import seaborn as sns
import numpy as np

In [2]:
base_path = os.getcwd()
df = pd.read_csv(base_path+'/survey.csv')

In [3]:
df.head()

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [4]:
df.rename(columns = {'Age': 'age', 'Gender': 'gender'}, inplace = True)

In [5]:
df.shape

(1259, 27)

### Columns
- X -Timestamp
- Y Age
- Y - Gender
- X - Country
- Y - state: If you live in the United States, which state or territory do you live in? - **convert to code**
- X - self_employed: Are you self-employed?
- Y - family_history: Do you have a family history of mental illness?
- Y - treatment: Have you sought treatment for a mental health condition?
- Y - work_interfere: If you have a mental health condition, do you feel that it interferes with your work?
- X - no_employees: How many employees does your company or organization have?
- X - remote_work: Do you work remotely (outside of an office) at least 50% of the time?
- T - tech_company: Is your employer primarily a tech company/organization?
- T - benefits: Does your employer provide mental health benefits?
- T - care_options: Do you know the options for mental health care your employer provides?
- T - wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?
- T - seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?
- T - anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
- Y - leave: How easy is it for you to take medical leave for a mental health condition?
- Y - mentalhealthconsequence: Do you think that discussing a mental health issue with your employer would have negative consequences?
- X - physhealthconsequence: Do you think that discussing a physical health issue with your employer would have negative consequences?
- Y - coworkers: Would you be willing to discuss a mental health issue with your coworkers?
- Y - supervisor: Would you be willing to discuss a mental health issue with your direct supervisor(s)?
- X - mentalhealthinterview: Would you bring up a mental health issue with a potential employer in an interview?
- X - physhealthinterview: Would you bring up a physical health issue with a potential employer in an interview?
- X - mentalvsphysical: Do you feel that your employer takes mental health as seriously as physical health?
- X - obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
- X - comments: Any additional notes or comments

## Processing

In [6]:
df = df[df['Country'] == 'United States'] #filter on US
df = df[~df['state'].isna()] #remove state with NA

In [7]:
df = df[['age', 'gender','state',
            'family_history', 'treatment', 'work_interfere',
            'leave', 'mental_health_consequence','coworkers', 'supervisor']]

In [8]:
df = df[df['age'] > 18]
df = df[~df['work_interfere'].isna()]

In [9]:
conditions  = [df['gender'].isin(['Female', 'female', 'Cis Female', 'F', 'f', 'Femake', 'woman', 'Female ', 'Woman', 'cis-female/femme', 'Female (cis)', 'femail']), 
               df['gender'].isin(['M', 'Male', 'male', 'Male-ish', 'maile','Cis Male', 'm', 'Male (CIS)', 'Male ', 'Man', 'msle', 'Mail', 'cis male']), 
               df['gender'].isin(['Trans-female', 'queer/she/they', 'non-binary', 'Make', 'Nah', 'Genderqueer', 'Trans woman', 'Female (trans)',])]
choices     = [ "Female", "Male", "Others"]
    
df["gender"] = np.select(conditions, choices, default=np.nan)

### FIPS

In [10]:
fips = pd.read_csv(base_path+'/fips_code_state.csv')
df_fips = df.merge(fips, left_on = 'state', right_on = 'Postal Code', how = 'left')
df_fips = df_fips[~df_fips['Postal Code'].isna()]
df_fips['FIPS'] = df_fips['FIPS'].astype(int)
df_fips.rename(columns = {'Name': 'state_name'}, inplace = True)
df_fips.drop(columns = ['Postal Code'], inplace = True)
df_fips.head()

## Categorical Values

In [17]:
for col in df_fips.columns.difference(['age', 'gender', 'state', 'state_name', 'FIPS']):
    df_fips[col+"_cat"] = df_fips[col].astype('category')
    df_fips[col+"_cat"] = df_fips[col+"_cat"].cat.codes

In [18]:
from sklearn import preprocessing

x = df_fips[['coworkers_cat',
       'family_history_cat', 'leave_cat',
       'mental_health_consequence_cat', 'supervisor_cat', 'treatment_cat',
       'work_interfere_cat']].values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_fips_cat = pd.DataFrame(x_scaled)
df_fips_cat.columns = ['coworkers_cat',
       'family_history_cat', 'leave_cat',
       'mental_health_consequence_cat', 'supervisor_cat', 'treatment_cat',
       'work_interfere_cat']

In [19]:
df_fips.drop(columns = ['coworkers_cat',
       'family_history_cat', 'leave_cat',
       'mental_health_consequence_cat', 'supervisor_cat', 'treatment_cat',
       'work_interfere_cat'], inplace = True)

In [20]:
df_fips = pd.merge(df_fips, df_fips_cat, left_index=True, right_index=True)

In [21]:
df_fips['mental_score'] = 0
for col in ['coworkers_cat',
       'family_history_cat', 'leave_cat',
       'mental_health_consequence_cat', 'supervisor_cat', 'treatment_cat',
       'work_interfere_cat']:
    df_fips['mental_score'] = df_fips['mental_score'] + df_fips[col]

In [22]:
df_fips['FIPS'] = 'US' + df_fips['FIPS'].astype(str)

In [23]:
df_fips = df_fips.round(2)
df_fips.head()

Unnamed: 0,age,gender,state,family_history,treatment,work_interfere,leave,mental_health_consequence,coworkers,supervisor,state_name,FIPS,coworkers_cat,family_history_cat,leave_cat,mental_health_consequence_cat,supervisor_cat,treatment_cat,work_interfere_cat,mental_score
0,37,Female,IL,No,Yes,Often,Somewhat easy,No,Some of them,Yes,Illinois,US17,0.5,0.0,0.5,0.5,1.0,1.0,0.33,3.83
1,44,Male,IN,No,No,Rarely,Don't know,Maybe,No,No,Indiana,US18,0.0,0.0,0.0,0.0,0.0,0.0,0.67,0.67
2,31,Male,TX,No,No,Never,Don't know,No,Some of them,Yes,Texas,US48,0.5,0.0,0.0,0.5,1.0,0.0,0.0,2.0
3,33,Male,TN,Yes,No,Sometimes,Don't know,No,Yes,Yes,Tennessee,US47,1.0,1.0,0.0,0.5,1.0,0.0,1.0,4.5
4,35,Female,MI,Yes,Yes,Sometimes,Somewhat difficult,Maybe,Some of them,No,Michigan,US26,0.5,1.0,0.25,0.0,0.0,1.0,1.0,3.75


### For Choropleth

In [24]:
df_fips.to_csv(base_path+'/survey_cleaned.csv', index = True)

### For Radar Chart

In [25]:
total = pd.pivot_table(pd.DataFrame(df_fips.mean()).reset_index(), values=0, columns=['index']).round(2)
total['state_name'] = 'Total'
total.drop(columns = {'age', 'mental_score'}, inplace = True)

total.rename(columns = {'coworkers_cat': "Coworkers' Receptiveness",
                              'family_history_cat': 'Family MH History',
                              'leave_cat': 'Medical Leaves',
                              'mental_health_consequence_cat': 'Negative Consequences',
                              'supervisor_cat': "Supervisor's Receptiveness",
                              'treatment_cat': 'Finding Treatment',
                              'work_interfere_cat': 'Work Interference'
                             }, inplace = True)

In [26]:
df_fips_grp = df_fips.groupby(['state_name'])[['mental_score', 'coworkers_cat',
       'family_history_cat', 'leave_cat',
       'mental_health_consequence_cat', 'supervisor_cat', 'treatment_cat',
       'work_interfere_cat']].mean().reset_index()
df_fips_grp = df_fips_grp.round(2)
df_fips_grp.rename(columns = {'coworkers_cat': "Coworkers' Receptiveness",
                              'family_history_cat': 'Family MH History',
                              'leave_cat': 'Medical Leaves',
                              'mental_health_consequence_cat': 'Negative Consequences',
                              'supervisor_cat': "Supervisor's Receptiveness",
                              'treatment_cat': 'Finding Treatment',
                              'work_interfere_cat': 'Work Interference'
                             }, inplace = True)

In [29]:
df_fips_grp = df_fips_grp.append(total, ignore_index=True)
# df_fips_grp.to_csv(base_path+'/survey_cleaned_grp.csv', index = True)

In [31]:
df_fips_grp.drop(columns = {'mental_score'}, inplace = True)
df_fips_grpT = df_fips_grp.T.reset_index()
new_header = df_fips_grpT.iloc[0] #grab the first row for the header
df_fips_grpT = df_fips_grpT[1:] #take the data less the header row
df_fips_grpT.columns = new_header #set the header row as the df header
df_fips_grpT = df_fips_grpT.sort_values(by = 'state_name')
df_fips_grpT

Unnamed: 0,state_name,Alabama,Arizona,California,Colorado,Connecticut,Florida,Georgia,Idaho,Illinois,...,Tennessee,Texas,Utah,Vermont,Virginia,Washington,West Virginia,Wisconsin,Wyoming,Total
1,Coworkers' Receptiveness,0.57,0.5,0.46,0.38,0.75,0.46,0.45,0,0.5,...,0.44,0.5,0.39,0.5,0.46,0.46,0.0,0.55,0.5,0.48
2,Family MH History,0.57,0.4,0.58,0.5,1.0,0.42,0.36,1,0.54,...,0.48,0.4,0.44,0.0,0.62,0.55,0.0,0.45,0.5,0.5
6,Finding Treatment,0.86,1.0,0.74,0.62,0.5,0.75,0.55,1,0.71,...,0.55,0.74,0.44,1.0,0.54,0.66,0.0,0.73,0.5,0.68
3,Medical Leaves,0.36,0.45,0.28,0.19,0.25,0.4,0.3,0,0.3,...,0.3,0.29,0.39,0.0,0.25,0.43,0.5,0.45,0.0,0.32
4,Negative Consequences,0.29,0.1,0.46,0.25,0.25,0.29,0.36,1,0.33,...,0.44,0.41,0.67,0.5,0.54,0.42,1.0,0.45,0.75,0.42
5,Supervisor's Receptiveness,0.57,0.3,0.5,0.56,0.75,0.62,0.55,0,0.48,...,0.68,0.54,0.44,0.5,0.42,0.49,0.0,0.55,0.75,0.53
7,Work Interference,0.76,0.93,0.66,0.54,0.5,0.72,0.67,1,0.68,...,0.59,0.74,0.59,1.0,0.67,0.62,1.0,0.7,1.0,0.65


In [32]:
df_fips_grpT[['Alabama', 'Arizona', 'California', 'Colorado',
       'Connecticut', 'Florida', 'Georgia', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri',
       'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New York',
       'North Carolina', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah',
       'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin',
       'Wyoming', 'Total']] = df_fips_grpT[['Alabama', 'Arizona', 'California', 'Colorado',
       'Connecticut', 'Florida', 'Georgia', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri',
       'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New York',
       'North Carolina', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah',
       'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin',
       'Wyoming', 'Total']].astype(float)

In [33]:
df_fips_grpT.to_csv(base_path+'/survey_cleaned_grpT.csv', index = True)

## Test Plots

In [34]:
# sns.catplot(x='self_employed', hue='remote_work', col='tech_company', kind='count', data=df)

In [35]:
# sns.catplot(x='anonymity', hue='leave', col='supervisor', row='coworkers', kind='count', data=df)

In [36]:
# sns.catplot(x='benefits', hue='treatment', col='wellness_program', row='care_options',kind='count', data=df)

In [37]:
# sns.catplot(x='seek_help', hue='mental_health_consequence', col='treatment', row='family_history', kind='count', data=df)