# Utilization of workplace resources and readiness to express concerns:


*a study on Mental Health in the Tech industry conducted by the Real Ricardo*

The data utilized in this notebook was sourced from Kaggle. The data originates from a survey conducted in 2014 by OSMI, a non-profit organization committed to increasing awareness, educating, and offering resources to support mental well-being within the tech and open source communities.



**Below is an explanation of the survey questions (columns) asked:**
Column	                                                                        Description
Timestamp	                                                                 Date when the survey was completed
Gender	                                                                Identification of the surveyed individual's gender
Country	                                                                    Country of residence
State	                                                                        Only applicable to residents of the US
self_employed	                                                            Indicates if the respondent is self-employed
family_history	                                                            Inquires about any family history of mental illness
treatment	                                                Asks if the respondent has sought treatment for a mental health condition
work_interfere	                                                    Assesses if a mental health condition interferes with work
no_employees	                                    Indicates the size of the respondent's company or organization in terms of employees
remote_work	                               Inquires if the respondent works remotely (outside of an office) for at least 50% of the time
tech_company	                               Determines if the respondent's employer primarily operates as a tech company/organization
benefits	                                                Inquires if the respondent's employer provides mental health benefits
care_options	                                  Asks if the respondent knows the mental health care options provided by their employer
wellness_program	           Determines if the respondent's employer has discussed mental health as part of an employee wellness program
seek_help	        Inquires if the respondent's employer provides resources to learn about mental health issues and how to seek help
anonymity	     Assesses if the respondent's anonymity is protected when utilizing mental health or substance abuse treatment resources
leave	Determines the ease of taking medical leave for a mental health condition
mental_health_consequence	Asks if discussing a mental health issue with the employer would have negative consequences
phys_health_consequence	Inquires if discussing a physical health issue with the employer would have negative consequences
coworkers	Assesses the willingness of the respondent to discuss a mental health issue with their coworkers
supervisors	Determines the willingness of the respondent to discuss a mental health issue with their direct supervisor(s)
mental_health_interview	Asks if the respondent would bring up a mental health issue with a potential employer during an interview
mental_vs_physical	Inquires if the respondent feels that their employer takes mental health as seriously as physical health
obs_consequence	Determines if the respondent has heard of or observed negative consequences for coworkers with mental health conditions in their workplace
comments	Provides space for any additional notes or comments





In [None]:
# classic importing of libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

# choosing color palette
flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]

: 

## 1. Data Cleaning

In [None]:
# loading data into pandas dataframe and exploring columns and first rows
survey = pd.read_csv('survey.csv')
survey.head()

: 

In [None]:
# renaming columns to have all lowercase
survey.columns = [col.lower() for col in survey.columns]

# looking at dtypes, we can see all of them are pandas objects except for age that is int.
print(survey.dtypes)

: 

- **after exploring some of the columns I have decided not to use the *timestamp, country, state or comments columns*.**
- **i don't think there are is not enough data to do a geographical analysis.**

In [None]:
survey.drop(['timestamp', 'state', 'comments', 'country'], axis= 1, inplace= True)

: 

In [None]:
# after finding some weird responses in the age column ('such as 99999999 and -1729')
# i decided to only include 'valid' numbers for age
survey['age'] = survey['age'].apply(lambda x: x if 0 < x < 100 else np.nan)

: 

In [None]:
# gender is one big mess of data but I believe in myself.
# first lower case and strip eveything to decrease options
survey['gender'] = survey['gender'].apply(lambda x: x.lower().strip())

# based on the replies made these lists manually, I hope to not offend anyone.
male = ['male', 'm', 'make', 'cis male', 'man', 'cis man', 'msle', 'malr', 'mail', 'maile', 'something kinda male?', 'ostensibly male, unsure what that really means', 'male-ish', 'guy (-ish) ^_^', 'mal', 'male (cis)']
female = ['female', 'f', 'woman', 'female (cis)', 'cis-female/femme', 'femake', 'cis female', 'femail']

def regender(gender_input):
    
    """
    Input: a string about gender
    Output: male, female or other depending on our dictionary
    
    """ 
    if gender_input in male:
        return 'male'
    elif gender_input in female:
        return 'female'
    else:
        return 'other'
    
survey['gender'] = survey['gender'].apply(regender)

# check our final results
survey.gender.value_counts(dropna= False)

: 

In [None]:
# checking for nans
nan_cols = []
for col in survey.columns:
    if survey[col].isnull().sum():
        print(col + ' (%): ' + str(sum(survey[col].isnull())/len(survey)))
        nan_cols += [col]

# nans are present in the self-employed column and in the work_interfere column.
# at most there are 20% of values with nans so we'll replace with 'Don't know'
for col in nan_cols:
    survey[col].fillna("Don't know", inplace= True)

# i'm just gonna drop the rows without age.
survey = survey[survey.age != "Don't know"]

# in the context of our questions we need to drop everyone who is self-employed
print(survey['self_employed'].value_counts())
survey.loc[survey['self_employed'] == 'Yes'] = np.nan
survey.dropna(inplace=True)

: 

- **it seems the rest of the columns are questions that might be yes or no, with few additional options**
- **we can identify that by checking the possible answers in each question**

In [None]:
# getting unique answers for every column
exclude = ['age']
possible_answers = {col:[i for i in survey[col].unique()] for col in survey.columns if col not in exclude}
possible_answers

: 

- **after looking at these answer possibilities want to normalize all 'half-answers' to 'Don't know', so changing 'Not sure' and 'Maybes'**
- **this decision might be open for interpretation, but I made a judgement call for the sake of consistency**

In [None]:
def replace_uncertainty(survey_answer):
    """
    Uniformizes all uncertain answers.
    Input: half-answer (string)
    Output: 'Don't know'
    
    """
    
    uncertainty = ['Maybe', 'Some of them', 'Not sure']
    if survey_answer in uncertainty:
        return "Don't know"
    else:
        return survey_answer

# apply function to all columns
for col in survey.columns:
    survey[col] = survey[col].apply(replace_uncertainty)

: 

- **the column 'care_options' asks the question: "Do you know the options for mental health care your employer provides?" and about 30% of the answers are 'Dont know'.** 
- **in my opinion: if the answer to a question 'Do you know?' is 'I don't know' - then the answer is No.**

In [None]:
survey['care_options'] = survey['care_options'].replace(to_replace="Don't know", value= 'No')
survey['care_options'].value_counts()

: 

## 2. Exploratory Data Analysis

**I was trying to come up with a question to ask the data but I wasn't feeling very creative.**
**So I'm gonna do an agnostic analysis and just make a correlation matrix bewteen all the variables.**

1. The first thing I'll do is encode all the answers as numbers.
- **Important notes:**
- All the half answers go in the middle (0)

In [None]:
# create a dictionary to map out all the answers in the survey
encoding_dict = {'No': -1, "Don't know": 0, 'Yes': 1,
                 'Never': -2, 'Rarely': -1, 'Sometimes': 1, 'Often': 2,
                 'Very difficult': -2, 'Somewhat difficult': -1, 'Somewhat easy': 1, 'Very easy': 2,
                 '1-5': 0, '6-25': 1, '26-100': 2, '100-500': 3, '500-1000': 4, 'More than 1000': 5,
                 'male': -1, 'other': 0, 'female': 1}

: 

In [None]:
# made a copy just to check values
survey_coded = survey.copy()

# for each column except the ones not excluded change the values in our dictionary
for col in survey_coded.columns:
    if col not in exclude:
        survey_coded[col] = survey_coded[col].map(encoding_dict)

: 

In [None]:
# let's take a look at our answer-encoded dataframe.
survey_coded

: 

- **because I'm still not feeling very creative, I'm gonna make a correlation matrix across all the variables in the study**

In [None]:
# make the correlation matrix
correlation = survey_coded.corr()

: 

In [None]:
# making a mask to only show half the table cause its duplicated.
mask = np.triu(np.ones_like(correlation, dtype=np.bool))

# choosing colors
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# creating the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation, cmap=cmap, mask=mask, annot= False)

: 

- **This plot is very pretty but I've come to the realization that it is useless to me. there are way too many variables. It's hard to consider what is important or not and I don't have the know-how yet to disentangle all of this.**

*Teach me please.*

## 3. Hypothesis-driven analysis

**To create a score for the willingness to speak, we combined all the columns that ask 'If you're willing to speak to someone about mental health**

**Following the same logic, I combined all the columns that ask about services provided by the company**

We combined these columns by adding them together, it was important to make sure they all 'move in the same direction'
The result of this are two columns:
1. 'workplace_resources': with a possible value ranging from -6 to 6 and is the score of the company
2. 'willingness': with a possible values ranging from -3 to 3 and is the metric of the employee

In [None]:
# define columns to combine
workplace_resources = ['benefits', 
                      'wellness_program', 
                      'anonymity', 
                      'seek_help', 
                      'leave']

willingness = ['coworkers',
               'supervisor',
               'mental_health_interview']

# add columns together
survey_coded['workplace_resources'] = survey_coded.benefits + survey_coded.wellness_program + survey_coded.seek_help + survey_coded.leave + survey_coded.anonymity
survey_coded['willingness'] = survey_coded.coworkers + survey_coded.supervisor + survey_coded.mental_health_interview

: 

In [None]:
# gonna define a function to make a nice looking histogram

def pretty_histogram(data, colname, label, save= False):
    
    """
    Input: data (pandas dataframe), column name (str), x label (str), save (bool)
    Output: a pretty histogram
    """

    hist = sns.distplot(data[colname], 
                        bins = data[colname].max() - data[colname].min() + 1,
                        kde= False,
                        color=flatui[-2],
                        hist_kws= {'range': (data[colname].min(), data[colname].max()),
                                   'rwidth': 0.95,
                                   'alpha': 1})
    
    hist.set(xlabel= label, ylabel= 'Count')
    sns.despine()
    
    if save:
        figure = hist.get_figure()
        figure.savefig('figures/' + colname + "_hist.png")
    

: 

In [None]:
# make a histogram for our newly created 'Willingness' variable.
pretty_histogram(survey_coded, 'willingness', 'Willingness Score', save= False)

: 

**This looks like a right skewed distribution, meaning that the majority of people are unwilling to talk about mental health in their work environment.**

In [None]:
# make a histogram for our newly created 'Workplace Resources' variable
pretty_histogram(survey_coded, 'workplace_resources', 'Workplace Resources Score', save= True)

: 

**This one looks more normally distributed but still centered left of the zero, meaning that the majority of companies don't have many mental health resources**

#### Before moving forwards, one of the first questions we can ask is: does having more mental health services mean that your employees are aware these services exist?

In [None]:
# interestingly to check the proportion of people aware of services I needed to recode the values in the
# care options column from 'No' = -1 to 'No' = 0
survey_coded['care_options'].replace(to_replace=-1, value=0, inplace= True)
survey_coded.pivot_table(index=['workplace_resources'], values=['care_options'], aggfunc= lambda x: sum(x)/len(x))

: 

In [None]:
# plot proportions of people that know about care options for each score of workplace resource.

label = 'Workplace Resources Score'

bars = sns.barplot(x='workplace_resources', y='care_options', data= survey_coded, 
                   ci= None, color= flatui[4])

bars.set(xlabel= label, ylabel= 'Proportion Employee Awareness')
sns.despine()
figure = bars.get_figure()
figure.savefig('figures/resources_and_awareness.png')

: 

- **This is a really interesting result, kind of U-shaped.**
- **One possible interpretation is that the more services are available, more people know about them, and where there's none people ALSO know that.**

- **before moving on with our analysis, I would like to see what's the behavior of our willingness score according to some other variables, such as family history, treatment and work interfere**

In [None]:
# group by variable of interest, create a label and rename columns
family_data = survey_coded.groupby('family_history').agg({'willingness':'mean'}).reset_index()
family_data['label'] = ['Family History']*len(family_data)
family_data.rename(columns= {'family_history':'Answer'}, inplace=True)

treatment_data = survey_coded.groupby('treatment').agg({'willingness':'mean'}).reset_index()
treatment_data['label'] = ['Treatment']*len(treatment_data)
treatment_data.rename(columns= {'treatment':'Answer'}, inplace=True)

willingness_variables = pd.concat([family_data, treatment_data], axis=0)
willingness_variables.Answer.replace(to_replace=-1, value='No', inplace=True)
willingness_variables.Answer.replace(to_replace=1, value='Yes', inplace=True)

: 

In [None]:
# making a category bar plot
fig, ax = plt.subplots(figsize=(5, 3))

bars= sns.barplot(x="label", y="willingness", hue="Answer", 
            palette= sns.color_palette(flatui[1:3]), 
            data=willingness_variables)

def change_width(ax, new_value) :
    for patch in ax.patches :
        current_width = patch.get_width()
        diff = current_width - new_value

        # we change the bar width
        patch.set_width(new_value)

        # we recenter the bar
        patch.set_x(patch.get_x() + diff * .5)


change_width(ax, .37)
sns.despine(bottom=True)
bars.set(xlabel= '', ylabel= 'Willingness Score')
bars.tick_params(axis='x', which='both', length=0)
figure = bars.get_figure()
figure.savefig('figures/history_treatment_willingness.png')

: 

In [None]:
xlabel = 'Work Interfere'
ylabel = 'Willingness Score'

fig, ax = plt.subplots(figsize=(5, 3))
bars = sns.barplot(x= 'work_interfere', y= 'willingness', data= survey_coded, color= flatui[-1], errwidth=0.75)

bars.set(xlabel= xlabel, ylabel= ylabel, xticklabels=['Never', 'Rarely', "Don't know", 'Sometimes', 'Often'])
bars.set_xticklabels(['Never', 'Rarely', "Don't know", 'Sometimes', 'Often'], rotation=30)
bars.set()
sns.despine(bottom= True)
change_width(ax, .95)
bars.tick_params(axis='x', which='both', length=0)

figure = bars.get_figure()
figure.savefig('figures/work_interfere_willingness.png')

: 

In [None]:
# I also did this analysis for gender but I find it slightly less interesting.
gender_willingness = survey_coded.groupby('gender').agg({'willingness':'mean'})
gender_willingness

: 

- **contrary to expected, it seems that women are less likely to open up about mental health. However, upon further thinking, this might make sense if we consider that tech is a male-dominated industry where perhaps climbing up the ladder as a woman is significantly harder. in that case, they would be much more scared to bring up these issues and make their lives ever harder?**

In [None]:
# to reach the answer to our main question
willingness_treatment = survey_coded.groupby('workplace_resources').agg({'willingness':'mean'})
willingness_treatment

: 

In [None]:
# create a bar plot with the average willingness score for each workplace resourses category
xlabel = 'Workplace Resources Score'
ylabel = 'Willingness Score'

fig, ax = plt.subplots()
bars = sns.barplot(x='workplace_resources', y='willingness', data= survey_coded, 
                   color= flatui[3], errwidth=0.75)

bars.set(xlabel= xlabel, ylabel= ylabel)
sns.despine()
change_width(ax, .9)
figure = bars.get_figure()
figure.savefig('figures/resources_willingness.png')

: 

- **it looks like the average willingness score goes up the more resources are available in the company! which is pretty cool**
- **to try and get some statistical power on this we took the raw data and tried to make a linear regression to see if we could predict the willing score of a person using workplace resources.**

In [None]:
# linear regression
x = survey_coded['workplace_resources']
y = survey_coded['willingness']

slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

print ('The slope is: ' + str(slope))
print ('The intercept is: ' + str(intercept))
print ('The r_value is: ' + str(r_value**2))
print ('The std error is: ' + str(std_err))
print(p_value)

: 

In [None]:
xlabel = 'Workplace Resources Score'
ylabel = 'Willingness Score'
reg = sns.regplot(x, y, x_estimator=np.mean,
                  scatter_kws={'color':flatui[3]}, 
                  line_kws = {'color':flatui[3],
                              'ls':'dashed'})
reg.set(xlabel= xlabel, ylabel= ylabel)
sns.despine()
figure = reg.get_figure()
figure.savefig('figures/resources_willingness_regplot.png')

: 

- **there doesn't seem to be a strong relationship between these variables BUT!**
- **Even though our model only explains a small amount of the variation, when you consider all the things that might affect behavior towards mental health, even a small effect should not be ignored.**