## Unit 1 Capstone

### Assignment

The main component of this capstone is an experimentation RFC. Using the data set you selected, propose and outline an experiment plan. The plan should consist of three key components:

- Analysis that highlights your experimental hypothesis.
- A rollout plan showing how you would implement and rollout the experiment
- An evaluation plan showing what constitutes success in this experiment

Your experiment should be as real as possible. Though you obviously will not have access to the full production environment to deploy your experiment, it should be feasible and of interest to the parties involved with your actual data source.

### Exploring the Data

For this capstone project, I will be using 2014 data from the Mental Health in Tech survey.  This survey measures attitudes towatd mental health and frequency of mental health disorders in the tech workplace.

More information on the mental health in tech survey [here](https://www.kaggle.com/osmi/mental-health-in-tech-survey/data).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

In [2]:
# Uploading file.

df = pd.read_csv('datafiles/mental_health_tech_survey.csv')

In [3]:
# Let's see what we're working with.

df.head()

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [5]:
# How is Gender being categorized?

df['Gender'].unique()

array(['Female', 'M', 'Male', 'male', 'female', 'm', 'Male-ish', 'maile',
       'Trans-female', 'Cis Female', 'F', 'something kinda male?',
       'Cis Male', 'Woman', 'f', 'Mal', 'Male (CIS)', 'queer/she/they',
       'non-binary', 'Femake', 'woman', 'Make', 'Nah', 'All', 'Enby',
       'fluid', 'Genderqueer', 'Female ', 'Androgyne', 'Agender',
       'cis-female/femme', 'Guy (-ish) ^_^', 'male leaning androgynous',
       'Male ', 'Man', 'Trans woman', 'msle', 'Neuter', 'Female (trans)',
       'queer', 'Female (cis)', 'Mail', 'cis male', 'A little about you',
       'Malr', 'p', 'femail', 'Cis Man',
       'ostensibly male, unsure what that really means'], dtype=object)

In [6]:
# Looks like people wrote in responses.  Data cleaning time!

df['Gender'] = df['Gender'].str.lower()
df['Gender'] = df['Gender'].replace('m','male')
df['Gender'] = df['Gender'].replace('f','female')

df['Gender'] = df['Gender'].str.strip()

df['Gender'] = df['Gender'].apply(lambda x: str(x).replace('cis ',''))
df['Gender'] = df['Gender'].apply(lambda x: str(x).replace('(cis)',''))

df['Gender'] = df['Gender'].apply(lambda x: str(x).replace('make','male'))
df['Gender'] = df['Gender'].apply(lambda x: str(x).replace('mail','male'))
df['Gender'] = df['Gender'].apply(lambda x: str(x).replace('mal','male'))
df['Gender'] = df['Gender'].apply(lambda x: str(x).replace('malee','male'))
df['Gender'] = df['Gender'].apply(lambda x: str(x).replace('malr','male'))
df['Gender'] = df['Gender'].apply(lambda x: str(x).replace('woman','female'))
df['Gender'] = df['Gender'].apply(lambda x: str(x).replace('maler','male'))
df['Gender'] = df['Gender'].apply(lambda x: str(x).replace('msle','male'))
df['Gender'] = df['Gender'].apply(lambda x: str(x).replace('man','male'))

df['Gender'].value_counts()

male                                              988
female                                            245
female (trans)                                      2
male leaning androgynous                            1
queer/she/they                                      1
queer                                               1
neuter                                              1
female                                              1
a little about you                                  1
malee                                               1
trans-female                                        1
p                                                   1
agender                                             1
something kinda male?                               1
fluid                                               1
non-binary                                          1
male                                                1
enby                                                1
androgyne                   

In [7]:
# Changing those who didn't reply "male" or "female" to "other".

df['Gender'] = df['Gender'].apply(lambda x: 'other' if x != 'male' and x != 'female' else x)

df['Gender'].value_counts()

male      988
female    245
other      26
Name: Gender, dtype: int64

### What is the frequency of mental health issues in tech?

When trying to determine the frequency of mental health conditions in the tech workplace, the most relevant survey question is this one: Have you sought treatment for a mental health condition?

It is worth mentioning that this question presents some limitations.

First, the question does not ask whether or not the respondent has a mental health issue, but rather if s/he has gone in for treatment.  A person who may have a mental health condition but is forgoing treatment would respond "no."  In this case, the data would be under-reporting the frequency of mental health conditions in tech.

Second, respondents are not randomly selected take this survey, but are instead self-selected. People who have mental health issues are more likely to complete surveys about mental health.  Therefore, the data would be over-reporting the frequency of mental health conditions in tech.

In [8]:
# Changing treatment responses to binary.

df['treatment'] = df.treatment.map({'Yes':1, 'No':0})

In [15]:
# Proportion of survey respondents that say they have sought treatment for a mental health condition.

df['treatment'].mean()

0.50595710881652101

In [10]:
# Proportion of people who say they went for mental health treatment by gender.

df.groupby('Gender')['treatment'].mean()

Gender
female    0.689796
male      0.454453
other     0.730769
Name: treatment, dtype: float64

While the majority of survey respondents are male, more females than males say they have sought mental health treatment.  This is on par with [existing research](https://www.theguardian.com/society/2016/nov/05/men-less-likely-to-get-help--mental-health) that men are less likely to seek mental health treatment than women are.  

73% of survey respondents in the "other" category have sought mental health treatment.  This is also on par with [existing research](https://www.healthypeople.gov/2020/topics-objectives/topic/lesbian-gay-bisexual-and-transgender-health#one).  Most respondents in the "Other" category identify as LGBT.  LGBT individuals, who are more likely to experience societal stigma, discrimination, and denial of rights, have higher rates of mental health issues. 

In [11]:
# Proportion of people who say they went for mental health treatment by 
#     whether or not the workplace provides mental health benefits.

df.groupby('benefits')['treatment'].mean()

benefits
Don't know    0.370098
No            0.483957
Yes           0.639413
Name: treatment, dtype: float64

In [17]:
# Is the difference in No and Yes groups statisticall significant?
stats.ttest_ind(df[df.benefits=='No'].treatment, df[df.benefits=='Yes'].treatment)

Ttest_indResult(statistic=-4.5986868552563989, pvalue=4.8985435953573476e-06)

In [12]:
# Proportion of people who say they went for mental health treatment by 
#     whether or not the workplace provides mental health care options.

df.groupby('care_options')['treatment'].mean()

care_options
No          0.413174
Not sure    0.391720
Yes         0.691441
Name: treatment, dtype: float64

In [18]:
# Is the difference in No and Yes groups statisticall significant?
stats.ttest_ind(df[df.care_options=='No'].treatment, df[df.care_options=='Yes'].treatment)

Ttest_indResult(statistic=-8.9163036150785491, pvalue=2.4538096289972692e-18)

In [13]:
# Proportion of people who say they went for mental health treatment by 
#     whether or not the workplace discussed mental health as part of an employee wellness program
df.groupby('wellness_program')['treatment'].mean()

wellness_program
Don't know    0.430851
No            0.498812
Yes           0.593886
Name: treatment, dtype: float64

In [19]:
# Is the difference in No and Yes groups statisticall significant?
stats.ttest_ind(df[df.wellness_program=='No'].treatment, df[df.wellness_program=='Yes'].treatment)

Ttest_indResult(statistic=-2.5586429041817653, pvalue=0.010645001151270303)

In [14]:
# Proportion of people who say they went for mental health treatment by 
#     whether or not the workplace provides resources to learn more about mental health and
#     how to seek help.

df.groupby('seek_help')['treatment'].mean()

seek_help
Don't know    0.4573
No            0.5000
Yes           0.5920
Name: treatment, dtype: float64

In [20]:
# Is the difference in No and Yes groups statisticall significant?
stats.ttest_ind(df[df.seek_help=='No'].treatment, df[df.seek_help=='Yes'].treatment)

Ttest_indResult(statistic=-2.4792815697578368, pvalue=0.013348107722419353)

Respondents who have sought mental health treatments are significantly more likely to say that their employers offer mental health benefits.  This could be because respondents who need mental health services are more aware of the mental health benefits offered by their workplaces.  Or, it could mean that workplaces that offer mental health services are more likely to have employees that seek mental health treatment.  Because this is a survey, it is impossible to establish temporality.

It would be interesting to run an experiment to determine: 
1. If a company adopts a mental health program, would employees use the services?  
2. Given that employees are utilizing mental health services, would overall job satisfaction and productivity improve?

## Mental Health in Tech: Experimentation RFC

Setting:

(Review current literature on mental health in the workplace)

Our theoretical company, BookFace, is concerned about the mental health of their employees.  Currently, BookFace offers employees health insurance and provide physical wellnesss program that promotes exercise and healthy eating.  However, leadership believes more can be done to support their employees' mental wellness.  Thus, they designed a mental health experiment to determine 
    1. What usage of mental health services may look like?
    2. How mental health benefits could improve overall job satisfaction and productivity?

To do this, they will implemental a mental health wellness program in randomly selected offices in North America. They will measure employee sick days as a proxy for productivity.  

Experiment:

BookFace has 30 locations across North America.  These locations will be randomly assigned to either the test or control groups.

For offices in the control group, employees will attend an info session that reviews their current health benefits, including any mental health service offerings with their health insurance plan.

Offices in the test group will adopt a more comprehensive mental health wellness program.  This program includes a "wellness week," in which leadership and employees openly discuss and address mental health issues in the workplace to combat stigma around mental health conditions.  For the duration of the experiment, a mental health professional will be available on staff, offering three free counselling session.  Employees will be insured that the names of the clients and the nature of the visit will be kept confidential.  BookFace will also be offering biweekly meditation session for staff members.


Hypothesis:
Employees in the test group will take less sick days on average than employees in the control group.

Null Hypothesis:
There is no difference in the average number of employee sick days between test and control groups.

Success metric: Average number of sick days.
Secondary metrics:  Number of visits to mental health professional.  

Timeline:

Week 1 - Implement "Wellness Week" for five offices in the test group and hold info session for control group.
Week 3 - Evaluate how Wellness Week went for the five test offices.  