Analysis by - [Pratheek Nistala](https://pratheek.tech)

## Tech Minds: Analyzing Mental Health Trends in Technology Professionals

# 1. Business Understanding

In this project, titled "Tech Minds: Analyzing Mental Health Trends in Technology Professionals" , I aim to measure employer attitudes toward mental health in tech workplaces and examine the prevalence of mental health disorders among tech workers. The dataset used for this analysis is derived from the 2014 Mental Health in Tech Survey .   

This project seeks to answer the following key questions:   

    1. Did employers in the US tech industry recognize the importance of mental health in 2014?  
    2. How frequently does mental health impact employees' work performance?  
    3. What are the prevailing attitudes toward mental health within the tech industry?  
    4. How do factors such as family history and age influence mental health outcomes?  
    5. What insights can gender provide into mental health conditions among tech professionals?
     

Additionally, I will build and evaluate five predictive models to determine whether employees are likely to seek medical treatment for mental health issues and assess their performance.

In [None]:
# Importing library for data processing
import pandas as pd
from google.colab import files

## 2. Data Understanding

In [None]:
# Upload the dataset manually
uploaded = files.upload()

# Load dataset
survey = list(uploaded.keys())[0]
data = pd.read_csv(survey)

Saving survey.csv to survey.csv


In [None]:
data.head()

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


The dataset contains the following columns:   

    Timestamp : Date and time of the survey response.  
    Age : Age of the respondent.  
    Gender : Gender identity of the respondent.  
    Country : Country of residence.  
    state : State or territory of residence (for US respondents).  
    self_employed : Whether the respondent is self-employed.  
    family_history : Whether the respondent has a family history of mental illness.  
    treatment : Whether the respondent has sought treatment for a mental health condition.  
    work_interfere : Whether a mental health condition interferes with the respondent's work.  
    no_employees : Number of employees in the respondent’s company.  
    remote_work : Whether the respondent works remotely at least 50% of the time.  
    tech_company : Whether the respondent works for a primarily tech-focused company.  
    benefits : Whether the respondent’s employer provides mental health benefits.  
    care_options : Awareness of mental health care options provided by the employer.  
    wellness_program : Whether the employer has discussed mental health in wellness programs.  
    seek_help : Availability of resources to learn about mental health issues and seek help.  
    anonymity : Whether anonymity is protected when accessing mental health resources.  
    leave : Ease of taking medical leave for mental health conditions.  
    mental_health_consequence : Perception of negative consequences when discussing mental health with employers.  
    phys_health_consequence : Perception of negative consequences when discussing physical health with employers.  
    coworkers : Willingness to discuss mental health with coworkers.  
    supervisor : Willingness to discuss mental health with direct supervisors.  
    mental_health_interview : Likelihood of bringing up mental health in job interviews.  
    phys_health_interview : Likelihood of bringing up physical health in job interviews.  
    mental_vs_physical : Employer’s perceived seriousness of mental health compared to physical health.  
    obs_consequence : Observation of negative consequences for coworkers with mental health conditions.  
    comments : Additional notes or comments provided by respondents.
     

In [None]:
# Shape of the Data
data.shape

(1259, 27)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Timestamp                  1259 non-null   object
 1   Age                        1259 non-null   int64 
 2   Gender                     1259 non-null   object
 3   Country                    1259 non-null   object
 4   state                      744 non-null    object
 5   self_employed              1241 non-null   object
 6   family_history             1259 non-null   object
 7   treatment                  1259 non-null   object
 8   work_interfere             995 non-null    object
 9   no_employees               1259 non-null   object
 10  remote_work                1259 non-null   object
 11  tech_company               1259 non-null   object
 12  benefits                   1259 non-null   object
 13  care_options               1259 non-null   object
 14  wellness

In [None]:
data.describe()

Unnamed: 0,Age
count,1259.0
mean,79428150.0
std,2818299000.0
min,-1726.0
25%,27.0
50%,31.0
75%,36.0
max,100000000000.0


## 3. Data Preperation

3.1 Cleaning the Dataset

3.1.1 Updating Age Column Values

In [None]:
# There are values in this column that doesn't make sense
data['Age'].unique()

array([         37,          44,          32,          31,          33,
                35,          39,          42,          23,          29,
                36,          27,          46,          41,          34,
                30,          40,          38,          50,          24,
                18,          28,          26,          22,          19,
                25,          45,          21,         -29,          43,
                56,          60,          54,         329,          55,
       99999999999,          48,          20,          57,          58,
                47,          62,          51,          65,          49,
             -1726,           5,          53,          61,           8,
                11,          -1,          72])

In [None]:
# Removing values which can't be changed
values = [-1726, 329, 99999999999, -1, -29]
for val in values:
  data = data[data.Age != val]

In [None]:
# Shape after Updating
data.shape

(1254, 27)

In [None]:
data.Country.count()

1254

In [None]:
# Dropping non-essential columns: Country column has 751 values from United States alone so if we keep it in the dataset, it will create bias
data = data.drop(["comments","Timestamp",
                                  "Country", "state"], axis = 1)

In [None]:
# Verifying the drop of 4 Columns
data.shape

(1254, 23)

3.1.2 Dealing with Null Values

In [None]:
# Checking for missing values
null_values = data.isnull().sum()
null_values[null_values > 0]

Unnamed: 0,0
self_employed,18
work_interfere,263


In [None]:
# Assigning default values for columns with missing values
defaultString = 'NaN' # Since both consists of string values

# Creating list of the columns to replace null values
stringFeatures = ['self_employed', 'work_interfere']

# Gettng consistent NaN's
for feature in data:
    if feature in stringFeatures:
        data[feature] = data[feature].fillna(defaultString)

In [None]:
# Most of the values in self_employed column are 'No' so will change NaN values to 'No'
data['self_employed'][data['self_employed']=='No'].count()

1092

In [None]:
# Removing NaN values from self_employed column and replacing
data['self_employed'] = data['self_employed'].replace([defaultString], 'No')

# Unique values in self_employed column after replacement
data['self_employed'].unique()

array(['No', 'Yes'], dtype=object)

In [None]:
# Removing NaN values from work_interfere column and replacing with most common value
data['work_interfere'] = data['work_interfere'].replace([defaultString], 'Don\'t know' )
print(data['work_interfere'].unique())

['Often' 'Rarely' 'Never' 'Sometimes' "Don't know"]


3.1.3 Making Gender Column Consistent

In [None]:
# Gender column consist of lots of inconsistent values
print(data["Gender"].unique())

['Female' 'M' 'Male' 'male' 'female' 'm' 'Male-ish' 'maile' 'Trans-female'
 'Cis Female' 'F' 'something kinda male?' 'Cis Male' 'Woman' 'f' 'Mal'
 'Male (CIS)' 'queer/she/they' 'non-binary' 'Femake' 'woman' 'Make' 'Nah'
 'Enby' 'fluid' 'Genderqueer' 'Female ' 'Androgyne' 'Agender'
 'cis-female/femme' 'Guy (-ish) ^_^' 'male leaning androgynous' 'Male '
 'Man' 'Trans woman' 'msle' 'Neuter' 'Female (trans)' 'queer'
 'Female (cis)' 'Mail' 'cis male' 'A little about you' 'Malr' 'femail'
 'Cis Man' 'ostensibly male, unsure what that really means']


In [None]:
# Changing all values to lowercase
data["Gender"] = data["Gender"].str.lower()

In [None]:
# Making gender-groups to classify all values
male_str = ["male", "m", "male-ish", "maile", "mal", "male (cis)", "make", "male ", "man","msle", "mail", "malr","cis man", "Cis Male", "cis male"]
trans_str = ["trans-female", "something kinda male?", "queer/she/they", "non-binary","nah", "all", "enby", "fluid", "genderqueer", "androgyne", "agender", "male leaning androgynous", "guy (-ish) ^_^", "trans woman", "neuter", "female (trans)", "queer", "ostensibly male, unsure what that really means"]
female_str = ["cis female", "f", "female", "woman",  "femake", "female ","cis-female/femme", "female (cis)", "femail"]

In [None]:
# Changing groups into three categories
for index, row in data.iterrows():
    if row['Gender'] in male_str:
        data.at[index, 'Gender'] = 'male'
    elif row['Gender'] in female_str:
        data.at[index, 'Gender'] = 'female'
    elif row['Gender'] in trans_str:
        data.at[index, 'Gender'] = 'trans'

# Getting rid of unnecessary values
stk_list = ['a little about you', 'p']
data = data[~data['Gender'].isin(stk_list)]

In [None]:
# Values after Cleaning
data["Gender"].value_counts()

Unnamed: 0_level_0,count
Gender,Unnamed: 1_level_1
male,988
female,247
trans,18


In [None]:
# Data After Cleaning
data.head()

Unnamed: 0,Age,Gender,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,...,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
0,37,female,No,No,Yes,Often,6-25,No,Yes,Yes,...,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No
1,44,male,No,No,No,Rarely,More than 1000,No,No,Don't know,...,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No
2,32,male,No,No,No,Rarely,6-25,No,Yes,No,...,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No
3,31,male,No,Yes,Yes,Often,26-100,No,Yes,No,...,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes
4,31,male,No,No,No,Never,100-500,Yes,Yes,Yes,...,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No


3.1.4 Saving Cleaned Dataset

In [36]:
# Making a copy of cleaned data to use for visualizations in the next notebook
data_cleaned = data.copy()

# Saving the cleaned data to a CSV file
file_name = 'clean_data.csv'
data_cleaned.to_csv(file_name, index=False)

# Download the cleaned dataset to your local machine
from google.colab import files
files.download(file_name)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>