# Mental Health in Tech Survey Analysis
**Objective:** Analyze mental health trends in tech workplaces and visualize relationships between company support structures and employee mental health outcomes

Dataset Source: [Mental Health in Tech Survey](https://www.kaggle.com/osmi/mental-health-in-tech-survey)

## 1 Initial Setup

**Libraries Used:**  
- Pandas & NumPy (data handling)  
- Bokeh (visualizations)  

In [57]:
# Import required libraries
import numpy as np
import pandas as pd
from bokeh.plotting import figure, show, output_notebook
from bokeh.layouts import gridplot
from bokeh.models import HoverTool
from bokeh.models import ColumnDataSource

# Enable inline Bokeh displays
output_notebook()

# Load dataset
df = pd.read_csv("survey.csv")

## 2. Data Preparation
### Initial Data Inspection

In [58]:
# Basic dataset inspection
print("Dataset shape:", df.shape)

print("\nFirst 5 rows:")
print(df.head())

print("\nColumn Summary:")
df.info()

print("\nMissing Values (Top 10):")
print(df.isna().sum().sort_values(ascending=False).head(10))

print("\nColumns with more than 30% missing values:")
missing_data = df.isna().mean() * 100  # Convert to percentage
print(missing_data[missing_data > 30])

Dataset shape: (1259, 27)

First 5 rows:
             Timestamp  Age  Gender         Country state self_employed  \
0  2014-08-27 11:29:31   37  Female   United States    IL           NaN   
1  2014-08-27 11:29:37   44       M   United States    IN           NaN   
2  2014-08-27 11:29:44   32    Male          Canada   NaN           NaN   
3  2014-08-27 11:29:46   31    Male  United Kingdom   NaN           NaN   
4  2014-08-27 11:30:22   31    Male   United States    TX           NaN   

  family_history treatment work_interfere    no_employees  ...  \
0             No       Yes          Often            6-25  ...   
1             No        No         Rarely  More than 1000  ...   
2             No        No         Rarely            6-25  ...   
3            Yes       Yes          Often          26-100  ...   
4             No        No          Never         100-500  ...   

                leave mental_health_consequence phys_health_consequence  \
0       Somewhat easy               

### Data Cleaning

In [59]:
# Select and clean relevant columns
columns_to_keep = [
    'Age', 'Gender', 'family_history', 'treatment', 'work_interfere',
    'remote_work', 'tech_company', 'benefits', 'seek_help',
    'anonymity', 'mental_health_consequence', 'phys_health_consequence'
]

df_clean = df[columns_to_keep].copy()

# Clean age outliers
df_clean['Age'] = df_clean['Age'].clip(18, 80)

# Simplify gender categories
df_clean['Gender'] = np.where(
    df_clean['Gender'].str.lower().str.contains('female'),
    'Female',
    'Male'
)

# Handle missing values
df_clean['work_interfere'] = df_clean['work_interfere'].fillna('Not Specified')

## 3. Exploratory Data Analysis

In [60]:
# Age distribution
age_bins = pd.cut(df_clean['Age'], bins=10, right=True)
age_bins = age_bins.astype(str).apply(lambda x: f"({int(float(x.split(',')[0].strip()[1:]))}, {int(float(x.split(',')[1].strip()[:-1]))})")
sorted_age_bins = age_bins.value_counts(normalize=True, sort=False).sort_index(
    key=lambda x: x.str.extract(r'(\d+)').astype(int)[0]
)
print("\nAge Distribution:")
display(sorted_age_bins)



# Gender distribution
print("\nGender Distribution:")
display(df_clean['Gender'].value_counts(normalize=True))

# Treatment distribution
print("\nMental Health Treatment Distribution:")
display(df_clean['treatment'].value_counts(normalize=True))

# Benefits distribution
print("\nCompany Benefits Distribution:")
display(df_clean['benefits'].value_counts(normalize=True))


Age Distribution:


Unnamed: 0_level_0,proportion
Age,Unnamed: 1_level_1
"(17, 24)",0.128674
"(24, 30)",0.335981
"(30, 36)",0.29865
"(36, 42)",0.150119
"(42, 49)",0.059571
"(49, 55)",0.014297
"(55, 61)",0.008737
"(61, 67)",0.001589
"(67, 73)",0.000794
"(73, 80)",0.001589



Gender Distribution:


Unnamed: 0_level_0,proportion
Gender,Unnamed: 1_level_1
Male,0.848292
Female,0.151708



Mental Health Treatment Distribution:


Unnamed: 0_level_0,proportion
treatment,Unnamed: 1_level_1
Yes,0.505957
No,0.494043



Company Benefits Distribution:


Unnamed: 0_level_0,proportion
benefits,Unnamed: 1_level_1
Yes,0.378872
Don't know,0.324067
No,0.297061


## 4. Data Visualizations

### Visualization 1: Age Distribution by Treatment Status

In [65]:
# Prepare data
age_treatment = df_clean.groupby('treatment')['Age'].apply(list).reset_index()

# Define colors
colors = {'No': '#718dbf', 'Yes': '#e84d60'}

# Create figure
p1 = figure(title="Age Distribution by Treatment Status",
            width=600, height=450,
            tools="pan,wheel_zoom,box_zoom,reset")

# X-axis mapping
x_positions = {'No': 1, 'Yes': 2}

# Ensure correct data extraction
for treatment, ages in zip(age_treatment['treatment'], age_treatment['Age']):
        p1.scatter(x=np.random.normal(x_positions[treatment], 0.3, len(ages)),
                   y=ages,
                   color=colors[treatment],
                   alpha=0.6,
                   size=6,
                   legend_label=treatment)

# Adjust x-axis labels
p1.xaxis.ticker = list(x_positions.values())
p1.xaxis.major_label_overrides = {1: 'No Treatment', 2: 'Sought Treatment'}
p1.yaxis.axis_label = 'Age'
p1.legend.title = "Treatment Status"
p1.legend.location = "top_right"

# Calculate the median age for each treatment group
median_ages = df_clean.groupby('treatment')['Age'].median()

# Print the median age by treatment status
print("Median Age by Treatment Status:")
print(f" - No Treatment: {median_ages['No']} years")
print(f" - Sought Treatment: {median_ages['Yes']} years")

# Show plot
show(p1)

Median Age by Treatment Status:
 - No Treatment: 31.0 years
 - Sought Treatment: 32.0 years


### Visualization 2: Mental Health Benefits vs Treatment Rates

In [62]:
# Prepare data
benefits_treatment = pd.crosstab(df_clean['benefits'], df_clean['treatment'], normalize='index') * 100

# Calculate actual treatment rates
treatment_rate_no_benefits = benefits_treatment.loc['No', 'Yes']  # % of employees seeking treatment (No benefits)
treatment_rate_with_benefits = benefits_treatment.loc['Yes', 'Yes']  # % of employees seeking treatment (With benefits)

# Calculate the percentage increase
percentage_increase = ((treatment_rate_with_benefits - treatment_rate_no_benefits) / treatment_rate_no_benefits) * 100

# Print actual rates
print(f"Treatment rate without benefits: {treatment_rate_no_benefits:.2f}%")
print(f"Treatment rate with benefits: {treatment_rate_with_benefits:.2f}%")
print(f"Percentage increase in treatment rate: {percentage_increase:.2f}%")

# Create plot
p2 = figure(title="Treatment Rates by Company Benefits",
           x_range=list(benefits_treatment.index),
           width=400, height=300,
           tools="hover")

p2.vbar(x='benefits', top='Yes', width=0.5,
       source=benefits_treatment.reset_index(),
       color='#4CAF50',
       legend_label="Sought Treatment")

p2.yaxis.axis_label = "Percentage (%)"
p2.xaxis.axis_label = "Company Provides Benefits"
p2.xgrid.grid_line_color = None
p2.legend.location = "top_left"

# Show plot
show(p2)

Treatment rate without benefits: 48.40%
Treatment rate with benefits: 63.94%
Percentage increase in treatment rate: 32.12%


### Visualization 3: Work Interference Distribution

In [63]:
# Prepare data
work_interfere = (
    df_clean['work_interfere']
    .value_counts()
    .reset_index()
    .rename(columns={'index': 'work_interfere', 0: 'count'})
)

# Calculate the percentage of respondents who reported "sometimes" or "often" interfere with work
total_responses = work_interfere['count'].sum()
interference_responses = work_interfere[work_interfere['work_interfere'].isin(['Sometimes', 'Often'])]['count'].sum()
percentage_interference = (interference_responses / total_responses) * 100

# Print the updated statement
print(f"Work Impact: {percentage_interference:.2f}% of respondents report mental health issues sometimes/often interfere with work")

# Create plot
p3 = figure(title="Work Interference Levels",
           x_range=work_interfere['work_interfere'],
           width=400, height=300)

p3.vbar(x='work_interfere', top='count', width=0.7,
       source=work_interfere,
       color='#FF9800')

p3.xaxis.axis_label = "Level of Work Interference"
p3.yaxis.axis_label = "Count"
p3.xgrid.grid_line_color = None
show(p3)


Work Impact: 48.37% of respondents report mental health issues sometimes/often interfere with work


## 5. Dashboard Layout

In [64]:
# Combine all visualizations
dashboard = gridplot([[p1, p2], [p3, None]])
show(dashboard)

## 6. Key Insights
1. **Age Distribution:** Employees seeking treatment tend to be slightly older (median age 32) vs the non-treatment group (median age 31)
2. **Company Benefits Matter:** Companies offering mental health benefits have 32.12% higher treatment rates
3. **Work Impact:** 48.37% of respondents report mental health issues sometimes/often interfere with work