# Capstone Project: Mental Health in the Tech Industry


## Problem Statement

Mental health is a critical issue in the tech industry, where stress, high workloads, and remote work conditions can significantly impact employees.
This study examines the state of mental health among tech professionals, focusing on how various factors (remote work, employer support, fear of consequences) influence employees' well-being.

**Key Questions for Analysis:**

1. How common is mental health treatment in tech?
2. Are age or family history influencing treatment rates?
3. Do employees fear consequences from employers?
4. Does working remotely or independently impact mental well-being?
5. How do companies support their employees?


## Import data

Importing datasets from KaggleHub using the kagglehub Python package.


In [839]:
import kagglehub

path = kagglehub.dataset_download("osmi/mental-health-in-tech-survey")
print("Path to dataset files:", path)

Path to dataset files: /Users/elizabethsheremet/.cache/kagglehub/datasets/osmi/mental-health-in-tech-survey/versions/3


## Read data

Reading the data and assigning it to a variable for further manipulation.


In [840]:
import pandas as pd
import os

dataset_file = os.path.join(path, "survey.csv")
df = pd.read_csv(dataset_file)
print(df.head())

             Timestamp  Age  Gender         Country state self_employed  \
0  2014-08-27 11:29:31   37  Female   United States    IL           NaN   
1  2014-08-27 11:29:37   44       M   United States    IN           NaN   
2  2014-08-27 11:29:44   32    Male          Canada   NaN           NaN   
3  2014-08-27 11:29:46   31    Male  United Kingdom   NaN           NaN   
4  2014-08-27 11:30:22   31    Male   United States    TX           NaN   

  family_history treatment work_interfere    no_employees  ...  \
0             No       Yes          Often            6-25  ...   
1             No        No         Rarely  More than 1000  ...   
2             No        No         Rarely            6-25  ...   
3            Yes       Yes          Often          26-100  ...   
4             No        No          Never         100-500  ...   

                leave mental_health_consequence phys_health_consequence  \
0       Somewhat easy                        No                      No   
1 

## Clean data

**Step 1: Dropping unnecessary columns**

NO need columns such as Timestamp, state and comments.

**Reasons:**

- Timestamp doesn't impact on the result.
- State column has a lot of passes (is not a critical information for analysis).
- Comments column is an optional field and has no matter for analysis.


In [841]:
cleaned_df = df.drop(["Timestamp", "comments", "state"], axis=1)
cleaned_df = pd.DataFrame(cleaned_df)
print(cleaned_df.head())

   Age  Gender         Country self_employed family_history treatment  \
0   37  Female   United States           NaN             No       Yes   
1   44       M   United States           NaN             No        No   
2   32    Male          Canada           NaN             No        No   
3   31    Male  United Kingdom           NaN            Yes       Yes   
4   31    Male   United States           NaN             No        No   

  work_interfere    no_employees remote_work tech_company  ...   anonymity  \
0          Often            6-25          No          Yes  ...         Yes   
1         Rarely  More than 1000          No           No  ...  Don't know   
2         Rarely            6-25          No          Yes  ...  Don't know   
3          Often          26-100          No          Yes  ...          No   
4          Never         100-500         Yes          Yes  ...  Don't know   

                leave mental_health_consequence phys_health_consequence  \
0       Somewhat 

**Step 2: Processing Key Values**

Some rows need to be normalized snd processed.

**Categorical Data**
| Column | Issue | Solution|
|----------|---------|----------|
| "Gender" | Various: "m, "F", "Male", "female", etc. | Bring to standart: "Male", "Female, "Other". |
| "self_employed" | Lots of passes | Filled with "No" and vring to binary. |
| "tech_company" | Includes "No" while focus on TECH industry | Remove rows where valus is "No" |


In [842]:
# bring all values in "Gender" column to lower case
cleaned_df["Gender"] = cleaned_df["Gender"].str.lower()
# all unique values for "Gender" column
# print(cleaned_df["Gender"].unique())

# function to bring existed values to standart
def clean_gender(gender):
    if gender in ["male", "m", "man", "mail", "male-ish", "maile", "mal"]:
        return "Male"
    elif gender in ["female", "f", "woman", "fm", "femake"]:
        return "Female"
    else:
        return "Other"

# Apply function to "Gender" column
cleaned_df["Gender"] = cleaned_df["Gender"].apply(clean_gender)

# Filled "No" instead of NaN values in "self_employed" column.
cleaned_df["self_employed"] = cleaned_df["self_employed"].fillna("No")

# Remove rows with non-tech company
cleaned_df = cleaned_df[cleaned_df["tech_company"] == "Yes"]
# print(cleaned_df)



**Numerical Data**
| Column | Issue | Solution|
|----------|---------|----------|
| "Age" | Values might be wrong(like <18 or >70) | Delete invalid values. |
| "no_employees" | Lots of different values | Group them into: "Large", "Medium", "Small" |


In [843]:
# Check "Age" column to valid values
cleaned_df = cleaned_df[(cleaned_df["Age"] >= 18) & (cleaned_df["Age"] <= 70)]
print(type(cleaned_df))
print(cleaned_df["Age"].max())
# Group "no-employees" values
print(cleaned_df["no_employees"].unique())
employee_groups = {
    "1-5": "Small",
    "6-25": "Small",
    "26-100": "Medium",
    "100-500": "Medium",
    "500-1000": "Large",
    "More than 1000": "Large",
}
# Create new column with the result
cleaned_df["company_size"] = cleaned_df["no_employees"].map(employee_groups)
# print(cleaned_df.head())

<class 'pandas.core.frame.DataFrame'>
62
['6-25' '26-100' '100-500' '1-5' '500-1000' 'More than 1000']


## Exploratory Data Analysis

### Importing libraries


In [844]:
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool, LabelSet
from bokeh.io import output_notebook
from bokeh.transform import factor_cmap, dodge, cumsum
from bokeh.palettes import Category20c

output_notebook()

### Question 1: How prevalent is mental illness in tech?

**Variables used:** `treatment`

**Goal:**  
To measure how many tech employees have received treatment for mental health issues.

**Visualization:**  
Pie chart with percentage labels.

**Insight:**  
Gives a general picture of how common mental health treatment is in the tech industry.


In [845]:
import numpy as np

treatment_counts = cleaned_df["treatment"].value_counts()
treatment_percent = treatment_counts / treatment_counts.sum() * 100

data = pd.DataFrame(
    {"treatment": treatment_percent.index, "value": treatment_percent.values}
)
data["angle"] = data["value"] / data["value"].sum() * 2 * np.pi
data["color"] = ["#e57373", "#c62828"]
data["label"] = data.apply(
    lambda row: f"{row['treatment']} — {row['value']:.1f}%", axis=1
)
angles = np.cumsum([0] + list(data["angle"][:-1]))
data["theta"] = angles + data["angle"] / 2
data["x"] = 0.25 * np.cos(data["theta"])
data["y"] = 1 + 0.25 * np.sin(data["theta"])
data["percent"] = data["value"].apply(lambda x: f"{x:.1f}%")

source = ColumnDataSource(data)

p = figure(
    height=400,
    title="Undergoing treatment for mental illnesses",
    toolbar_location=None,
    tools="hover",
    tooltips="@label",
    x_range=(-0.5, 1.0),
)

p.wedge(
    x=0,
    y=1,
    radius=0.4,
    start_angle=cumsum("angle", include_zero=True),
    end_angle=cumsum("angle"),
    line_color="white",
    fill_color="color",
    legend_field="label",
    source=source,
)

labels = LabelSet(
    x="x",
    y="y",
    text="percent",
    level="glyph",
    x_offset=0,
    y_offset=0,
    text_align="center",
    text_baseline="middle",
    text_color="white",
    source=source,
    text_font_size="14pt",
)

p.add_layout(labels)

p.axis.visible = False
p.grid.visible = False
p.legend.label_text_font_size = "9pt"
p.title.text_font_size = "16pt"
p.title.align = "center"
show(p)

### Question 2: Does age affect the likelihood of treatment?

**Variables used:** `Age`, `treatment`

**Goal:**  
To explore if mental health treatment is more common in specific age groups.

**Visualization:**  
Histogram of age with grouped color bars (by treatment status).

**Insight:**  
Helps determine whether younger or older employees are more likely to seek help.


In [846]:
bins = [18, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70]

cleaned_df["age_group"] = pd.cut(cleaned_df["Age"], bins, right=False)
grouped = (
    cleaned_df.groupby(["age_group", "treatment"], observed=True)
    .size()
    .unstack(fill_value=0)
)
grouped.index = grouped.index.map(lambda x: f"{int(x.left)} - {int(x.right - 1)}")
total_counts = grouped.sum(axis=1)
percent_yes = (grouped["Yes"] / total_counts * 100).round(1).astype(str) + "%"
percent_no = (grouped["No"] / total_counts * 100).round(1).astype(str) + "%"

source = ColumnDataSource(
    data=dict(
        age_group=grouped.index.tolist(),
        Yes=grouped["Yes"].tolist(),
        No=grouped["No"].tolist(),
        percent_yes=percent_yes.tolist(),
        percent_no=percent_no.tolist(),
        y_yes=(grouped["Yes"] + 5).tolist(),
        y_no=(grouped["No"] + 5).tolist(),
        x_yes=grouped.index.tolist(),
        x_no=grouped.index.tolist(),
    )
)

p = figure(
    x_range=grouped.index.tolist(),
    y_range=(0, grouped.values.max() + 100),
    title="Treatment Seeking by Age Group",
    width=800,
    height=500,
    background_fill_color="#fafafa",
)

p.vbar(
    x=dodge("age_group", -0.2, range=p.x_range),
    top="Yes",
    width=0.4,
    source=source,
    color="#d0bcfe",
    legend_label="Yes",
)

p.vbar(
    x=dodge("age_group", 0.2, range=p.x_range),
    top="No",
    width=0.4,
    source=source,
    color="#6a7fd2",
    legend_label="No",
)


labels_yes = LabelSet(
    x=dodge("age_group", -0.2, range=p.x_range),
    y="y_yes",
    text="percent_yes",
    level="glyph",
    source=source,
    text_align="center",
    text_baseline="bottom",
    text_font_size="8pt",
    text_color="#333",
)

labels_no = LabelSet(
    x=dodge("age_group", 0.2, range=p.x_range),
    y="y_no",
    text="percent_no",
    level="glyph",
    source=source,
    text_align="center",
    text_baseline="bottom",
    text_font_size="8pt",
    text_color="#333",
)

p.add_layout(labels_yes)
p.add_layout(labels_no)

p.title.text_font_size = "16pt"
p.title.align = "center"
p.xaxis.axis_label = "Age Range"
p.yaxis.axis_label = "Number of Respondents"
p.xaxis.major_label_orientation = 1
p.outline_line_color = None
p.grid.grid_line_alpha = 0.25
p.legend.title = "Treatment"
p.legend.location = "top_right"
p.legend.label_text_font_size = "8pt"

show(p)

### Question 3: Does family history influence treatment?

**Variables used:** `family_history`, `treatment`

**Goal:**  
To check if people with a family history of mental illness are more likely to seek treatment.

**Visualization:**  
Grouped bar chart (horizontal or vertical).

**Insight:**  
Shows the relationship between inherited mental health factors and actual help-seeking behavior.


In [847]:
grouped = (
    cleaned_df.groupby(["family_history", "treatment"]).size().unstack(fill_value=0)
)
grouped_percent = grouped.div(grouped.sum(axis=1), axis=0) * 100
categories = list(grouped_percent.index)  # Yes, No
treatments = list(grouped_percent.columns)  # Yes, No

source = ColumnDataSource(
    data=dict(
        y=categories,
        yes=[grouped_percent.loc[i, "Yes"] for i in categories],
        no=[grouped_percent.loc[i, "No"] for i in categories],
    )
)

p = figure(
    y_range=categories,
    height=500,
    width=800,
    title="Family history and treatment",
    toolbar_location=None,
)

p.hbar(
    y=dodge("y", -0.15, range=p.y_range),
    right="yes",
    height=0.25,
    color="#1f77b4",
    legend_label="Yes",
    source=source,
)
p.hbar(
    y=dodge("y", 0.15, range=p.y_range),
    right="no",
    height=0.25,
    color="#ff7f0e",
    legend_label="No",
    source=source,
)
p.x_range.start = 0

p.xaxis.axis_label = "Persentage (%)"
p.yaxis.axis_label = "Family history"
p.xaxis.major_label_orientation = 0.5
p.legend.title = "Treatment"
p.legend.location = "top_right"
p.legend.label_text_font_size = "8pt"
p.title.text_font_size = "16pt"
p.title.align = "center"

show(p)

### Question 4: Does remote vs. office work relate to treatment?

**Variables used:** `remote_work`, `treatment`

**Goal:**  
To explore whether remote employees are more or less likely to seek treatment compared to office workers.

**Visualization:**  
Grouped bar chart or pie chart by remote_work status.

**Insight:**  
May show isolation or work environment influence on mental health outcomes.


In [848]:
cleaned_df["remote_work"] = cleaned_df["remote_work"].map(
    {"Yes": "Remote", "No": "On-site"}
)

grouped = (
    cleaned_df.groupby(["remote_work", "treatment"], observed=True)
    .size()
    .unstack(fill_value=0)
)

grouped["total"] = grouped["Yes"] + grouped["No"]
grouped["Yes_pct"] = (grouped["Yes"] / grouped["total"] * 100).round(1)
grouped["No_pct"] = (grouped["No"] / grouped["total"] * 100).round(1)

grouped["Yes_pct_str"] = grouped["Yes_pct"].astype(str) + "%"
grouped["No_pct_str"] = grouped["No_pct"].astype(str) + "%"

grouped["Yes_y"] = grouped["Yes"] / 2
grouped["No_y"] = grouped["Yes"] + (grouped["No"] / 2)

source = ColumnDataSource(
    data=dict(
        work_type=grouped.index.tolist(),
        Yes=grouped["Yes"].tolist(),
        No=grouped["No"].tolist(),
        Yes_y=grouped["Yes_y"].tolist(),
        No_y=grouped["No_y"].tolist(),
        Yes_pct_str=grouped["Yes_pct_str"].tolist(),
        No_pct_str=grouped["No_pct_str"].tolist(),
    )
)

categories = ["Yes", "No"]
colors = ["#5cdac5", "#256676"]

p2 = figure(
    x_range=grouped.index.tolist(),
    height=500,
    width=800,
    title="Remote VS On-site",
    background_fill_color="#fafafa",
)

p2.vbar_stack(
    stackers=categories,
    x="work_type",
    width=0.7,
    color=colors,
    source=source,
    legend_label=categories,
)

yes_labels = LabelSet(
    x="work_type",
    y="Yes_y",
    text="Yes_pct_str",
    source=source,
    text_align="center",
    text_baseline="middle",
    text_font_size="12pt",
    text_color="black",
)

no_labels = LabelSet(
    x="work_type",
    y="No_y",
    text="No_pct_str",
    source=source,
    text_align="center",
    text_baseline="middle",
    text_font_size="12pt",
    text_color="white",
)

p2.add_layout(yes_labels)
p2.add_layout(no_labels)

p2.title.text_font_size = "16pt"
p2.title.align = "center"
p2.xaxis.axis_label = "Work Type"
p2.yaxis.axis_label = "Number of Respondents"
p2.xaxis.major_label_orientation = 1
p2.outline_line_color = None
p2.grid.grid_line_alpha = 0.25
p2.legend.title = "Treatment"
p2.legend.location = "top_right"

show(p2)

### Question 5: Does company size influence mental health?

**Variables used:** `company_size`, `treatment`

**Goal:**  
To analyze whether employees in small, medium, or large companies are more likely to seek mental health treatment.

**Visualization:**  
Stacked or grouped bar chart.

**Insight:**  
Can reveal how organizational scale affects support and pressure.


In [849]:
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, LabelSet

grouped = (
    cleaned_df.groupby(["company_size", "treatment"], observed=True)
    .size()
    .unstack(fill_value=0)
)

grouped["total"] = grouped["Yes"] + grouped["No"]
grouped["Yes_pct"] = (grouped["Yes"] / grouped["total"] * 100).round(1)
grouped["No_pct"] = (grouped["No"] / grouped["total"] * 100).round(1)

grouped["Yes_pct_str"] = grouped["Yes_pct"].astype(str) + "%"
grouped["No_pct_str"] = grouped["No_pct"].astype(str) + "%"

grouped["Yes_y"] = grouped["Yes"] / 2
grouped["No_y"] = grouped["Yes"] + (grouped["No"] / 2)

source = ColumnDataSource(
    data=dict(
        company_size=grouped.index.tolist(),
        Yes=grouped["Yes"].tolist(),
        No=grouped["No"].tolist(),
        Yes_y=grouped["Yes_y"].tolist(),
        No_y=grouped["No_y"].tolist(),
        Yes_pct_str=grouped["Yes_pct_str"].tolist(),
        No_pct_str=grouped["No_pct_str"].tolist(),
    )
)

categories = ["Yes", "No"]
colors = ["#718dbf", "#e84d60"]

p2 = figure(
    x_range=grouped.index.tolist(),
    height=500,
    width=800,
    title="Treatment Seeking by Company Size",
    background_fill_color="#fafafa",
)

p2.vbar_stack(
    stackers=categories,
    x="company_size",
    width=0.5,
    color=colors,
    source=source,
    legend_label=categories,
)

yes_labels = LabelSet(
    x="company_size",
    y="Yes_y",
    text="Yes_pct_str",
    source=source,
    text_align="center",
    text_baseline="middle",
    text_font_size="11pt",
    text_color="black",
)

no_labels = LabelSet(
    x="company_size",
    y="No_y",
    text="No_pct_str",
    source=source,
    text_align="center",
    text_baseline="middle",
    text_font_size="11pt",
    text_color="white",
)

p2.add_layout(yes_labels)
p2.add_layout(no_labels)

p2.title.text_font_size = "16pt"
p2.title.align = "center"
p2.xaxis.axis_label = "Company Size"
p2.yaxis.axis_label = "Number of Respondents"
p2.xaxis.major_label_orientation = 1
p2.outline_line_color = None
p2.grid.grid_line_alpha = 0.25
p2.legend.title = "Treatment"
p2.legend.location = "top_left"

show(p2)

### Question 6: Are employees afraid to talk to employers about mental health?

**Variables used:** `mental_health_consequence`, `treatment`

**Goal:**  
To determine how comfortable employees feel discussing mental health with their employers.

**Visualization:**  
Pie chart or grouped bar chart (e.g., Yes/No/Maybe).

**Insight:**  
Highlights stigma and fear of consequences in the workplace.


In [850]:
counts = cleaned_df["mental_health_consequence"].value_counts()
percent = counts / counts.sum() * 100

data = pd.DataFrame({"consequence": percent.index, "value": percent.values})

data["angle"] = data["value"] / data["value"].sum() * 2 * np.pi
data["start_angle"] = np.cumsum([0] + list(data["angle"][:-1]))
data["end_angle"] = np.cumsum(data["angle"])

color_map = {"Yes": "#e63946", "No": "#a8dadc", "Maybe": "#adb5bd"}
data["color"] = data["consequence"].map(color_map)

angles = np.cumsum([0] + list(data["angle"][:-1]))
data["theta"] = angles + data["angle"] / 2
data["x"] = 0.25 * np.cos(data["theta"])
data["y"] = 1 + 0.25 * np.sin(data["theta"])

data["label"] = data.apply(lambda row: f"{row['value']:.1f}%", axis=1)
source = ColumnDataSource(data)

p = figure(
    width=800,
    height=400,
    title="Are employees afraid to talk to their employer about problems?",
    toolbar_location=None,
    tools="",
    x_range=(-1, 1),
)

p.wedge(
    x=0,
    y=1,
    radius=0.4,
    start_angle="start_angle",
    end_angle="end_angle",
    line_color="white",
    fill_color="color",
    legend_field="consequence",
    source=source,
)

labels = LabelSet(
    x="x",
    y="y",
    text="label",
    source=source,
    text_font_size="10pt",
    text_align="center",
)

p.add_layout(labels)
p.axis.visible = False
p.grid.visible = False
p.legend.location = "top_right"
p.legend.label_text_font_size = "9pt"
p.title.text_font_size = "16pt"
p.title.align = "center"
show(p)

### Question 7: How do companies support their employees?

**Variables used:** `wellness_program`, `seek_help`, `care_options`

**Goal:**  
To explore the availability and awareness of company-provided mental health support programs.

**Visualization:**  
Grouped horizontal bar chart.

**Insight:**  
Reveals gaps in employee awareness and actual support availability.


In [851]:
cols = ["wellness_program", "seek_help", "care_options"]
valid_responses = ["Yes", "No", "Don't know"]
data = {}
for col in cols:
    counts = cleaned_df[col].value_counts(normalize=True) * 100
    for response in valid_responses:
        key = f"{col}_{response}"
        data[key] = counts.get(response, 0)

source = ColumnDataSource(
    data=dict(
        responses=valid_responses,
        wellness=[data[f"wellness_program_{r}"] for r in valid_responses],
        seek=[data[f"seek_help_{r}"] for r in valid_responses],
        care=[data[f"care_options_{r}"] for r in valid_responses],
    )
)
p = figure(
    y_range=valid_responses,
    height=450,
    width=800,
    title="Company-Provided Mental Health Support",
    toolbar_location=None,
)
p.hbar(
    y=dodge("responses", -0.25, range=p.y_range),
    right="wellness",
    height=0.2,
    source=source,
    color="#718dbf",
    legend_label="Wellness Program",
)
p.hbar(
    y=dodge("responses", 0.0, range=p.y_range),
    right="seek",
    height=0.2,
    source=source,
    color="#5cdac5",
    legend_label="Seek Help Resources",
)
p.hbar(
    y=dodge("responses", 0.25, range=p.y_range),
    right="care",
    height=0.2,
    source=source,
    color="#e84d60",
    legend_label="Care Options",
)
p.x_range.start = 0
p.ygrid.grid_line_color = None
p.xaxis.axis_label = "Percentage (%)"
p.yaxis.axis_label = "Availability"
p.legend.location = "top_right"
p.legend.label_text_font_size = "9pt"
p.title.text_font_size = "16pt"
p.title.align = "center"
show(p)

## Final Conclusions

Based on the analysis of mental health survey data in the tech industry, we can draw the following conclusions:

### General Observations:

- Mental health concerns are **very common** in tech. Nearly **50% of respondents have received treatment**.
- There is a **strong correlation between family history** of mental illness and an individual's likelihood of seeking help.
- Many employees are **afraid to discuss mental health issues** with their employers, with about **one-third expecting negative consequences**.
- **Self-employed individuals** appear **less likely to seek treatment**, possibly due to a lack of organizational support.

### Company Support:

- Support from companies is **not consistently available or clearly communicated**.
- A significant percentage of employees **do not know** whether support programs exist.
- This suggests a need for **improved awareness** and communication about available mental health resources.

---

## Final Thoughts

The data shows that **mental health is a critical issue** in the tech industry.  
There is a clear need for:

- **Open company culture**
- **Better access to mental health resources**
- **Reduction of stigma** around discussing mental health

Companies that prioritize mental well-being can build **healthier, more productive teams**.
