# Capstone Project: Mental Health in the Tech Industry

## Problem Statement
Mental health is a critical issue in the tech industry, where stress, high workloads, and remote work conditions can significantly impact employees.
This study examines the state of mental health among tech professionals, focusing on how various factors (remote work, employer support, fear of consequences) influence employees' well-being.

**Key Questions for Analysis:**
1. How common are mental health disorders among tech workers?
2. Does a family history of mental illness increase the likelihood of seeking treatment?
3. Does remote work (remote_work) have a positive or negative impact on mental health?
4. How do employers support employees? Is employer-provided support (benefits, seek_help, anonymity) linked to seeking treatment?
5. Are employees afraid of negative consequences for discussing mental health at work (mental_health_consequence)?


## Import data
Importing data from Kagglehub using kagglehub package for Python.

In [None]:
import kagglehub
# Download data
path = kagglehub.dataset_download("osmi/mental-health-in-tech-survey")
# Check data path
print("Path to dataset files:", path)

## Read data
Reading data and assining to a variable for folowwing manupulating.

In [None]:
import pandas as pd
import os
dataset_file = os.path.join(path, "survey.csv")
df = pd.read_csv(dataset_file)
print(df.head())

## Clean data

**Step 1: Dropping unnecessary columns**

NO need columns such as Timestamp, state and comments. 

**Reasons:**
- Timestamp doesn't impact on the result. 
- State column has a lot of passes (is not a critical information for analysis).
- Comments column is an optional field and has no matter for analysis.

In [None]:
cleaned_df = df.drop(['Timestamp', 'comments', 'state'], axis=1)
print(cleaned_df.head())

**Step 2: Processing Key Values**

Some rows need to be normalized snd processed.

**Categorical Data**
| Column | Issue | Solution|
|----------|---------|----------|
| "Gender" | Various: "m, "F", "Male", "female", etc. | Bring to standart: "Male", "Female, "Other". |
| "self_employed" | Lots of passes  | Filled with "No" and vring to binary. |
| "family_history"  |String values.  | Bring to binary. |
| "treatment"  | String values.  |  Bring to binary. |
| "tech_company" | Includes "No" while focus on TECH industry | Remove rows where valus is "No" |



In [None]:
# bring all values in "Gender" column to lower case
cleaned_df["Gender"] = cleaned_df["Gender"].str.lower()
# all unique values for "Gender" column
print(cleaned_df["Gender"].unique())

# function to bring existed values to standart
def clean_gender(gender):
    if gender in ["male", "m", "man", "mail", "male-ish", "maile", "mal"]:
        return "Male"
    elif gender in ["female", "f", "woman", "fm", "femake"]:
        return "Female"
    else:
        return "Other"

# Apply function to "Gender" column
cleaned_df["Gender"] = cleaned_df["Gender"].apply(clean_gender)

# Filled "No" instead of NaN values in "self_employed" column.
cleaned_df["self_employed"] = cleaned_df["self_employed"].fillna("No")

# Replace "yes"/"no" string values to binary in columns:
cleaned_df['self_employed'] = cleaned_df['self_employed'].replace({"Yes": 1, "No": 0})
cleaned_df['family_history'] = cleaned_df['family_history'].replace({"Yes": 1, "No": 0})
cleaned_df['treatment'] = cleaned_df['treatment'].replace({"Yes": 1, "No": 0})

# Remove rows with non-tech company
cleaned_df = cleaned_df[cleaned_df["tech_company"] == "Yes"]
print(cleaned_df)



**Numerical Data**
| Column | Issue | Solution|
|----------|---------|----------|
| "Age" | Values might be wrong(like <18 or >70) | Delete invalid values. |
| "no_employees" | Lots of different values | Group them into: "Large", "Medium", "Small"  |



In [None]:
# Check "Age" column to valid values
cleaned_df[(cleaned_df["Age"] >= 18) & (cleaned_df["Age"] <= 70)]

# Group "no-employees" values
print(cleaned_df["no_employees"].unique())
employee_groups = {
    "1-5": "Small",
    "6-25": "Small",
    "26-100": "Medium",
    "100-500": "Medium",
    "500-1000": "Large",
    "More than 1000": "Large"
}
# Create new column with the result
cleaned_df["company_size"] = cleaned_df["no_employees"].map(employee_groups)
print(cleaned_df.head())