<a href="https://colab.research.google.com/github/kaminikumari543/LabMetrix/blob/main/Mental_Health_Tech_Survey/Mental_health_in_tech_survey.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Mental Health In Tech Survey




##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

Write the summary here within 500-600 words.

The dataset analyzed in this study is derived from a public survey focused on mental health in the tech industry. It captures a wide range of information about individual demographics, work environments, and mental health treatment history. With 1,259 original entries and 27 columns, the dataset includes responses primarily from technology workers across different countries, providing a valuable lens into the intersection between mental well-being and professional life.

Data Structure and Variables
The dataset consists of 26 features and one timestamp. It contains both numerical and categorical data. The only numerical column is age, while all other fields are categorical, including important variables such as gender, country, self_employed, family_history, treatment, work_interfere, benefits, care_options, and leave. Each variable contributes insight into either the respondent’s demographic background, workplace policies, or their experience with mental health issues and support systems.

Demographics: Includes age, gender, country, and state.

Workplace Context: Captures company size (no_employees), presence of mental health benefits, wellness programs, and openness to discussing mental health with coworkers and supervisors.

Mental Health Indicators: Includes whether the respondent has a family history of mental illness, whether they have sought treatment, and whether mental health issues interfere with their work.

The comments column allows for open-ended responses, but it has a high proportion of missing values and is not used in structured analysis.

Data Quality and Preprocessing Needs
Before analysis, the dataset required several cleaning steps:

Gender Standardization: The gender column had over 20 different formats and spellings for gender identities due to it being an open text field. These were standardized into consistent categories such as “male,” “female,” and “non-binary/trans” for effective analysis.

Age Outliers: The age variable had some unrealistic values like 5 and 300 years, likely due to user input errors. These outliers were filtered out by keeping only ages between 18 and 100, which are considered reasonable working-age boundaries.

Missing Values: Certain fields, especially state, self_employed, and work_interfere, had missing values. These were filled with the value "Unknown" to retain entries while marking uncertainty.

Geographic and Industry Focus
The majority of the respondents were from the United States, Canada, and the United Kingdom. The focus on technology professionals is evident from responses indicating employment in tech companies and remote work conditions. This demographic is particularly relevant because the tech industry is known for high workloads, flexible work environments, and increasing concern around employee mental health.

Purpose and Usefulness
The main purpose of this dataset is to examine the relationship between workplace policies and mental health outcomes. It can help identify factors that lead individuals to seek mental health treatment and understand how workplace support systems affect employee well-being. Organizations can use these insights to implement better mental health strategies, improve HR policies, and foster healthier work cultures.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

To analyze mental health survey data from the tech industry in order to identify key factors—such as demographics, workplace environment, and support systems—that influence whether individuals seek mental health treatment, and to provide insights that can help organizations improve mental health support for their employees.











#### **Define Your Business Objective?**

Answer Here.

1.Identify workplace factors that impact employees' mental health and treatment-seeking behavior.

2.Improve mental health support systems through data-driven insights.

3.Enhance employee well-being and productivity by shaping effective HR policies.

4.Reduce stigma around mental health discussions in tech organizations.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/survey.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df = pd.DataFrame(df)
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
df.isnull().sum().plot(kind='bar')
plt.show()

### What did you know about your dataset?

Answer Here

The dataset is a mental health survey from the tech industry with 1,259 responses. It includes demographic details (age, gender, country), workplace context (company size, remote work, benefits), and mental health factors (family history, treatment, work interference). Most columns are categorical, and the goal is to understand how personal and workplace factors influence mental health treatment decisions.











## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns


In [None]:
# Dataset Describe
df.describe()

### Variables Description



```
# This is formatted as code
```

Answer Here

age: Respondent’s age

gender: Gender identity

country/state: Location details

self_employed: Self-employment status

family_history: Family history of mental illness

treatment: Whether respondent sought mental health treatment

work_interfere: Impact of mental health on work

no_employees: Company size

remote_work: Works remotely or not

tech_company: If the company is in tech

benefits: Employer provides mental health benefits

care_options: Availability of mental health care

wellness_program: Employer wellness program

seek_help: Resources available to seek help

anonymity: Anonymity in treatment

leave: Ease of taking mental health leave

mental/phys_health_consequence: Perceived consequences of discussing health

coworkers/supervisor: Comfort discussing mental health

mental/phys_health_interview: Willingness to discuss in job interviews

mental_vs_physical: Belief in mental vs physical health importance

obs_consequence: Observed negative outcomes at work













### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Standerize columns name
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Clean and standerize gender columns
df['gender'] = df['gender'].str.lower().str.strip()
df['gender'] = df['gender'].replace({
    'm': 'male', 'male-ish': 'male', 'msle': 'male', 'man': 'male',
    'f': 'female', 'woman': 'female', 'femake': 'female',
    'trans-female': 'trans', 'trans woman': 'trans', 'transgender': 'trans',
    'cis female': 'female', 'cis-female/femme': 'female', 'female (cis)': 'female',
    'cis male': 'male', 'cis man': 'male', 'male (cis)': 'male',
    'make': 'male', 'mal': 'male', 'maile': 'male', 'malr': 'male',
    'fluid': 'non-binary', 'genderqueer': 'non-binary', 'non-binary': 'non-binary',
    'androgyne': 'non-binary', 'enby': 'non-binary', 'queer': 'non-binary',
    'neuter': 'non-binary', 'nah': 'non-binary'
})

# Remove Outliers from Age group
df = df[df['age'] >= 18]
df = df[df['age'] <= 100]

# Fill missing values in objects columns with known
df['state'] = df['state'].fillna('Unknown')
df['self_employed'] = df['self_employed'].fillna('Unknown')
df['work_interfere'] = df['work_interfere'].fillna('Unknown')

# Final clean dataset
df.dropna(inplace=True)

# Save clean dataset
df.to_csv('clean_survey.csv', index=False)

### What all manipulations have you done and insights you found?

Answer Here.

Manupulation Done

1.Standardized column names (lowercase, no spaces)

2.Cleaned gender values (grouped similar/incorrect entries)

3.Removed age outliers (kept only ages 18–100)

4.Filled missing values in categorical columns with 'Unknown'

5.Filtered and grouped data to analyze treatment patterns

Insight Found

1.74% with family history of mental illness sought treatment vs. 35% without.

2.85% of those whose mental health "often" affects work sought treatment.

3.69% with employer support (care options) got treatment vs. 41% without.

4.Average age of treated individuals is ~32 years.

5.Women (69%) are more likely to seek treatment than men (45%).

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Treatment Vs Family History
import matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x='treatment', hue='family_history', data=df)
plt.title('Treatment Vs Family History')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.  This bar chart is ideal for comparing categorical variables — here, treatment and family history — and visualizing their relationship clearly.

##### 2. What is/are the insight(s) found from the chart?

Answer Here  
1. Individuals with a family history of mental illness are more likely to seek treatment.

2.Among those without a family history, fewer seek treatment.

3.A considerable number with no family history still seek treatment, showing awareness.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here   
Positive Impact
Yes, this insight can guide targeted mental health campaigns, focusing more on those with no known family history to raise awareness and encourage early intervention.

Negative Impact
Yes, the low treatment rate among people without family history may indicate stigma or lack of awareness, which could hinder mental wellness initiatives if not addressed properly.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Work Interferance Vs Treatment
sns.countplot(x='work_interfere', hue='treatment', data=df)
plt.title('Work Interferance Vs Treatment')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

This grouped bar chart is ideal for comparing how different levels of work interference relate to mental health treatment status.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

People who report that work "Sometimes" interferes with their mental health are most likely to seek treatment.

"Often" interference also shows a notable number seeking treatment.

Those who "Never" experience interference are less likely to undergo treatment.

A surprising number of "Unknown" respondents haven’t sought treatment.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Positive Impact

Yes, companies can identify how work stress contributes to mental health issues and offer targeted support to reduce interference and encourage seeking help.

Negative Impact

Yes, high interference without treatment (like in the “Often” and “Unknown” categories) may indicate underreported mental stress and lack of support, which could lead to reduced employee productivity and morale if unaddressed.



#### Chart - 3

In [None]:
# Chart - 3  visualization code
# Age Distribution Of Treatment Who Sought Treatment
sns.histplot(df[df['treatment'] == 'Yes']['age'], bins=15, kde=False, color='skyblue', edgecolor='black')


##### 1. Why did you pick the specific chart?

Answer Here.   
This histogram is suitable for showing the age distribution of people who sought treatment, helping identify age groups with higher mental health support needs.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Most individuals who sought treatment are aged between 25 to 35 years.

The peak treatment-seeking age is around 33–35.

Very few individuals above 45 or below 22 sought treatment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Positive Impact
Yes, organizations can focus mental health programs on the 25–35 age group, where the need is highest, ensuring better resource allocation and employee support.

Negative Impact

Yes, the low treatment rate among younger and older employees suggests potential neglect, stigma, or unawareness, which could impact long-term employee wellbeing and retention if not addressed.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Gender Vs Treatment(Filtered For Top Genders)
sns.countplot(x='gender', hue='treatment', data=df[df['gender'].isin(['male', 'female'])])
plt.title('Gender Vs Treatment')
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.  
This grouped bar chart effectively shows the relationship between gender and mental health treatment, helping compare treatment rates across genders.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

More males sought treatment compared to females.

However, a large number of males also did not seek treatment, indicating a possible hesitation.

Females show fewer numbers overall, but a relatively higher proportion sought treatment compared to males.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Positive Impact

Yes, this helps design gender-sensitive mental health programs, addressing treatment gaps and encouraging openness in male-dominated environments.

Negative Impact

Yes, the high number of untreated males may reflect stigma or lack of awareness, leading to unaddressed mental health issues, which can reduce productivity and workplace morale.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# How does the frequency of mental health illness and attitudes towards mental health vary by geographic location?
# 1.Treatment rates by country
treatment_by_country = df.groupby('country')['treatment'].value_counts(normalize=True).unstack()
treatment_by_country.plot(kind='bar', stacked=True, figsize=(10, 4))



##### 1. Why did you pick the specific chart?



```
# This is formatted as code
```

Answer Here.

This stacked bar chart is ideal to compare the proportion of treatment seekers vs. non-seekers across different countries, giving a clear visual of international patterns.


##### 2. What is/are the insight(s) found from the chart?



```
`# This is formatted as code`
```

Answer Here

Countries like United States, Germany, and Canada show a higher proportion of treatment seekers.

Australia, Russia, and Singapore have lower treatment rates despite participation.

Some countries show nearly 100% non-treatment, suggesting either stigma or lack of access.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Positive Impact

Yes, it helps global companies customize mental health support based on regional openness and access to treatment, improving well-being across geographies.

Negative Impact

Yes, countries with very low treatment rates may indicate lack of awareness, support systems, or cultural stigma, leading to unaddressed issues and lower workforce performance in those regions.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#2.Family History By Country
sns.lineplot(x='country', y='family_history', data=df)
plt.title('Family History By Country')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

This line chart helps to visualize variation in family history of mental illness across countries, making it easy to spot spikes or dips in the data trend.

##### 2. What is/are the insight(s) found from the chart?

Answer Here


Countries like Switzerland and Germany show a high presence of reported family history.

Several countries show no or very low reporting, indicating either true absence or underreporting.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Positive Impact

Yes, businesses can use this to focus educational efforts and support programs in countries with low awareness or reporting, improving mental health disclosure and early action.

Negative Impact

Yes, low or inconsistent reporting in some countries may reflect cultural stigma or lack of mental health awareness, which can delay treatment and negatively affect employee productivity and morale.


#### Chart - 7

In [None]:
# chart 7 visualization code
# 3.Mental health care availability
sns.boxplot(x='country', y='care_options', data=df)
plt.title('Mental health care availability by country')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

A box plot is ideal here to compare the distribution of mental health care availability responses ("Yes", "No", "Not sure") across countries and identify variations or outliers.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

A large number of respondents across countries are "Not sure" about mental health care availability.

Few countries show a consistent "Yes", indicating confirmed access.

"No" responses are also common, suggesting limited mental health infrastructure in many regions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*Answer* Here

Positive Impact

Yes, companies can use this insight to improve awareness and ensure clear communication about available mental health support in the workplace, especially in countries with uncertainty.

Negative Impact

Yes, the high number of "Not sure" responses shows a lack of awareness, which may prevent employees from seeking help, leading to lower well-being and reduced productivity.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# 2. What are the strongest predictors of mental health illness or certain attitudes towards mental health in the workplace?

# STEP 1: Select relevant features and target
features = [
    'age', 'gender', 'self_employed', 'family_history', 'work_interfere',
    'no_employees', 'remote_work', 'tech_company', 'benefits',
    'care_options', 'wellness_program', 'seek_help', 'anonymity',
    'leave', 'mental_health_consequence', 'phys_health_consequence',
    'coworkers', 'supervisor', 'mental_health_interview',
    'phys_health_interview', 'mental_vs_physical', 'obs_consequence'
]

target = 'treatment'

# STEP 2: Create a copy and encode all categorical variables
data = df[features + [target]].copy()
le = LabelEncoder()

for col in data.columns:
    if data[col].dtype == 'object':
        data[col] = le.fit_transform(data[col])

# STEP 3: Split the data
X = data[features]
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# STEP 4: Train the Random Forest Model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# STEP 5: Get feature importances
importances = pd.Series(model.feature_importances_, index=X.columns)
top_features = importances.sort_values(ascending=False).head(10)

# STEP 6: Plot the top predictors
plt.figure(figsize=(10, 8))
sns.barplot(x=top_features.values, y=top_features.index, palette="coolwarm")
plt.title('Top Predictors of Mental Health Treatment')
plt.xlabel('Feature Importance Score')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

This horizontal bar chart clearly ranks the top predictors of mental health treatment by their feature importance score, making it ideal for identifying key influencing factors.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

Work interference is the strongest predictor of seeking mental health treatment.

Age, number of employees, and family history also significantly influence treatment decisions.

Organizational factors like benefits, leave, and care options impact outcomes too.




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Positive Impact

Yes, companies can focus on reducing work-related stress, improving benefits, and increasing mental health support, leading to a more engaged and healthier workforce.

Negative Impact

Yes, ignoring key predictors like work interference or lack of leave options can lead to untreated mental health issues, resulting in burnout, absenteeism, and decreased employee retention.


#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

In summary, this dataset offers a comprehensive snapshot of mental health awareness and support in the tech industry. With diverse variables spanning demographics, employment, and treatment history, it provides rich opportunities for analysis that can inform mental health policies and promote well-being in modern workplaces.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***