# **Project Name**    -
Mental Health & Workplace Dynamics



##### **Project Type**    - EDA
##### **Contribution**    - Individual- Nikhar Roy Chaudhuri

# **Project Summary -**

Write the summary here within 500-600 words.
This project analyzes a workplace mental health survey to understand factors influencing employees’ decisions to seek treatment. The dataset includes demographics (age, gender, country), work-related variables (company size, remote work, tech status), and mental health factors (treatment history, benefits, stigma).

Key insights show that most respondents are aged 25–35, with a majority being male. Employees with mental health benefits and supportive work environments are more likely to seek treatment. Work interference strongly correlates with higher treatment-seeking behavior.

Visualizations such as bar charts, box plots, sunburst diagrams, and treemaps clearly reveal patterns across different groups. Correlation heatmaps highlight moderate links between work interference, treatment, and company size.

The results emphasize the need for accessible mental health benefits, reducing workplace stigma, and training managers to support employees.

Overall, this analysis helps organizations identify improvement areas to promote mental well-being, create a supportive culture, and boost employee satisfaction and productivity.





# **GitHub Link -**

Provide your GitHub Link here.
https://github.com/nikharroy/mental-health-analysis

# **Problem Statement**


**Write Problem Statement Here.**
Despite growing awareness, mental health issues among employees often go unaddressed in many organizations. Many employees avoid seeking treatment due to lack of workplace support, stigma, or insufficient mental health resources.

This project aims to analyze survey data to identify key demographic and workplace factors that influence employees' decisions to seek mental health treatment. By uncovering these patterns, we can help organizations design targeted policies and interventions to promote mental well-being, reduce stigma, and improve overall employee productivity and satisfaction.

#### **Define Your Business Objective?**

Answer Here-The primary business objective of this project is to identify the key factors that encourage or discourage employees from seeking mental health treatment. By analyzing patterns in demographics, workplace support, and treatment-seeking behavior, the goal is to provide actionable insights that help organizations:

Design effective mental health programs and benefits.
Reduce workplace stigma around mental health.
Improve employee well-being and retention.
Enhance overall productivity and organizational culture.
Ultimately, these insights will support data-driven decision-making to build healthier, more supportive, and more engaged workforces.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

In [None]:
# pandas is used for data manipulation and analysis
import pandas as pd

# numpy is used for numerical operations (we may need it later)
import numpy as np

# matplotlib and seaborn are used for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# set seaborn style for better looking plots
sns.set(style="whitegrid")


### Dataset Loading

In [None]:
# Load Dataset

In [None]:
# Read the CSV file into a pandas DataFrame
# Make sure the path matches your file location
df = pd.read_csv("survey.csv")

### Dataset First View

In [None]:
# Dataset First Look

In [None]:
# Display the first 5 rows of the dataset
# This helps us understand what columns we have and get a feel for the data
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:
# Get the number of rows and columns using shape
print("The dataset has", df.shape[0], "rows and", df.shape[1], "columns.")

# Alternatively, you can display as a DataFrame if you want it prettier
pd.DataFrame({
    'Rows': [df.shape[0]],
    'Columns': [df.shape[1]]
})

### Dataset Information

In [None]:
# Dataset Info

In [None]:
# Get general info about the DataFrame
# This shows number of entries, column names, non-null counts, and data types
df.info()


In [None]:
# Get a quick statistical summary for numerical columns
df.describe()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

In [None]:
# Check how many duplicate rows are in the dataset
duplicate_count = df.duplicated().sum()

print("Number of duplicate rows:", duplicate_count)

# If you'd like to see those rows, you can use:
# df[df.duplicated()]

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

In [None]:
# Check how many missing (null) values are in each column
missing_values = df.isnull().sum()

# Display the count of missing values
print("Missing values in each column:")
print(missing_values)

# Alternatively, show as a DataFrame (nicer format)
missing_df = pd.DataFrame({
    'Column Name': df.columns,
    'Missing Values': df.isnull().sum(),
    'Percentage (%)': (df.isnull().sum() / len(df)) * 100
})

# Sort by most missing
missing_df = missing_df.sort_values(by='Missing Values', ascending=False)
missing_df


In [None]:
# Visualizing the missing values

In [None]:
# Plot a heatmap to visually see missing data locations
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title("Heatmap of Missing Values")
plt.show()

# Or plot a bar plot of missing values
missing_df_plot = missing_df[missing_df['Missing Values'] > 0]
plt.figure(figsize=(10, 6))
sns.barplot(x='Missing Values', y='Column Name', data=missing_df_plot, palette="crest")
plt.title("Number of Missing Values per Column")
plt.show()

### What did you know about your dataset?

Answer Here-This dataset is from a survey conducted to understand mental health in the tech industry, focusing on demographics, mental health history, work environment, and attitudes towards mental health at work.
The dataset contains 1,259 rows and 27 columns.
Important columns include Age, Gender, Country, state, self_employed, family_history, treatment, and workplace support factors like benefits, leave, and anonymity.
There are missing values in some columns, especially comments, state, work_interfere, and self_employed, which need to be handled during cleaning.
There are no duplicate rows in the dataset, so no deduplication is required.
The data has a mix of categorical and numerical features, requiring different types of analysis and visualization.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

In [None]:
# Display all column names in the dataset
print("Columns in the dataset:")
print(df.columns.tolist())

In [None]:
# Dataset Describe

In [None]:
# Get statistical summary for numerical columns
df.describe()


### Variables Description

Answer Here-  Variables Description

Timestamp: Date and time when the survey was submitted.
Age: Respondent's age (note: contains outliers and errors that need cleaning).
Gender: Respondent's gender.
Country: Country where the respondent lives.
state: U.S. state or territory (if applicable).
self_employed: Whether the respondent is self-employed.
family_history: If there is a family history of mental illness.
treatment: Whether the respondent has sought treatment for mental health.
work_interfere: If mental health condition interferes with work.
no_employees: Size of the respondent’s company.
remote_work: If they work remotely at least 50% of the time.
tech_company: Whether their employer is primarily a tech company.
benefits: Whether mental health benefits are provided.
care_options: Awareness of mental health care options provided by employer.
wellness_program: If employer has discussed mental health as part of wellness program.
seek_help: If resources are provided to learn about mental health and seek help.
anonymity: If anonymity is protected when using mental health resources.
leave: Ease of taking medical leave for mental health conditions.
mental_health_consequence: Perceived negative consequences of discussing mental health at work.
phys_health_consequence: Perceived negative consequences of discussing physical health at work.
coworkers: Willingness to discuss mental health with coworkers.
supervisor: Willingness to discuss mental health with supervisors.
mental_health_interview: Willingness to discuss mental health during a job interview.
phys_health_interview: Willingness to discuss physical health during a job interview.
mental_vs_physical: If employer takes mental health as seriously as physical health.
obs_consequence: If they have observed negative consequences for colleagues with mental health issues.
comments: Additional free-text comments.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

In [None]:
# Loop through each column and print the number of unique values
for column in df.columns:
    unique_vals = df[column].nunique()
    print(f"{column}: {unique_vals} unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
#  Fix invalid ages
df = df[(df['Age'] >= 18) & (df['Age'] <= 100)]

# Standardize Gender
df['Gender'] = df['Gender'].str.lower().str.strip()

df['Gender'] = df['Gender'].replace([
    'female', 'f', 'woman', 'cis female', 'trans female', 'female (trans)', 'femake', 'woman '
], 'female')

df['Gender'] = df['Gender'].replace([
    'male', 'm', 'man', 'cis male', 'male (cis)', 'male ', 'msle', 'malr', 'mail'
], 'male')

df['Gender'] = df['Gender'].apply(lambda x: x if x in ['male', 'female'] else 'other')

#  Drop columns safely
df = df.drop(columns=['comments', 'state'], errors='ignore')

#  Fill missing values
for col in ['self_employed', 'work_interfere']:
    df[col] = df[col].fillna(df[col].mode()[0])

#  Final checks
print("Shape after cleaning:", df.shape)
print("Remaining missing values:\n", df.isnull().sum())

### What all manipulations have you done and insights you found?

Answer Here-What all manipulations have you done and insights you found?
Removed invalid ages: Filtered out respondents with unrealistic ages (only kept ages between 18 and 100). This helped remove outliers and improve data quality.
Standardized gender entries: Cleaned and grouped different variations of gender labels into three main categories: male, female, and other. This ensures consistency and makes analysis easier.
Dropped columns with high missing values or low relevance: The comments and state columns were dropped since they contained mostly free text or were not relevant for this analysis.
Handled missing values: Filled missing values in self_employed and work_interfere columns using their mode (most frequent value). This ensured no remaining null values and made the dataset analysis-ready.
Checked final shape and missing values: After cleaning, the final dataset has 1,251 rows and 25 columns, with no missing values left.

 Insights after cleaning
The dataset is now fully clean and ready for analysis, without any null or inconsistent entries.
By standardizing categories and fixing outliers, further analysis (like visualizations or modeling) will be more accurate and reliable.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

In [None]:
#Chart 1: Distribution of Gender


# Countplot to see the distribution of gender among respondents
plt.figure(figsize=(8, 5))
sns.countplot(x='Gender', data=df, palette='Set2')
plt.title('Distribution of Gender among Respondents')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose this count plot because it clearly shows the distribution of respondents by gender. This helps us understand the demographic composition of the survey sample at a glance. It’s important to know who the majority participants are before exploring mental health trends by group.



##### 2. What is/are the insight(s) found from the chart?

Answer Here-The chart shows that most respondents identified as male, followed by female, and a small group identified as other. This indicates that the tech workforce in this survey is male-dominated, and future analyses should consider this imbalance when interpreting results.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, these insights can help create a positive business impact. Understanding that most respondents are male allows organizations to tailor mental health programs to better reach and support them while also creating targeted initiatives for underrepresented groups (female and other gender identities).
There are no direct negative growth insights here, but ignoring minority groups could harm inclusivity and employee satisfaction in the long term, potentially impacting retention and brand reputation.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

In [None]:
#chart 2: Treatment vs Gender


plt.figure(figsize=(8, 6))
sns.countplot(x='Gender', hue='treatment', data=df, palette='Set1')
plt.title('Treatment Seeking Behavior by Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.legend(title='Sought Treatment')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here-I chose this grouped count plot to understand how mental health treatment-seeking behavior varies across different gender groups. It visually compares those who sought treatment versus those who did not within each gender, which helps identify potential disparities in accessing mental health support.

##### 2. What is/are the insight(s) found from the chart?

Answer Here- The chart shows that a higher proportion of female respondents sought mental health treatment compared to males. Among males, there is a noticeable group who did not seek treatment. The "other" gender category has relatively few respondents but shows a tendency toward seeking treatment as well. This suggests that females may be more open to addressing mental health issues than males.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, these insights can help create targeted mental health awareness and support campaigns. Knowing that males are less likely to seek treatment can guide organizations to create safe spaces, reduce stigma, and encourage them to access resources.
Ignoring this insight could negatively impact employee well-being and productivity, especially among male employees who might avoid treatment due to stigma or lack of support.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

In [None]:
#Chart 3: Family history of mental illness (Pie Chart)
# -----------------------------

# Count the values
family_counts = df['family_history'].value_counts()

# Create a pie chart
plt.figure(figsize=(7, 7))
plt.pie(family_counts, labels=family_counts.index, autopct='%1.1f%%', colors=sns.color_palette('pastel'))
plt.title('Proportion of Respondents with Family History of Mental Illness')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here-I chose a pie chart because it effectively shows the proportion of respondents who reported having a family history of mental illness. Pie charts are useful for highlighting parts of a whole and quickly communicate overall composition.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-The chart shows that around 39% of respondents have a family history of mental illness, while about 61% do not. This suggests that a significant portion of employees may have a genetic or environmental predisposition to mental health issues.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, these insights can help organizations design targeted mental health initiatives, knowing that many employees might have personal or family experiences with mental illness. Proactive education and support programs can reduce stigma and encourage early intervention.
There are no direct negative growth implications here, but if these insights are ignored, organizations risk higher absenteeism, lower productivity, and increased turnover due to unaddressed mental health challenges.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

In [None]:
#Chart 4: Age distribution by treatment status (Box Plot)
# -----------------------------

plt.figure(figsize=(10, 6))
sns.boxplot(x='treatment', y='Age', data=df, palette='coolwarm')
plt.title('Age Distribution by Mental Health Treatment Status')
plt.xlabel('Sought Treatment')
plt.ylabel('Age')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here-I chose a box plot to visualize the distribution of ages among those who have and have not sought mental health treatment. A box plot is effective for comparing distributions, showing medians, quartiles, and potential outliers in age data for each group.

##### 2. What is/are the insight(s) found from the chart?

Answer Here- The chart shows that respondents who sought treatment tend to have a slightly higher median age compared to those who did not. There are also some older respondents (outliers) in both groups, but overall, the age ranges are similar. This suggests that age might have a mild influence on treatment-seeking behavior, with slightly older employees being more open to seeking help.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, these insights can help businesses design mental health initiatives that are age-sensitive. Knowing that slightly older employees might be more proactive about seeking help can guide targeted communication and training programs for different age groups.
There are no negative growth insights here, but ignoring age-based preferences could reduce engagement and lead to less effective support programs.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

In [None]:
#Chart 5: Work Interference vs Treatment (Stacked Bar)
# -----------------------------

# Create a cross-tab of work_interfere and treatment
work_treatment = pd.crosstab(df['work_interfere'], df['treatment'])

# Plot as stacked bar
work_treatment.plot(kind='bar', stacked=True, figsize=(10, 6), color=['#ff9999', '#66b3ff'])
plt.title('Work Interference and Mental Health Treatment')
plt.xlabel('Work Interference')
plt.ylabel('Number of Respondents')
plt.legend(title='Sought Treatment')
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

Answer HereI chose a stacked bar chart because it helps to compare multiple groups within each category clearly. This type of chart allows us to see how treatment-seeking behavior differs across levels of work interference, providing a visual breakdown of both "Yes" and "No" responses together in one bar.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-The chart shows that respondents who reported that their mental health "often" or "rarely" interferes with work are more likely to have sought treatment. Meanwhile, those who reported "never" interfering have a much higher proportion of not seeking treatment. Interestingly, the "sometimes" group has the largest number of respondents but also a large group who did not seek treatment, suggesting a potential gap in support or awareness.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here- Yes, these insights can help employers design targeted interventions. Companies can focus on employees who report occasional interference ("sometimes") and provide more proactive resources to encourage them to seek help before problems worsen.
If these patterns are ignored, it could lead to decreased productivity and increased absenteeism, which would negatively impact business outcomes. Addressing these gaps can create a healthier, more supportive work environment and improve employee well-being and performance.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

In [None]:
#Chart 6: Perceived mental health consequences (Horizontal bar)


# Count responses
consequence_counts = df['mental_health_consequence'].value_counts()

# Plot horizontal bar chart
plt.figure(figsize=(8, 5))
sns.barplot(x=consequence_counts.values, y=consequence_counts.index, palette='viridis')
plt.title('Perceived Negative Consequences of Discussing Mental Health at Work')
plt.xlabel('Number of Respondents')
plt.ylabel('Perception')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here-I chose a horizontal bar chart because it is an effective way to display categorical data with long text labels, such as perceptions about negative consequences. This layout improves readability and clearly shows the number of respondents in each category.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-The chart shows that most respondents do not believe there would be negative consequences for discussing mental health issues with their employer, while a significant number selected "Maybe," suggesting uncertainty. A smaller but notable group believes there would be negative consequences. This indicates that although many feel safe, there is still a large group that hesitates or fears potential risks.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, these insights can guide employers to improve workplace culture and communication strategies. Knowing that a large group is uncertain ("Maybe") highlights the need for stronger assurances and policies to make employees feel safe.
If these concerns are not addressed, it could lead to underreporting of mental health issues, reduced engagement, and lower morale — ultimately harming productivity and retention.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

In [None]:
 #Chart 7: Number of respondents by company size (Line Chart)
# -----------------------------

# Count respondents by company size
company_counts = df['no_employees'].value_counts().sort_index()

# Plot line chart
plt.figure(figsize=(10, 6))
sns.lineplot(x=company_counts.index, y=company_counts.values, marker='o', color='purple')
plt.title('Number of Respondents by Company Size')
plt.xlabel('Company Size')
plt.ylabel('Number of Respondents')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here-I chose a line chart because it clearly shows trends in the number of respondents across different company sizes. The connected points help to visualize the fluctuations and patterns more smoothly than individual bars would.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-The chart shows that respondents are most commonly from very small companies (1–5) and mid-size companies (6–25, 26–100). There is a sharp drop in participation for companies with 500–1000 employees. Larger organizations ("More than 1000") also have significant representation. This indicates that mental health surveys reach both small startups and large corporations, but there's lower engagement in mid-large sized companies (500–1000).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, these insights can help target mental health initiatives by company size. Companies with lower participation, like those in the 500–1000 range, might require more focused outreach to understand and support their employees' mental health needs.
Ignoring this gap could lead to missed opportunities for engagement, potentially resulting in lower employee satisfaction and higher turnover in that segment.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

In [None]:
#Chart 8: Donut chart — Remote work distribution
# -----------------------------

# Count values
remote_counts = df['remote_work'].value_counts()

# Create pie chart
plt.figure(figsize=(8, 8))
plt.pie(remote_counts, labels=remote_counts.index, autopct='%1.1f%%', startangle=90, colors=sns.color_palette('pastel'))

# Draw a circle in the center to make it a donut
centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.title('Distribution of Remote Work Status')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here- I chose a donut chart because it offers a modern, visually appealing way to show proportions while keeping the focus on the parts of a whole. The open center also makes it more stylish and easier to read than a traditional pie chart.

##### 2. What is/are the insight(s) found from the chart?

Answer Here- The chart shows that about 30% of respondents work remotely at least 50% of the time, while around 70% do not. This indicates that most employees in this survey are primarily working on-site.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, these insights can help organizations understand the current distribution of remote work and plan mental health support accordingly. For example, on-site employees may need more in-person support resources, while remote employees might benefit from virtual wellness programs.
Ignoring this distinction could lead to ineffective support programs and reduced engagement among remote workers.


#### Chart - 9

In [None]:
# Chart - 9 visualization code

In [None]:
# -----------------------------
# Chart 9: Benefits vs Treatment (Stacked Column)
# -----------------------------

# Create crosstab
benefits_treatment = pd.crosstab(df['benefits'], df['treatment'])

# Plot as stacked column
benefits_treatment.plot(kind='bar', stacked=True, figsize=(8, 6), color=['#ff9999', '#66b3ff'])
plt.title('Availability of Mental Health Benefits vs Treatment Status')
plt.xlabel('Benefits Availability')
plt.ylabel('Number of Respondents')
plt.xticks(rotation=0)
plt.legend(title='Sought Treatment')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a stacked column chart because it is a clear and commonly used way to compare parts of a whole across different categories. It helps to show both the total number of respondents in each benefits group and the breakdown of those who did or did not seek treatment.

##### 2. What is/are the insight(s) found from the chart?

Answer - The chart shows that respondents who have mental health benefits available are more likely to seek treatment compared to those who do not have benefits or are unsure. This suggests a positive relationship between the availability of mental health benefits and the likelihood of seeking professional help.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here- Yes, these insights can guide organizations to invest in and promote mental health benefits more actively. Providing clear information about these benefits and ensuring they are accessible can encourage more employees to seek help, improving overall well-being and productivity.
Ignoring this could result in lower treatment rates, higher stress levels, and reduced employee engagement — all of which negatively impact business growth and retention.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

In [None]:
# -----------------------------
# Chart 10: Distribution of Age by Treatment Status split by Gender
# -----------------------------
import seaborn as sns
import matplotlib.pyplot as plt

# Make sure gender categories are clean
df_clean = df[df['Gender'].isin(['male', 'female', 'other'])]

# Create FacetGrid
g = sns.FacetGrid(df_clean, col="Gender", height=5, aspect=1)
g.map(sns.histplot, "Age", bins=20, kde=True, color="teal")
g.set_titles("Gender: {col_name}")
g.set_axis_labels("Age", "Count")
g.fig.suptitle("Age Distribution by Gender (with Treatment Context)", y=1.05)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here- I chose a FacetGrid histogram because it clearly shows the age distribution within each gender category in separate panels. This makes it easy to compare patterns across genders side by side and identify trends without overlap or clutter.

##### 2. What is/are the insight(s) found from the chart?

Answer Here -
From the chart, we can see that most respondents are males in the 25–35 age range, with a clear peak around 30. Female respondents also show a similar peak but in a slightly lower count range. Respondents who identified as "other" are fewer and have a wider spread in ages. This indicates that younger professionals are more engaged with mental health surveys, and gender representation is uneven.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here - These insights can help target mental health programs by age and gender more effectively. Knowing the dominant age groups and gender distribution allows companies to tailor wellness campaigns to be more inclusive and focused. There are no direct negative impacts; rather, ignoring these differences could lead to lower engagement in support programs, indirectly affecting growth and employee well-being.



#### Chart - 11

In [None]:
# Chart - 11 visualization code

In [None]:
# Chart 11: Scatter plot — Age vs Company Size
# Create a mapping to convert company size categories into approximate numeric values
company_size_mapping = {
    '1-5': 3,
    '6-25': 15,
    '26-100': 63,
    '100-500': 300,
    '500-1000': 750,
    'More than 1000': 2000
}

# Map the 'no_employees' column to numeric values
df['company_size_num'] = df['no_employees'].map(company_size_mapping)

# ---------------------------------------------
# Create scatter plot
# ---------------------------------------------
plt.figure(figsize=(10, 6))

# Plot Age on y-axis and numeric company size on x-axis
# alpha makes points slightly transparent to see overlaps better
# edgecolors='w' adds white borders to points for better visibility
plt.scatter(df['company_size_num'], df['Age'], alpha=0.6, color='teal', edgecolors='w')

# ---------------------------------------------
# Add titles and labels
# ---------------------------------------------
plt.title('Age vs Company Size (Scatter Plot)')
plt.xlabel('Approximate Company Size')
plt.ylabel('Age')

# Add grid for easier reading
plt.grid(True)

# ---------------------------------------------
# Show the plot
# ---------------------------------------------
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here-I chose a scatter plot to show the relationship between age and company size because it helps to visualize how individual data points are distributed and whether any trend or pattern exists. Scatter plots are commonly used in data analysis to identify clusters, outliers, or potential correlations.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-
From the chart, we can see that respondents across different company sizes generally have a wide age distribution, with most respondents concentrated between ages 25 and 40. There is no strong visible trend showing age preference for certain company sizes, but larger companies (e.g., more than 1,000 employees) show a slightly broader age range, possibly indicating more diversity in workforce age.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, these insights can help companies understand the age diversity in different company sizes, which is important for planning mental health programs and employee support. If age diversity is high, tailored programs might be needed to support different age groups effectively. There are no insights from this chart that would lead to negative growth; instead, it helps highlight opportunities to improve inclusivity and employee well-being strategies.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

In [None]:
# ---------------------------------------------
# Import plotly.express as px
# This is needed to create interactive Plotly charts
# ---------------------------------------------
import plotly.express as px

# ---------------------------------------------
# We are creating a Sunburst chart to show hierarchical relationships.
# Example hierarchy:
# Country → Gender → Treatment
# ---------------------------------------------

fig = px.sunburst(
    df,
    path=['Country', 'Gender', 'treatment'],  # Define the drill-down order
    values=None,                              # Count rows automatically
    color='treatment',                        # Color segments by treatment status
    color_discrete_map={
        'Yes': 'lightgreen',                  # Color for 'Yes'
        'No': 'lightcoral'                    # Color for 'No'
    },
    title='Sunburst Chart: Country → Gender → Treatment'
)

# ---------------------------------------------
# Adjust chart layout (margins)
# ---------------------------------------------
fig.update_layout(
    margin=dict(t=40, l=0, r=0, b=0),
)

# ---------------------------------------------
# Show interactive chart
# ---------------------------------------------
fig.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a sunburst chart because it is excellent for showing hierarchical relationships in the data.

In this case, we have Country → Gender → Treatment as a hierarchy.
A sunburst helps us see how each segment (like gender or treatment) contributes within a country, and we can drill down step by step.
It provides an intuitive, visual breakdown rather than using separate bar or pie charts.

##### 2. What is/are the insight(s) found from the chart?

Answer Here- We see that in countries like the United States, males form the largest group, and within that, a large proportion have not sought treatment (the red segment).
Females in most countries are more likely to have sought treatment (green), suggesting higher mental health awareness or willingness to seek help.
Countries such as the United Kingdom and Canada show more balanced distributions between genders and treatment statuses.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here - Positive Business Impact:

Companies operating globally can tailor mental health initiatives per country and gender.
For example, targeted awareness programs for males in the US could encourage more treatment seeking.
HR policies can become more culturally and demographically sensitive.
Potential negative insight (to address):

If large segments of employees (like males in the US) do not seek treatment, it could lead to lower productivity, higher absenteeism, and poor workplace morale.
This insight is crucial to act upon to avoid negative business growth in the long term.


#### Chart - 13

In [None]:
# Chart - 13 visualization code

In [None]:
fig = px.treemap(
    df,
    path=['Gender', 'remote_work', 'treatment'],
    title='Treemap: Gender → Remote Work → Treatment',
    color='treatment',
    color_discrete_map={'Yes': 'green', 'No': 'red'}
)
fig.show()


##### 1. Why did you pick the specific chart?

Answer Here-The treemap chart is chosen because it allows us to represent multiple hierarchical categories together in a compact space. It shows Gender → Remote Work → Treatment in one view.
Unlike bar or pie charts, it visualizes proportions within nested groups, making it easier to compare different subgroups at the same time.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Male non-remote workers have a large proportion who did not seek treatment (large red block).
Female non-remote workers show a higher rate of seeking treatment compared to males (green area is larger).
Overall, there is a visible gap in treatment seeking among males, especially those not working remotely.
The "other" gender group is very small, indicating low representation in this dataset.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here- Positive business impact:

These insights can guide HR teams to design targeted mental health support, focusing on male employees, especially those working on-site.
Helps promote a culture where seeking treatment is normalized.
 Any negative growth?

No direct negative growth, but if these gaps are ignored, it may lead to higher absenteeism, burnout, or lower productivity among groups that are not seeking help.
By acting on these insights, companies can improve employee well-being and long-term performance.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Select only numeric columns
df_numeric = df.select_dtypes(include=[np.number])

# Calculate correlation matrix
corr = df_numeric.corr()

# Create a mask for weak correlations (absolute value < 0.5)
mask = np.abs(corr) < 0.5

# Set up the matplotlib figure
plt.figure(figsize=(14, 10))

# Draw the heatmap
sns.heatmap(
    corr,
    mask=mask,           # Hide weak correlations
    cmap='coolwarm',     # Color scheme for intensity
    vmin=-1, vmax=1,     # Fix the color scale range from -1 to 1
    annot=True,          # Show correlation coefficients
    fmt='.2f',           # Decimal places
    linewidths=0.5,
    cbar_kws={'label': 'Correlation Coefficient'}  # Color bar label
)

plt.title('Correlation Heatmap (Stronger Correlations Highlighted by Intensity)', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I picked the correlation heatmap because it is a very effective way to quickly visualize relationships between multiple numerical features in one glance.

It shows both the strength and the direction (positive or negative) of correlations using color intensity and numerical values.
It helps in identifying which variables are strongly related, which can guide further analysis or feature selection in modeling.


##### 2. What is/are the insight(s) found from the chart?

Answer Here-From this heatmap:

The only strong correlation highlighted is between Age and company_size_num, and it is perfectly 1.0 (as shown in red).
All other weak correlations were masked out (blanked) to focus only on strong relationships.
This suggests that either:

These columns might actually represent the same or derived data (e.g., company_size_num might be accidentally coded using Age or vice versa), or
There is a data issue or strong encoding overlap that needs to be checked.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

In [None]:
# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Select numerical columns only for pair plot
# You may customize this list as per your cleaned and encoded dataset
numerical_cols = ['Age', 'company_size_num', 'no_employees', 'work_interfere']

# Create the pair plot
sns.pairplot(df[numerical_cols], diag_kind='kde', hue='work_interfere')

# Add a title to the whole plot
plt.suptitle('Pair Plot of Selected Numerical Features', y=1.02)

# Show the plot
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here- I chose the pair plot because it helps visualize relationships and correlations between multiple numerical variables in a single compact plot. It also shows distributions on the diagonal, giving a complete overview of data spread and possible patterns in different combinations. Adding hue (work_interfere) gives further context on how different levels of work interference impact these relationships.

##### 2. What is/are the insight(s) found from the chart?

Answer Here- We can see that age distribution is mostly clustered between 20 and 40, with slight variations across different work interference categories.
Most respondents are from companies of smaller or very large sizes, suggesting a polarized distribution.
There isn’t a strong visible linear relationship between age and company size, but patterns hint at younger respondents often working in both very small and very large companies.
Work interference levels are spread across company sizes and ages, but "Sometimes" interference seems more common in middle-age groups.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here- Business Objective Recommendation (Concise)
1️⃣ Improve mental health benefits and clearly communicate them to increase treatment uptake and reduce stigma.

2️⃣ Tailor awareness programs by gender and region, since male employees and certain countries show lower treatment-seeking behavior.

3️⃣ Promote flexible and remote work options to support mental well-being.

4️⃣ Train managers to reduce work interference perceptions and encourage open mental health discussions.

5️⃣ Develop company-size-specific strategies, focusing on reducing barriers in larger firms.

# **Conclusion**

Write the conclusion here.
The analysis shows that mental health support at workplaces is still inadequate. Many employees, especially males and those in larger companies, hesitate to seek treatment due to stigma and fear of negative consequences. Younger employees and females are more open to discussing mental health. Companies should strengthen mental health benefits, promote open communication, and create supportive work cultures to improve employee well-being and productivity.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***