<a href="https://colab.research.google.com/github/ishadvay3928/Bird-Species-Observation-Analysis/blob/main/Bird_Species_Observation_EDA_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Bird Species Observation Analysis**



##### **Project Type**    -  Exploratory Data Analysis(EDA)
##### **Contribution**    - Individual


# **Project Summary -**

This project presents a comprehensive Exploratory Data Analysis (EDA) of a mental health survey dataset focusing on employees in the tech industry. The main objective was to analyze employee responses to understand mental health trends, workplace support systems, and treatment-seeking behavior, ultimately aiding organizations in crafting more effective mental health policies and initiatives.

The dataset underwent data cleaning and transformation to handle inconsistencies and categorical encoding, followed by visual exploration through 15 different charts including histograms, pie charts, count plots, strip plots, and a correlation heatmap. These visualizations provided in-depth insights into demographic distributions, treatment patterns, and organizational support factors.

Key findings revealed that:

- Employees in their late 20s and early 30s dominate the respondent pool and face significant mental health interference.

- The respondent base is heavily male-dominated, and treatment rates among males are notably lower.

- The majority of responses came from the US, indicating regional participation bias.

- About half of the respondents are undergoing treatment, highlighting a growing openness towards mental health.

- Smaller companies (6–100 employees) form the majority, calling for scalable mental health solutions.

- Strong correlations were found between coworker-supervisor support (0.57), wellness programs and seeking help (0.47), and benefits-care options (0.44), suggesting key leverage points for integrated mental health strategies.

The analysis uncovered gaps such as lack of awareness around mental health leave, treatment hesitancy among certain groups, and minimal participation from non-Western countries. These insights provide organizations with actionable data to design inclusive, gender-sensitive, scalable, and globally relevant workplace mental health initiatives.

# **GitHub Link -**

https://github.com/ishadvay3928/Bird-Species-Observation-Analysis/blob/main/Bird_Species_Observation_Analysis.ipynb

# **Problem Statement**


**The project aims to analyze the distribution and diversity of bird species in two distinct ecosystems: forests and grasslands. By examining bird species observations across these habitats, the goal is to understand how environmental factors, such as vegetation type, climate, and terrain, influence bird populations and their behavior. The study will involve working on the provided observational data of bird species present in both ecosystems, identifying patterns of habitat preference, and assessing the impact of these habitats on bird diversity. The findings can provide valuable insights into habitat conservation, biodiversity management, and the effects of environmental changes on avian communities.**



#### **Define Your Business Objective?**

- Inform decisions on protecting critical bird habitats and enhancing biodiversity conservation efforts.
- Optimize land use and habitat restoration strategies by understanding the preferences of different bird species.
- Identify bird-rich areas to develop bird-watching tourism, attracting eco-tourists and boosting local economies.
- Support the development of agricultural practices that minimize the impact on bird populations in grasslands and forests.
- Provide data-driven insights to help environmental agencies create effective conservation policies and strategies for vulnerable bird species.
- Track the health and diversity of avian populations, aiding in the monitoring of ecosystem stability.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

### Dataset Loading

In [None]:
# Load forest Dataset

import pandas as pd
# Specify the file path
file_path = "/content/Bird_Monitoring_Data_FOREST.XLSX"

# Read the Excel file with multiple sheets
excel_data = pd.ExcelFile(file_path)

# Get all sheet names
sheet_names = excel_data.sheet_names

# Read data from all sheets into a dictionary
sheets_dict = {sheet: excel_data.parse(sheet) for sheet in sheet_names}

In [None]:
Forest_combined_df = pd.concat(
    [df.assign(Sheet=sheet_name) for sheet_name, df in sheets_dict.items()],
    ignore_index=True
)

In [None]:
# later u can Drop the 'Sheet' column
Forest_combined_df = Forest_combined_df.drop(columns=['Sheet'])

In [None]:
# Load grassland Dataset

import pandas as pd
# Specify the file path
file_path = "/content/Bird_Monitoring_Data_GRASSLAND.XLSX"

# Read the Excel file with multiple sheets
excel_data = pd.ExcelFile(file_path)

# Get all sheet names
sheet_names = excel_data.sheet_names

# Read data from all sheets into a dictionary
sheets_dict = {sheet: excel_data.parse(sheet) for sheet in sheet_names}

In [None]:
Grassland_combined_df = pd.concat(
    [df.assign(Sheet=sheet_name) for sheet_name, df in sheets_dict.items()],
    ignore_index=True
)

In [None]:
# later u can Drop the 'Sheet' column
Grassland_combined_df = Grassland_combined_df.drop(columns=['Sheet'])

### Dataset First View

In [None]:
# Dataset First Look
Forest_combined_df.head()

In [None]:
Grassland_combined_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
Forest_combined_df.shape

In [None]:
Grassland_combined_df.shape

### Dataset Information

In [None]:
# Dataset Info
Forest_combined_df.info()

In [None]:
Grassland_combined_df.info()

#### Duplicate Values

In [None]:
# Duplicate Value Count
Forest_combined_df.duplicated().sum()

In [None]:
Grassland_combined_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count of datasets
Forest_combined_df.isnull().sum()

In [None]:
Grassland_combined_df.isnull().sum()

In [None]:
# Visualizing the missing values
import missingno as msno
msno.bar(Forest_combined_df)

In [None]:
import missingno as msno
msno.bar(Grassland_combined_df)

### What did you know about your dataset?

**Forest dataset**
- There are 8546 rows and 29 columns in the dataset.
- Out of which 5 Columns have missing Values. Column 'Sub_Unit_Code' have most missing values of 7824.
- Out of all 'ID_Method' Have least missing values of 1.

**Grassland dataset**
- There are 8531 rows and 29 columns in the dataset.
- Out of which 5 Columns have missing Values. Column 'Sub_Unit_Code' have most missing values of 8531.
- Out of all 'ID_Method' Have least missing values of 1.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
Forest_combined_df.columns

In [None]:
Grassland_combined_df.columns

In [None]:
#Dataset Describe
Forest_combined_df.describe(include='all')

In [None]:
Grassland_combined_df.describe(include='all')

### Variables Description

**The dataset contains observational data for bird species recorded across multiple forest and grassland sites. It includes detailed columns describing location, observation methods, bird species, and environmental conditions.**

- Admin_Unit_Code: The code for the administrative unit (e.g., "ANTI") where the observation was conducted.
- Sub_Unit_Code: The sub-unit within the administrative unit for further classification.
- Site_Name: The name of the specific observation site within the unit.
- Plot_Name: A unique identifier for the specific plot where observations were recorded.
- Location_Type: The habitat type of the observation area (e.g., "Forest").
- Year: The year in which the observation took place.
- Date: The exact date of the observation.
- Start_Time: The start time of the observation session.
- End_Time: The end time of the observation session.
- Observer: The individual who conducted the observation.
- Visit: The count of visits made to the same observation site or plot.
- Interval_Length: The duration of the observation interval (e.g., "0-2.5 min").
- ID_Method: The method used to identify the species (e.g., "Singing," "Calling," "Visualization").
- Distance: The distance of the observed species from the observer (e.g., "<= 50 Meters").
- Flyover_Observed: Indicates whether the bird was observed flying overhead (TRUE/FALSE).
- Sex: The sex of the observed bird (e.g., Male, Female, Undetermined).
- Common_Name: The common name of the observed bird species (e.g., "Eastern Towhee").
- Scientific_Name: The scientific name of the observed bird species (e.g., Pipilo erythrophthalmus).
- AcceptedTSN: The Taxonomic Serial Number for the observed species.
- NPSTaxonCode: A unique code assigned to the taxon of the species.
- AOU_Code: The American Ornithological Union code for the species.
- PIF_Watchlist_Status: Indicates whether the species is on the Partners in  Flight Watchlist (e.g., "TRUE" for at-risk species).
- Regional_Stewardship_Status: Denotes the conservation priority within the region (TRUE/FALSE).
- Temperature: The temperature recorded at the time of observation (in degrees).
- Humidity: The humidity percentage recorded at the time of observation.
- Sky: The sky condition during the observation (e.g., "Cloudy/Overcast").
- Wind: The wind condition (e.g., "Calm (< 1 mph) smoke rises vertically").
- Disturbance: Notes any disturbances that could affect the observation (e.g., "No effect on count").
- Initial_Three_Min_Cnt: The count of the species observed in the first three minutes of the session.

**Sheets Information:**

The Excel file contains multiple sheets representing different administrative units, with their codes matching the Admin_Unit_Code column:

- ANTI: Data for the Antietam National Battlefield.
- CATO: Data for the Catoctin Mountain Park.
- CHOH: Data for the Chesapeake and Ohio Canal National Historical Park.
- GWMP: Data for the George Washington Memorial Parkway.
- HAFE: Data for Harpers Ferry National Historical Park.
- MANA: Data for the Manassas National Battlefield Park.
- MONO: Data for the Monocacy National Battlefield.
- NACE: Data for the National Capital East Parks.
- PRWI: Data for the Prince William Forest Park.
- ROCR: Data for the Rock Creek Park.
- WOTR: Data for the Wolf Trap National Park for the Performing Arts.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable of dataset.
Forest_combined_df.nunique()

In [None]:
Grassland_combined_df.nunique()

## ***3. Data Wrangling***

### Data Wrangling Code

In [None]:
# Merge both datasets
Merged_df = pd.concat([Forest_combined_df, Grassland_combined_df], ignore_index=True)

In [None]:
# Drop Sub_Unit_Code column as it has very less non-null values
Merged_df.drop(columns=['Sub_Unit_Code'], inplace=True)

In [None]:
# impute null values in Site_Name column
Merged_df['Site_Name'] = Merged_df['Site_Name'].fillna("Unknown")

# impute null values in Distance column
Merged_df['Distance'] = Merged_df['Distance'].fillna("Unknown")

# impute null values in Sex column
Merged_df['Sex'] = Merged_df['Sex'].fillna("Undetermined")

# impute null values in NPSTaxonCode column
Merged_df['NPSTaxonCode'] = Merged_df['NPSTaxonCode'].fillna("N/A")

# impute null values in TaxonCode column
Merged_df['TaxonCode'] = Merged_df['TaxonCode'].fillna("N/A")

In [None]:
# impute null values in Previously_Obs column using mode
Merged_df['Previously_Obs'] = Merged_df['Previously_Obs'].fillna(Merged_df['Previously_Obs'].mode()[0])

In [None]:
# Drop rows where 'ID_Method' or 'AcceptedTSN' are null
Merged_df = Merged_df.dropna(subset=['ID_Method', 'AcceptedTSN'])

In [None]:
# CHANGE DATATYPES

# Fix Year column
Merged_df['Year'] = pd.to_numeric(Merged_df['Year'], errors='coerce').astype('Int64')

# Handle 'Start_time' and ''End_Time' Columns
# Convert to string and strip spaces
Merged_df['Start_Time'] = Merged_df['Start_Time'].astype(str).str.strip()
Merged_df['End_Time'] = Merged_df['End_Time'].astype(str).str.strip()

# Extract only the HH:MM:SS part (last 8 characters)
Merged_df['Start_Time'] = Merged_df['Start_Time'].str[-8:]
Merged_df['End_Time'] = Merged_df['End_Time'].str[-8:]

# Convert to proper time format
Merged_df['Start_Time'] = pd.to_datetime(Merged_df['Start_Time'], format='%H:%M:%S', errors='coerce').dt.time
Merged_df['End_Time'] = pd.to_datetime(Merged_df['End_Time'], format='%H:%M:%S', errors='coerce').dt.time

# Create a column for observation hour
Merged_df['Observation_Hour'] = pd.to_datetime(Merged_df['Start_Time'].astype(str), format='%H:%M:%S').dt.hour


In [None]:
bool_cols = ['Flyover_Observed', 'PIF_Watchlist_Status',
             'Regional_Stewardship_Status', 'Initial_Three_Min_Cnt']

for col in bool_cols:
    Merged_df[col] = Merged_df[col].astype(str).str.strip().str.lower().map(
        {'true': True, 'false': False, 'yes': True, 'no': False}
    )
cat_cols = ['Admin_Unit_Code', 'Site_Name', 'Plot_Name', 'Location_Type',
            'Observer', 'ID_Method', 'Distance', 'Sex', 'Common_Name',
            'Scientific_Name', 'Sky', 'Wind', 'Disturbance']

for col in cat_cols:
    Merged_df[col] = Merged_df[col].astype('category')

In [None]:
# Drop Duplicates from merged dataset
Merged_df.drop_duplicates(inplace=True)

In [None]:
Merged_df.info()

In [None]:
# Save cleaned dataset
Merged_df.to_csv("Bird_Monitoring_Clean_Merged_dataset.csv", index=False)

### What all manipulations have you done and insights you found?

### **Key Manipulations:**

* **Merged Forest and Grassland Datasets** to create a single unified dataset for analysis.
* **Dropped `Sub_Unit_Code`** as it contained very few non-null values (sparse, low-utility data).
* **Imputed Missing Values**:

  * `Site_Name` and `Distance` → `"Unknown"`
  * `Sex` → `"Undetermined"`
  * `NPSTaxonCode` & `TaxonCode` → `"N/A"`
  * `Previously_Obs` → Filled using the most frequent (mode) value.
* **Dropped Rows with Missing `ID_Method` and `AcceptedTSN`** to ensure essential identification information is retained.
* **Converted `Year` to Integer** (`Int64`) for consistent numeric analysis.
* **Cleaned and Standardized Time Fields** (`Start_Time` and `End_Time`): removed extra spaces, extracted `HH:MM:SS`, converted to proper time format.
* **Created `Observation_Hour` Column** from `Start_Time` to enable hourly trend analysis.
* **Standardized Boolean-like Columns** (`Flyover_Observed`, `PIF_Watchlist_Status`, `Regional_Stewardship_Status`, `Initial_Three_Min_Cnt`) by mapping variations like `'yes'/'no'` and `'true'/'false'` to `True`/`False`.
* **Converted Categorical Columns** (e.g., location codes, observer, species, environmental conditions) to category dtype for efficiency and consistency.
* **Removed Duplicate Rows** to maintain data integrity.


### **Insights Gained:**

* **Data Deduplication** ensures no repeated entries, preventing double counting in species observations.
* **Consistent Missing Value Handling** preserves maximum usable data while avoiding gaps in analysis.
* **Categorical Standardization** improves the accuracy of grouping, filtering, and summary statistics.
* **Time Cleaning and Hour Extraction** enables meaningful time-based pattern detection (e.g., peak bird activity hours).
* **Boolean Standardization** supports reliable filtering and aggregation for conservation status and observation methods.
* **Dropping Low-Value Columns** removes noise and improves dataset quality for focused analysis.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 (Age Distribution)

In [None]:
# Age Distribution
plt.figure(figsize=(12,6))
sns.histplot(df['Age'], bins=30, kde=True, color='teal')
plt.title('Age Distribution of Respondents')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

This histogram was chosen to effectively display the frequency distribution of the respondents' ages, which is a continuous data set.

##### 2. What is/are the insight(s) found from the chart?

The data reveals that the largest group of respondents are young adults, predominantly in their late 20s and early 30s, with a significant drop in respondents over the age of 40.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight can drive positive impact by offering healthy working environment perticularly for the dominant age group, but it also reveals a potential negative factor by highlighting a major workingforce age struggling with mental health issues.

#### Chart - 2 (Gender Distribution)

In [None]:
# Gender Count
plt.figure(figsize=(12, 6))
df['Gender'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=['cornflowerblue', 'plum','lavender'])
plt.title('Gender Distribution of Respondents')
plt.ylabel('')
plt.show()



##### 1. Why did you pick the specific chart?

This pie chart was chosen to clearly show the proportional breakdown of respondents by gender, which is a categorical variable.

##### 2. What is/are the insight(s) found from the chart?

The insight from the chart is that the respondent base is heavily skewed towards males, who make up a vast majority (78.4%) of the participants followed by females and others.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight allows for tailoring mental health initiatives to the current male-dominated workforce, but it also signals a significant risk to organizational well-being, as a lack of gender diversity can lead to support systems that fail to address the unique challenges faced by other genders, fostering an exclusionary culture.

#### Chart - 3 (Top 10 Countries)

In [None]:
# Top 10 Countries
top_countries = df['Country'].value_counts().nlargest(10)

plt.figure(figsize=(12,6))
sns.barplot(x=top_countries.values, y=top_countries.index, palette="coolwarm")
plt.title('Top 10 Countries by Respondent Count')
plt.xlabel('Number of Respondents')
plt.ylabel('Country')
plt.show()

##### 1. Why did you pick the specific chart?

This horizontal bar chart was selected to clearly rank and compare the number of respondents from different countries, effectively handling long category labels.

##### 2. What is/are the insight(s) found from the chart?

The data shows that respondents are overwhelmingly from the United States, with a sharp decline in numbers for other countries, most of which are Western and English-speaking.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight suggests that most of the responders are from US ,which means US is active country in mental health survey.On the other hand,Maximum population of responders from US and very little from other countries can not properly display the global mental in tech correctly.

#### Chart - 4 (Mental Health Treatment Distribution)

In [None]:
# Mental Health Treatment Distribution
plt.figure(figsize=(8, 5))
df['treatment'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=['lightgreen', 'lightcoral'])
plt.title('People undergoing Treatment for Mental Health')
plt.ylabel('')
plt.show()



##### 1. Why did you pick the specific chart?

A pie chart was chosen to effectively visualize the proportion of people undergoing mental health treatment versus those who are not, showcasing parts of a whole.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that a near-equal proportion of individuals are currently undergoing mental health treatment (50.5%) as those who are not (49.5%).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, This insight of roughly half the population undergoing treatment suggests a significant and addressable market for mental health in tech, indicating a positive business impact by validating demand; however, the equally large segment not undergoing treatment highlights a potential for negative growth if these individuals are overlooked, representing a missed opportunity for market expansion and deeper penetration through awareness campaigns or less stigmatizing access points.

#### Chart - 5 (Work Interference Frequency)

In [None]:
# Work Interference Frequency
plt.figure(figsize=(10, 5))
sns.countplot(data=df, y='work_interfere', order=df['work_interfere'].value_counts().index, palette='pastel')
plt.title('Work Interference Due to Mental Health')
plt.xlabel('Count')
plt.ylabel('Interference Level')
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart was selected to clearly compare the frequency of different "Interference Levels" of mental health on work.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that "Sometimes" is the most common work interference level due to mental health, followed by "Never," then "Rarely," and "Often" is the least common.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insight that a significant portion of individuals are impacted by mental health at work presents a positive business impact for mental health in tech by indicating a large addressable market for workplace solutions; however, the "Never" category, while smaller, represents a potential for negative growth if mental health solutions don't proactively engage this segment through preventative measures or general wellness programs, missing an opportunity for broader impact across the entire workforce.

#### Chart - 6 (Company Size by Number of Employees)

In [None]:
# Company Size (No. of Employees)
plt.figure(figsize=(12, 6))
sns.countplot(data=df, y='no_employees', order=df['no_employees'].value_counts().index, palette='muted')
plt.title('Company Size by Number of Employees')
plt.xlabel('Count')
plt.ylabel('Company Size')
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart was chosen to effectively compare the number of companies across various employee size ranges, making it easy to see which size categories are most prevalent.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that companies with 6-25 and 26-100 employees are the most numerous, followed by companies with "More than 1000" employees, then 100-500, 1-5, and lastly 500-1000 employees.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insight that smaller to medium-sized companies represent the largest segment of the market presents a positive business impact for mental health in tech by indicating a vast, accessible target audience for scalable mental health solutions; however, the relatively lower count of very large (500-1000) companies suggests a potential for negative growth potentially overlooking the larger opportunity in serving the more numerous organizations with simpler  off-the-shelf offerings.

#### Chart - 7 (Family History)

In [None]:
# Family History
plt.figure(figsize=(12, 6))
df['family_history'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=['cornflowerblue', 'plum'])
plt.title('Family History of Respondents')
plt.ylabel('')
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart was chosen to clearly show the proportion of respondents with and without a family history of mental health conditions, illustrating parts of a whole.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that a majority of respondents (60.9%) do not have a family history of mental health issues, while a significant minority (39.1%) do.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, The insight that nearly 40% of respondents have a family history of mental health conditions presents a positive business impact for mental health by identifying a segment with potentially higher awareness and predisposition; however, the larger segment without a family history represents a potential for negative growth if solutions solely focus on known risk factors, potentially missing the broader population who may still benefit from general mental wellness tools.

#### Chart - 8 (Medical Leave Applicable)

In [None]:
# Medical Leave Applicable for mental Health Condition
plt.figure(figsize=(12, 6))
sns.countplot(data=df, y='leave', order=df['leave'].value_counts().index, palette='muted')
plt.title('Medical Leave Applicable for mental Health Condition')
plt.xlabel('Count')
plt.ylabel('Responses')
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart was chosen to effectively compare the number of responses for different levels of ease/difficulty in taking medical leave for mental health conditions.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that a large number of respondents "Don't know" if medical leave is applicable for mental health, while it's considered "Somewhat easy" or "Very easy" by a notable portion, and "Somewhat difficult" or "Very difficult" by a smaller segment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. the insights highlights an opportunity for mental health platforms to improve awareness of workplace policies. However, many finding leave access "difficult" signals potential risks to employee well-being and productivity, underscoring the need for policy reforms driven by this data.











#### Chart - 9 (Treatment vs Gender)

In [None]:
# Treatment vs Gender
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='Gender', hue='treatment', palette='Set1')
plt.title('Treatment vs Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

A grouped bar chart was chosen to effectively compare the number of individuals receiving or not receiving mental health treatment across different gender categories.

##### 2. What is/are the insight(s) found from the chart?

The chart indicates that a higher number of males are not receiving treatment compared to those who are, while for females, more are receiving treatment than not; the "Others" category shows very low counts for both treatment statuses.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. The low treatment rates among males highlight an underserved group, offering mental health platforms a key opportunity for targeted outreach. However, neglecting the unique needs of males and the underrepresented "Others" group could worsen care disparities if not properly addressed in data and solutions.

#### Chart - 10 (Family History vs Treatment)

In [None]:
# Family History vs Treatment
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='family_history', hue='treatment', palette='Set2')
plt.title('Treatment vs Family History')
plt.xlabel('Family History of Mental Illness')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

A grouped bar chart was chosen to effectively compare the number of individuals receiving or not receiving mental health treatment based on their family history of mental illness.

##### 2. What is/are the insight(s) found from the chart?

The chart indicates that individuals with no family history are more likely to not be in treatment, whereas individuals with a family history of mental illness are more likely to be receiving treatment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, Individuals with a family history are more likely to seek treatment, offering a clear target for advanced mental health solutions. However, focusing only on high-risk groups may overlook a wider audience, limiting growth in early detection and general wellness efforts.

#### Chart - 11(Work Interference vs Treatment)

In [None]:
# Work Interference vs Treatment
plt.figure(figsize=(10, 5))
sns.countplot(data=df, x='work_interfere', hue='treatment', palette='Set2')
plt.title('Treatment vs Work Interference')
plt.xlabel('Work Interference')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

A grouped bar chart was chosen to effectively compare the number of individuals receiving or not receiving mental health treatment across different levels of work interference.

##### 2. What is/are the insight(s) found from the chart?

The chart indicates that individuals who experience "Often" or "Rarely" work interference are more likely to be receiving treatment, while those who experience "Never" work interference are significantly less likely to be in treatment, and "Sometimes" shows a roughly even split.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, Individuals reporting frequent work interference are actively seeking treatment, indicating strong demand for workplace mental health support. However, overlooking those with little or occasional interference risks missing a large group that could benefit from early interventions, potentially limiting long-term impact and growth.











#### Chart - 12 (Remote Work vs Treatment)

In [None]:
# Remote Work vs Treatment
plt.figure(figsize=(7, 5))
sns.countplot(data=df, x='remote_work', hue='treatment', palette='cool')
plt.title('Treatment vs Remote Work')
plt.xlabel('Remote Work')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

A grouped bar chart was chosen to effectively compare the number of individuals receiving or not receiving mental health treatment based on their remote work status.

##### 2. What is/are the insight(s) found from the chart?

The chart indicates that a higher number of non-remote workers are not in treatment compared to those who are, while among remote workers, slightly more are receiving treatment than not.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Remote workers show a higher tendency to seek treatment, highlighting a key audience for digital mental health solutions. However, focusing too heavily on remote setups may overlook the larger non-remote workforce, limiting opportunities to address mental health needs in traditional and hybrid workplaces.

#### Chart - 13 (Supervisor Support vs Treatment)

In [None]:
# Supervisor Support vs Treatment
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='supervisor', hue='treatment', palette='spring')
plt.title('Treatment vs Supervisor Support')
plt.xlabel('Supervisor Support')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

A grouped bar chart was chosen to effectively compare the number of individuals receiving or not receiving mental health treatment across different levels of supervisor support.

##### 2. What is/are the insight(s) found from the chart?

The chart indicates that individuals with supervisor support ("Yes") are slightly less likely to be in treatment, while those with no supervisor support ("No") are more likely to be in treatment. Individuals with "Some of them" support show a slightly higher likelihood of being in treatment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, Individuals receiving supervisor support still seek treatment, revealing demand for mental health care tools. However, those without support being more likely in treatment signals a risk: without improving managerial empathy,mental health tools alone may fall short in fostering true workplace well-being.

#### Chart - 14 (Age vs Work Interfere colored by Treatment)

In [None]:
# Age vs Work Interfere colored by Treatment
plt.figure(figsize=(10,6))
sns.stripplot(data=df, x="work_interfere", y="Age", hue="treatment", jitter=True)
plt.title("Age vs Work Interfere colored by Treatment")
plt.show()

##### 1. Why did you pick the specific chart?

A stripplot (or jitter plot) was chosen to visualize the distribution of individual age data points for different work interference levels.

##### 2. What is/are the insight(s) found from the chart?

The chart indicates that individuals across all work interference levels, particularly in "Sometimes" and "Never," span a wide age range, and that people of all ages experience work interference and either are or are not receiving treatment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, Work interference affects all age groups, highlighting a broad market for age-inclusive mental health in tech. However, the absence of age-based treatment patterns fails to tailor interventions by specific needs may limit effectiveness and growth.

#### Chart - 15 (Correlation matrix)

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encode categorical columns
encoded_df = df.copy()
label_encoders = {}
for col in encoded_df.select_dtypes(include='object').columns:
    le = LabelEncoder()
    encoded_df[col] = le.fit_transform(encoded_df[col].astype(str))
    label_encoders[col] = le

# Correlation matrix
plt.figure(figsize=(16, 10))
sns.heatmap(encoded_df.corr(), cmap="coolwarm", annot=True, fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap of Survey Features", fontsize=16)
plt.show()



##### 1. Why did you pick the specific chart?

The correlation heatmap effectively visualizes relationships between multiple mental health survey features at once, helping identify key associations.

##### 2. What is/are the insight(s) found from the chart?

There is a strong correlation between coworker and superviser(0.57) then seek_help and wellness_program(0.47) and then care_options and benefits (0.44).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can drive positive business impact by improving mental health support systems through aligned wellness programs, benefits, and managerial support. No strong negative correlations were found, but weak associations may highlight missed opportunities in underutilized support services.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

- Focus on employees in their late 20s to early 30s with tailored mental health programs.
- Launch campaigns for men and include underrepresented genders in mental health initiatives.
- Collaborate with global tech hubs to gather diverse data and improve outreach.
- Strengthen supervisor and coworker support to boost mental health outcomes.
- Make leave policies clearer and easier to access to reduce stress and absenteeism.
- Provide affordable, flexible programs for small to mid-sized companies.
- Offer wellness checks and preventive support even to those not yet seeking treatment.
- Create virtual mental health services to support the growing remote workforce.
- Reduce stigma through manager training, open dialogue, and anonymous feedback systems.

# **Conclusion**

This analysis of the Mental Health in Tech dataset reveals critical insights into employee well-being and organizational gaps. Young adults in their late 20s and early 30s, especially males, face significant mental health challenges, yet treatment rates remain low. Factors such as supervisor support, workplace interference, and leave policies greatly influence treatment decisions. Small to mid-sized companies dominate the landscape, highlighting the need for scalable solutions. Additionally, remote workers show higher engagement in treatment, indicating digital intervention potential. These findings emphasize the need for inclusive, targeted, and proactive mental health strategies to build a healthier, more supportive tech work environment.