# **Project Name**    - EDA Meddibuddy




##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

Write the summary here within 500-600 words.

This project focused on the analysis of Medibuddy insurance data, which includes customer demographics, lifestyle attributes, and insurance charges. The primary goal was to derive actionable insights to help the company improve policy planning, risk assessment, and customer segmentation.

The dataset comprised two files: one containing personal details (age, sex, BMI, smoker status, region, and number of children) and the other containing policy costs (insurance charges). After merging the two datasets on the common key (Policy no.), the resulting dataset provided a comprehensive view of each customer’s profile alongside the insurance charges incurred.

We started by cleaning the data: removing missing values, handling duplicates, and ensuring correct data types. Outlier detection was performed using Z-score methods for numerical variables such as age, bmi, and charges. These outliers, while extreme, provided important insights into high-risk or high-cost individuals.

The analysis explored the influence of several factors on insurance charges:

Gender: A boxplot analysis indicated that there is no significant difference in charges based on gender. Therefore, gender should not be a constraint when extending policies.

Average Cost: The average insurance cost per policy was approximately ₹13,000–₹15,000, providing a baseline for policy pricing and forecasting.

Geographic Region: Charges varied slightly by region. Certain regions showed higher median charges, potentially due to lifestyle or healthcare costs. This suggests that geographic segmentation of policies could be considered to optimize pricing strategies.

Dependents (Children): The number of children showed a weak correlation with charges. Households with more dependents did not necessarily incur higher costs, indicating that this variable may not significantly impact individual policy pricing.

BMI (Body Mass Index): A strong relationship was observed between BMI and insurance charges, especially when segmented by smoker status. Higher BMI often correlated with higher charges, suggesting that BMI can be a useful indicator for estimating policy risk. This opens up opportunities to offer health-based discounts or wellness incentives for customers within a healthy BMI range.

Smoker Status: Smoking status had one of the most significant impacts on insurance charges. Smokers were charged dramatically higher amounts than non-smokers, validating the risk assessment criteria. This highlights the importance of continuing to collect and evaluate lifestyle data for policy underwriting.

Age: Age was directly correlated with insurance charges. Older individuals tended to incur higher charges, which is consistent with the general increase in health risks over time. Age should continue to be a key factor in premium calculation.

Health-Based Discounts: Based on BMI distributions, a case can be made for offering discounts to customers within a healthy BMI range (18.5-24.9). This could incentivize healthier lifestyles and potentially reduce claim rates.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

To analyze Medibuddy's insurance data and identify key demographic and lifestyle factors—such as age, gender, BMI, region, smoking status, and dependents—that influence insurance charges, in order to optimize policy pricing, risk assessment, and customer segmentation strategies.










#### **Define Your Business Objective?**

1.Optimize Policy Pricing based on customer risk profiles.

2.Identify High-Risk Individuals using factors like age, BMI, and smoking status.

3.Enhance Customer Segmentation for targeted policy offerings.

4.Evaluate Geographic Impact on insurance costs for regional pricing strategies.

5.Explore Discount Opportunities based on health indicators to encourage wellness.










# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats


### Dataset Loading

In [None]:
# Load Data
file_path_details = "Medibuddy insurance data personal details (1) (2).xlsx"
file_path_price = "Medibuddy Insurance Data Price (1) (2).xlsx"
# read sheets into dataframe
df_details=pd.read_excel("/content/Medibuddy insurance data personal details (1) (2).xlsx")
df_price=pd.read_excel('/content/Medibuddy Insurance Data Price (1) (2).xlsx')

In [None]:
# Merge on 'Policy no.
merged_df = pd.merge(df_details, df_price, on="Policy no.", how="inner")

In [None]:
# Save to CSV
merged_df.to_csv("Merged_Medibuddy_Data.csv", index=False)

In [None]:
# Load the merged data
merged_df = pd.read_csv("Merged_Medibuddy_Data.csv")

In [None]:
# Check head and tail of the data
merged_df.head()
merged_df.tail()

### Dataset First View

In [None]:
# Dataset First Look
#merged_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
merged_df.shape


### Dataset Information

In [None]:
# Dataset Info
merged_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
merged_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
merged_df.isnull().sum()

In [None]:
# Visualizing the missing values
# Create a heatmap of missing values
plt.figure(figsize=(8, 5))
sns.heatmap(merged_df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

There is no missing values in the dataset


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
merged_df.columns

In [None]:
# Dataset Describe
merged_df.describe()

### Variables Description

Policy No- Unique identifier for each insurance policy

Age- Age of the insured individual (in years)

Sex- Gender of the individual (male / female)

BMI- Body Mass Index - a measure of body fat

Children- Number of dependents (children) covered

Smoker- Smoking status (yes / no)

Region- Geographic location (northeast, northwest, etc.)

Charge in INR- Total insurance claim amount in Indian Rupees (INR)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = merged_df.nunique()
unique_values

## 3. ***Data Wrangling***

In [None]:
#Outliers detection (z Score)
# Compute Z-scores
z_scores = np.abs(stats.zscore(merged_df.select_dtypes(include=[np.number])))


In [None]:
# Identify rows with z-score > 3 (outliers)
outlier_rows = (np.abs(z_scores) > 3).any(axis=1)


In [None]:
print("Number of outlier rows:", outlier_rows.sum())

In [None]:
# Exact duplicate rows
exact_duplicates = merged_df[merged_df.duplicated()]
print("Exact Duplicate Rows:\n", exact_duplicates)


In [None]:
# Duplicate rows based on 'age', 'region', 'charges'
column_duplicates = merged_df[merged_df.duplicated(subset=['age', 'region', 'charges in INR'])]
print("Duplicate Rows based on 'age', 'region', 'charges':\n", column_duplicates)

### Data Wrangling Code

### What all manipulations have you done and insights you found?

Merged two datasets on Policy no.

Cleaned data (removed duplicates, handled missing values)

Converted data types and standardized columns

Detected outliers using Z-scores

Performed exploratory data analysis with visualizations

In merged dataset doing data wrngling for finding duplicate

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#Gender vs Charges
plt.figure(figsize=(8, 6))
sns.boxplot(x='sex', y='charges in INR', data=merged_df)
plt.title('Gender vs Charges')
plt.xlabel('Gender')
plt.ylabel('Charges in INR')


 Question. Does the gender of the person matter for the company as a constraint for extending policies?

Answer: The boxplot shows charges by gender, but unless there's a clear, significant difference, gender should not be a constraint for extending policies due to ethical, legal, and fairness concerns. Other factors like age, smoking, or BMI are more relevant.

##### 1. Why did you pick the specific chart?

I used a boxplot because:

It effectively displays distribution, median, interquartile range (IQR), and outliers.

It compares a numerical variable (charges in INR) against a categorical variable (sex), which is ideal for boxplots.



##### 2. What is/are the insight(s) found from the chart?

The median charges for both males and females are very close.

The spread (IQR) is similar for both genders.

The number and position of outliers are also fairly balanced.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
The company can avoid gender-based pricing, which:

Simplifies the pricing model

Promotes fairness and equality

Aligns with ethical insurance practices

Avoids potential legal and regulatory issues

No negative impact is expected.

#### Chart - 2

In [None]:

# Chart - 2 visualization code
#Average charges
average_charges = merged_df['charges in INR'].mean()

# Delete the variable named 'print' to restore the built-in print function

# del print

print("Average Charges:", average_charges)



 Question:What is the average amount of money the company spent over each policy
cover?

Answer: The average amount of money the company spends per policy cover is ₹13,270.42.



##### 1. Why did you pick the specific chart?

i did not chose any chart for this, Simply I am using print statement for Average Charges, as printed value is ideal to display a single summary metric like average charges — it is clean, quick, and to the point.

##### 2. What is/are the insight(s) found from the chart?

The average insurance charge is approximately Rs13,000-Rs15,000.

This sets a benchmark for policy pricing and expected payouts.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Helps in budget forecasting, premium design, and understanding profitability margins.

 No negative growth observed from this insight — unless claims consistently exceed this average, which would signal financial loss (but needs further analysis).

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#Region vs Charges
plt.figure(figsize=(10, 5))
sns.barplot(x='region', y='charges in INR', data=merged_df)
plt.title('Region vs Charges')
plt.xlabel('Region')
plt.ylabel('Charges in INR')


 Question:Could you advice if the company needs to offer separate policies based upon the
geographic location of the person?

Yes, I advice the company should consider offering separate policies by geographic region by analyzing the bar plot of Region vs Charges.



##### 1. Why did you pick the specific chart?

I chose bar chart was chosen because it clearly shows the average charges across different geographic regions. It’s simple, effective, and ideal for comparing grouped numerical values like costs by region.



##### 2. What is/are the insight(s) found from the chart?

The chart reveals whether there are regional variations in how much the company is spending on policyholders.

If some regions show higher average charges, it may indicate higher risk, cost of healthcare, or health issues in those areas.

Other regions with lower charges may represent lower-risk customer groups.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Positive Impact: Yes. These insights allow the company to:

Segment policy offerings based on region,

Price premiums more accurately,

Reduce losses by adjusting risk exposure,

Improve customer targeting for marketing and policy design.

Neagtive impact:  yes, if misused:

If the company increases premiums too much in high-cost regions, it may lose customers or face regulatory issues.

Over-segmentation can lead to complexity, higher operational costs, and customer dissatisfaction.





#### Chart - 4

In [None]:
# Chart - 4 visualization code
#Children (dependents) vs Charges
plt.figure(figsize=(10, 5))
sns.boxplot(x='children', y='charges in INR', data=merged_df)
plt.title('Children (Dependents) vs Charges')
plt.xlabel('Children (Dependents)')
plt.ylabel('Charges in INR')
plt.show()

 Question: Does the no. of dependents make a difference in the amount claimed?

Answer: The number of dependents does not significantly impact the amount claimed.

##### 1. Why did you pick the specific chart?

 I chose boxplot because it effectively shows the distribution and spread of charges across different numbers of dependents (children). It helps visualize:

Medians,

Variability (IQR),

Outliers.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that:

The number of children (dependents) has little to no impact on the amount claimed.

Medians and spreads are similar across categories, meaning charges are not strongly correlated with dependents.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Positive Impact: Yes. The insight helps the company by:
Avoiding unnecessary segmentation based on number of dependents,

Keeping policies simple and fair,

Focusing pricing on more influential factors (e.g., age, smoking, BMI).

Negative Impact:  No negative growth if applied correctly.


#### Chart - 5

In [None]:
# Chart - 5 visualization code
# BMI Vs Charges
plt.figure(figsize=(10, 5))
sns.scatterplot(x='bmi', y='charges in INR', data=merged_df)
plt.title('BMI vs Charges')
plt.xlabel('BMI')
plt.ylabel('Charges in INR')

Question: Does a study of persons BMI get the company any idea for the insurance claim
that it would extend?

Answer: Yes, studying a person's BMI does help the company estimate potential insurance claims.



##### 1. Why did you pick the specific chart?

The scatter plot was chosen because it shows the relationship between two continuous variables — BMI and Charges. It clearly reveals whether there is:

A trend or pattern,

A correlation (e.g., higher BMI → higher charges),

Any clusters or outliers.

##### 2. What is/are the insight(s) found from the chart?

There is a positive correlation: as BMI increases, especially beyond 30 (obese range), insurance charges tend to rise.

People with high BMI are more likely to incur higher healthcare costs due to obesity-related conditions.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:   Yes. These insights allow the company to:

Assess health risk more accurately,

Set appropriate premiums,

Encourage wellness programs for high-BMI customers to reduce future claims.

Negative Impact: No

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Smoker vs Non-Smoker
sns.histplot(x='smoker', y='charges in INR', data=merged_df)
plt.title('Smoker vs Non-Smoker')
plt.xlabel('Smoker')
plt.ylabel('Charges in INR')
plt.show()



Question:  Is it needed for the company to understand whether the person covered is a
smoker or a non-smoker?

Answer: Yes, it is very important for the company to know whether a person is a smoker or non-smoker.



##### 1. Why did you pick the specific chart?

I chose histogram because it compare the distribution of charges between smokers and non-smokers. It visually highlights the frequency and charge amounts for each group, making it easy to spot differences in cost behavior.

##### 2. What is/are the insight(s) found from the chart?

Smokers incur much higher charges compared to non-smokers.

The chart shows a clear and significant cost difference between the two groups.

Smoking is strongly associated with higher medical expenses and insurance claims.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact: Yes. These insights help the company:

Assess risk accurately,

Price premiums appropriately (higher for smokers),

Promote wellness programs or incentives to reduce smoking.


Negative Impact: Possibly, if not managed ethically:

Overcharging smokers or denying coverage may cause customer backlash or legal issues.

Could lead to loss of trust or negative brand perception.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
#  Age vs Charges
plt.figure(figsize=(10, 5))
sns.scatterplot(x='age', y='charges in INR', data=merged_df)
plt.title('Age vs Charges')
plt.xlabel('Age')
plt.ylabel('Charges in INR')

Question: Does age have any barrier on the insurance claimed?

Answer: Yes, age has a clear impact on insurance claims — older individuals tend to claim higher amounts.

##### 1. Why did you pick the specific chart?

The scatter plot was chosen because it’s ideal for showing the relationship between two continuous variables — age and insurance charges.
It helps visualize:

Trends across age groups

Variations in charges



##### 2. What is/are the insight(s) found from the chart?

There is a positive correlation: as age increases, charges tend to rise.

Especially after middle age, charges increase more steeply, reflecting higher health risks and medical costs in older individuals.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact: Yes. The insights enable the company to:

Design age-based pricing for policies,

Predict future claim costs,

Develop age-specific coverage plans (e.g., senior health policies),

Improve risk management and profitability.

But with balanced pricing and senior-friendly plans, it can drive positive business growth and customer retention.

Negative Impact: Potentially, yes, if misapplied





#### Chart - 8

In [None]:
# Chart - 8 visualization code
# BMI Distribution for Health-Based Discounts
plt.figure(figsize=(10, 5))
sns.histplot(x='bmi', kde=True, data=merged_df)
plt.title('BMI Distribution for Health-Based Discounts')
plt.xlabel('BMI')
plt.ylabel('Density')

Question: Can the company extend certain discounts after checking the health status
(BMI) in this case?

Answer: Yes, the company can extend health-based discounts using BMI as a criterion.

##### 1. Why did you pick the specific chart?

I chose histogram with KDE because it clearly shows the distribution of BMI across the population.
It helps identify:

How many people fall into healthy, overweight, or obese categories,

Where discount thresholds could be applied.



##### 2. What is/are the insight(s) found from the chart?

A significant number of individuals may fall in or near the healthy BMI range (18.5-24.9).

This allows the company to target and reward healthier individuals.

Also highlights the proportion at higher risk , who may need wellness interventions.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:  Yes. The insights support:

Offering health-based discounts to low-risk individuals,

Reducing future claims through preventive health incentives,

Encouraging healthier lifestyles, leading to long-term savings.

Negative Impact: yes, if not handled fairly:

If high-BMI customers feel penalized without support, it may cause:

Customer dissatisfaction or churn,

Negative publicity.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

These recommendations will help the company:

Control risk,

Boost profitability,

Retain healthier customers, and

Offer fair and competitive insurance products in the market.

Answer Here.

# **Conclusion**

Write the conclusion here.

In conclusion, the analysis provided valuable insights into how demographic and lifestyle factors influence insurance costs. Key recommendations include maintaining a focus on age, BMI, and smoking status for risk assessment, considering regional pricing adjustments, and offering wellness incentives. These steps can improve profitability, customer satisfaction, and operational efficiency for Medibuddy's insurance offerings.



### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***