# **Project Name**    -  MediBuddy Insurance Data Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

The MediBuddy Insurance Data Analysis & Risk Assessment project aims to understand the key factors that influence medical insurance charges and to provide data-driven insights that can support better decision-making in insurance policy design and risk management. MediBuddy, being a digital healthcare and insurance platform, deals with diverse customer profiles varying in age, health status, lifestyle, and demographics. This project leverages cleaned and structured insurance data to analyze how these factors impact insurance claim amounts.

The dataset used in this project includes information such as policy number, age, age group, gender, region, number of children (dependents), smoking status, BMI (Body Mass Index), BMI category, and insurance charges in INR. Since the data was already cleaned prior to analysis, the focus of the project was on exploratory data analysis (EDA) and visualization rather than preprocessing. Feature engineering, such as creating age groups and BMI categories, helped in simplifying analysis and drawing more meaningful insights.

The analysis followed a structured UBM approach (Univariate, Bivariate, and Multivariate analysis) to ensure a comprehensive understanding of the data. Univariate analysis was used to study the distribution of individual variables like age group, BMI, and insurance charges. These analyses revealed that most policyholders belong to young adult, adult, and middle-aged categories, while senior citizens form a smaller proportion of the insured population. BMI distribution showed that most individuals fall within the normal to overweight range, indicating a generally moderate health risk profile.

Bivariate analysis examined relationships between two variables at a time, such as age and insurance charges, BMI and insurance charges, number of dependents and charges, and smoking status versus charges. These visualizations clearly indicated that insurance charges increase with age and BMI. Smoking status emerged as one of the most influential factors, with smokers consistently incurring significantly higher insurance costs than non-smokers across all comparisons. In contrast, the number of dependents showed only a marginal impact on insurance charges, suggesting it is not a major cost driver.

Multivariate analysis further strengthened these findings by examining combined effects of multiple variables. Visualizations such as grouped bar charts, box plots, heatmaps, and bubble charts demonstrated that the interaction between age, BMI, and smoking status leads to substantially higher insurance charges. Senior individuals with obese BMI and smoking habits represented the highest-risk group. Heatmap analysis highlighted that insurance charges rise sharply when higher age groups intersect with unhealthy BMI categories. Bubble chart analysis reinforced that smoking amplifies the combined effect of age and BMI on insurance costs.

Overall, the project concludes that age, BMI, and smoking status are the most critical factors influencing insurance charges, while gender, region, and number of dependents play comparatively smaller roles. These insights are highly valuable for insurance companies like MediBuddy, as they can be used to design risk-based premium structures, introduce targeted wellness programs, and offer incentives or discounts for healthier individuals. The findings also support the idea of personalized and preventive healthcare-driven insurance models.

In summary, this project demonstrates how data visualization and exploratory analysis can uncover meaningful patterns in insurance data, enabling smarter business decisions and improved risk assessment in the healthcare insurance domain.

# **GitHub Link -**

https://github.com/krishnavekariya1995/Medibuddy-insurance-by-Labmentix

# **Problem Statement**


The objective of this project is to analyze MediBuddy insurance data to understand how different demographic, health, and lifestyle factors influence insurance charges, based strictly on insights derived from data visualizations. The analysis focuses on identifying key risk drivers by examining patterns observed in charts related to age groups, BMI distribution, smoking status, number of dependents, and their combined effects on insurance charges. From the visual evidence, insurance costs show a consistent increase with age, higher BMI categories, and smoking behavior, while factors such as gender and number of dependents exhibit limited influence. By relying solely on the outcomes observed in the provided charts, this project aims to highlight which customer segments contribute most to higher insurance claims and to support data-driven decision-making for risk assessment, premium structuring, and targeted policy design within the MediBuddy insurance framework.

#### **Define Your Business Objective?**

The primary business objective of this project is to support MediBuddy in identifying key factors that drive insurance charges using insights derived strictly from data visualizations. Based on the observed charts, the objective is to understand how customer attributes such as age, BMI, smoking status, and their combined effects influence insurance costs. The analysis aims to help the company recognize high-risk customer segments, optimize risk-based premium pricing, and make informed decisions regarding policy design, discounts, and preventive health initiatives. By focusing only on visual evidence, the objective is to enable MediBuddy to improve cost control, underwriting accuracy, and strategic planning without relying on assumptions beyond the analyzed charts.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

df_personal = pd.read_excel('/content/drive/MyDrive/Medibuddy Project/Medibuddy insurance data personal details.xlsx')
df_price = pd.read_excel('/content/drive/MyDrive/Medibuddy Project/Medibuddy Insurance Data Price.xlsx')
df_personal.rename(columns={'Policy no.':'policy_no'}, inplace=True)
df_price.rename(columns={'Policy no':'policy_no'}, inplace=True)
df_price.columns = df_price.columns.map(lambda x: x.strip().replace(" ","_").replace("\n","").replace("\r","").replace(" ","").lower().replace(" ","_").replace(".",""))


In [None]:
df_merged=pd.merge(df_personal, df_price, on='policy_no', how='inner')

### Dataset First View

In [None]:
# Dataset First Look
df_merged.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df_merged.shape


### Dataset Information

In [None]:
# Dataset Info
df_merged.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df_merged.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df_merged.isnull().sum()


In [None]:
# Visualizing the missing values
sns.heatmap(df_merged.isnull(), cbar=False)

### What did you know about your dataset?

The dataset contains MediBuddy insurance policyholder information, including demographic, health, and lifestyle attributes along with insurance charges. It shows that most policyholders belong to the working-age population and generally have moderate BMI levels. Insurance charges are unevenly distributed, with a small group of individuals accounting for higher costs. Age, BMI, and smoking status strongly influence insurance charges, while gender and number of dependents have minimal impact.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df_merged.columns

In [None]:
# Dataset Describe
df_merged.describe()

### Variables Description

policy_no: Unique identification number assigned to each insurance policyholder.

age: Age of the insured individual in years.

age_group: Categorized age bands created from the age variable to group policyholders into meaningful age ranges for analysis.

sex: Gender of the policyholder (Male/Female).

region: Geographic region where the policyholder resides.

children: Number of dependents covered under the insurance policy.

smoker: Indicates whether the policyholder is a smoker or non-smoker.

bmi: Body Mass Index of the individual, representing overall health condition.

bmi_category: Classification of BMI into health categories such as underweight, normal, overweight, and obese.

charges_in_inr: Total insurance charges incurred by the policyholder, measured in Indian Rupees (INR).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df_merged.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Remove duplicate records
df_merged = df_merged.drop_duplicates()

df_merged['smoker']=df_merged['smoker'].str.lower().str.strip()
df_merged['sex']=df_merged['sex'].str.lower().str.strip()
df_merged['region']=df_merged['region'].str.lower().str.strip()

df_merged.columns = df_merged.columns.str.strip().str.lower().str.replace(' ', '_').str.replace(".","")

#Create Age Group column
bins_age = [0, 25, 40, 55, 100]
labels_age = ['Young Adult', 'Adult', 'Middle Aged', 'Senior']
df_merged['age_group'] = pd.cut(df_merged['age'], bins=bins_age, labels=labels_age)
#Create BMI Category column
bins_bmi = [0, 18.5, 24.9, 29.9, 100]
labels_bmi = ['Underweight', 'Normal', 'Overweight', 'Obese']
df_merged['bmi_category'] = pd.cut(df_merged['bmi'], bins=bins_bmi, labels=labels_bmi)

df_merged.info()
df_merged.head()

### What all manipulations have you done and insights you found?

The dataset was first inspected to understand its structure, data types, and completeness. Missing values were handled using appropriate techniques, and duplicate records were removed to maintain data accuracy. Categorical variables were standardized to ensure consistency across the dataset. Feature engineering was performed by deriving meaningful variables such as age group and BMI category to enable better segmentation and analysis. The dataset was then validated to ensure logical value ranges and overall data quality.

From the analysis, it was observed that insurance charges are unevenly distributed, with a small group of individuals contributing to higher costs. Age plays a significant role, as insurance charges generally increase with increasing age. BMI also influences insurance costs, with higher BMI values associated with higher charges. Smoking status emerged as one of the most critical factors, with smokers incurring substantially higher insurance costs than non-smokers. In contrast, gender and number of dependents showed minimal impact on insurance charges. Overall, the manipulations helped uncover key risk drivers and supported meaningful insurance risk and cost analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# **Gender vs Insurance Charges (Histogram with Mean)**

In [None]:
# Separate data by gender
male_charges = df_merged[df_merged['sex'] == 'male']['charges_in_inr']
female_charges = df_merged[df_merged['sex'] == 'female']['charges_in_inr']

# Calculate overall mean
mean_charge = df_merged['charges_in_inr'].mean()

# Plot histograms
plt.figure()
plt.hist(male_charges, bins=30, alpha=0.6, label='Male')
plt.hist(female_charges, bins=30, alpha=0.6, label='Female')

# Plot mean line
plt.axvline(mean_charge, linestyle='--', label='Overall Mean')

plt.xlabel('Insurance Charges (INR)')
plt.ylabel('Frequency')
plt.title('Distribution of Insurance Charges by Gender')
plt.legend()
plt.show()

The chart shows that male and female insurance charge distributions largely overlap, indicating similar cost patterns. Both genders have most policyholders in the lower charge range, with a long right tail of high-cost cases. The overall mean lies within the common range for both genders, showing no extreme gender-based difference.

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# Q - 1 Does the gender of the person matter for the company as a constraint for
#      extending policies?


import matplotlib.pyplot as plt

gender_charges = df_merged.groupby('sex')['charges_in_inr'].mean()

gender_charges.plot(kind='bar')
plt.xlabel('Gender')
plt.ylabel('Average Insurance Charges (INR)')
plt.title('Average Insurance Charges by Gender')
plt.show()

### **Answer:-** The Visualization shows that average insurance charges for males and females are very similar, with no extreme deviation. This indicates that gender alone is not a strong constraint for extending insurance policies. Medibuddy does not need gender - specific restrictions while issuing policies.

##### 1. Why did you pick the specific chart?

A bar chart comparing average insurance charges by gender was chosen because it directly shows whether there is a meaningful cost difference between male and female policyholders, which is essential to decide if gender should be treated as a constraint in policy extension.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that although there is a slight difference in average insurance charges between males and females, the difference is not significant. Both genders have comparable insurance costs, indicating that gender alone does not strongly influence insurance claims.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:
The insight supports fair and inclusive policy extension, allowing MediBuddy to avoid gender-based restrictions and focus on more relevant risk factors such as age, BMI, and smoking status.

Negative growth insight:
There is no indication of negative growth from this insight, as excluding or restricting policies based on gender would not meaningfully reduce risk and could limit customer acquisition.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Q -2 What is the average amount of money the company spent over each policy
# cover?


average_charge = df_merged['charges_in_inr'].mean()

plt.hist(df_merged['charges_in_inr'], bins = 30)
plt.axvline(average_charge)
plt.xlabel('Insurance Charges (INR)')
plt.ylabel('Number of Policies')
plt.title('Distribution of Insurance Charges with Average Line')
plt.show()

exact_mean = df_merged['charges_in_inr'].mean()
exact_mean

**Answer:-** The histogram with the average line shows that the company spends an average amount of approximately ₹13,000 per policy cover. The distribution is right-skewed, indicating that a few high-cost claims significantly increase the overall average spending.

##### 1. Why did you pick the specific chart?

A histogram with an average reference line was chosen because it effectively shows both the distribution of insurance charges and the central tendency (average spending). This helps understand not just the average value, but also how individual policy costs are spread around it.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most insurance charges are concentrated at lower values, while a smaller number of policies incur very high costs. This results in a right-skewed distribution, where a few high-cost claims increase the overall average amount spent per policy.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:
Knowing the average spending per policy helps MediBuddy plan premium pricing, budgeting, and risk forecasting, ensuring financial sustainability.

Negative growth insight:
The presence of high-cost outliers indicates potential financial risk if such cases increase, which could negatively impact profitability if not managed through risk-based pricing or preventive health measures.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#Q - 3 Could you advice if the company needs to offer separate policies based upon the
# geographic location of the person?

region_charges = df_merged.groupby('region')['charges_in_inr'].mean()

region_charges.plot(kind='bar')
plt.xlabel('Region')
plt.ylabel('Average Insurance Charges (INR)')
plt.title('Average Insurance Charges by Region')
plt.show()

**Answer:-** The visualization shows noticeable differences in average insurance charges across different regions. Since certain regions incur significantly higher costs than others, geographic location plays an important role in insurance expenditure. Therefore, MediBuddy should consider offering region-specific or customized policies to better manage risk and pricing.

##### 1. Why did you pick the specific chart?

A bar chart showing average insurance charges by region was chosen because it allows a clear comparison of insurance costs across different geographic locations. This makes it easier to identify whether location-based differences in charges exist.

##### 2. What is/are the insight(s) found from the chart?

The chart shows noticeable variation in average insurance charges across regions. Certain regions have higher average charges compared to others, indicating differences in healthcare costs, lifestyle patterns, or risk exposure based on geographic location.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:
This insight helps MediBuddy consider region-specific pricing or customized policy offerings, leading to better risk management and more accurate premium calculation.

Negative growth insight:
If regional differences are ignored, the company may underprice high-risk regions or overprice low-risk regions, which could negatively affect profitability or reduce competitiveness in certain areas.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Q- 4 Does the no. of dependents make a difference in the amount claimed?

df_merged.boxplot(column='charges_in_inr', by='children')

plt.xlabel('Number of Children')
plt.ylabel('Insurance Cahrges (INR)')
plt.title('Insurance Charges by Number of Dependents')
plt.suptitle('')
plt.show()

**Answer:-** The box plot shows that insurance charges tend to increase slightly with the number of dependents; however, the variation across different groups overlaps significantly. This suggests that the number of dependents has only a marginal impact on the amount claimed and is not a strong determining factor compared to other variables such as age, smoking status, or BMI.

##### 1. Why did you pick the specific chart?

A box plot of insurance charges against the number of dependents was chosen because it effectively shows the distribution, median, and variability of claim amounts for each dependent group, making it easier to compare differences across groups.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that insurance charges vary across different numbers of dependents, but there is significant overlap in the distributions. While there is a slight increase in charges for higher numbers of dependents, the difference is not substantial, indicating a weak relationship between dependents and claim amount.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:
This insight suggests that MediBuddy does not need to heavily penalize customers based on the number of dependents, allowing more inclusive policy offerings without significantly increasing risk.

Negative growth insight:
Since dependents do not strongly influence claim amounts, relying on this factor for premium differentiation could limit competitiveness without providing meaningful risk reduction.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Q - 5 Does a study of persons BMI get the company any idea for the insurance claim
# that it would extend?

plt.scatter(df_merged['bmi'], df_merged['charges_in_inr'])
plt.xlabel('BMI')
plt.ylabel('Insurance Charges (INR)')
plt.title('BMI Vs Insurance Charges')
plt.show()

**Answer:-** The scatter plot shows that insurance charges tend to increase with higher BMI values. Individuals with higher BMI exhibit a wider spread and higher upper range of insurance claims. This suggests that BMI is a useful health indicator and provides the company with valuable insight into potential insurance claim amounts.

##### 1. Why did you pick the specific chart?

A scatter plot between BMI and insurance charges was chosen because both variables are continuous, and this chart type helps identify whether changes in BMI are associated with changes in insurance claim amounts.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that insurance charges tend to increase as BMI increases, with higher BMI values associated with a wider spread and higher upper range of charges. This indicates that individuals with higher BMI generally pose a higher insurance risk.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:
This insight helps MediBuddy use BMI as a health risk indicator to improve risk assessment, premium pricing, and preventive health initiatives.

Negative growth insight:
Overemphasizing BMI without considering other factors could discourage certain customers, potentially limiting customer acquisition if not balanced with holistic risk evaluation.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Q - 6 Is it needed for the company to understand whether the person covered is a
# smoker or a non-smoker?

df_merged.boxplot(column = 'charges_in_inr', by='smoker')

plt.xlabel('Smoker')
plt.ylabel('Insurance Charges (INR)')
plt.title('Insurance Charges: Smoker Vs Non-Smoker')
plt.suptitle('')
plt.show()

**Answer:-** The box plot clearly shows that smokers incur significantly higher insurance charges compared to non-smokers. The median and overall distribution of claims for smokers are much higher, indicating increased health risk. Therefore, it is essential for the company to understand whether a person is a smoker or non-smoker while extending insurance policies and determining premiums.

##### 1. Why did you pick the specific chart?

A box plot comparing insurance charges for smokers and non-smokers was chosen because it clearly shows differences in median, spread, and extreme values between the two groups, making it ideal for assessing risk associated with smoking status.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that smokers incur significantly higher insurance charges than non-smokers. The entire distribution of charges for smokers is shifted upward, with higher medians and more extreme high-cost claims.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:
This insight allows MediBuddy to incorporate smoking status into risk-based pricing, underwriting decisions, and preventive health programs, leading to better cost control.

Negative growth insight:
Higher premiums for smokers may reduce policy uptake among this group, but ignoring smoking status would increase financial risk due to consistently higher claims.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

# Q - 7 Does age have any barrier on the insurance claimed?

plt.scatter(df_merged['age'], df_merged['charges_in_inr'])
plt.xlabel('Age')
plt.ylabel('Insurance Charges (INR)')
plt.title('Age Vs Insurance Charges')
plt.show()

**Answer:-** The scatter plot indicates that insurance charges increase as age increases. Older individuals tend to have higher and more variable insurance claims compared to younger individuals. This suggests that age acts as a significant factor and creates a barrier in terms of higher insurance claims, which must be considered while extending policies.

##### 1. Why did you pick the specific chart?

A scatter plot between age and insurance charges was chosen because it effectively shows how insurance claims vary across different ages and helps identify whether increasing age leads to higher insurance c

##### 2. What is/are the insight(s) found from the chart?

The chart shows a clear upward trend, where insurance charges increase as age increases. Older individuals tend to have higher and more variable insurance claims compared to younger individuals.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:
This insight enables MediBuddy to apply age-based risk assessment and premium structuring, improving pricing accuracy and financial sustainability.

Negative growth insight:
Higher insurance costs for older individuals may reduce policy uptake in senior age groups if premiums are not carefully balanced with affordability and coverage benefits.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Q - 8 Can the company extend certain discounts after checking the health status
# (BMI) in this case?

bmi_cat_charges = df_merged.groupby('bmi_category', observed=True)['charges_in_inr'].mean()

bmi_cat_charges.plot(kind='bar')
plt.xlabel('BMI Category')
plt.ylabel('Average Insurance Charges (INR)')
plt.title('Average Charges by BMI Category')
plt.show()

**Answer:-** The bar chart shows that individuals with normal and underweight BMI categories incur lower average insurance charges compared to overweight and obese individuals. This indicates that healthier individuals pose lower risk to the company. Therefore, MediBuddy can extend certain discounts or incentives to customers with healthier BMI levels, encouraging preventive health behavior and reducing future claim costs.

##### 1. Why did you pick the specific chart?

A bar chart showing average insurance charges across different BMI categories was chosen because it allows a clear comparison of insurance costs based on health status, making it suitable to evaluate whether healthier individuals incur lower claims.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that individuals with lower BMI levels incur lower average insurance charges, while those with higher BMI levels have significantly higher charges. This indicates a strong relationship between health status and insurance cost.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:
This insight supports the idea of offering discounts or incentives to healthier individuals, encouraging preventive healthcare and reducing long-term claim costs.

Negative growth insight:
Higher premiums for individuals with higher BMI could discourage some customers from purchasing insurance if not balanced with wellness support or health improvement programs.

#### Chart - 9

# **Univariate Analysis**

In [None]:
# Chart - 9 visualization code
# Distribution of Age Group

import matplotlib.pyplot as plt

age_group_counts = df_merged['age_group'].value_counts().sort_index()

plt.figure()
age_group_counts.plot(kind='bar')
plt.xlabel('Age Group')
plt.ylabel('Number of Policyholders')
plt.title('Distribution of Policyholders by Age Group')
plt.show()

##### 1. Why did you pick the specific chart?

This bar chart was chosen to understand how policyholders are distributed across different age segments. It is appropriate because it clearly compares the number of customers in each age group, helping identify which age segments dominate the customer base.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that middle-aged and adult policyholders form the largest portion of the insured population, followed by young adults. The senior age group has the lowest number of policyholders, indicating lower participation from older individuals.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:
The insight helps the company focus its offerings, marketing, and retention strategies on adult and middle-aged customers, who contribute most to the policy count.

Negative growth insight:
The low representation of senior policyholders suggests potential growth limitations in this segment, possibly due to higher premiums or stricter eligibility, which may require tailored products to improve inclusion.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Distribution of Insurance Charges

plt.figure()
plt.violinplot(df_merged['charges_in_inr'], showmeans=True)
plt.ylabel('Insurance Charges (INR)')
plt.title('Distribution of Insurance Charges')
plt.show()

##### 1. Why did you pick the specific chart?

A violin plot was chosen to understand the overall distribution and density of insurance charges. Unlike a simple histogram, this chart shows the spread, concentration, and skewness of the data, helping to identify how charges are distributed across policyholders.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most insurance charges are concentrated at lower values, while there is a long upper tail representing a smaller number of high-cost claims. This indicates a right-skewed distribution, where a few expensive policies significantly increase overall costs.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:
This insight helps MediBuddy recognize that the majority of customers are low-cost, allowing the company to design affordable base policies while managing high-risk cases separately.

Negative growth insight:
The presence of high-cost outliers poses a financial risk, as an increase in such cases could negatively affect profitability if premiums are not aligned with risk levels.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Distribution of BMI

import seaborn as sns

plt.figure()
sns.kdeplot(df_merged['bmi'], fill=True)
plt.xlabel('BMI')
plt.ylabel('Density')
plt.title('BMI Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

A KDE (density) plot was chosen to understand the overall distribution pattern of BMI values among policyholders. This chart helps visualize where BMI values are most concentrated and whether the distribution is skewed, without the noise of bar boundaries.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most BMI values are concentrated in the normal to overweight range, with fewer individuals at very low or very high BMI levels. The distribution is slightly right-skewed, indicating the presence of some high-BMI individuals.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:
This insight helps the company understand that most customers fall within a moderate health-risk range, enabling stable pricing and targeted wellness initiatives.

Negative growth insight:
The presence of high-BMI individuals indicates higher potential claim risk, which could increase costs if preventive health measures or risk-based pricing are not applied.

#### Chart - 12

# **Bivariate Analysis**

In [None]:
# Chart - 12 visualization code
# Average Insurance Charges by Age Group

age_group_avg = df_merged.groupby('age_group', observed=False)['charges_in_inr'].mean()

plt.figure()
plt.plot(age_group_avg.index, age_group_avg.values, marker='o')
plt.xlabel('Age Group')
plt.ylabel('Average Insurance Charges (INR)')
plt.title('Average Insurance Charges by Age Group')
plt.show()

##### 1. Why did you pick the specific chart?

A line chart was chosen to clearly show the trend of average insurance charges across age groups. This chart type is effective for ordered categories like age groups and helps visualize how insurance costs change progressively with age.

##### 2. What is/are the insight(s) found from the chart?

The chart shows a steady and significant increase in average insurance charges from young adults to seniors. Young adults incur the lowest charges, while senior individuals have the highest average insurance costs, indicating that insurance risk increases with age.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:
This insight supports age-based risk assessment and helps the company design appropriate premium structures and coverage plans for different age segments.

Negative growth insight:
Higher insurance costs for senior customers may reduce policy adoption in this group if premiums become unaffordable, potentially limiting growth unless flexible or subsidized plans are introduced.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Smoker Vs Non-Smoker Proportion by Age Group

smoker_age = pd.crosstab(df_merged['age_group'], df_merged['smoker'], normalize='index')

smoker_age.plot(kind='bar', stacked=True)
plt.xlabel('Age Group')
plt.ylabel('Proportion')
plt.title('Proportion of Smokers and Non-Smokers by Age Group')
plt.legend(title='Smoker')
plt.show()

##### 1. Why did you pick the specific chart?

A stacked bar chart was chosen to compare the composition of smokers and non-smokers within each age group. This chart clearly shows proportional differences and helps understand how smoking behavior varies across age segments.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that non-smokers form the majority across all age groups, while smokers represent a smaller but consistent proportion. The senior age group shows a slightly higher proportion of smokers compared to younger age groups, indicating varying lifestyle risk across ages.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact:
This insight helps MediBuddy combine age and smoking status for more accurate risk profiling and premium calculation, improving underwriting decisions.

Negative growth insight:
A higher proportion of smokers in older age groups may increase claim risk and costs, potentially affecting profitability if not managed through appropriate pricing or preventive health programs.

#### Chart - 14

In [None]:
# Insurance Charges vs Number of Children

plt.figure()
sns.stripplot(x='children', y='charges_in_inr', data=df_merged, jitter=True)
plt.xlabel('Number of Children')
plt.ylabel('Insurance Charges (INR)')
plt.title('Insurance Charges vs Number of Children')
plt.show()

##### 1. Why did you pick the specific chart?

A strip plot was chosen to visualize individual insurance charge values across different numbers of children. This chart helps observe data spread, clustering, and overlap between groups more clearly than summary-based charts.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that insurance charges are widely spread across all categories of children, with no clear increasing or decreasing trend as the number of children increases. High-cost claims appear for individuals with both fewer and more children, indicating a weak relationship between dependents and insurance charges.

#### Chart - 15

# **Multivariate Analysis**

In [None]:
# Chart - 15 visualization code
# Average Insurance Charges by BMI Category and Smoking Status

bmi_smoker_avg = df_merged.groupby(['bmi_category', 'smoker'], observed=True)['charges_in_inr'].mean().unstack()

bmi_smoker_avg.plot(kind='bar')
plt.xlabel('BMI Category')
plt.ylabel('Average Insurance Charges (INR)')
plt.title('Average Insurance Charges by BMI Category and Smoking Status')
plt.legend(title='Smoker')
plt.show()

##### 1. Why did you pick the specific chart?

A grouped bar chart was chosen to compare average insurance charges across BMI categories while simultaneously differentiating between smokers and non-smokers. This chart clearly highlights the combined effect of health status and lifestyle behavior on insurance costs.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that smokers have significantly higher insurance charges than non-smokers across all BMI categories. Among all groups, obese smokers incur the highest average charges, indicating that the combination of high BMI and smoking greatly increases insurance risk.

#### Chart - 16

In [None]:
# Chart - 16 visualization code
# Insurance Charges by Age Group and Smoking Status

plt.figure()
sns.boxplot(x='age_group', y='charges_in_inr', hue='smoker', data=df_merged)
plt.xlabel('Age Group')
plt.ylabel('Insurance Charges (INR)')
plt.title('Insurance Charges by Age Group and Smoking Status')
plt.show()

##### 1. Why did you pick the specific chart?

A grouped box plot was chosen to compare the distribution of insurance charges across age groups while distinguishing between smokers and non-smokers. This chart clearly shows differences in median, spread, and outliers for each combination of age and smoking status.

##### 2. What is/are the insight(s) found from the chart?




The chart shows that smokers incur significantly higher insurance charges than non-smokers in every age group. Insurance charges increase with age for both groups, but the increase is much steeper for smokers. Senior smokers exhibit the highest median charges and greatest variability, indicating the highest risk.

#### Chart - 17

In [None]:
#Heatmap of Average Insurance Charges by Age Group and BMI Category

pivot_table = df_merged.pivot_table(
    values='charges_in_inr',
    index='age_group',
    columns='bmi_category',
    aggfunc='mean',
    observed=True
)

plt.figure()
sns.heatmap(pivot_table, annot=True, fmt='.0f')
plt.title('Average Insurance Charges by Age Group and BMI Category')
plt.xlabel('BMI Category')
plt.ylabel('Age Group')
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap was chosen to analyze the combined impact of age and BMI on average insurance charges. This chart is effective because it visually highlights cost intensity across different combinations using color gradients, making comparison easy and intuitive.

##### 2. What is/are the insight(s) found from the chart?


The heatmap shows that insurance charges increase with both age and BMI. Younger individuals with lower BMI incur the lowest charges, while older individuals with higher BMI, especially senior and obese policyholders, incur the highest insurance costs. The progression across both dimensions indicates a strong combined effect on insurance claims.

#### Chart - 18

In [None]:
#Bubble Chart – Age vs BMI with Insurance Charges & Smoking Status

plt.figure()
sns.scatterplot(
    data=df_merged,
    x='age',
    y='bmi',
    size='charges_in_inr',
    hue='smoker',
    sizes=(20, 300),
    alpha=0.6
)
plt.xlabel('Age')
plt.ylabel('BMI')
plt.title('Age vs BMI with Insurance Charges and Smoking Status')
plt.show()



##### 1. Why did you pick the specific chart?

A bubble chart was chosen because it allows the analysis of multiple variables simultaneously—age, BMI, insurance charges, and smoking status. This chart is effective in showing how these factors interact and jointly influence insurance costs.

##### 2. What is/are the insight(s) found from the chart?


The chart shows that larger bubbles (higher insurance charges) are mainly associated with smokers, especially at higher age and higher BMI values. Non-smokers generally have smaller bubbles, even as age and BMI increase. This indicates that smoking significantly amplifies the combined effect of age and BMI on insurance charges.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve the business objective, MediBuddy should adopt a risk-based and data-driven insurance strategy. The company should prioritize age, BMI, and smoking status as primary factors in underwriting and premium pricing, as these variables have the strongest influence on insurance charges. Implementing differentiated premium structures based on these factors will improve pricing accuracy and cost control.

MediBuddy should also introduce preventive healthcare and wellness programs, such as BMI management and smoking cessation initiatives, to reduce long-term claim costs while improving customer health outcomes. At the same time, factors like gender and number of dependents should not be heavily weighted, ensuring fair and inclusive policy offerings.

Additionally, the company can design targeted insurance products for different risk segments and continuously monitor high-cost customer profiles to manage risk effectively. These steps will help MediBuddy balance profitability, customer satisfaction, and sustainable business growth.

# **Conclusion**

This project analyzed MediBuddy insurance data to understand the key drivers of insurance charges. The analysis revealed that age, BMI, and smoking status significantly influence insurance costs, with higher values leading to increased claims. Customers with multiple risk factors contribute disproportionately to overall insurance expenditure. In contrast, gender and number of dependents showed minimal impact on insurance charges. These insights support a risk-based pricing approach and encourage preventive health initiatives. Overall, the project enables data-driven decision-making for sustainable growth and improved risk management.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***