Title

Factors affecting Medical Insurance Costs.


Overview

Health Care costs vary across United States for different individuals due to various factors. These include Demographic and Behavioral factors. This project aims to delve into visually understanding the patterns in cost distribution and the factors for assessing risk and for pricing strategies.

Source

  Medical cost personal dataset.It is a publicly available dataset from Kaggle for academic use.

Key Attributes

This dataset consists of the following variables

Continous

•	Age – Gives the age of the person

•	BMI – Body Mass Index of the person

•	Charges – Yearly Medical Insurance Cost

Categorical

•	Sex – Gender of the person

•	Smoker – Smoking status of the person

•	Region – Geographical region inside USA

Discrete Numeric

•	Children – Specifies the number of dependents in the plan



In [None]:
import pandas as pd

url = "https://raw.githubusercontent.com/nalinichandrasekhar/Factors-Affecting-Medical-Insurance-Costs/main/insurance.csv"

df = pd.read_csv(url)

df.head()

In [None]:
df.shape
df.info()

Distribution of the Medical Insurance Charges

In [None]:
import altair as alt

alt.Chart(df).mark_bar().encode(
    x=alt.X("charges:Q", bin=alt.Bin(maxbins=30), title="Medical Insurance Charges"),
    y=alt.Y("count()", title="Count")
).properties(
    title="Distribution of Medical Insurance Charges"
)

Distribution of the Medical Insurance Charges
Observation

Strongly right skewed
X axis - Charges
Y axis -Count(For each range of charges, count how many individuals fall into that range)

Majority of individuals are in the  lower charge values

Interpretation

The distribution of medical insurance charges is right that is positively skewed, indicating that while most individuals incur moderate costs, a smaller subset experiences significantly higher expenses.


Bivariate Analysis

Smoking Vs Charges

In [None]:
import plotly.express as px
a = df.groupby("smoker", as_index=False)["charges"].mean()

fig = px.bar(a,
             x="smoker",
             y="charges",
             color="smoker",
             title="Average Insurance Charges by Smoking Status")

fig.show()

The data set is grouped into smokers and non smokers.Average medical insurance charges are calculated for each group.
The plot shows that smokers incur significantly higher charges than Non smmokers.

Sex Vs Charges

In [None]:
fig2 = px.histogram(df,
                    x="sex",
                    y="charges",
                    histfunc="avg",
                    color="sex",
                    title="Average Charges by Sex")
fig2.show()

The data is split according to gender and average charges are calculated and compared.

The difference in average insurance charges between males and females appears relatively small compared to other factors such as smoking status

Region Vs Charges

In [None]:
fig3 = px.histogram(df,
                    x="region",
                    y="charges",
                    histfunc="avg",
                    color="region",
                    title="Average Charges by Region")
fig3.show()

The region categories are compared against the average of charges and is is clearly seen that Regional differences do  exist  but are moderate and definitely not anywhere  near the smoker effect.

Correlation Strength

• Is the relationship between age and charges linear?

• Which factor strongly identifies with higher insurance cost?

In [None]:
corr = df.corr(numeric_only=True)
corr

In [None]:
fig = px.imshow(corr,
                text_auto=True,
                color_continuous_scale="Spectral",
                title="Correlation Matrix")

fig.show()

Intrepratation


Age r=0.29 not close to 1

This is a moderate positive correlation.It does not have a not dominant linear association with insurance cost so we observe that as age increases the  charges tend to increase.


BMI r=0.19

This weak positive correlation.It does not have a not dominant linear association with insurance cost so we observe that as BMI increases the  charges tend to increase but not strongly.

Children  r=0.06

This weakvcorrelation.It does not have a  dominant linear association with insurance cost so we observe that the increase in charge in relation to number of children is negligible.


Correlation analysis indicates that age has the strongest linear association with charges among continuous variables followed by BMI.The number of children shows minimal association.Thus smoking status exhibits a much stronger association with charges than Age,Bmi and dependents.


Interaction effect

To explore the relationship between age and chargers in smokers ann non smokers

In [None]:
fig = px.scatter(
    df,
    x="age",
    y="charges",
    color="smoker",
    opacity=0.50,
    trendline="ols",
    title="Interaction View: Age vs Charges (Colored by Smoking Status) with Trendlines"
)

fig.update_layout(
    xaxis_title="Age (years)",
    yaxis_title="Annual Medical Insurance Charges (USD)"
)

fig.show()

Scatter plots are used to examine the relationship between two numeric variables.In this case Age and Charges.The color of the scatter dots are based on the smoking status.

The scatter plot with regression trendlines indicates that charges increase with age for both smokers and non-smokers.But smokers consistently show higher charges across all age groups.

 The upward slopes show a positive relationship between age and cost, while the separation between the trendlines show that smoking status is associated with higher insurance charges.

Final Intrepretation:

• Most people have moderate medical insurance costs, but a small number of people have very high costs. This makes the data right-skewed.

• Smoking status has the strongest impact on insurance charges. Smokers pay much higher average costs than non-smokers.

• Age has a moderate relationship with insurance cost. In general, as age increases, medical charges also increase.

• BMI has a weak relationship with charges. Higher BMI is slightly linked to higher costs, but the effect is not very strong.

• The number of children has very little impact on insurance charges.

• When looking at age and smoking together, charges increase with age for both smokers and non-smokers.

• However, smokers consistently have higher charges at every age.

• This means smoking plays a major role in explaining differences in insurance costs in this dataset.