# Healthcare Insurance Analysis & Prediction

## Table of Contents


* [Introduction](#Introduction)
* [Data Loading](#Data-Loading)
* [Data Transformation](#Data-Transformation)
* [EDA (Exploratory Data Analysis)](#EDA-(Exploratory-Data-Analysis))
* [Model Prediction](#Model-Prediction)
* [Conclusion](#Conclusion)
* [Future Scope](#Future-Scope)


### **Introduction**

This dataset provides insights into the factors influencing medical insurance costs, such as personal attributes (age, gender, BMI, family size), lifestyle habits (smoking), and geographic factors. It aims to study how these variables impact insurance charges and can help develop predictive models for estimating healthcare expenses. The dataset includes key features like age, gender, BMI, smoking status, and region, making it suitable for analyzing trends and correlations. By leveraging machine learning, we aim to uncover patterns and build models to predict insurance costs effectively. This analysis will provide valuable insights for insurance providers and healthcare policymakers.

### **Data Loading**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from xgboost import XGBRegressor

warnings.filterwarnings('ignore')


In [None]:
df = pd.read_csv(r'/kaggle/input/healthcare-insurance/insurance.csv')

In [None]:
df.head()

### **Data Transformation**

In [None]:
df.isnull().sum()

In [None]:
df.info()

In [None]:
df.describe().T

In [None]:
#Found duplicated record
df[df.duplicated()==True]

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
for item in df:
    print(df[item].unique())

In [None]:
# Creating BMI Status column based on BMI column
bmi_status = []

for item in df.bmi:
    if item < 16:
        bmi_status.append('Severe Thinness')
    elif item >= 16 and item < 17:
        bmi_status.append('Moderate Thinness')
    elif item >= 17 and item < 18.5:
        bmi_status.append('Mild Thinness')
    elif item >= 18.5 and item < 25:
        bmi_status.append('Normal weight')
    elif item >= 25 and item < 30:
        bmi_status.append('Overweight')
    elif item >= 30 and item < 35:
        bmi_status.append('Obese I')
    elif item >= 35 and item < 40:
        bmi_status.append('Obese II')
    elif item > 40:
        bmi_status.append('Obese III')
    else:
        bmi_status.append('Invalid')

df['bmi_status'] = bmi_status

In [None]:
df['bmi_status'].unique()

In [None]:
df.head()

### **EDA (Exploratory Data Analysis)**

In [None]:
# Count of people as per BMI status -

bmi_status_data = df.groupby('bmi_status').size().reset_index(name='bmi_status_count')
print(bmi_status_data)
print()
plt.barh(bmi_status_data['bmi_status'], bmi_status_data['bmi_status_count'])
plt.xticks(rotation=90)
plt.show()

In [None]:
# Distribution of BMI range-
sns.histplot(df['bmi'], bins=7, kde=True)
plt.show()

In [None]:
# Distribution of Age
sns.histplot(df['age'], bins=5, kde=True)
plt.show()

In [None]:
# Region wise smoker
region_smoker = df.groupby(['region','smoker']).size().reset_index(name='count')
sns.barplot(x='region',y='count',data = region_smoker, hue='smoker')
plt.title('Region wise smoker')
plt.show()

In [None]:
# Region wise people count classifing gender
region_sex = df.groupby(['region','sex']).size().reset_index(name='count')
sns.barplot(x='region',y='count',data = region_sex, hue='sex')
plt.title('Gender Count for Each Region')
plt.show()

In [None]:
# Regional BMI Status
region_bmi = df.groupby(['region','bmi_status']).size().reset_index(name='count')
sns.barplot(x='region',y='count',data = region_bmi, hue='bmi_status')
plt.title('Regional BMI Status')
plt.show()

In [None]:
smoker = normal_bmi.groupby(['smoker']).size().reset_index(name='count')
smoker

In [None]:
#Normal BMI People
normal_bmi_smoker = normal_bmi.groupby(['smoker','sex']).size().reset_index(name='count')
sns.barplot(x='smoker',y='count',data=normal_bmi_smoker, hue='sex')
plt.title('Healty People smoking habbit')
plt.show()


In [None]:
#People with Normal BMI having childrens
normal_bmi_child = normal_bmi.groupby(['children']).size().reset_index(name='count')
plt.pie(normal_bmi_child['count'], labels = normal_bmi_child['children'], autopct='%1.2f%%')
plt.legend(title='Children Count')
plt.title("Healthy People's Children Count")
plt.show()

In [None]:
# Charges by region
region_charges = df.groupby(['region'])['charges'].sum().reset_index()
plt.pie(region_charges['charges'], labels = region_charges['region'], autopct='%1.2f%%')
plt.legend(title='Region')
plt.title('Regionwise Insurance Charges')
plt.show()

In [None]:
# Smoker Sex Charges
smoke_charges = df.groupby(['smoker','sex'])['charges'].sum().reset_index()
sns.barplot(x='sex',y='charges',data=smoke_charges,hue='smoker')
plt.title('Insurance charges based on Gender & Smoking habbits')
plt.legend(title='Smoker')
plt.show()

In [None]:
bmi_charges = df.groupby(['bmi_status'])['charges'].sum().reset_index()

sns.barplot(x='bmi_status',y='charges',data=bmi_charges)
plt.xticks(rotation=90)
plt.legend()
plt.title('Insurance Costing based on BMI Status')
plt.show()

In [None]:
df.groupby(['bmi_status'])['charges'].sum().reset_index()

### Model Prediction

In [None]:
label_encoder = LabelEncoder()

df['sex'] = label_encoder.fit_transform(df['sex'])
df['smoker'] = label_encoder.fit_transform(df['smoker'])

In [None]:
onehot_encoder = OneHotEncoder(sparse_output=False)
df_encoded = pd.DataFrame(onehot_encoder.fit_transform(df[['region']]), columns=onehot_encoder.get_feature_names_out(["region"]))

df = pd.concat([df, df_encoded], axis=1)

In [None]:
df.info()

In [None]:
df.dropna(inplace=True)

In [None]:
# scaler = StandardScaler()
# df[['age', 'bmi']] = scaler.fit_transform(df[['age', 'bmi']])

In [None]:
# Feature engineering 

df['bmi_smoker'] = df['bmi'] * df['smoker']
df['age_smoker'] = df['age'] * df['smoker']

In [None]:
df_corr = df.drop(['bmi_status','region'], axis=1).corr()
sns.heatmap(df_corr, annot=True,cmap="coolwarm", fmt=".2f", linewidths=.2)

In [None]:
y = df['charges']
x = df.drop(['charges','bmi_status','region'],axis=1)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=2)

##### Linear Regression

In [None]:
model = LinearRegression()

model.fit(x_train, y_train)

print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_[0])

In [None]:
y_pred = model.predict(x_test)

results = pd.DataFrame({
    "Actual": y_test,
    "Predicted": y_pred
})

results

In [None]:
# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared (R2 Score):", r2)


##### XGBoost Regression

In [None]:
model = XGBRegressor()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

results = pd.DataFrame({
    "Actual": y_test,
    "Predicted": y_pred
})

results

In [None]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared (R2 Score):", r2)

### **Conclusion**

1. A very small portion of the population falls under the "thin" BMI category, while 225 individuals have a healthy BMI, and the majority fall into overweight or obese categories, increasing their risk for diseases like diabetes and hypertension.  
2. The age distribution is fairly balanced, but a significant proportion of the population is concentrated in the 20-30 age group.  
3. The Southeast region has the highest number of smokers, which directly contributes to higher insurance charges in this area.  
4. The population of the Southeast region is slightly higher compared to other regions, further influencing regional insurance cost trends.  
5. While some individuals with a normal BMI are smokers, their count is much lower compared to non-smokers.  
6. Individuals in the "Obese I" category contribute significantly more to insurance claims compared to other BMI categories.  
7. A machine learning model was developed to predict insurance charges, achieving an accuracy of approximately 85%, indicating strong predictive capability.  

### **Future Scope**

1. Incorporating additional features like physical activity levels, diet, and medical history could enhance model accuracy and provide deeper insights into healthcare expenses.  
2. Developing region-specific models or interventions can help insurance providers design tailored policies to mitigate high-risk factors like smoking and obesity.  

### Credits

This notebook was created by [Pranal Patil](https://www.kaggle.com/Pranal17).  
Feel free to connect with me on Kaggle and explore more of my work!