## üè• End-to-End Medical Insurance Cost Prediction Project

## üîç Project Overview
This project focuses on building an intelligent system to **predict individual medical insurance costs** based on demographic and lifestyle factors using **Machine Learning**.  
The workflow includes **data preprocessing**, **feature encoding**, **model training**, and **deployment using Gradio** for easy interaction and testing.

---

## ‚öôÔ∏è Challenge: Non-linear Relationships
The dataset contains mixed numerical and categorical features such as **age**, **BMI**, **region**, **smoking status**, and **number of dependents**.  
The main challenge was modeling the **non-linear relationships** between these factors and the insurance charges.  
To address this, multiple regression algorithms were evaluated ‚Äî including **Linear Regression**, **Random Forest**, and **XGBoost** ‚Äî and the best-performing model was selected based on **R¬≤**, **MAE**, and **RMSE** scores.

---

## üõ†Ô∏è Tools & Technologies
- **Python** (Scikit-learn, Pandas, NumPy, Matplotlib, Seaborn, Joblib)  
- **Gradio** (for model deployment and interactive UI)  
- **Jupyter Notebook / Kaggle** (for experimentation and documentation)  
- **GitHub** (for version control and project sharing)

---

## üöÄ Model Deployment
A **Gradio app** was built where users can input key personal details (e.g., age, BMI, smoker status, etc.) and instantly receive a **predicted insurance cost üí∞**.

---

## üìÇ Project Structure
### app.py ‚Üí Gradio deployment script
### requirements.txt ‚Üí List of dependencies
### README.md ‚Üí Project documentation
### screen/ ‚Üí Screenshots of the Gradio app
## üë®‚Äçüíª Developed By
**AI & Data Scientist ‚Äî Mina Nabil Samir**  
*(Engineering for this notebook project)*

# Import Necessary Libraries 

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt 
from sklearn.preprocessing import LabelEncoder , StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Ensemble Algorthims
#1-Bagging
from sklearn.ensemble import BaggingRegressor
#2-Boosting
from sklearn.ensemble import GradientBoostingRegressor
#3-Voting
from sklearn.ensemble import VotingRegressor
from sklearn.ensemble import RandomForestRegressor
#4-Stacking
from sklearn.ensemble import StackingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import warnings 
warnings.filterwarnings("ignore")

# Data Exploration 

In [None]:
df=pd.read_csv("/kaggle/input/insurance/insurance.csv")
df.head(10)

In [None]:
df.tail(10)

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.info()

# Statistical Insights

In [None]:
df.describe().T

In [None]:
df.var(numeric_only=True)

In [None]:
df.mode().T

In [None]:
df.median(numeric_only=True)


In [None]:
df.skew(numeric_only=True)


# Data Wrangling (Cleaning)

#### Check Null Values  

In [None]:
df.isna().sum()

#### Check Duplicated Values 

In [None]:
df.duplicated().sum()

In [None]:
# Remove it
df.drop_duplicates(inplace=True)


In [None]:
df.duplicated().sum()

# Check Outliers

In [None]:
def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers[[column]]

numeric_values = ["age", "bmi", "children", "charges"]

for i in numeric_values:
    outliers = detect_outliers_iqr(df, i)
    print(f"Outliers in {i}: {len(outliers)} values")     



### This is Real Values Exist in Real life but to Make it not Effect in Model i Will Handling it by Apply Scaling for Data Befor Step of Modeling

# EDA

## Univariate Analysis

### Age

In [None]:
plt.boxplot(df["age"])
plt.title("Boxplot of Age")
plt.show()

In [None]:
sns.histplot(df["age"], kde=True, bins=30)
plt.show()

In [None]:
sns.kdeplot(df['age'],fill=True,color='red',alpha=0.7)
plt.title('Density Plot')
plt.xlabel('Age')
plt.ylabel('Density')
plt.show()

### Sex

In [None]:
x=df["sex"].value_counts()
x

In [None]:
plt.figure(figsize=(12,8))
plt.pie(x.values,labels = x.index, startangle = 90,explode =[0.2,0],shadow=True,colors=["c","r"],autopct='%1.1f%%')
plt.legend()
plt.show()

### BMI

In [None]:
plt.boxplot(df["bmi"])
plt.title("Boxplot of Age")
plt.show()

In [None]:
sns.histplot(df["bmi"], kde=True, bins=30)
plt.show()

In [None]:
sns.kdeplot(df['bmi'],fill=True,color='red',alpha=0.7)
plt.title('Density Plot')
plt.xlabel('BMI')
plt.ylabel('Density')
plt.show()

### Children 

In [None]:
x=df["children"].value_counts()
x

In [None]:
colors = plt.cm.Set2.colors  
plt.figure(figsize=(12,8))
plt.pie(x.values,labels = x.index, startangle = 90,explode =[0,0,0.4,0,0,0],shadow=True,colors=colors,autopct='%1.1f%%')
plt.title("Number Of Children")
plt.legend()
plt.show()

### Smoker 

In [None]:
x=df["smoker"].value_counts()
x

In [None]:
colors = plt.cm.Set2.colors  
plt.figure(figsize=(12,8))
plt.pie(x.values,labels = x.index, startangle = 90,explode =[0,0.2],shadow=True,colors=colors,autopct='%1.1f%%')
plt.title("Smoking Status")
plt.legend()
plt.show()

### Region 

In [None]:
x=df["region"].value_counts()
x

In [None]:
colors = plt.cm.Set2.colors  
plt.figure(figsize=(12,8))
plt.pie(x.values,labels = x.index, startangle = 90,explode =[0,0.2,0,0],shadow=True,colors=colors,autopct='%1.1f%%')
plt.title("Region Distribution")
plt.legend()
plt.show()	

### Charges 

In [None]:
sns.histplot(df["charges"], kde=True, bins=30)
plt.show()

# Bivariate Analysis

In [None]:
sns.scatterplot(x="age", y="charges", data=df)
plt.title("Age vs Charges")
plt.show()

In [None]:
sns.boxplot(x="sex", y="charges", data=df)
plt.title("Sex vs Charges")
plt.show()

In [None]:
stats_sex = df.groupby("sex")["charges"].agg(["mean", "median", "std"]).reset_index()
print(stats_sex)

## Bivariate Analysis: Sex vs. Charges üìä

General Distribution:

Both males and females have a very similar distribution of costs.

The median is slightly higher for males than for females.

Charges:

Most cases in both categories are between $5,000 and $15,000.

Males have a greater prevalence of higher values ‚Äã‚Äãthan females (meaning some cases pay more).

Outliers:

These are more prevalent in both males and females, but males have slightly higher values.

This may be related to the confounding of other variables, such as smoking or age, rather than gender itself.

Conclusion:

Gender is not a significant factor in medical costs.

The slight difference is that males have a higher median and more cases pay more, but the overall trend is very similar.

Other variables such as smoking, age, and BMI will be much more influential.

In [None]:
sns.lmplot(x="bmi", y="charges", data=df, scatter_kws={"alpha":0.5}, line_kws={"color":"red"})
plt.title("BMI vs Charges")
plt.show()

## Bivariate Analysis: BMI vs. Charges üìä

General Relationship:

There is a weak to moderate positive relationship between BMI and costs (meaning that as BMI increases, costs often increase).

The red line (regression line) confirms this trend.

Distribution:

Most of the data is clustered around BMIs between 20 and 35.

There is a clear cluster of cases with BMI ‚â• 30 (obesity) and very high costs. This makes sense medically, as obesity is associated with health problems.

Outliers:

They are clearly present in people with very high BMIs (> 40) and charges greater than 40,000‚Äì60,000.

This reflects very high treatment costs.

Conclusion:

BMI is a variable that influences medical costs.

The general trend: People with obesity (BMI ‚â• 30) have significantly higher costs.

However, not everyone with a high BMI will pay a higher cost. Other factors play a role (such as smoking and age).

In [None]:
sns.boxplot(x="children", y="charges", data=df)
plt.title("Children vs Charges")
plt.show()

## General Distribution:

Most of the cases in the data have 0 to 3 children.

The number of cases with 4 or 5 children is very small.

Costs by Number of Children:

The median of the cost is very close for all categories (0, 1, 2, and 3 children).

There is a slight increase in the median with 2 and 3 children compared to 0 and 1.

Those with 4 and 5 children have a lower median, but the samples are very small, making the results less accurate.

Outliers:

In almost all categories, there are very high outliers (people paying more than $50,000).

This means that having children is not a strong factor in determining high costs, because the outliers are spread across all categories.

Conclusion:

Number of children is not a strong predictor of insurance costs compared to other variables such as smoking, age, or BMI.

There is a slight difference between some categories, but not a big one.

In [None]:
sns.boxplot(x="smoker", y="charges", data=df)
plt.title("Smoker vs Charges")
plt.show()

In [None]:
stats_smoker = df.groupby("smoker")["charges"].agg(["mean", "median", "std"]).reset_index()
print(stats_smoker)

## Bivariate Analysis: Smoker vs. Charges üìä

A very large difference between the two groups:

Non-smoker: Most costs are under $15,000.

Smoker: Costs are over $60,000.

Median:

Smokers have costs about 3-4 times higher than non-smokers.

Outliers:

Strongly present among smokers (very high numbers of high values).

Non-smokers have very few outliers.

Conclusion:

Smoking is the most influential variable on health insurance costs compared to any other variable (age, gender, BMI, children).

The majority of smokers have very high medical costs due to the health risks associated with smoking (heart disease, cancer, etc.).

In [None]:
sns.boxplot(x="region", y="charges", data=df)
plt.title("Region vs Charges")
plt.show()

## Bivariate Analysis: Region vs. Charges üìä

General Distribution:

The data is divided into four regions: southeast, southwest, northwest, northeast.

They all have roughly the same distribution shape, and there is no significant difference like the one we saw with the smoker.

Differences between Regions:

The southeast appears to have a higher number of cases (more data in the sample).

The mean and median are very close between regions.

No particular region clearly outperforms in costs.

Outliers:

These are present in all regions in the same way (cases reaching 50K-60K).

Conclusion:

Region is not a strong predictor of costs.

A slight difference may appear, for example, in Southeast, which is slightly higher, but this is not due to the region itself. It could be due to a higher number of smokers or people with a higher BMI.

# Multivariate Analysis

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(x="age", y="charges", hue="smoker", data=df, alpha=0.7)
plt.title("Age & Smoker vs Charges")
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(x="bmi", y="charges", hue="smoker", data=df, alpha=0.7)
plt.title("BMI & Smoker vs Charges")
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.boxplot(x="children", y="charges", hue="region", data=df)
plt.title("Children & Region vs Charges")
plt.show()

In [None]:
sns.lmplot(x="bmi", y="charges", hue="smoker", col="sex", data=df, scatter_kws={"alpha":0.6})
plt.show()

In [None]:
sns.pairplot(df, hue="smoker")
plt.show()

## Summary 
Age: A direct relationship with costs (older people pay more).

Sex: Slight differences (not a strong factor).

BMI: The higher the BMI, the higher the costs, especially with smoking.

Children: Weak effect.

Smoker: The strongest factor (smokers have a significant increase in costs).

Region: Weak or statistically insignificant effect.

# Data Preprocessing 

## Encoding 

In [None]:
le = LabelEncoder()

In [None]:
df = pd.get_dummies(df, columns=['region'], drop_first=True)
for col in df.filter(like="region_").columns:
    df[col] = df[col].astype(int)
df['sex'] = le.fit_transform(df['sex'])  
df['smoker'] = le.fit_transform(df['smoker']) 

In [None]:
df

## Define X feature and y target

In [None]:
X=df.drop(columns="charges",axis=1)
y=df["charges"]

## Split Data To Train , Test

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

# Data Normalization 

In [None]:
scaler = StandardScaler()
num_cols = ["age", "bmi"] 
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

## Modelling 

## linear Regression

### Call Model 

In [None]:
lin_reg = LinearRegression()


### Train Model

In [None]:
lin_reg.fit(X_train, y_train)


### Test

In [None]:
y_pred_lr = lin_reg.predict(X_test)

## Ensemble Models

### 1-Bagging

In [None]:
bagging = BaggingRegressor(base_estimator=LinearRegression(),n_estimators=50,random_state=42)

In [None]:
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)

### 2-Boosting

In [None]:
boosting = GradientBoostingRegressor(n_estimators=200,learning_rate=0.1,max_depth=3,random_state=42)


In [None]:
boosting.fit(X_train, y_train)
y_pred_boost = boosting.predict(X_test)

### 3-Voting

In [None]:
rf = RandomForestRegressor(n_estimators=200, random_state=42)

voting = VotingRegressor(estimators=[('lr', lin_reg),('rf', rf),('gb', boosting)])

In [None]:
voting.fit(X_train, y_train)
y_pred_vote = voting.predict(X_test)

### 4-Stacking 

In [None]:
stacking = StackingRegressor(
    estimators=[('lr', lin_reg), ('rf', rf), ('gb', boosting)],
    final_estimator=LinearRegression()
)

In [None]:
stacking.fit(X_train, y_train)
y_pred_stack = stacking.predict(X_test)

## Evalution

In [None]:
def evaluate(y_true, y_pred, model_name):
    print(f"\n{model_name} Performance:")
    print("MAE:", mean_absolute_error(y_true, y_pred))
    print("RMSE:", mean_squared_error(y_true, y_pred, squared=False))
    print("R2 Score:", r2_score(y_true, y_pred))

In [None]:
evaluate(y_test, y_pred_lr, "Linear Regression")
evaluate(y_test, y_pred_bag, "Bagging")
evaluate(y_test, y_pred_boost, "Boosting")
evaluate(y_test, y_pred_vote, "Voting")
evaluate(y_test, y_pred_stack, "Stacking")

# Compare

In [None]:
models_preds = {
    "Linear Regression": y_pred_lr,
    "Bagging": y_pred_bag,
    "Boosting": y_pred_boost,
    "Voting": y_pred_vote,
    "Stacking": y_pred_stack
}

plt.figure(figsize=(15,10))
for i, (name, y_pred) in enumerate(models_preds.items(), 1):
    plt.subplot(2,3,i)
    sns.scatterplot(x=y_test, y=y_pred, alpha=0.6)
    plt.plot([y_test.min(), y_test.max()],
             [y_test.min(), y_test.max()],
             'r--')  # ÿÆÿ∑ ŸÖÿ´ÿßŸÑŸä
    plt.title(f"{name}\nActual vs Predicted")
    plt.xlabel("Actual Charges")
    plt.ylabel("Predicted Charges")

plt.tight_layout()
plt.show()

results = []
for name, y_pred in models_preds.items():
    mae = mean_absolute_error(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    r2 = r2_score(y_test, y_pred)
    results.append([name, mae, rmse, r2])

results_df = pd.DataFrame(results, columns=["Model", "MAE", "RMSE", "R2"])

plt.figure(figsize=(10,6))
sns.barplot(x="Model", y="R2", data=results_df)
plt.title("Model Comparison (R¬≤ Score)")
plt.ylabel("R¬≤ Score")
plt.show()


## Save Best Model

In [None]:
import pickle

best_model = stacking  
with open("best_model.pkl", "wb") as file:
    pickle.dump(best_model, file)

## Deploy Model

In [None]:
with open("best_model.pkl", "rb") as file:
    model = pickle.load(file)

In [None]:
def predict_charges(age, sex, bmi, children, smoker, region):
    # ÿ™ÿ≠ŸàŸäŸÑ sex Ÿà smoker ÿ•ŸÑŸâ 0/1
    sex = 1 if sex == "male" else 0
    smoker = 1 if smoker == "yes" else 0

    region_northeast = 1 if region == "northeast" else 0
    region_northwest = 1 if region == "northwest" else 0
    region_southeast = 1 if region == "southeast" else 0

    input_data = np.array([[age, sex, bmi, children, smoker,
                            region_northeast, region_northwest, region_southeast]])

    # ÿ™ŸÜÿ®ÿ§
    prediction = model.predict(input_data)[0]
    return f"Predicted Insurance Charges: ${prediction:,.2f}"

In [None]:
import gradio as gr

demo = gr.Interface(
    fn=predict_charges,
    inputs=[
        gr.Number(label="Age"),
        gr.Radio(["male", "female"], label="Sex"),
        gr.Number(label="BMI"),
        gr.Number(label="Children"),
        gr.Radio(["yes", "no"], label="Smoker"),
        gr.Dropdown(["southwest", "southeast", "northwest", "northeast"], label="Region")
    ],
    outputs="text",
    title="Medical Insurance Charges Prediction",
    description="Enter patient information to predict insurance charges using the trained model."
)

demo.launch()
