# 📊 Insurance Charges Data Analysis
Welcome to this healthcare-focused data analysis project! In this notebook, we explore how age, BMI, smoking habits, and other factors influence medical insurance charges.

**Goals:**
- Explore the dataset and perform basic EDA (Exploratory Data Analysis)
- Visualize key relationships using scatter plots, line plots, and facet grids
- Build a basic predictive model to estimate charges
- Summarize findings and insights

_Dataset source: [Kaggle - Medical Cost Personal Dataset](https://www.kaggle.com/datasets/mirichoi0218/insurance)_

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

sns.set(style='whitegrid')

In [None]:
# Load the dataset
url = 'https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv'
df = pd.read_csv(url)
df.head()

## 📋 Descriptive Statistics

In [None]:
df.describe(include='all')

## 🔥 Correlation Heatmap

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

## 📈 Scatterplot: BMI vs Insurance Charges

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(data=df, x='bmi', y='charges', hue='smoker')
plt.title('Scatterplot of BMI vs Insurance Charges')
plt.xlabel('BMI')
plt.ylabel('Insurance Charges ($)')
plt.show()

## 📊 Line Plot: Age vs Avg Charges by Region

In [None]:
# Group data
avg_charges = df.groupby(['region', 'age'])['charges'].mean().reset_index()

sns.relplot(
    data=avg_charges,
    x='age', y='charges', kind='line',
    col='region', col_wrap=2, height=4, aspect=1.2
)
plt.suptitle('Line Plot of Age vs Average Charges by Region', y=1.05)
plt.show()

## 🔍 FacetGrid: Age vs Charges by Smoker and Sex

In [None]:
g = sns.FacetGrid(df, col='smoker', row='sex', height=4, aspect=1.2)
g.map_dataframe(sns.scatterplot, x='age', y='charges', alpha=0.6)
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle('FacetGrid of Age vs Charges by Smoker and Sex')
plt.show()

## 🌐 Interactive Scatter Plot with Plotly

In [None]:
fig = px.scatter(df, x='bmi', y='charges', color='smoker', 
                 hover_data=['age', 'region', 'sex'])
fig.show()

## 🧠 Predictive Modeling: Linear Regression

In [None]:
# Encode categorical variables
df_encoded = pd.get_dummies(df, drop_first=True)

X = df_encoded.drop('charges', axis=1)
y = df_encoded['charges']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
print('Model Score (R²):', model.score(X_test, y_test))

## ✅ Summary & Key Learnings
- **Smokers** tend to have significantly higher insurance charges.
- **BMI** and **age** are positively correlated with medical expenses.
- **FacetGrids** helped uncover combined effects of gender and smoking.
- Built a **basic regression model** to predict charges with a decent R² score.

### 💡 What I Learned
- How to clean, explore, and visualize a real-world healthcare dataset.
- Created informative plots using seaborn and interactive visuals with Plotly.
- Applied basic machine learning to model insurance cost outcomes.
- Gained confidence in using pandas, seaborn, and scikit-learn for analysis.