## Lab 3 ML 

### 1. Setup and Load Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv("insurance.csv")

# Display the first 5 rows 
df.head()

### 2. Check Data Structure

In [None]:
# Check data types and non-null counts
print(df.info())

# Summary statistics for numerical columns
print(df.describe())

# Check the dimensions of the dataset (rows, columns)
print(f"Dataset shape: {df.shape}")

### 3. Data Cleaning: Handle Missing Values and Duplicates

In [None]:
# 1. Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

# 1st option: Drop rows with missing values
# df = df.dropna()

# 2nd option: Fill missing values with the mean (for numerical columns)
numeric_cols = df.select_dtypes(include=['number']).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

# 2. Handle Duplicates
print(f"Duplicates found before: {df.duplicated().sum()}")

# Remove the duplicate found
df.drop_duplicates(inplace=True)

print(f"Duplicates found after: {df.duplicated().sum()}")

### 4. Visulaization (Charts, graphs)

In [None]:
# Distribution of Medical Charges
plt.figure(figsize=(10, 6))
sns.histplot(df['charges'], kde=True, color='blue')
plt.title('Distribution of Medical Charges')
plt.show()

# Insight: Smoker vs. Charges
plt.figure(figsize=(10, 6))
sns.boxplot(x='smoker', y='charges', data=df)
plt.title('Medical Charges: Smokers vs Non-Smokers')
plt.show()

# Insight: BMI vs. Charges
plt.figure(figsize=(10, 6))
sns.scatterplot(x='bmi', y='charges', hue='smoker', data=df)
plt.title('Impact of BMI on Charges (Smokers vs Non-Smokers)')
plt.show()

# Correlation Heatmap
plt.figure(figsize=(10, 8))
# Only calculate correlation for numeric columns
numeric_df = df.select_dtypes(include=['float64', 'int64'])
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

### 5. Find patterns and insights
After analyzing the visualizations, we can identify several key drivers for medical insurance charges:

- **The "Smoker" Premium:** The boxplot and scatterplot clearly show that smoking is the 
    most significant factor in higher medical charges. Smokers consistently face much higher 
    costs, with their median charges being significantly higher than the maximum charges for 
    most non-smokers.

- **The BMI-Smoker Interaction:** In the scatterplot, there is a distinct "threshold" effect at a 
    BMI of 30. For smokers, once BMI exceeds 30 (obesity category), charges jump drastically 
    (from approximately 20,000 to over 40,000). For non-smokers, a higher BMI leads to a 
    much more gradual increase in costs.

- **Charge Distribution:** The histogram reveals a **right-skewed distribution**. Most 
    individuals have lower charges (under 15,000), while a smaller group—likely the smokers
    and those with chronic conditions—form the smaller "peaks" or "humps" at the 40,000+ 
    range.

- **Weak Linear Correlations:** The heatmap shows that while `age` has the strongest 
    positive linear correlation with `charges` (0.3), it is still relatively moderate. This suggests 
    that medical costs aren't just a simple linear function of one variable, but rather a complex 
    interaction between lifestyle (smoking) and physical attributes (BMI).