# Modeling & Prediction

---

**Dataset: Insurance Charges**

**Goal: Build a model to predict insurance charges using features like age, BMI, smoker status, etc.**

In [35]:
# Install necessary packages
!pip install pandas numpy scikit-learn joblib



In [36]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os
import joblib

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder

In [37]:
# Load your dataset
df = pd.read_csv("C:/Users/harsh/Downloads/AlmaBetter/EDA-ML-ALGO/insurance-charges-prediction/data/cleaned_insurance.csv")

In [38]:
# Drop 'age_group' column if it exists
if 'age_group' in df.columns:
    print("Dropping age_group column...")
    df.drop('age_group', axis=1, inplace=True)

# Encode categorical features using LabelEncoder
categorical_cols = ['sex', 'smoker', 'region']
label_encoders = {}

for col in categorical_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

# Save encoders
os.makedirs("models", exist_ok=True)
joblib.dump(label_encoders, "models/label_encoders.pkl")

Dropping age_group column...


['models/label_encoders.pkl']

* LabelEncoder: converts categories to numbers.
* Label Encoding for Categorical Columns
* Many ML models require all input features to be numeric.
* So, we convert text labels (e.g., 'male', 'female') into numbers using LabelEncoder.
* This ensures consistency between training and prediction.
* It's especially important for models like Linear Regression, which cannot handle strings.

In [40]:
# Create features and target
X = df.drop("charges", axis=1)
y = df["charges"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

* Randomly splits the dataset — so we can test our model’s performance on data it hasn’t seen before.

* It prevents overfitting and helps simulate real-world prediction.

In [41]:
# Train model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Save model
joblib.dump(model, "models/random_forest_model.pkl")

['models/random_forest_model.pkl']

* Random Forest is an ensemble method
* It builds multiple decision trees and averages them.

* No need for scaling because tree models split data based on thresholds, not values.

* It is more accurate than linear regression in many real-world problems.

* Handles non-linearity and feature interactions better.

In [42]:
# Evaluate model
y_pred = model.predict(X_test)
print("MAE :", mean_absolute_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R²  :", r2_score(y_test, y_pred))


MAE : 2555.9294270283576
RMSE: 4628.2660316436195
R²  : 0.8834278119122363
