#  #1 Predicting medical insurance expenses using machine learning techniques.


**üìò Problem Statement**
The objective of this project is to estimate medical insurance premiums for individuals by leveraging machine learning models trained on personal and lifestyle features, including age, sex, body mass index (BMI), smoking status, and geographic region. Providing precise forecasts of healthcare expenses aids insurers in evaluating risks and supports individuals in recognizing the influence of various factors on their insurance costs.

üìä Dataset Overview
The available data comprises detailed information about the demographics and lifestyle habits of people with insurance, paired with their respective yearly medical costs.

Source: https://www.kaggle.com/datasets/rahulvyasm/medical-insurance-cost-prediction

# üß© Features Description

# **1 Loading Dataset**

In [28]:
import pandas as pd 
df = pd.read_csv("medical_insurance.csv")
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


**# 2 Dataset Exploration 
**

In [27]:
df.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

**üéØ Objective**
Understood. You want the paraphrased text to be as close as possible to the original length and maintain the exact same format (including line breaks and structure).

Here is the full project description you provided, paraphrased to change the wording while preserving the meaning, length, and format:

The collection of records contains personal characteristics and lifestyle data of people who are insured, along with their yearly healthcare expenditure.

üéØ Goal Develop and assess predictive regression models using the Scikit-Learn framework to estimate the medical cost value based on the input variables. The procedure encompasses:

Acquisition of data & exploratory analysis

Data preparation & categorical conversion

Standardization of features

Model construction & performance measurement

Implementation of the final model

# 2 Dataset Exploration

In [25]:
df.shape

(2772, 7)

In [29]:
df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

## üî† Encoding Categorical Features

Before training regression models, categorical variables must be converted into numeric format, since machine learning algorithms in Scikit-Learn work with numerical data only.

### 1. Encoding `smoker`
- The `smoker` column contains two categories: `yes` and `no`.
- It can be converted into a binary numeric variable:
  - `yes` ‚Üí `1`
  - `no` ‚Üí `0`

Example:
| smoker | smoker_encoded |
|:-------|:----------------|
| yes | 1 |
| no | 0 |

### 2. Encoding `region`
- The `region` column has **four categories**: `northeast`, `northwest`, `southeast`, and `southwest`.
- We will use **One-Hot Encoding** to create separate binary columns for each region.

Example:
| region | northeast | northwest | southeast | southwest |
|:--------|:-----------|:-----------|:-----------|:-----------|
| northwest | 0 | 1 | 0 | 0 |
| southeast | 0 | 0 | 1 | 0 |

> ‚úÖ After encoding, all features will be in numeric form and ready for model training.


In [30]:
df['region'].value_counts()

region
southeast    766
southwest    684
northwest    664
northeast    658
Name: count, dtype: int64

In [31]:
from sklearn.preprocessing import OrdinalEncoder

df['sex'] = df['sex'].map({'male':1, 'female':0})
df['smoker'] = df['smoker'].map({"yes":1,'no':0})
df['region'] = df['region'].map({'southeast':0,"southwest":1,'northwest':2,"northeast":3})
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,1,16884.924
1,18,1,33.77,1,0,0,1725.5523
2,28,1,33.0,3,0,0,4449.462
3,33,1,22.705,0,0,2,21984.47061
4,32,1,28.88,0,0,2,3866.8552


‚öñÔ∏è Feature Scaling
After encoding categorical variables, we apply feature scaling to ensure that all numeric features are on a similar scale.
This helps regression models (especially distance-based or gradient-based models) perform better and converge faster.

Why Scaling is Important
Features with distinct value ranges include age, BMI, and costs.
Feature magnitudes have an impact on models like KNN, SVR, and Linear Regression.
Scaling makes guarantee that the learning process is not dominated by any one feature.

Common Scaling Techniques
Standardization (StandardScaler)

1 Transforms data so it has a mean = 0 and standard deviation = 1.
Formula:
[ z = \frac{x - \mu}{\sigma} ]
Works well for most regression models.
2 Normalization (MinMaxScaler)

Scales all values between 0 and 1.
Useful when features have different units or ranges.

In [32]:
df[['age', 'bmi', 'children']].head()


Unnamed: 0,age,bmi,children
0,19,27.9,0
1,18,33.77,1
2,28,33.0,3
3,33,22.705,0
4,32,28.88,0


Applied in this Project
We will use StandardScaler from Scikit-Learn to standardize the numeric columns:

age
bmi
children
‚ö†Ô∏è Note: The target variable charges is not scaled, since it is the value we want to predict.

In [33]:
features = ['age','bmi','children']

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df[features] = scaler.fit_transform(df[features])

df[features].head()




Unnamed: 0,age,bmi,children
0,-1.428353,-0.457114,-0.907084
1,-1.499381,0.500731,-0.083758
2,-0.789099,0.375085,1.562893
3,-0.433959,-1.304814,-0.907084
4,-0.504987,-0.297201,-0.907084


In [34]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,-1.428353,0,-0.457114,-0.907084,1,1,16884.924
1,-1.499381,1,0.500731,-0.083758,0,0,1725.5523
2,-0.789099,1,0.375085,1.562893,0,0,4449.462
3,-0.433959,1,-1.304814,-0.907084,0,2,21984.47061
4,-0.504987,1,-0.297201,-0.907084,0,2,3866.8552


Train Test Split

In [35]:
from sklearn.model_selection import train_test_split

X = df.drop("charges",axis=1)
y = df['charges']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3 Model Training and Evaluation


**# 3.1 Linear Regression**

In [36]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error


lr_model = LinearRegression()

lr_model.fit(X_train,y_train)

y_pred = lr_model.predict(X_test)

print("r2 score :", r2_score(y_test, y_pred))
print("MSE :", mean_squared_error(y_test, y_pred))

r2 score : 0.7395439728031499
MSE : 39975040.356450126


**# 3.2 Random Forest Regressor**

In [37]:
from sklearn.ensemble import RandomForestRegressor


rf_model = RandomForestRegressor()

rf_model.fit(X_train,y_train)

y_pred = rf_model.predict(X_test)

print("r2 score :", r2_score(y_test, y_pred))
print("MSE :", mean_squared_error(y_test, y_pred))

r2 score : 0.9527325334957625
MSE : 7254656.002358392


**3.4 Save Models**

In [39]:
import pickle 

pickle.dump(rf_model,open("rf_model.pkl",'wb'))
pickle.dump(scaler,open("scaler.pkl",'wb'))

# 5 Inference

In [40]:
import numpy as np 

rf_model.predict(np.array([df.iloc[10,:-1]]))[0]



np.float64(2713.791250999994)

In [41]:
rf_model.predict(np.array([df.iloc[50,:-1]]))[0]



np.float64(2313.037812500003)

In [42]:
rf_model.predict(np.array([df.iloc[150,:-1]]))[0]



np.float64(5438.5857001)