---
<center><h1>Used Medical Insurance Prediction</h1></center>
<center><h3>Part of 30 Days 30 ML Projects Challenge</h3></center>

---

## 1) Understanding Problem Statement
---

The goal of this project is to leverage machine learning **to develop a machine learning model that can predict the medical insurance cost based on its features**. This falls under **Regression Machine Learning Problem**. The aim is to assist insurance companies, healthcare providers, and individuals in making informed decisions about insurance coverage and premium pricing.

## 2) Understanding Data
---

The project uses **Used Car Price  Data** which contains several variables (independent variables) and one outcome variable (dependent variable) called **selling_price**. The variables in the datasets are as follows:

- name
- year
- selling_price (Target Variable)
- km_driven
- fuel
- seller_type
- transmission
- Owner

## 3) Getting System Ready
---
Importing required libraries


In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from six.moves import urllib

warnings.filterwarnings("ignore")
%matplotlib inline

## 4) Data Eyeballing
---

### Laoding Data

In [2]:
used_car_df = pd.read_csv('Datasets/Day7_Used_Car_Price_Data.csv') 

In [3]:
used_car_df

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.60,6.87,42450,Diesel,Dealer,Manual,0
...,...,...,...,...,...,...,...,...,...
296,city,2016,9.50,11.60,33988,Diesel,Dealer,Manual,0
297,brio,2015,4.00,5.90,60000,Petrol,Dealer,Manual,0
298,city,2009,3.35,11.00,87934,Petrol,Dealer,Manual,0
299,city,2017,11.50,12.50,9000,Diesel,Dealer,Manual,0


In [4]:
print('The size of Dataframe is: ', used_car_df.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
used_car_df.info()
print('-'*100)

The size of Dataframe is:  (301, 9)
----------------------------------------------------------------------------------------------------
The Column Name, Record Count and Data Types are as follows: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       301 non-null    object 
 1   Year           301 non-null    int64  
 2   Selling_Price  301 non-null    float64
 3   Present_Price  301 non-null    float64
 4   Kms_Driven     301 non-null    int64  
 5   Fuel_Type      301 non-null    object 
 6   Seller_Type    301 non-null    object 
 7   Transmission   301 non-null    object 
 8   Owner          301 non-null    int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 21.3+ KB
----------------------------------------------------------------------------------------------------


In [5]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in used_car_df.columns if used_car_df[feature].dtype != 'O']
categorical_features = [feature for feature in used_car_df.columns if used_car_df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

We have 5 numerical features : ['Year', 'Selling_Price', 'Present_Price', 'Kms_Driven', 'Owner']

We have 4 categorical features : ['Car_Name', 'Fuel_Type', 'Seller_Type', 'Transmission']


In [6]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=used_car_df.isnull().sum().sort_values(ascending=False)
percent=(used_car_df.isnull().sum()/used_car_df.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

Missing Value Presence in different columns of DataFrame are as follows : 
----------------------------------------------------------------------------------------------------


Unnamed: 0,Total,Percent
Car_Name,0,0.0
Year,0,0.0
Selling_Price,0,0.0
Present_Price,0,0.0
Kms_Driven,0,0.0
Fuel_Type,0,0.0
Seller_Type,0,0.0
Transmission,0,0.0
Owner,0,0.0


In [7]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
print('-'*100)
used_car_df.describe()

Summary Statistics of numerical features for DataFrame are as follows:
----------------------------------------------------------------------------------------------------


Unnamed: 0,Year,Selling_Price,Present_Price,Kms_Driven,Owner
count,301.0,301.0,301.0,301.0,301.0
mean,2013.627907,4.661296,7.628472,36947.20598,0.043189
std,2.891554,5.082812,8.644115,38886.883882,0.247915
min,2003.0,0.1,0.32,500.0,0.0
25%,2012.0,0.9,1.2,15000.0,0.0
50%,2014.0,3.6,6.4,32000.0,0.0
75%,2016.0,6.0,9.9,48767.0,0.0
max,2018.0,35.0,92.6,500000.0,3.0


In [8]:
print('Summary Statistics of categorical features for DataFrame are as follows:')
print('-'*100)
used_car_df.describe(include='object').T

Summary Statistics of categorical features for DataFrame are as follows:
----------------------------------------------------------------------------------------------------


Unnamed: 0,count,unique,top,freq
Car_Name,301,98,city,26
Fuel_Type,301,3,Petrol,239
Seller_Type,301,2,Dealer,195
Transmission,301,2,Manual,261


## 5) Data Cleaning & Preprocessing
---

### Encoding the Categorical Data

In [9]:
# encoding "Fuel_Type" Column
used_car_df.replace({'Fuel_Type':{'Petrol':0,'Diesel':1,'CNG':2}},inplace=True)

# encoding "Seller_Type" Column
used_car_df.replace({'Seller_Type':{'Dealer':0,'Individual':1}},inplace=True)

# encoding "Transmission" Column
used_car_df.replace({'Transmission':{'Manual':0,'Automatic':1}},inplace=True)

In [10]:
used_car_df

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,0,0,0,0
1,sx4,2013,4.75,9.54,43000,1,0,0,0
2,ciaz,2017,7.25,9.85,6900,0,0,0,0
3,wagon r,2011,2.85,4.15,5200,0,0,0,0
4,swift,2014,4.60,6.87,42450,1,0,0,0
...,...,...,...,...,...,...,...,...,...
296,city,2016,9.50,11.60,33988,1,0,0,0
297,brio,2015,4.00,5.90,60000,0,0,0,0
298,city,2009,3.35,11.00,87934,0,0,0,0
299,city,2017,11.50,12.50,9000,1,0,0,0


## 6) Model Building
---

### Creating Feature Matrix (Independent Variables) & Target Variable (Dependent Variable)

In [11]:
# separating the data and labels
X = used_car_df.drop(columns = ['Car_Name','Selling_Price'], axis=1) # Feature matrix
y = used_car_df['Selling_Price'] # Target variable

In [12]:
X

Unnamed: 0,Year,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,2014,5.59,27000,0,0,0,0
1,2013,9.54,43000,1,0,0,0
2,2017,9.85,6900,0,0,0,0
3,2011,4.15,5200,0,0,0,0
4,2014,6.87,42450,1,0,0,0
...,...,...,...,...,...,...,...
296,2016,11.60,33988,1,0,0,0
297,2015,5.90,60000,0,0,0,0
298,2009,11.00,87934,0,0,0,0
299,2017,12.50,9000,1,0,0,0


In [13]:
y

0       3.35
1       4.75
2       7.25
3       2.85
4       4.60
       ...  
296     9.50
297     4.00
298     3.35
299    11.50
300     5.30
Name: Selling_Price, Length: 301, dtype: float64

### Data Standardization

In [14]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [15]:
scaler.fit(X)

In [16]:
standardized_data = scaler.transform(X)

In [17]:
standardized_data

array([[ 0.128897  , -0.23621461, -0.25622446, ..., -0.73728539,
        -0.39148015, -0.17450057],
       [-0.21751369,  0.22150462,  0.1559105 , ..., -0.73728539,
        -0.39148015, -0.17450057],
       [ 1.16812909,  0.25742689, -0.77396901, ..., -0.73728539,
        -0.39148015, -0.17450057],
       ...,
       [-1.60315648,  0.39068691,  1.31334003, ..., -0.73728539,
        -0.39148015, -0.17450057],
       [ 1.16812909,  0.56450434, -0.7198763 , ..., -0.73728539,
        -0.39148015, -0.17450057],
       [ 0.8217184 , -0.20029235, -0.81095812, ..., -0.73728539,
        -0.39148015, -0.17450057]])

In [18]:
X = standardized_data

In [19]:
X

array([[ 0.128897  , -0.23621461, -0.25622446, ..., -0.73728539,
        -0.39148015, -0.17450057],
       [-0.21751369,  0.22150462,  0.1559105 , ..., -0.73728539,
        -0.39148015, -0.17450057],
       [ 1.16812909,  0.25742689, -0.77396901, ..., -0.73728539,
        -0.39148015, -0.17450057],
       ...,
       [-1.60315648,  0.39068691,  1.31334003, ..., -0.73728539,
        -0.39148015, -0.17450057],
       [ 1.16812909,  0.56450434, -0.7198763 , ..., -0.73728539,
        -0.39148015, -0.17450057],
       [ 0.8217184 , -0.20029235, -0.81095812, ..., -0.73728539,
        -0.39148015, -0.17450057]])

### Train-Test Split

In [20]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=45)

In [21]:
print(X.shape, X_train.shape, X_test.shape)

(301, 7) (240, 7) (61, 7)


In [22]:
print(y.shape, y_train.shape, y_test.shape)

(301,) (240,) (61,)


### Model Comparison : Training & Evaluation

In [23]:
# For Model Building
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [24]:
models = [LinearRegression, Lasso, Ridge, SVR, DecisionTreeRegressor, RandomForestRegressor]
mae_scores = []
mse_scores = []
rmse_scores = []
r2_scores = []

for model in models:
    regressor = model().fit(X_train, y_train)
    y_pred = regressor.predict(X_test)
    
    mae_scores.append(mean_absolute_error(y_test, y_pred))
    mse_scores.append(mean_squared_error(y_test, y_pred))
    rmse_scores.append(mean_squared_error(y_test, y_pred, squared=False))
    r2_scores.append(r2_score(y_test, y_pred))

In [25]:
regression_metrics_df = pd.DataFrame({
    "Model": ["Linear Regression", "Lasso", "Ridge", "SVR", "Decision Tree Regressor", "Random Forest Regressor"],
    "Mean Absolute Error": mae_scores,
    "Mean Squared Error": mse_scores,
    "Root Mean Squared Error": rmse_scores,
    "R-squared (R2)": r2_scores
})

regression_metrics_df.set_index('Model', inplace=True)
regression_metrics_df

Unnamed: 0_level_0,Mean Absolute Error,Mean Squared Error,Root Mean Squared Error,R-squared (R2)
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Linear Regression,1.201595,2.32136,1.523601,0.893639
Lasso,1.560071,4.372852,2.091136,0.799643
Ridge,1.20103,2.326063,1.525143,0.893423
SVR,1.040184,4.062913,2.015667,0.813844
Decision Tree Regressor,0.524754,0.843621,0.918489,0.961347
Random Forest Regressor,0.494918,0.653351,0.808301,0.970064


### Inference

In the context of predicting used car prices,
- **Random Forest Regressor** seems to be the **best-performing model** followed closely by the Decision Tree Regressor and the linear regression-based models. These models offer relatively low prediction errors and good explanatory power for the variance in car prices. However, the choice of the best model may also depend on other factors such as computational complexity and interpretability.