<a href="https://www.kaggle.com/code/manishkr1754/medical-insurance-cost-prediction?scriptVersionId=144308737" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

---
<center><h1>Used Medical Insurance Prediction</h1></center>
<center><h3>Part of 30 Days 30 ML Projects Challenge</h3></center>

---

## 1) Understanding Problem Statement
---

The goal of this project is to leverage machine learning **to develop a machine learning model that can predict the medical insurance cost based on its features**. This falls under **Regression Machine Learning Problem**. The aim is to assist insurance companies, healthcare providers, and individuals in making informed decisions about insurance coverage and premium pricing.

## 2) Understanding Data
---

The project uses **Medical Insurance Cost Data** which contains several variables (independent variables) and one outcome variable (dependent variable).

## 3) Getting System Ready
---
Importing required libraries


In [None]:
import pandas as pd
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from six.moves import urllib

warnings.filterwarnings("ignore")
%matplotlib inline

## 4) Data Eyeballing
---

### Laoding Data

In [None]:
medical_insurance_df = pd.read_csv('Datasets/Day11_Medical_Insurance_Data.csv') 

In [None]:
medical_insurance_df

In [None]:
print('The size of Dataframe is: ', medical_insurance_df.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
medical_insurance_df.info()
print('-'*100)

In [None]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in medical_insurance_df.columns if medical_insurance_df[feature].dtype != 'O']
categorical_features = [feature for feature in medical_insurance_df.columns if medical_insurance_df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

In [None]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=medical_insurance_df.isnull().sum().sort_values(ascending=False)
percent=(medical_insurance_df.isnull().sum()/medical_insurance_df.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [None]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
print('-'*100)
medical_insurance_df.describe()

In [None]:
print('Summary Statistics of categorical features for DataFrame are as follows:')
print('-'*100)
medical_insurance_df.describe(include='object').T

## 5) Data Cleaning & Preprocessing
---

### Distribution of Age

In [None]:
sns.set()
plt.figure(figsize=(6,6))
sns.distplot(medical_insurance_df['age'])
plt.title('Age Distribution')
plt.show()

### Sex Distribution

In [None]:
plt.figure(figsize=(4,4))
ax = sns.countplot(x='sex', data=medical_insurance_df)
plt.title('Sex Distribution')

counts = medical_insurance_df['sex'].value_counts()

for i, count in enumerate(counts):
    ax.text(i, count, str(count), ha='center', va='bottom')

plt.xlabel('Sex')
plt.ylabel('Count')
plt.show()

### BMI Distribution

In [None]:
plt.figure(figsize=(6,6))
sns.distplot(medical_insurance_df['bmi'])
plt.title('BMI Distribution')
plt.show()

### Children Count Distribution

In [None]:
plt.figure(figsize=(6,6))
sns.countplot(x='children', data=medical_insurance_df)
plt.title('Children')
plt.show()

In [None]:
children_count = medical_insurance_df['children'].value_counts()
children_count

### Smoker Distribution

In [None]:
plt.figure(figsize=(6,6))
ax = sns.countplot(x='smoker', data=medical_insurance_df)
plt.title('smoker')
plt.xlabel('Smoker')
plt.ylabel('Count')
plt.show()

In [None]:
smoker_counts = medical_insurance_df['smoker'].value_counts()
smoker_counts

### Region Distribution

In [None]:
plt.figure(figsize=(6,6))
sns.countplot(x='region', data=medical_insurance_df)
plt.title('region')
plt.show()

In [None]:
medical_insurance_df['region'].value_counts()

### Distibution of Insurance Cost

In [None]:
plt.figure(figsize=(6,6))
sns.distplot(medical_insurance_df['charges'])
plt.title('Medical Cost Distribution')
plt.show()

### Encoding the Categorical Features

In [None]:
# encoding sex column
medical_insurance_df.replace({'sex':{'male':0,'female':1}}, inplace=True)

3 # encoding 'smoker' column
medical_insurance_df.replace({'smoker':{'yes':0,'no':1}}, inplace=True)

# encoding 'region' column
medical_insurance_df.replace({'region':{'southeast':0,'southwest':1,'northeast':2,'northwest':3}}, inplace=True)

In [None]:
medical_insurance_df

## 6) Model Building
---

### Creating Feature Matrix (Independent Variables) & Target Variable (Dependent Variable)

In [None]:
# separating the data and labels
X = medical_insurance_df.drop(columns = ['charges'], axis=1) # Feature matrix
y = medical_insurance_df['charges'] # Target variable

In [None]:
X

In [None]:
y

### Data Standardization

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [None]:
scaler.fit(X)

In [None]:
standardized_data = scaler.transform(X)

In [None]:
standardized_data

In [None]:
X = standardized_data

In [None]:
X

### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=45)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

In [None]:
print(y.shape, y_train.shape, y_test.shape)

### Model Comparison : Training & Evaluation

In [None]:
# For Model Building
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
models = [LinearRegression, Lasso, Ridge, SVR, DecisionTreeRegressor, RandomForestRegressor]
mae_scores = []
mse_scores = []
rmse_scores = []
r2_scores = []

for model in models:
    regressor = model().fit(X_train, y_train)
    y_pred = regressor.predict(X_test)
    
    mae_scores.append(mean_absolute_error(y_test, y_pred))
    mse_scores.append(mean_squared_error(y_test, y_pred))
    rmse_scores.append(mean_squared_error(y_test, y_pred, squared=False))
    r2_scores.append(r2_score(y_test, y_pred))

In [None]:
regression_metrics_df = pd.DataFrame({
    "Model": ["Linear Regression", "Lasso", "Ridge", "SVR", "Decision Tree Regressor", "Random Forest Regressor"],
    "Mean Absolute Error": mae_scores,
    "Mean Squared Error": mse_scores,
    "Root Mean Squared Error": rmse_scores,
    "R-squared (R2)": r2_scores
})

regression_metrics_df.set_index('Model', inplace=True)
regression_metrics_df

### Inference

In the context of medical insurance cost prediction,
- Among the models evaluated, Linear Regression, Lasso, and Ridge Regression exhibit similar levels of accuracy, with Mean Absolute Errors (MAE) around 4443 and R-squared values around 0.70. These models offer a reasonably good fit to the data and could be suitable for estimating insurance costs.
- However, the Support Vector Regressor (SVR) performs significantly worse, with a much higher MAE and a negative R-squared value, suggesting that it struggles to capture the underlying patterns in the data. Therefore, it may not be an appropriate choice for this prediction task.
- The Decision Tree Regressor and Random Forest Regressor show promising results, with lower MAE values and higher R-squared values, indicating their ability to capture complex relationships within the data. These models may offer superior performance in estimating medical insurance costs and could be worth further exploration and fine-tuning.

In summary, for medical insurance cost prediction, the **Random Forest Regressor** seems to be the most promising model, providing accurate estimates with a low MAE and a high R-squared value, making it a potential choice for practical implementation.