<a href="https://colab.research.google.com/github/jagirdaarsufiyan-source/Supervised-ML-project/blob/main/Supervised_ML_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Medical Insurance Cost Prediction using Supervised Machine Learning

###Problem Statement
##Medical insurance companies need to estimate the insurance cost of customers based on personal and health details.
##In this project, we predict insurance charges using supervised machine learning regression models.

###DataSet
##Medical Insurance Cost Dataset(CSV)
##Features: age, sex, bmi, childern, smoker, region

In [23]:
!pip install -U kaleido



In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [25]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [26]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [27]:
data = pd.read_csv('/content/drive/MyDrive/data/insurance.csv')
data.head()


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [28]:
data.info()
data.describe()
data.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


Unnamed: 0,0
age,0
sex,0
bmi,0
children,0
smoker,0
region,0
charges,0


In [29]:
le = LabelEncoder()
data['sex'] = le.fit_transform(data['sex'])
data['smoker'] = le.fit_transform(data['smoker'])
data['region'] = le.fit_transform(data['region'])


In [30]:
X = data.drop('charges', axis=1)
y = data['charges']


In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=42)

In [32]:
X_train.shape, X_test.shape

((1070, 6), (268, 6))

In [33]:
y_train.shape, y_test.shape

((1070,), (268,))

In [34]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [35]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    print("MAE:", mean_absolute_error(y_test, y_pred))
    print("MSE:", mean_squared_error(y_test, y_pred))
    print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
    print("R2 Score:", r2_score(y_test, y_pred))
    print("-"*40)


In [36]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
evaluate_model(lr, X_train, X_test, y_train, y_test)


MAE: 4186.508898366436
MSE: 33635210.43117845
RMSE: 5799.5870914383595
R2 Score: 0.7833463107364536
----------------------------------------


In [37]:
from sklearn.linear_model import Ridge

ridge = Ridge()
evaluate_model(ridge, X_train, X_test, y_train, y_test)

MAE: 4187.971685427724
MSE: 33641818.58882587
RMSE: 5800.156772780014
R2 Score: 0.7833037457661384
----------------------------------------


In [38]:
from sklearn.linear_model import Lasso

lasso = Lasso()
evaluate_model(lasso, X_train, X_test, y_train, y_test)

MAE: 4186.623542226471
MSE: 33637843.01629289
RMSE: 5799.814050147892
R2 Score: 0.7833293535279202
----------------------------------------


In [39]:
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor()
evaluate_model(dt, X_train, X_test, y_train, y_test)

MAE: 2864.898040432836
MSE: 39517422.85439793
RMSE: 6286.288480049093
R2 Score: 0.7454573543070017
----------------------------------------


In [40]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    random_state=42
)
evaluate_model(rf, X_train, X_test, y_train, y_test)

MAE: 2517.112512436949
MSE: 21037151.30069391
RMSE: 4586.627442979636
R2 Score: 0.8644938924875618
----------------------------------------


In [41]:
from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor()
evaluate_model(gbr, X_train, X_test, y_train, y_test)

MAE: 2446.925720308309
MSE: 18932125.408063125
RMSE: 4351.106228083052
R2 Score: 0.8780529462228404
----------------------------------------


In [42]:
from sklearn.svm import SVR

svr = SVR(kernel='rbf')
evaluate_model(svr, X_train, X_test, y_train, y_test)


MAE: 8599.328962388287
MSE: 165839509.92452022
RMSE: 12877.868997800848
R2 Score: -0.06821813183902203
----------------------------------------


#Final Conclusion
##This project demonstrates how supervised machine learning can effectively predict medical insurance costs. Ensemble models like Random Forest provide the highest accuracy due to their ability to capture complex relationships.