# Medical Insurance Premium Prediction

The amount of the premium for a health insurance policy depends from person to person, as many factors affect the amount of the premium for a health insurance policy. Let’s say age, a young person is very less likely to have major health problems compared to an older person. Thus, treating an older person will be expensive compared to a young one. That is why an older person is required to pay a high premium compared to a young person.

Just like age, many other factors affect the premium for a health insurance policy. Hope you now have understood what health insurance is and how the premium for a health insurance policy is determined. In the section below, I will take you through the task of health insurance premium prediction with machine learning using Python.

Dataset :- https://www.kaggle.com/tejashvi14/medical-insurance-premium-prediction

In [1]:
#import libraries
import numpy as np #2 perform mathematical operations on arrays
import pandas as pd #for data analysis

from xgboost import XGBRegressor #gradient boosting model used for regression predictive modeling
from sklearn.model_selection import train_test_split #measure the accuracy of the model 
from sklearn.model_selection import cross_val_score #statistical method used to estimate the performance (or accuracy) of machine learning models

In [2]:
#load the dataset 2 pandas data frame for manupulating the data
data = pd.read_csv('Medicalpremium.csv', encoding = 'latin-1')

#now v hv 2 replace null values with null string otherwise it will show errors
#v will store this in variable claaed "data"
medical_premium_data = data.where((pd.notnull(data)), '')

#lets check the shape of the dataset
medical_premium_data.shape

(986, 11)

In [3]:
#printing the dataset
print(medical_premium_data)

     Age  Diabetes  BloodPressureProblems  AnyTransplants  AnyChronicDiseases  \
0     45         0                      0               0                   0   
1     60         1                      0               0                   0   
2     36         1                      1               0                   0   
3     52         1                      1               0                   1   
4     38         0                      0               0                   1   
..   ...       ...                    ...             ...                 ...   
981   18         0                      0               0                   0   
982   64         1                      1               0                   0   
983   56         0                      1               0                   0   
984   47         1                      1               0                   0   
985   21         0                      0               0                   0   

     Height  Weight  KnownA

### Printing the head of the dataset to have a look at the dataframe

In [4]:
#lets c sample of this dataset in pandas data frame
#first 10 rows of the dataset
medical_premium_data.head(10)

Unnamed: 0,Age,Diabetes,BloodPressureProblems,AnyTransplants,AnyChronicDiseases,Height,Weight,KnownAllergies,HistoryOfCancerInFamily,NumberOfMajorSurgeries,PremiumPrice
0,45,0,0,0,0,155,57,0,0,0,25000
1,60,1,0,0,0,180,73,0,0,0,29000
2,36,1,1,0,0,158,59,0,0,1,23000
3,52,1,1,0,1,183,93,0,0,2,28000
4,38,0,0,0,1,166,88,0,0,1,23000
5,30,0,0,0,0,160,69,1,0,1,23000
6,33,0,0,0,0,150,54,0,0,0,21000
7,23,0,0,0,0,181,79,1,0,0,15000
8,48,1,0,0,0,169,74,1,0,0,23000
9,38,0,0,0,0,182,93,0,0,0,23000


### Printing the tail of the dataset to have a look at the dataframe

In [5]:
#last 10 rows of the dataset
medical_premium_data.tail(10)

Unnamed: 0,Age,Diabetes,BloodPressureProblems,AnyTransplants,AnyChronicDiseases,Height,Weight,KnownAllergies,HistoryOfCancerInFamily,NumberOfMajorSurgeries,PremiumPrice
976,21,0,1,0,0,155,74,0,0,0,39000
977,45,0,1,0,1,157,67,0,0,1,25000
978,40,0,1,1,0,168,70,0,0,0,17000
979,24,0,0,0,0,161,71,0,0,0,15000
980,40,0,1,1,0,171,74,0,0,0,38000
981,18,0,0,0,0,169,67,0,0,0,15000
982,64,1,1,0,0,153,70,0,0,3,28000
983,56,0,1,0,0,155,71,0,0,1,29000
984,47,1,1,0,0,158,73,1,0,1,39000
985,21,0,0,0,0,158,75,1,0,1,15000


In [6]:
#dataset informations
medical_premium_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 986 entries, 0 to 985
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   Age                      986 non-null    int64
 1   Diabetes                 986 non-null    int64
 2   BloodPressureProblems    986 non-null    int64
 3   AnyTransplants           986 non-null    int64
 4   AnyChronicDiseases       986 non-null    int64
 5   Height                   986 non-null    int64
 6   Weight                   986 non-null    int64
 7   KnownAllergies           986 non-null    int64
 8   HistoryOfCancerInFamily  986 non-null    int64
 9   NumberOfMajorSurgeries   986 non-null    int64
 10  PremiumPrice             986 non-null    int64
dtypes: int64(11)
memory usage: 84.9 KB


In [7]:
#data preprocessing 2 check whether if there r any empty values
#checking the number of missing values in each column
medical_premium_data.isnull().sum()

Age                        0
Diabetes                   0
BloodPressureProblems      0
AnyTransplants             0
AnyChronicDiseases         0
Height                     0
Weight                     0
KnownAllergies             0
HistoryOfCancerInFamily    0
NumberOfMajorSurgeries     0
PremiumPrice               0
dtype: int64

In [8]:
#statistical Measures of the dataset
medical_premium_data.describe()

Unnamed: 0,Age,Diabetes,BloodPressureProblems,AnyTransplants,AnyChronicDiseases,Height,Weight,KnownAllergies,HistoryOfCancerInFamily,NumberOfMajorSurgeries,PremiumPrice
count,986.0,986.0,986.0,986.0,986.0,986.0,986.0,986.0,986.0,986.0,986.0
mean,41.745436,0.419878,0.46856,0.055781,0.180527,168.182556,76.950304,0.21501,0.117647,0.667343,24336.713996
std,13.963371,0.493789,0.499264,0.229615,0.384821,10.098155,14.265096,0.411038,0.322353,0.749205,6248.184382
min,18.0,0.0,0.0,0.0,0.0,145.0,51.0,0.0,0.0,0.0,15000.0
25%,30.0,0.0,0.0,0.0,0.0,161.0,67.0,0.0,0.0,0.0,21000.0
50%,42.0,0.0,0.0,0.0,0.0,168.0,75.0,0.0,0.0,1.0,23000.0
75%,53.0,1.0,1.0,0.0,0.0,176.0,87.0,0.0,0.0,1.0,28000.0
max,66.0,1.0,1.0,1.0,1.0,188.0,132.0,1.0,1.0,3.0,40000.0


Splitting the Features and Target

In [9]:
#assigning features as X
#v r gonna drop the class column 
#as v r droping the column v need 2 mention axis = 1
X = medical_premium_data.drop(columns='PremiumPrice', axis=1)

#assigning targets as Y
Y = medical_premium_data['PremiumPrice']

In [10]:
print(X) #printing the features
print("---------------------------------------------------------------------------------------------")
print(Y) #printing the target

     Age  Diabetes  BloodPressureProblems  AnyTransplants  AnyChronicDiseases  \
0     45         0                      0               0                   0   
1     60         1                      0               0                   0   
2     36         1                      1               0                   0   
3     52         1                      1               0                   1   
4     38         0                      0               0                   1   
..   ...       ...                    ...             ...                 ...   
981   18         0                      0               0                   0   
982   64         1                      1               0                   0   
983   56         0                      1               0                   0   
984   47         1                      1               0                   0   
985   21         0                      0               0                   0   

     Height  Weight  KnownA

## Dividing data into train and test data using sklearn's train_test_split()

In [11]:
#spliting the dataset in2 Training & Testing

#test size --> 2 specify the percentage of test data needed ==> 0.2 ==> 20%

#random state --> specific split of data each value of random_state splits the data differently, v can put any state v want
#v need 2 specify the same random_state everytym if v want 2 split the data the same way everytym
X_train, X_test, Y_train, Y_test = train_test_split(X.values, Y.values, test_size = 0.2, random_state = 2)

In [12]:
#lets c how many examples r there for each cases
#checking dimensions of Features
print(X.shape, X_train.shape, X_test.shape)

(986, 10) (788, 10) (198, 10)


In [13]:
#lets c how many examples r there for each cases
#checking dimensions of Targets
print(Y.shape, Y_train.shape, Y_test.shape)

(986,) (788,) (198,)


## Our data is ready to be applied a machine learning algorithm

## XGBoost Regressor

This is type of Decision Tree based ensemble model(use 1 or more model so its lyk incorporating 2 or more models together) 

In [14]:
# loading the model
# training the model with training data
model = XGBRegressor().fit(X_train, Y_train)

Evaluation

Prediction on training data

In [15]:
# accuracy for prediction on training data
training_data_prediction = model.predict(X_train)
training_data_prediction

array([29072.727 , 29099.766 , 28991.11  , 15042.729 , 24935.96  ,
       14968.284 , 14876.2295, 14934.355 , 28059.701 , 22978.86  ,
       28936.896 , 27982.852 , 29043.248 , 15009.639 , 27971.09  ,
       28013.03  , 29034.24  , 34993.633 , 23002.176 , 37919.215 ,
       27908.824 , 23920.936 , 15011.43  , 25048.225 , 29039.893 ,
       15005.502 , 15011.429 , 34980.355 , 22836.924 , 15072.412 ,
       22869.37  , 31550.19  , 22990.285 , 25604.45  , 28012.98  ,
       22903.848 , 22957.54  , 24814.297 , 15040.018 , 25044.486 ,
       23110.758 , 25088.139 , 30999.52  , 15145.787 , 24872.61  ,
       23064.81  , 29029.814 , 29936.385 , 23086.549 , 14963.112 ,
       15071.051 , 15121.25  , 29943.951 , 20977.455 , 23030.13  ,
       14913.5205, 23046.727 , 22993.297 , 14951.567 , 27960.668 ,
       27956.238 , 28072.129 , 34969.12  , 14992.166 , 14997.585 ,
       37952.844 , 15038.15  , 14977.477 , 25006.957 , 23030.691 ,
       37842.125 , 34977.45  , 27974.691 , 39002.55  , 35006.2

Prediction on Test Data 

In [16]:
# accuracy for prediction on test data
test_data_prediction = model.predict(X_test)
test_data_prediction

array([31522.592, 34418.977, 22939.62 , 29325.85 , 27299.066, 32133.8  ,
       16277.406, 27771.814, 30520.424, 25124.678, 27165.387, 23610.73 ,
       15442.265, 24848.695, 28449.059, 23090.574, 28731.201, 22938.918,
       30183.084, 14819.499, 40606.508, 23725.55 , 23538.85 , 15160.81 ,
       23222.527, 23665.338, 14916.795, 23837.049, 25534.871, 28327.965,
       28728.133, 14576.701, 27556.139, 28253.668, 26795.512, 27302.924,
       24269.375, 14511.605, 23201.299, 23870.516, 14903.758, 22992.707,
       20332.379, 23119.092, 27744.418, 14815.873, 23401.953, 28691.2  ,
       26494.936, 37648.145, 23266.2  , 15244.201, 28419.104, 26956.568,
       28412.113, 22408.646, 22889.9  , 27215.402, 30084.893, 25092.826,
       22964.139, 14243.423, 29196.65 , 22780.059, 22943.955, 34752.117,
       28880.977, 27959.85 , 26176.28 , 34540.254, 27726.47 , 36155.066,
       22570.07 , 37894.832, 37438.855, 15981.653, 14941.79 , 28913.098,
       24414.393, 25144.031, 14952.204, 23900.447, 

## XGBRegressor ML model Score

In [17]:
#mean accuracy (accuracy score)
#measuring the accuracy of the model against the training data 
model.score(X_train, Y_train)

0.999625650447872

In [18]:
#mean accuracy (accuracy score)
#measuring the accuracy of the model against the test data 
model.score(X_test, Y_test)

0.8099216961227746

Lets Cross Validate and Check how the model performs.

In [19]:
#cross validation
#it is used to protect against overfitting in a predictive model, 
#particularly in a case where the amount of data may be limited. In cross-validation, 
#you make a fixed number of folds (or partitions) of the data, run the analysis on each fold, and then average the overall error estimate.
#cv = 5 ==> partition the data in2 4 Training & 1 Testing Data parts
print(cross_val_score(model, X, Y, cv = 5))

[0.8201978  0.72201425 0.78901907 0.82596822 0.63907311]


Making a Predictive System

In [20]:
#v r predicting by giving the input 
print("***Price of Medical Insurance***")
print('---------------------------------')

a = int(input("Enter the Age of the Customer : "))
b = int(input("Does the Customer has Diabetes (1 = Yes, 0 = No) : "))
c = int(input("Does the Customer has Blood Pressure Problems (1 = Yes, 0 = No) : "))
d = int(input("Does the Customer has Any Transplants (1 = Yes, 0 = No) : "))
e = int(input("Does the Customer has Any Chronic Diseases (1 = Yes, 0 = No) : "))
f = float(input("Height of Customer (in cm) : "))
g = float(input("Weight of Customer (in kg) : "))
h = int(input("Does the Customer has Known Allergies (1 = Yes, 0 = No) : "))
i = int(input("Does the Customer has History Of Cancer In Family (1 = Yes, 0 = No) : "))
j = int(input("Number Of Major Surgeries for Customer : "))

input_data = np.array([[a, b, c, d, e, f, g, h, i, j]])

# changing input_data to a numpy array
input_data_as_numpy_array = np.asarray(input_data)

# reshape the array
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

prediction = model.predict(input_data_reshaped)
print("-----------------------------------------------------------------------------------")

print('The Medical Insurance cost : $', prediction[0])

***Price of Medical Insurance***
---------------------------------
Enter the Age of the Customer : 20
Does the Customer has Diabetes (1 = Yes, 0 = No) : 0
Does the Customer has Blood Pressure Problems (1 = Yes, 0 = No) : 0
Does the Customer has Any Transplants (1 = Yes, 0 = No) : 0
Does the Customer has Any Chronic Diseases (1 = Yes, 0 = No) : 0
Height of Customer (in cm) : 185
Weight of Customer (in kg) : 105
Does the Customer has Known Allergies (1 = Yes, 0 = No) : 1
Does the Customer has History Of Cancer In Family (1 = Yes, 0 = No) : 0
Number Of Major Surgeries for Customer : 0
-----------------------------------------------------------------------------------
The Medical Insurance cost : $ 15739.655


In [21]:
import pickle

with open('medical_insurance_model.pkl', 'wb') as file:
  pickle.dump(model, file)

# Summary
So this is how we can build a Medical Insurance Premium Prediction model using Machine Learning and the Python programming language. I hope you liked this project on how to build a  model with Machine Learning.