Skip to content

Exploratory Data Analysis on the Medical Health Insurance dataset to determine the contributing factors and predict the health insurance cost using regression models.

Notifications You must be signed in to change notification settings

mohan-kartik/Health_Insurance_Cost_Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Project Overview

• Seek insight from the dataset with Exploratory Data Analysis
• Performed Data Processing, Data Engineering and Feature Transformation to prepare data before modeling
• Built a model to predict Insurance Cost based on the features
• Evaluated the model using various Performance Metrics like RMSE, R2, Testing Accuracy, Training Accuracy and MAE

Data source : https://www.kaggle.com/mirichoi0218/insurance

Data Definition

  1. age: age of primary beneficiary
  2. sex: insurance contractor gender, female, male
  3. bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
  4. children: Number of children covered by health insurance / Number of dependents
  5. smoker: Smoking
  6. region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest
  7. charges: Individual medical costs billed by health insurance

Data Processing

  1. Check missing value - there are none
  2. Check duplicate value - there are 1 duplicate, will be remove
  3. Feature engineering - make a new column weight_status based on BMI score
  4. Feature transformation:
    A) Encoding sex, region, & weight_status attributes
    B) Ordinal encoding smoker attribute
  5. Modeling:
    A) Separating target & features
    B) Splitting train & test data
    C) Modeling using Linear Regression, Random Forest, Decision Tree, Ridge, & Lasso algorithm
    D) Find the best algorithm
    E) Tuning Hyperparameter

Exploratory Data Analysis

• Feature sex, region has an almost balanced amount, meanwhile most people are non smoker & obese
image

• A person who smoke and have BMI above 30 tends to have a higher medical cost
image

• Older people who smoke have more expensive charges
image

• People who smoke and obese have the highest average charges compared to others
image

Model Evaluation

Score LinearRegression DecisionTree RandomForest Ridge
R2 0.77 0.78 0.78 0.86
Train Accuracy 0.74 1.0 0.97 0.74
MAE 4305.20 2798.83 2608.55 4311.10
Test Accuracy 0.77 0.78 0.86 0.77
RMSE 6209.88 6067.50 4841.88 6238.13

Conclusion

Based on the predictive modeling, Linear Regression algorithm has the best score compared to the others, with MAE Score 4305.20, RMSE Score 6209.88, & R2 Score 0.77.
Therefore, Linear Regression algorithm is the best fitted model based on the training and testing accuracy.

About

Exploratory Data Analysis on the Medical Health Insurance dataset to determine the contributing factors and predict the health insurance cost using regression models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published