• Seek insight from the dataset with Exploratory Data Analysis
• Performed Data Processing, Data Engineering and Feature Transformation to prepare data before modeling
• Built a model to predict Insurance Cost based on the features
• Evaluated the model using various Performance Metrics like RMSE, R2, Testing Accuracy, Training Accuracy and MAE
Data source : https://www.kaggle.com/mirichoi0218/insurance
- age: age of primary beneficiary
- sex: insurance contractor gender, female, male
- bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
- children: Number of children covered by health insurance / Number of dependents
- smoker: Smoking
- region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest
- charges: Individual medical costs billed by health insurance
- Check missing value - there are none
- Check duplicate value - there are 1 duplicate, will be remove
- Feature engineering - make a new column
weight_status
based on BMI score - Feature transformation:
A) Encodingsex
,region
, &weight_status
attributes
B) Ordinal encodingsmoker
attribute - Modeling:
A) Separating target & features
B) Splitting train & test data
C) Modeling using Linear Regression, Random Forest, Decision Tree, Ridge, & Lasso algorithm
D) Find the best algorithm
E) Tuning Hyperparameter
• Feature sex, region has an almost balanced amount, meanwhile most people are non smoker & obese
• A person who smoke and have BMI above 30 tends to have a higher medical cost
• Older people who smoke have more expensive charges
• People who smoke and obese have the highest average charges compared to others
Score | LinearRegression | DecisionTree | RandomForest | Ridge |
---|---|---|---|---|
R2 | 0.77 | 0.78 | 0.78 | 0.86 |
Train Accuracy | 0.74 | 1.0 | 0.97 | 0.74 |
MAE | 4305.20 | 2798.83 | 2608.55 | 4311.10 |
Test Accuracy | 0.77 | 0.78 | 0.86 | 0.77 |
RMSE | 6209.88 | 6067.50 | 4841.88 | 6238.13 |
Based on the predictive modeling, Linear Regression algorithm has the best score compared to the others, with MAE Score 4305.20, RMSE Score 6209.88, & R2 Score 0.77.
Therefore, Linear Regression algorithm is the best fitted model based on the training and testing accuracy.