This project explores health insurance premium charges using regression models, specifically Box-Cox and Gamma Regression, implemented in R and SAS. The goal is to understand the factors influencing premium variations and identify an effective model for the given dataset.
Insurance premium refers to the payment made by individuals or businesses for their insurance coverage, whether it's for healthcare, auto, home, or life insurance. When someone signs up for an insurance policy, they are charged a premium by the insurer. This premium is used by the insurer to cover potential costs associated with the policies they offer. Apart from the premium, the total cost of healthcare includes other factors like deductibles, copayments/coinsurance, and expenses for health and drug services. A deductible is the amount an individual pays for covered health services before the insurance company starts covering costs. Copayments and coinsurance are payments made to healthcare providers each time someone receives care. Analytics plays a crucial role in the insurance industry by helping businesses interpret data effectively. These analytic methods often involve using data-mining tools and statistical inference techniques.
The dataset was obtained from Kaggle, a popular platform for data scientists, containing 1338 rows of data with 7 variables:
- Age: Age of primary beneficiary
- Sex: insurance contractor gender: female or male
- BMI: Body mass index
- Children: Number of children covered by health insurance / Number of dependents
- Smoker: (yes/no)
- Region: The beneficiary's residential area in the US (southeast, southwest, northeast, northwest)
- Charges: Individual medical costs billed by health insurance
After plotting the individual medical costs billed by health insurance, the distribution appears to not be normally distributed as supported by the histogram and the normality tests. Many statistical methods assume that the residuals of the data are normally distributed. Right-skewed data can violate these assumptions, leading to incorrect inferences and results.
To test the normality of a distribution, one commonly used method is the Shapiro-Wilk test. The null hypothesis,
The test statistic for the Shapiro-Wilk test is denoted by
If the p-value associated with the Shapiro-Wilk test is less than the chosen significance level
Both Box-Cox and Gamma Regression models were applied to model the insurance premium charges.
If the density of the response variable
The fitted mean for the Box-Cox transformed response is:
BoxCox.fit.x | BoxCox.fit.y |
---|---|
0.25 | -3716.47 |
The optimal
A gamma regression, alternatively, was fitted to the positive response with a right-skewed distribution. In this model,
The fitted mean response has the form:
After fitting the gamma model, significant predictors included age, BMI, smoking, and region northwest.
The interpretation for significant predictors is as follows:
- As age increases by one year, the estimated mean premiums increase by 3.51%.
- An increase of one BMI point corresponds to a 1.745% increase in estimated mean premiums.
- Smokers have estimated mean premiums 5.44% higher than nonsmokers.
- People living in the northwest have estimated mean premiums 1.3066% of those living in the southeast.