# Preliminary Project Proposal

## Motivation and problem statement:

In light of the current situation, I realized that there exists a knowledge gap in the insurance policies and services, therefore I wanted to understand more about how individuals are billed for their medical insurances. Putting myself in the position of a customer, I thought that I will only be aware of my details and won't have any visibility of the internal parameters of the insurance companies. Hence, with these details can I figure out the reasons for my bill? or what bill I can get in the future?  

The project is also a way for me to prepare myself to explore unknown and faulty datasets. It will be great to learn about statistical techniques that work with different kinds of variables, conditions where techniques might not work, and how to deal with them.

## Data Selected for Analysis:

For the analysis, I will be using a publicly available dataset from [Kaggle](https://www.kaggle.com/mirichoi0218/insurance). The dataset consists of individual records and what were they billed by health insurance companies. The dataset does not contain any PII data. Each individual is described by the following features:
1.	**age**: age of the primary beneficiary
2.	**sex**: insurance contractor gender, female, male
3.	**bmi**: Body mass index, providing an understanding of the body, weights that are relatively high or low relative to height, objective index of body weight using the ratio of height to weight, ideally 18.5 to 24.9
4.	**children**: Number of children covered by health insurance / Number of dependents
5.	**smoker**: Smoking (Yes/No)
6.	**region**: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
7.	**charges**: Individual medical costs billed by health insurance

These features represent some of the most common details every customer is required to fill while applying for medical insurance. Providing an analysis using these features allows an easier interpretation for everyone (especially people who do not know about the small details of insurances).
The description of data on Kaggle reveals that we have a similar number of male and female individuals, an almost equal number of individuals for each age (18-20 are slightly more than the rest) and there are more individual with 0 or 1 children when compared to individuals with greater than 2 children.


#### License:
[Database: Open Database, Contents: Database Contents](http://opendatacommons.org/licenses/dbcl/1.0/)

## Unknowns and dependencies:

* Although in the real world these features are not enough to adequately describe the health insurance bills, we can start with these limited columns and then gradually improve our analysis as we get new data
* The data is a small random subset of the whole population and inferences from this data cannot be generalized to the whole population

In [2]:
import pandas as pd
import numpy as np

In [4]:
## Read the data and describe it

insurance_data = pd.read_csv('./data/insurance.csv', header = 0)

In [5]:
insurance_data.head(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
7,37,female,27.74,3,no,northwest,7281.5056
8,37,male,29.83,2,no,northeast,6406.4107
9,60,female,25.84,0,no,northwest,28923.13692


In [6]:
insurance_data.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


# Final Project Proposal

# Title: Medical Insurance Bills

## Data
The dataset is publicly available at **[Kaggle](https://www.kaggle.com/mirichoi0218/insurance)**

## Description
Medical insurance costs are the premium each individual pays for getting medical insurance from any private or government institute. These premiums are decided based on the plans and services the individual applies for. According to [eHealthInsurance](https://www.ehealthinsurance.com/resources/affordable-care-act/much-health-insurance-cost-without-subsidy), for unsubsidized customers in 2019, premiums for individual coverage averaged $462 per month and around $199 with a subsidy. Therefore, an individual spends in the range of $2,388 to $5,988 in a year. If a person in investing such an amount of money, then it is important for them to understand the details of how things work outside of selecting a plan

## Motivation / Problem Statement
In light of the current situation, I realized that there exists a knowledge gap in the insurance policies and services, therefore I wanted to understand more about how individuals are billed for their medical insurances. Putting myself in the position of a customer, I thought that I will only be aware of my details and won't have any visibility of the internal parameters of the insurance companies. Hence, with these details can I figure out the reasons for my bill? or what bill I can get in the future?  

The project is also a way for me to prepare myself to explore unknown and faulty datasets. It will be great to learn about statistical techniques that work with different kinds of variables, conditions where techniques might not work***, an howto deal **with them.

## Unknowns and dependencies
- Although in the real world these features are not enough to adequately describe the health insurance bills, we can start with these limited columns and then gradually improve our analysis as we get new data
- The data is a small random subset of the whole population and inferences from this data cannot be generalized to the whole population

## Background
The following bullets highlights related and previous work/studies done in the past and general observations on the type of work done
- https://www.ehealthinsurance.com/resources/affordable-care-act/much-health-insurance-cost-without-subsidy
  - An article by the ehealth describing the amount spent by an individual or family yearly on medical insurance premiums
- https://www.kaggle.com/mirichoi0218/insurance/activity
  - The problems has been extensively worked on by the data science community. They have employed numerous techniques including but not limited to classfication, regression and clustering. The problem provides dual benefits as it allows people to understand the medical bills based on known entities (known by customer) and allows new data scientist to test and improve their skills
- https://people.csail.mit.edu/gjw/papers/healthcare.pdf
  - Scientist at MIT worked on identyfying health insurance cost using the claims data of about 800k individuals, with about 200k individuals as the test set (out of sample). The approach used in the paper includes multiple variables outside of the indiviuals demographics. They use classification trees and clustering to create groups that can be explained by a set of characterstics. They are focused on grouping people with similar claims and information, with the price of claim as one of the attributes in the process

## Research Questions
- **What are the different variables that impact the individual medical insurance cost?** <br />
**Null Hypothesis** - *A particular feature has no relevance in the dataset (check for all features)* <br />
**Significance** - The answer to the question will allow user to see the root causes of the high medical insurance cost.


- **Can we create a model that can compute the medical insurance cost of individual? What is the average error that can be expected from the new data?** <br />
**Null Hypothesis** - *Any combination of different features in the dataset have no impact on the medical insurance cost* <br />
**Significance** - The prediction capabilities of the model will allow user to get an estimate of medical insurance cost for a set of observation present or not present in the current data

## Methodology

#### First Research Question
  - To address the first question, I plan on running a correlation aand colinearity analysis on the variables. The correlation will tell us about how much each variable contributes to the cost (in a linear way). The colinearity can tell us if two variables are mutually related to each other and whether we need both of them to accurately understand the medical insurance cost. We will use the packages **pandas** and **statsmodel** to calculate the correlation and colinearity respectively
  - Another method is to use **sckit-learn** package and implement **Random Forest Regression** with the medical insurance price as the predictor variable. The random forest regressor comes with an added advantages of generating the *feature importance* matrix. Random forest is a robust algorithm which controls the high variance of tree based methods, therefore the predictions and the feature importance matrix are reliable
  - I will also use the permutation feature importance method to analyze the impact of on a simple model when one of the feature is rendered unusable. The loss in the predictive power of the algorithm will determine how much a particular feature is important for a model
  - The fourth step would be to use different visualizations to understand the relations between different features and medical insurance cost. Visual aids help us see the *feature importance* in addition to observing the numbers

#### Second Research Question
  - To address the second question, I will implement **4 different models** using **scikit-learn** package:
    - **Linear Regression Model**: The linear regression model will simply try to use all the input variables to generate the best possible model. It is comparitively easier to understand and gives a great first look at the prediction capabilities from the data   
    - **Polynomial Regression Model**: The polynomial regression model will add higher degree variables (Squared, cubic values of given inputs) to the linear inputs. We will then use linear regression on all the variables to get the best possible model. Polynomial regression helps us introduce non-linearity in the variables, which allows us to increase the number of features and improve the predictive strength of the model
    - **L1 Polynomial Regression Model**: We will use the polynomial variables generated in the previous step and use an L1 regularized linear regression model on it. The L1 regularizer automatically filters the uneccesary variables, therefore providing models with lesser complexity
    - **Random Forest Model** - The random forest regressor is a non linear regression model which allows us to observe more complex relations between the input and outputs 

### References
- Pedregosa, F. (2011). Scikit-learn: Machine Learning in Python. Retrieved from https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html
- Seabold, S., & Perktold, J. (2010). statsmodels: Econometric and statistical modeling with python. *In 9th Python in Science Conference.*
- Santhanam, N. (2019, October 14). Explain your machine learning with feature importance. Retrieved from https://towardsdatascience.com/explain-your-machine-learning-with-feature-importance-774cd72abe

## License
Database: [Open Database, Contents: Database Contents](http://opendatacommons.org/licenses/dbcl/1.0/) <br />
Republishing is not prohibited by kaggle
