# Insurance Forecast
## Objective
The purpose of the analysis is to predict individual medical costs billed by health insurance based on age, sex, BMI, smoking, and region.
### Team Members
* Jason Zelaya
* Madeleine Merken
* Shannon Chang

### Data Source
Kaggle: https://www.kaggle.com/mirichoi0218/insurance
### Data Content
* Age: age of primary beneficiary
* Sex: insurance contractor gender, male/female
* BMI: body mass index derived from the weight and height of an individual, ideally 18.5 to 24.9
* Smoker: smoking or not
* Region: the beneficiary's residential area in the US, northeast/southeast/southwest/northwest
* Charges: individual medical costs billed by health insurance

## Data Validation
Since the model should conform to the assumptions of linear regression to actually be usable in practice, we examine the data set to check:
* The regression model is linear in parameters
* The mean of residuals is zero
* Homoscedasticity of residuals or equal variance
* No autocorrelation of residuals

## ML Algorithm Validation
Supervised learning using linear regression

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:
# Read the csv file into a pandas DataFrame
insurance = pd.read_csv('insurance.csv')
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [3]:
# Cleaning the data
insurance = insurance.drop(columns="children")
insurance

Unnamed: 0,age,sex,bmi,smoker,region,charges
0,19,female,27.900,yes,southwest,16884.92400
1,18,male,33.770,no,southeast,1725.55230
2,28,male,33.000,no,southeast,4449.46200
3,33,male,22.705,no,northwest,21984.47061
4,32,male,28.880,no,northwest,3866.85520
5,31,female,25.740,no,southeast,3756.62160
6,46,female,33.440,no,southeast,8240.58960
7,37,female,27.740,no,northwest,7281.50560
8,37,male,29.830,no,northeast,6406.41070
9,60,female,25.840,no,northwest,28923.13692


In [4]:
# Use Pandas get_dummies to convert categorical data
insurance = pd.get_dummies(insurance)
insurance.head()

Unnamed: 0,age,bmi,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.9,16884.924,1,0,0,1,0,0,0,1
1,18,33.77,1725.5523,0,1,1,0,0,0,1,0
2,28,33.0,4449.462,0,1,1,0,0,0,1,0
3,33,22.705,21984.47061,0,1,1,0,0,1,0,0
4,32,28.88,3866.8552,0,1,1,0,0,1,0,0
