# Insurance Forecast

## Problem Statement
The purpose of this analysis is to predict individual health insurance costs charged by health insurance companies based on age, sex, BMI, smoking, and region.

### Team Members
* Jason Zelaya
* Madeleine Merken
* Shannon Chang

### Data Source
Kaggle: https://www.kaggle.com/mirichoi0218/insurance

### Data Content
**Note: The individual paying for the health insurance will be referred to as the "beneficiary" in the following definitions.

* Age: age of the beneficiary in years
* Sex: whether the beneficiary is male or female
* BMI: body mass index derived from the weight and height of an individual. A healthy BMI is generally known to be from 18.5 to   24.9 BMI
* Smoker: whether or not the beneficiary smokes
* Region: the beneficiary's residential area in the US. The categories are northeast/southeast/southwest/northwest
* Charges: the price the beneficiary pays the health insurance companies in USD

*We dropped the "children" column because the youngest age is 18 years old which is legally considered an adult.

## Underlying Assumptions
The model should conform to the assumptions of linear regression to be usable in practice. To confirm this we examined the data set to check:
* The regression model is linear in parameters
* The mean of residuals is zero
* Homoscedasticity of residuals or equal variance
* No autocorrelation of residuals

## ML algorithm Validation
Multi-linear regression (supervised learning)

In [3]:
Exploratory Data Analysis

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [4]:
# Read the csv file into a pandas DataFrame
insurance = pd.read_csv('insurance.csv')
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [5]:
# Cleaning the data

In [6]:
# Check how many null values are in each column
insurance.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [7]:
# Check how many "-" values are in each column
insurance[insurance.isin(["-"])].dropna(how="all").count()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [8]:
# Replace any inf values with NAN and preview the first five rows
insurance.replace([np.inf, -np.inf], np.nan).head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [9]:
# Drop all NAN values and preview the first five rows
insurance.dropna().head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [10]:
# Check the count of each age that are male and female. This allows us to confirm whether or not there are any 
# inconsistent/incorrect age values such as negative age values
pd.crosstab(insurance.sex, insurance.age)

age,18,19,20,21,22,23,24,25,26,27,...,55,56,57,58,59,60,61,62,63,64
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
female,33,33,14,13,13,14,14,13,13,14,...,13,13,13,13,13,11,12,12,12,11
male,36,35,15,15,15,14,14,15,15,14,...,13,13,13,12,12,12,11,11,11,11


In [11]:
# Round all values in the charges column to two decimals
decimals = 2    
insurance['charges'] = insurance['charges'].apply(lambda x: round(x, decimals))
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.77,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.705,0,no,northwest,21984.47
4,32,male,28.88,0,no,northwest,3866.86


In [12]:
# Use Pandas get_dummies to convert categorical data
insurance = pd.get_dummies(insurance)
insurance.head()

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.92,1,0,0,1,0,0,0,1
1,18,33.77,1,1725.55,0,1,1,0,0,0,1,0
2,28,33.0,3,4449.46,0,1,1,0,0,0,1,0
3,33,22.705,0,21984.47,0,1,1,0,0,1,0,0
4,32,28.88,0,3866.86,0,1,1,0,0,1,0,0


In [13]:
insurance.to_csv('insurance_cleaned.csv')