# Health Insurance Charges: Regression Modeling & Analysis

## Goals

1. Develop a predictive regression model to accurately estimate individual medical insurance charges based on demographic and lifestyle factors in order to support data-driven underwriting decisions.

2. Identify and quantify the key drivers influencing insurance premium costs (e.g., smoking status, BMI, age) to better understand risk factors and inform pricing strategy.

In [1]:
import pandas as pd
import numpy as np

import wrangle as w
import seaborn as sns
import matplotlib.pyplot as plt

from scipy import stats
from scipy.stats import pearsonr, spearmanr, ttest_ind, f_oneway
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
from sklearn.preprocessing import PolynomialFeatures


import warnings
warnings.filterwarnings("ignore")

-----

## Acquire

- Data acquired from Kaggle
- Wrote a function in my wrangle.py file to combine multiple csv into one large dataframe
- Data frame contained 49,732 rows and 17 columns before cleaning
- Each row represents a client of a Portuguese banking institution
- Each column represents a feature of the dataset


----

## Prepare

- Checked for nulls in the data (there were none)
- Checked that column data types were appropriate
- Descriptive Statistics on numerical variables
- Encoded variables that would promote useability for my models
- Created dummy variables for categorical variables
- Added dummy variables to dataset
- Split data into train, validate and test (approx. 70/15/15), stratifying on 'y' (clients that made term deposits)

---- 

## Data Dictionary



| Target Variable |     Definition     |
| --------------- | ------------------ |
|      charges     | Individual medical costs billed by health insurance. (numeric) |

| Feature  | Definition |
| ------------- | ------------- |
| age  | Age of primary beneficiary (numeric)  |
| sex | Insurance contractor gender, female / male (binary) |
| bmi | Body mass index, providing an understanding of body, weights that are relatively high or low (numeric) |
| children | Number of children covered by health insurance / Number of dependents (numeric) |
| smoker | (binary: smoker, no-smoker)  |
| region | The beneficiary's residential area in the US, northeast, southeast, southwest, northwest. (categorical) | 



---

## A brief look at the data

In [3]:
df_origin, df = w.insurance('insurance.csv')

In [5]:
df_origin.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## A summary of the data

In [7]:
train, validate, test= w.data_split(df_origin)
train.shape, validate.shape, test.shape

((936, 7), (201, 7), (201, 7))

In [8]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 936 entries, 463 to 1128
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       936 non-null    int64  
 1   sex       936 non-null    object 
 2   bmi       936 non-null    float64
 3   children  936 non-null    int64  
 4   smoker    936 non-null    object 
 5   region    936 non-null    object 
 6   charges   936 non-null    float64
dtypes: float64(2), int64(2), object(3)
memory usage: 58.5+ KB


-----

## Explore

----

## Preprocessing

-----

## Modeling 

----

### Conclusion