In [1]:
import pandas as pd

# Load data file
file_path = 'insurance.csv'
insurance_data = pd.read_csv(file_path)

insurance_data.head()


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552



# Predicting Health Insurance Premium Cost

## Problem Statement
The high cost of healthcare in the United States makes health insurance a necessity for many residents. However, insurance premiums vary based on personal factors, making it hard to predict individual costs. This notebook aims to predict annual insurance premium costs for individuals based on relevant data.

### SMART Framework
- **Specific**: Predict the annual health insurance premium based on individual data.
- **Measurable**: Use historical data to estimate premiums with error metrics like MAE or RMSE.
- **Achievable**: Given the dataset and model tools, a prediction model is feasible.
- **Relevant**: Accurate premium predictions aid financial planning for health insurance.
- **Time-Bound**: Completion and validation of the model will occur within the project timeline.




## Model Selection: Regression
This problem requires predicting a continuous numerical value (insurance charges), so **Regression** is the most suitable machine learning model type.



## Business Understanding
Accurate predictions of insurance premiums help individuals budget effectively, especially given the rising healthcare costs. By understanding the key factors influencing premium costs, individuals and insurance providers can make informed decisions, potentially driving more customized insurance plans and cost management strategies.



## Problem Statement for EDA
To better understand the variables in the dataset, we will focus on:
- Age, BMI, and Smoking Status: Factors that may strongly influence premium costs.
- Number of Children: Family size could impact premium levels.
- Region: Location-specific costs could vary.

EDA will focus on analyzing these variables, their distributions, correlations, and their potential impact on premium charges.



## Data Understanding


In [2]:
insurance_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [3]:
insurance_data.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801



## Data Cleaning
1. **Check for Missing Values**
2. **Data Type Verification**
3. **Handle Categorical Variables**.


In [4]:
# Check for missing values
insurance_data.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [5]:
# Encode categorical variables
insurance_data_encoded = pd.get_dummies(insurance_data, drop_first=True)
insurance_data_encoded.head()

Unnamed: 0,age,bmi,children,charges,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.924,False,True,False,False,True
1,18,33.77,1,1725.5523,True,False,False,True,False
2,28,33.0,3,4449.462,True,False,False,True,False
3,33,22.705,0,21984.47061,True,False,True,False,False
4,32,28.88,0,3866.8552,True,False,True,False,False
