# **Health Insurance_Load**

## Objectives

* Prepare the DataFrame for Load phase and capture final comments and next steps

## Inputs

*  feature_engineered_insurance.csv file

## Outputs

* Next steps 

---

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns          
import matplotlib.pyplot as plt
import plotly.express as px

# Section 1: Transformed CSV file Check

Loading the transformed dataset and ensuring it is ready for further processing.

In [5]:
df = pd.read_csv("../data/feature_engineered_insurance.csv")
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,high_risk,medium_risk,low-medium_risk,low_risk,age_group
0,19,female,27.900,0,yes,southwest,16884.92400,0,1,0,0,teen
1,18,male,33.770,1,no,southeast,1725.55230,1,0,0,0,teen
2,28,male,33.000,3,no,southeast,4449.46200,1,0,0,0,young_adult
3,33,male,22.705,0,no,northwest,21984.47061,0,0,1,1,young_adult
4,32,male,28.880,0,no,northwest,3866.85520,0,1,0,0,young_adult
...,...,...,...,...,...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830,1,0,0,0,adult
1334,18,female,31.920,0,no,northeast,2205.98080,1,0,0,0,teen
1335,18,female,36.850,0,no,southeast,1629.83350,1,0,0,0,teen
1336,21,female,25.800,0,no,southwest,2007.94500,0,1,0,0,young_adult


In [6]:
df.isnull().sum()

age                0
sex                0
bmi                0
children           0
smoker             0
region             0
charges            0
high_risk          0
medium_risk        0
low-medium_risk    0
low_risk           0
age_group          0
dtype: int64

Following transformation, there are no missing values in the newly created dataset.

In [10]:
print(df.dtypes)

age                  int64
sex                 object
bmi                float64
children             int64
smoker              object
region              object
charges            float64
high_risk            int64
medium_risk          int64
low-medium_risk      int64
low_risk             int64
age_group           object
dtype: object


## Column Descriptions

- age: Age of the insured
- sex: Gender of the insured
- bmi: Body Mass Index
- children: Number of dependents
- smoker: Smoking status
- region: Residential region
- charges: Insurance charges
- high_risk, medium_risk, low-medium_risk, low_risk: Risk category flags
- age_group: Age group bucket

---

# Assumptions

1. The final deliverables are the following:
    
    1. three Jupyter notebooks: extraction, transformation, and load
    2. source data: Original dataset, copy of the original from the extraction phase, and the transformed dataset
    3. readme file with notes

2. Any discoveries that require further analysis can be pushed forward to the next sprint and be added to the backlog

3. There are no further requirements for ML processing(pipeline creation) 

---

# Next step

1. The gender variable from the dataset could be further clubbed with other datapoints to explore its potential for further analysis. I will pick this up as part of the next sprint.

2. The feature_engineered_insurance.csv file is ready for further analysis.

3. In the next iteration of data collection, here are some datapoints that will help further balance the dataset:
    1. Increase sample size of age groups beyond 19 years to match the 18-19-year-old respondents.

    2. Increase sample size of all regional data proportionately—currently, Southeast has more samples while other regions have similar number of samples.