# **Health Insurance_Extraction**

## Objectives

* Extract data from Kaggle and check its quality.

## Inputs

* Require the CSV file from Kaggle. 
    - For convenience, I am downloading the file and running it locally.
    - Here is the link: [kaggle_health_insurance](https://www.kaggle.com/datasets/willianoliveiragibin/healthcare-insurance/data?select=insurance.csv)

## Outputs

* At the end of this file, I will have:
    - Data extracted in the form of CSV
    - Data saved as a new CSV (with any changes I require)
    - Run a correlation on the DataFrame to figure out how variables are interacting
    - Information for further analysis

---

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Section 1: Loading CSV

### Reviewing the insurance.csv dataset:
- For convenience, I've downloaded the CSV and stored it locally

In [4]:
df = pd.read_csv("../data/insurance.csv")
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


Variables in this dataset:

1. Age
2. Sex
3. BMI
4. Children
5. Smoker
6. Region
7. Charges

I am interested in discovering how age, sex, BMI, smoking habits, and region impact insurance charges.

In [4]:
print(df.dtypes) #Checking the dtypes to prepare for encoding/further interpretation. Printing these as they are more accurate than just the command itself.


age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object


In [5]:
df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

There are no missing values in the dataset

---

Duplication Analysis
Assumption: I am assuming that duplicate rows represent unique values, as two people can have the same BMI or insurance expenditure. At the workplace, I would check this with the data manager/data engineer to understand how they are dealing with this. 

In [None]:
df["bmi"].nunique() #I do not think that duplication can be detected with this level of detail. We would require a customer identifier or another datapoint that makes each variable unique. Without that, I think finding duplicates is not a valuable exercise.

548

---

# Section 2: Data exploration

In [None]:
pip install nbformat #had to install this package as I want to explore the data and check if there is any interesting information before I transform.

In [17]:
for col in df.select_dtypes(include="object").columns: #Adapted this loop from the LMS: Pandas Topic 17 
    fig= px.histogram(df, x=col, title=f"Distribution of {col}", category_orders={"region":["northwest", "northeast", "southeast", "southwest"]}, width=800, height=400)
    print("\n\n")
    fig.show()
















**Distribution of sex**: The dataset has almost equal number of male and female respondents.

**Distribution of smoker**: Smokers are fewer in number compared to non-smokers.

**Distribution of region**: Three regions—northwest, northeast, and southwest—have similar number of repondents, while southeast has a slightly higher number.

These pieces of information will aid me before I can generalize any observation and predict. For instance, because southeast has more respondents than other regions, data from this region is going to be relatively larger too. Additionally, these visuals help audience understand how data is structured without having to deal with the details.

In [20]:
fig = px.histogram(data_frame=df, x="age", color="sex", title="Distribution of Age", barmode="group", width=1000, height=600)
fig.show()

**Distribution of Age**

This histogram shows how the age group is distributed. While there is a higher number of 18-19 year-old respondents, the other age groups are fairly similar in size, except for 64-65 age, which is almost half the size of the other age groups.

In [None]:
fig = px.histogram(data_frame=df, x="charges", title="Distribution of Charges", width=1000, height=600)
fig.show()

**Distribution of Charges**

This histogram shows how charges are distributed. The majority of the respondents pay 2,000 to 14,000. I will be probing into how the charges are impacted by age, bmi, and smoking habits in the next stage.

In [21]:
fig = px.scatter(data_frame=df, x="age", y="bmi",
                        color="sex", size= "charges", animation_frame="smoker",
                        title="Age vs BMI by Smoker Status and Charges", width=1000, height=600)

fig.show()

**Age vs BMI by Smoker Status and Charges**

This graph reveals that smoking impacts the insurance charges respondents pay—the bubble size indicates the charges. 

I don't see a direct connection between BMI and charges. However, I will probe in this direction by grouping data points in the next stage.

Questions I want to answer:
1. How does smoking impact insurance charges?
2. How does region impact charges?
3. How does gender impact charges?
4. How does age impact bmi and charges?

As I progress to the next stage, I will find answers to these questions.

---

# Section 3: Data interaction

I want to understand how data points are interacting and for this, I am going to determine the correlation. However, before doing that, I need to transform the string values. I will start this by creating of copy of the original dataset.

In [40]:
df_new_copy = df.copy()
df_new_copy = df_new_copy.to_csv("modified_insurance.csv", index=False) #dataset has been duplicated and saved as a new file.


In [36]:
df1 = pd.read_csv("../data/modified_insurance.csv")
df1.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [38]:
df1.describe(include="all") #I have used the describe function to get a summary of the dataset. I have also set normalize to True to get the percentage of each category in categorical variables.

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
count,1338.0,1338,1338.0,1338.0,1338,1338,1338.0
unique,,2,,,2,4,
top,,male,,,no,southeast,
freq,,676,,,1064,364,
mean,39.207025,,30.663397,1.094918,,,13270.422265
std,14.04996,,6.098187,1.205493,,,12110.011237
min,18.0,,15.96,0.0,,,1121.8739
25%,27.0,,26.29625,0.0,,,4740.28715
50%,39.0,,30.4,1.0,,,9382.033
75%,51.0,,34.69375,2.0,,,16639.912515


In [43]:
df1_corr = df1.corr(method="pearson", numeric_only=True) #I think because of the pandas version, I'm unable to use numeric_only=True. I will have to transform the data frame and convert strings to numbers before I can determine the correlation between the variables.

In [44]:
df1_encoded = pd.get_dummies(df1, drop_first=False) #GitHub Copilot suggested this line to convert categorical variables into numerical ones. This is necessary for correlation analysis.
df1_corr = df1_encoded.corr()
df1_corr

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
age,1.0,0.109272,0.042469,0.299008,0.020856,-0.020856,0.025019,-0.025019,0.002475,-0.000407,-0.011642,0.010016
bmi,0.109272,1.0,0.012759,0.198341,-0.046371,0.046371,-0.00375,0.00375,-0.138156,-0.135996,0.270025,-0.006205
children,0.042469,0.012759,1.0,0.067998,-0.017163,0.017163,-0.007673,0.007673,-0.022808,0.024806,-0.023066,0.021914
charges,0.299008,0.198341,0.067998,1.0,-0.057292,0.057292,-0.787251,0.787251,0.006349,-0.039905,0.073982,-0.04321
sex_female,0.020856,-0.046371,-0.017163,-0.057292,1.0,-1.0,0.076185,-0.076185,0.002425,0.011156,-0.017117,0.004184
sex_male,-0.020856,0.046371,0.017163,0.057292,-1.0,1.0,-0.076185,0.076185,-0.002425,-0.011156,0.017117,-0.004184
smoker_no,0.025019,-0.00375,-0.007673,-0.787251,0.076185,-0.076185,1.0,-1.0,-0.002811,0.036945,-0.068498,0.036945
smoker_yes,-0.025019,0.00375,0.007673,0.787251,-0.076185,0.076185,-1.0,1.0,0.002811,-0.036945,0.068498,-0.036945
region_northeast,0.002475,-0.138156,-0.022808,0.006349,0.002425,-0.002425,-0.002811,0.002811,1.0,-0.320177,-0.345561,-0.320177
region_northwest,-0.000407,-0.135996,0.024806,-0.039905,0.011156,-0.011156,0.036945,-0.036945,-0.320177,1.0,-0.346265,-0.320829


## Here are some key observations from the correlation exercise:
1. There is a weak positive relationship between:
    a. age and charges(0.2)
    b. bmi and charges(0.19)

2. There is a strong positive relationship between smoking and charges(0.7)

---

# Section 4: Summary

1. The insurance.csv dataset is quite clean with no missing values.
2. I've assumed that there are no duplicates provided the nature of the dataset.
3. Initial analysis reveal that smoking and insurance charges have a relatively higher correlation value.
4. Inintial graphs indicate that age, sex, and charges, and region and charges may be related in some manner.

I will close this notebook here and continue with the transformation. I have a good understanding of how I want the data to be transformed. 