# HACKATHON 1: HEALTH insurance-ETL Notebook

This notebook focuses on ETL process 
- Extracting the dataset 
- Cleaning and transforming it 
- Saving the cleaned version for analysis 
- All the sisualisation steps are documented seprately in 'viz.ipynb'.

## Objectives
- Work with the health insurance dataset provided for the Hackathon 
- perform basic cleaning (check for missing values, round numbers).
- save a cleaned dataset. 


## Inputs
- Dataset: insurance.csv
- info about rows and columns 

## Outputs

- A cleaned dataset saved as 
- Cleaned_healthcare_insurance.csv.
- summary statistics and initial exploratory analysis (if it's part of ETL only)
## Additional Comments
- This work was carried out using a template Notebook provided during the Hackathon. 
- The code has been adapted to work with my own dataset and reflects my own cleaning and analysis process.
- This notebook is focused only on the ETL process (extracting , cleaning and savng the dataset.)
- All data visualisation steps are documented separately in the viz.ipynb notebook. 



---

#DATA CLEANING

### Data Cleaning 
- In this section, i perform basic cleaning and preparation of the dataset before analysis. 
- Steps iclude:
- Importing the required python liberies ('pandas', 'numpy', 'matplotlib' , 'seaborn'). 
- Reading the raw dataset from the CSV file. 
- Displaying the first few rows to inspect the structure and values. 

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


Load Data 
## Load Raw Data 
I will load the dataset ('insurance.csv') and check first few raws to understand its structure. 

In [2]:
df= pd.read_csv("../Data/insurance.csv")

In [3]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


#Check data type: object, float, etc & View a summary of the data frame 

## Inspect Data 
Check the dataset's shape, info and summery statistics to identify porential cleaning needs (missing values, incorrect formats e.t.c.). 

In [12]:
print (df.shape)
df.info()  
df.describe() 


(1338, 7)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663572,1.094918,13270.42228
std,14.04996,6.097951,1.205493,12110.011259
min,18.0,15.96,0.0,1121.87
25%,27.0,26.2975,0.0,4740.2875
50%,39.0,30.4,1.0,9382.03
75%,51.0,34.6925,2.0,16639.915
max,64.0,53.13,5.0,63770.43


#Sample of first few rows

In [5]:
#]q2 cdf.head(10)

#Count of missing values per column

### check for missing values 
we check each column to see if there are missing entries.The result shows **0 missing values** across all columns. 

In [6]:
df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

#Round values for bmi and charges to 2 decimal numbers

### Round Numeric Values 
we round 'bMi' and 'charges' to decimal places to standardise the data for easier readability and analysis.  

In [7]:
df['bmi'] = df['bmi'].round(2)
df['charges'] = df['charges'].round(2)

### Preview Cleaned Data 
Display the first 10 rows of the cleaned dataset to confirm changes. 

In [21]:
df.head(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.77,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.88,0,no,northwest,3866.86
5,31,female,25.74,0,no,southeast,3756.62
6,46,female,33.44,1,no,southeast,8240.59
7,37,female,27.74,3,no,northwest,7281.51
8,37,male,29.83,2,no,northeast,6406.41
9,60,female,25.84,0,no,northwest,28923.14


# Save Cleaned Data

### Saving and Reloading Cleaned Data 
-The cleaned dataset is saved as 'cleaned_healthcare_insurance.csv'.
Reloading the dataset ensures it has been saved correcly and can be reused in future analysis. 

In [8]:
df.to_csv('cleaned_healthcare_insurance.csv', index=False)


In [9]:
import pandas as pd
import numpy as np

In [10]:
df_cleaned = pd.read_csv('cleaned_healthcare_insurance.csv')


In [25]:
df_cleaned.head(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.77,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.88,0,no,northwest,3866.86
5,31,female,25.74,0,no,southeast,3756.62
6,46,female,33.44,1,no,southeast,8240.59
7,37,female,27.74,3,no,northwest,7281.51
8,37,male,29.83,2,no,northeast,6406.41
9,60,female,25.84,0,no,northwest,28923.14


# Change working directory

When working with notebooks stored in subfolder, it's often neccessary to change the working dierectory. 
This allows us to access files (like datasets) that are stored in the parent directory. 
Steps:
- use 'os.getcwd()' to check the current directory . 
use 'os.path.dirname()'to set the new current directory. 

import os 
# Get the current working directory 
current_dir=os.getcwd()
current_dir 



# Change to parent directory 
import os 
os.chdir(os.path.dirname(current_dir))
print("you set a new current directory")


import os
current_dir = os.getcwd()
current_dir

In [22]:

# Change to parent directory 
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")



You set a new current directory


Confirm the new current directory

In [21]:
import os
#  Confirm the new working directory 
current_dir = os.getcwd()
current_dir

'c:\\Users\\ifrah\\OneDrive\\Pictures\\Documents\\vscode-projects\\da-project1\\notebooks'