# Healthcare Data Cleaning Notebook
## Introduction
This notebook takes the raw (but previously cleaned) healthcare datasets and applies a series of cleaning and transformation steps based on our earlier analysis. The goal is to prepare the data for loading into a database or for further analysis. Each step is broken down into a separate cell for clarity.
### 1. Imports and Setup
First, we import the necessary libraries. pandas is for data manipulation and os is for interacting with the file system (to create directories and manage file paths).

In [5]:
import pandas as pd
import os

### 2. Configuration
Design Decision: Define input and output file paths as variables in a dedicated configuration cell.
Why? This separates configuration from the core logic. Instead of hard-coding file paths deep inside the script, this design allows anyone (including your future self) to easily reuse this notebook for different data by only changing these two variables. It makes the code more maintainable and reusable.

In [6]:
INPUT_DATA_DIR = os.path.join('..', 'cleaned_data')
OUTPUT_DATA_DIR = os.path.join('..', 'v2cleaned_data')

### 3. Load Data

Load all datasets

In [8]:
try:
    observations = pd.read_csv(os.path.join(INPUT_DATA_DIR, "observations_cleaned.csv"))
    patients = pd.read_csv(os.path.join(INPUT_DATA_DIR, "patients_cleaned.csv"))
    procedures = pd.read_csv(os.path.join(INPUT_DATA_DIR, "procedures_cleaned.csv"))
    diagnoses = pd.read_csv(os.path.join(INPUT_DATA_DIR, "diagnoses_cleaned.csv"))
    encounters = pd.read_csv(os.path.join(INPUT_DATA_DIR, "encounters_cleaned.csv"))
    medications = pd.read_csv(os.path.join(INPUT_DATA_DIR, "medications_cleaned.csv"))
    print("All datasets loaded successfully.")
except FileNotFoundError as e:
    print(f"[ERROR] A data file was not found. Please check the input directory path. Details: {e}")

All datasets loaded successfully.


Lets double check

In [10]:
observations.head()

Unnamed: 0,encounter_id,observation_code,observation_datetime,observation_description,observation_id,patient_id,units,value_numeric,value_text
0,f5f83a54-5883-413d-9bb4-c859fa6b8cde,4548-4,2025-04-14,Hemoglobin A1c/Hemoglobin.total in Blood,c70dc224-4c15-43ec-89b6-ed7821d80df2,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,%,7.6,
1,f5f83a54-5883-413d-9bb4-c859fa6b8cde,2345-7,2025-04-14,Glucose [Mass/Vol],065df109-6962-496e-82a7-ab975746f265,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,mg/dL,210.0,
2,f5f83a54-5883-413d-9bb4-c859fa6b8cde,2160-0,2025-04-14,Creatinine [Mass/Vol],ea1a0317-d4cf-4f4c-9d3b-9e87700f67bc,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,mg/dL,1.0,
3,a4345130-e167-45b5-9e60-75a1815d3ae0,8480-6,2026-04-08,Systolic blood pressure,8cd3eab8-0a7b-4e49-856c-ba2a081f969f,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,mmHg,101.0,
4,a4345130-e167-45b5-9e60-75a1815d3ae0,8462-4,2026-04-08,Diastolic blood pressure,31cb9c28-2bad-431a-b59a-6be7750e3184,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,mmHg,68.0,


### 4. Cleaning the patients DataFrame
Design Decisions:
* Use pd.to_datetime with errors='coerce'.
* Convert zip_code to a string type using .astype(str).
#### Why?
* pd.to_datetime: This is the standard pandas function for converting columns to a proper date format, which is essential for correct sorting and calculations.
* errors='coerce': This is a crucial safety feature. If any date in the column is malformed (e.g., '2020-02-30'), this option will turn it into NaT (Not a Time) instead of crashing the script. This allows the process to complete, and we can handle the invalid entries later if needed.
* .astype(str) for zip codes: Zip codes are identifiers, not numbers meant for mathematical operations. Storing them as numbers can cause problems, like automatically removing leading zeros (e.g., 07960 would become 7960). Converting to a string preserves the exact format.

In [12]:
patients['date_of_birth'] = pd.to_datetime(patients['date_of_birth'], errors='coerce')
patients['zip_code'] = patients['zip_code'].astype(str)
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   address        150 non-null    object        
 1   city           150 non-null    object        
 2   date_of_birth  150 non-null    datetime64[ns]
 3   first_name     150 non-null    object        
 4   gender         150 non-null    object        
 5   last_name      150 non-null    object        
 6   patient_id     150 non-null    object        
 7   phone_number   150 non-null    object        
 8   state          150 non-null    object        
 9   zip_code       150 non-null    object        
dtypes: datetime64[ns](1), object(9)
memory usage: 11.8+ KB
