# Healthcare Data Cleaning Notebook
## Introduction
This notebook takes the raw (but previously cleaned) healthcare datasets and applies a series of cleaning and transformation steps based on our earlier analysis. The goal is to prepare the data for loading into a database or for further analysis. Each step is broken down into a separate cell for clarity.
### Imports and Setup
First, we import the necessary libraries. pandas is for data manipulation and os is for interacting with the file system (to create directories and manage file paths).

In [19]:
import pandas as pd
import os

### Configuration
Design Decision: Define input and output file paths as variables in a dedicated configuration cell.
Why? This separates configuration from the core logic. Instead of hard-coding file paths deep inside the script, this design allows anyone (including your future self) to easily reuse this notebook for different data by only changing these two variables. It makes the code more maintainable and reusable.

In [20]:
INPUT_DATA_DIR = os.path.join('..', 'cleaned_data')
OUTPUT_DATA_DIR = os.path.join('..', 'v2cleaned_data')

### Load Data

Load all datasets

In [21]:
try:
    observations = pd.read_csv(os.path.join(INPUT_DATA_DIR, "observations_cleaned.csv"))
    patients = pd.read_csv(os.path.join(INPUT_DATA_DIR, "patients_cleaned.csv"))
    procedures = pd.read_csv(os.path.join(INPUT_DATA_DIR, "procedures_cleaned.csv"))
    diagnoses = pd.read_csv(os.path.join(INPUT_DATA_DIR, "diagnoses_cleaned.csv"))
    encounters = pd.read_csv(os.path.join(INPUT_DATA_DIR, "encounters_cleaned.csv"))
    medications = pd.read_csv(os.path.join(INPUT_DATA_DIR, "medications_cleaned.csv"))
    print("All datasets loaded successfully.")
except FileNotFoundError as e:
    print(f"[ERROR] A data file was not found. Please check the input directory path. Details: {e}")

All datasets loaded successfully.


Lets double check

In [22]:
observations.head()

Unnamed: 0,encounter_id,observation_code,observation_datetime,observation_description,observation_id,patient_id,units,value_numeric,value_text
0,f5f83a54-5883-413d-9bb4-c859fa6b8cde,4548-4,2025-04-14,Hemoglobin A1c/Hemoglobin.total in Blood,c70dc224-4c15-43ec-89b6-ed7821d80df2,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,%,7.6,
1,f5f83a54-5883-413d-9bb4-c859fa6b8cde,2345-7,2025-04-14,Glucose [Mass/Vol],065df109-6962-496e-82a7-ab975746f265,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,mg/dL,210.0,
2,f5f83a54-5883-413d-9bb4-c859fa6b8cde,2160-0,2025-04-14,Creatinine [Mass/Vol],ea1a0317-d4cf-4f4c-9d3b-9e87700f67bc,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,mg/dL,1.0,
3,a4345130-e167-45b5-9e60-75a1815d3ae0,8480-6,2026-04-08,Systolic blood pressure,8cd3eab8-0a7b-4e49-856c-ba2a081f969f,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,mmHg,101.0,
4,a4345130-e167-45b5-9e60-75a1815d3ae0,8462-4,2026-04-08,Diastolic blood pressure,31cb9c28-2bad-431a-b59a-6be7750e3184,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,mmHg,68.0,


### Basic Data Type Cleaning (patients)
Design Decisions:
* Use pd.to_datetime with errors='coerce'.
* Convert zip_code to a string type using .astype(str).
#### Why?
* pd.to_datetime: This is the standard pandas function for converting columns to a proper date format, which is essential for correct sorting and calculations. errors='coerce' is a crucial safety feature that will turn malformed dates into NaT (Not a Time) instead of crashing the script.
* .astype(str) for zip codes: Zip codes are identifiers, not numbers meant for mathematical operations. Storing them as numbers can cause problems, like automatically removing leading zeros (e.g., 07960 would become 7960). Converting to a string preserves the exact format.

In [23]:
patients['date_of_birth'] = pd.to_datetime(patients['date_of_birth'], errors='coerce')
patients['zip_code'] = patients['zip_code'].astype(str)
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   address        150 non-null    object        
 1   city           150 non-null    object        
 2   date_of_birth  150 non-null    datetime64[ns]
 3   first_name     150 non-null    object        
 4   gender         150 non-null    object        
 5   last_name      150 non-null    object        
 6   patient_id     150 non-null    object        
 7   phone_number   150 non-null    object        
 8   state          150 non-null    object        
 9   zip_code       150 non-null    object        
dtypes: datetime64[ns](1), object(9)
memory usage: 11.8+ KB


### Advanced Cleaning: Standardizing Text and Categorical Data (patients)
Design Decisions:
* Use .str.title() for names and cities to enforce proper case.
* Use .str.upper() for state abbreviations.
* Convert the gender column to the category data type.
### Why?
* Case Standardization (.str.title()/.str.upper()): Real-world data is messy. You might have "new york", "New York", and "NEW YORK" all referring to the same city. Standardizing the case makes the data consistent, which is essential for accurate grouping and analysis. We use title case for names/cities and upper case for state codes, as is standard convention.
* Categorical Data (.astype('category')): For columns with a small, fixed number of unique values (like gender), converting to the category type is more memory-efficient than leaving it as object (string). It can also speed up some operations. This is a best practice for clean, optimized data.

In [24]:
patients['first_name'] = patients['first_name'].str.title()
patients['last_name'] = patients['last_name'].str.title()
patients['city'] = patients['city'].str.title()
patients['state'] = patients['state'].str.upper()

Convert gender to a more efficient categorical type

In [25]:
patients['gender'] = patients['gender'].astype('category')

print("\nSample of cleaned patient names and locations:")
patients.head()


Sample of cleaned patient names and locations:


Unnamed: 0,address,city,date_of_birth,first_name,gender,last_name,patient_id,phone_number,state,zip_code
0,26236 Nunez Road Apt. 527,Sharpchester,1985-01-11,Juan,Male,Calderon,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,+4246662958x202,MD,9173
1,90829 Thomas Summit,East Christophermouth,1981-12-11,Paul,Male,Price,0eeb5541-d0b3-47fe-839c-a2227526b751,+15154233703,ND,62828
2,3939 Sarah Ridges,Jeffreyburgh,1950-12-17,Julie,Female,Brown,83f30300-2873-49f7-8fe4-06903a75db73,+0016517844153,NH,80694
3,5559 Walton Inlet,West Holly,1963-01-22,Sarah,Female,Dillon,3a707a9a-00b9-40f1-90bf-1a4ff74fcb61,+0019383725030x5868,AK,30233
4,4609 Reginald Plaza Apt. 985,Megantown,1943-01-06,Laura,Female,Brown,825e3f21-ca2a-442a-8d95-7f3dd64c3c6a,+8447233702,FL,40222


### Feature Engineering: Creating an age Column (patients)
Design Decision: Create a new age column by calculating the difference between a fixed current date and the patient's date of birth.
Why? This is a form of feature engineering—creating a new, useful piece of information from existing data. An age column is much more useful for analysis than a date_of_birth column. For example, we can now easily analyze health trends by age group. We use a fixed date for reproducibility; using pd.Timestamp.now() would cause the age to change every time the script is run.

Use a fixed date for reproducibility of the age calculation

In [26]:
analysis_date = pd.to_datetime('2025-06-27')

Calculate age in years

In [27]:
patients['age'] = (analysis_date - patients['date_of_birth']).dt.days // 365
print("'age' column created.")
print("\nSample of patients with new 'age' column:")
print(patients[['first_name', 'last_name', 'date_of_birth', 'age']].head())

'age' column created.

Sample of patients with new 'age' column:
  first_name last_name date_of_birth  age
0       Juan  Calderon    1985-01-11   40
1       Paul     Price    1981-12-11   43
2      Julie     Brown    1950-12-17   74
3      Sarah    Dillon    1963-01-22   62
4      Laura     Brown    1943-01-06   82


### Data Validation and Feature Engineering (encounters)
Design Decisions:
* Validate that discharge_date is not before admission_date.
* Create a new visit_duration_days column.
Why?
* Logical Validation: It's logically impossible for a patient to be discharged before they are admitted. This check ensures data integrity. We will flag these rows rather than deleting them, as they may require manual investigation.
* Feature Engineering: visit_duration_days is a much more valuable feature for analysis than the two separate date columns. It allows us to quickly find the length of each visit. We calculate this only for valid date ranges.

First, convert dates to the correct type

In [28]:
encounters['admission_date'] = pd.to_datetime(encounters['admission_date'], errors='coerce')
encounters['discharge_date'] = pd.to_datetime(encounters['discharge_date'], errors='coerce')

Validation check

In [30]:
invalid_dates = encounters[encounters['discharge_date'] < encounters['admission_date']]
if not invalid_dates.empty:
    print(f"[WARNING] Found {len(invalid_dates)} encounters with discharge date before admission date.")
else:
    print("Encounter date logic is valid.")

Encounter date logic is valid.


Feature Engineering

In [38]:
encounters['visit_duration_days'] = (encounters['discharge_date'] - encounters['admission_date']).dt.days
print(encounters[['admission_date', 'discharge_date', 'visit_duration_days']].describe())

                      admission_date                 discharge_date  \
count                            291                            291   
mean   2011-07-02 12:51:57.525773312  2011-07-02 13:31:32.783505152   
min              1970-09-02 00:00:00            1970-09-02 00:00:00   
25%              2003-05-24 12:00:00            2003-05-24 12:00:00   
50%              2015-11-09 00:00:00            2015-11-09 00:00:00   
75%              2023-09-14 00:00:00            2023-09-14 00:00:00   
max              2027-05-12 00:00:00            2027-05-12 00:00:00   
std                              NaN                            NaN   

       visit_duration_days  
count           291.000000  
mean              0.027491  
min               0.000000  
25%               0.000000  
50%               0.000000  
75%               0.000000  
max               1.000000  
std               0.163792  


### Advanced Cleaning and Outlier Detection (observations)
Design Decisions:
* Create a reusable function to detect outliers using the IQR method.
* Apply this function to the value_numeric column.
* Create a new boolean column is_outlier instead of removing the data.
### Why?
* IQR Method: The Interquartile Range (IQR) method is a standard and robust statistical technique for identifying outliers. It is less sensitive to a few extreme values than methods based on standard deviation.
* Non-Destructive Flagging: In medical data, an extreme value might be a data entry error, or it could be a critical, life-threatening event. Deleting potential outliers is dangerous because it involves throwing away potentially vital information. Flagging them in a new column is a much safer approach. It preserves all original data while allowing analysts to easily include or exclude these values during analysis.

First, convert date to the correct type

In [39]:
observations['observation_datetime'] = pd.to_datetime(observations['observation_datetime'], errors='coerce')

In [40]:
def flag_outliers(df, column):
    """Flags outliers in a specified column of a DataFrame using the IQR method."""
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return (df[column] < lower_bound) | (df[column] > upper_bound)

Apply the function to the numeric values

In [41]:
observations['is_outlier'] = flag_outliers(observations, 'value_numeric')

Fill NaN in the new column with False, as missing values are not outliers

In [42]:
observations['is_outlier'].fillna(False, inplace=True)

outlier_count = observations['is_outlier'].sum()
if outlier_count > 0:
    print(f"Flagged {outlier_count} potential outliers in 'value_numeric'.")
else:
    print("No outliers detected in 'value_numeric'.")

print("\nSample of observations with outlier flag:")
print(observations[['observation_description', 'value_numeric', 'is_outlier']].tail())

Flagged 9 potential outliers in 'value_numeric'.

Sample of observations with outlier flag:
      observation_description  value_numeric  is_outlier
881   Systolic blood pressure         136.00       False
882  Diastolic blood pressure          94.00       False
883     Creatinine [Mass/Vol]           0.91       False
884   Systolic blood pressure         132.00       False
885  Diastolic blood pressure          84.00       False


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  observations['is_outlier'].fillna(False, inplace=True)


### Cleaning Remaining DataFrames (procedures, diagnoses, medications)
Design Decision: Process the remaining tables, handling their specific data types and missing values as determined in our analysis.
#### Why? 
Each table has unique characteristics. We apply the specific logic needed for each one: converting dates, and for medications, correctly handling missing dosage and end_date values to preserve their meaning.

In [45]:
procedures['date_performed'] = pd.to_datetime(procedures['date_performed'], errors='coerce')
diagnoses['date_recorded'] = pd.to_datetime(diagnoses['date_recorded'], errors='coerce')

medications['start_date'] = pd.to_datetime(medications['start_date'], errors='coerce')
medications['end_date'] = pd.to_datetime(medications['end_date'], errors='coerce')

Fill missing dosage with 'Unknown' - this is a data quality issue

In [55]:
medications['dosage'].fillna('Unknown', inplace=True)
print(medications[medications['dosage'] == 'Unkown'])

Empty DataFrame
Columns: [medication_order_id, patient_id, encounter_id, drug_code, drug_name, dosage, route, frequency, start_date, end_date]
Index: []


### Preparing Data for PostgreSQL
Design Decision: Save the final DataFrames as Parquet files instead of CSV.
#### Why? This is a critical data engineering best practice.
* CSV (The Alternative): CSV files are just plain text. They do not store metadata. When you save a DataFrame as a CSV, all data types are lost. 2025-06-27 becomes a simple string "2025-06-27", and the number 50 becomes the string "50". Your teammate would have to guess the correct data types when creating the PostgreSQL table, which often leads to errors.
* Parquet (Our Choice): Parquet is a modern, columnar file format designed for efficiency and data-aware systems. It saves the schema (the column names and their data types) along with the data. When your teammate reads the Parquet file, it will know that age is an integer and admission_date is a timestamp. This makes the data handoff much more reliable and efficient.

If you dont have pyarrow: pip install pyarrow 

In [60]:
print("\nSaving cleaned dataframes to Parquet format...")

dataframes = {
"observations": observations, "patients": patients, "procedures": procedures,
"diagnoses": diagnoses, "encounters": encounters, "medications": medications
}

for name, df in dataframes.items():
# Trim whitespace from all object (string) columns before saving
    for col in df.select_dtypes(include=['object']).columns:
        if df[col].notna().any():
            df[col] = df[col].str.strip()

# Define the output path
output_path = os.path.join(OUTPUT_DATA_DIR, f"{name}.parquet")

# Save the cleaned dataframe to a new Parquet file
df.to_parquet(output_path, index=False)
print(f"Successfully saved cleaned data to: {output_path}")

print("\nData cleaning process finished successfully!")


Saving cleaned dataframes to Parquet format...
Successfully saved cleaned data to: ..\v2cleaned_data\medications.parquet

Data cleaning process finished successfully!
