# Data engineering Healthcare Project
---
### Data Analysis 

In this notebook, we analyze the provided datasets and compare three database systems.
We also motivate our final database choice based on the structure and properties of the data.

## I'm going to try and answer these key questions:

---

### Types of Data
- [ ] Is the data structured, semi-structured, or unstructured?
- [ ] What kind of data is it? (e.g., patient info, vital signs, diagnoses)
- [ ] What data formats are used? (e.g., CSV, JSON)

---

### Metadata & Schema
- [ ] Are column names clear and meaningful?
- [ ] What is the data type of each column? (int, float, string, datetime)
- [ ] Are there unique identifiers? (e.g., patient_id)
- [ ] Are there relationships between tables? (e.g., foreign keys?)

---

### Data Overview
- [ ] How many files or tables are there?
- [ ] How many rows and columns in each?

---

### Missing Values
- [ ] Which columns have missing values?
- [ ] What percentage of each column is missing?

---

### Data Quality
- [ ] Are there duplicate rows or IDs?
- [ ] Are category labels consistent? (e.g., "Male", "male", "M")
- [ ] Are there logical errors? (e.g., birthdate after visit date)

---

### Privacy & Sensitivity
- [ ] Are there any personal identifiers? (e.g., name, address)
- [ ] Should any data be anonymized or masked?

---

### Summary & Next Steps
- [ ] What kind of cleaning might be needed?
- [ ] Is the data ready for processing and storage?


Lets start importing relevant libraries, i will use pandas and numpy ofc for the data, matplotlib and seaborn for visualising, next to matplotlib, seaborn is excelent at visualising with pandas dataframes and makes my life a lot easier

In [75]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

Now were going to import the data of set1 (observations_cleaned.csv, patients_cleaned.csv, procedures_cleaned.csv) they are all in csv format.

In [64]:
observations = pd.read_csv("../cleaned_data/observations_cleaned.csv")
patients = pd.read_csv("../cleaned_data/patients_cleaned.csv")
procedures = pd.read_csv("../cleaned_data/procedures_cleaned.csv")

Lets try to answer the questions:
### 1. Is the data structured, semi-structured, or unstructured? Spoiler: Structered

In [65]:
observations.head()

Unnamed: 0,encounter_id,observation_code,observation_datetime,observation_description,observation_id,patient_id,units,value_numeric,value_text
0,f5f83a54-5883-413d-9bb4-c859fa6b8cde,4548-4,2025-04-14,Hemoglobin A1c/Hemoglobin.total in Blood,c70dc224-4c15-43ec-89b6-ed7821d80df2,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,%,7.6,
1,f5f83a54-5883-413d-9bb4-c859fa6b8cde,2345-7,2025-04-14,Glucose [Mass/Vol],065df109-6962-496e-82a7-ab975746f265,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,mg/dL,210.0,
2,f5f83a54-5883-413d-9bb4-c859fa6b8cde,2160-0,2025-04-14,Creatinine [Mass/Vol],ea1a0317-d4cf-4f4c-9d3b-9e87700f67bc,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,mg/dL,1.0,
3,a4345130-e167-45b5-9e60-75a1815d3ae0,8480-6,2026-04-08,Systolic blood pressure,8cd3eab8-0a7b-4e49-856c-ba2a081f969f,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,mmHg,101.0,
4,a4345130-e167-45b5-9e60-75a1815d3ae0,8462-4,2026-04-08,Diastolic blood pressure,31cb9c28-2bad-431a-b59a-6be7750e3184,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,mmHg,68.0,


In [66]:
patients.head()

Unnamed: 0,address,city,date_of_birth,first_name,gender,last_name,patient_id,phone_number,state,zip_code
0,26236 Nunez Road Apt. 527,Sharpchester,1985-01-11,Juan,Male,Calderon,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,+4246662958x202,MD,9173
1,90829 Thomas Summit,East Christophermouth,1981-12-11,Paul,Male,Price,0eeb5541-d0b3-47fe-839c-a2227526b751,+15154233703,ND,62828
2,3939 Sarah Ridges,Jeffreyburgh,1950-12-17,Julie,Female,Brown,83f30300-2873-49f7-8fe4-06903a75db73,+0016517844153,NH,80694
3,5559 Walton Inlet,West Holly,1963-01-22,Sarah,Female,Dillon,3a707a9a-00b9-40f1-90bf-1a4ff74fcb61,+0019383725030x5868,AK,30233
4,4609 Reginald Plaza Apt. 985,Megantown,1943-01-06,Laura,Female,Brown,825e3f21-ca2a-442a-8d95-7f3dd64c3c6a,+8447233702,FL,40222


In [67]:
procedures.head()

Unnamed: 0,date_performed,encounter_id,patient_id,procedure_code,procedure_description,procedure_id
0,2026-04-08,a4345130-e167-45b5-9e60-75a1815d3ae0,ea3a68f6-ecf9-46aa-be97-7ecbfc7e7fcb,90686,"Influenza virus vaccine, quadrivalent",5f2e0689-60e3-47b3-a38f-d28003acd902
1,2000-06-21,946bb28d-741a-4e03-8a1f-7a8d96e75a4e,83f30300-2873-49f7-8fe4-06903a75db73,90686,"Influenza virus vaccine, quadrivalent",8f3434ea-7fa8-4fee-91e5-67117def181f
2,2024-03-08,c0736418-c4a8-4201-8c4c-0b24c923886f,5b2fa6df-d688-44cd-9c4a-9016eeb2989d,92014,"Ophthalmological examination and evaluation, c...",7036a518-b594-4409-a939-06676e0cffa2
3,2025-07-03,4dd9bf3c-3308-45a2-91cb-92f59842abff,5b2fa6df-d688-44cd-9c4a-9016eeb2989d,90686,"Influenza virus vaccine, quadrivalent",186297e1-c8a8-44b2-93ec-b3751cae31ea
4,2008-10-14,a63d2106-a907-42c8-9f4a-7124eeeed185,2b091a02-499d-498e-8f35-e5f6c24d54d3,92014,"Ophthalmological examination and evaluation, c...",06662205-7ef7-4949-917e-9067acf3c8ba


Looks pretty good to me letch check columns

In [68]:
observations.columns.tolist()

['encounter_id',
 'observation_code',
 'observation_datetime',
 'observation_description',
 'observation_id',
 'patient_id',
 'units',
 'value_numeric',
 'value_text']

In [69]:
patients.columns.tolist()

['address',
 'city',
 'date_of_birth',
 'first_name',
 'gender',
 'last_name',
 'patient_id',
 'phone_number',
 'state',
 'zip_code']

In [70]:
patients.columns.tolist()

['address',
 'city',
 'date_of_birth',
 'first_name',
 'gender',
 'last_name',
 'patient_id',
 'phone_number',
 'state',
 'zip_code']

Lets check if we have columns that are named null

In [71]:
observations.columns.isnull()

array([False, False, False, False, False, False, False, False, False])

In [72]:
patients.columns.isnull()

array([False, False, False, False, False, False, False, False, False,
       False])

In [73]:
procedures.columns.isnull()

array([False, False, False, False, False, False])

Nope

Now lets check if the data types are consistent, we can create a function to run this on each dataframe

In [77]:
def data_types(df):
    count = 0
    for col in df:
        non_null = df[col].dropna()
        if non_null.empty:
            print(f'Column {col} is empty or all missing')
            continue

        is_numeric = pd.to_numeric(non_null, errors='coerce').notnull().all()
        if is_numeric:
            print(f"Column '{col}' contains numeric data.")
            count += 1
        else:
            print(f"Column '{col}' contains non-numeric or mixed data.")

    print(f'{count} out of {len(df.columns)} full numerical data columns')

In [None]:
data_types(observations)

Column 'encounter_id' contains non-numeric or mixed data.
Column 'observation_code' contains non-numeric or mixed data.
Column 'observation_datetime' contains non-numeric or mixed data.
Column 'observation_description' contains non-numeric or mixed data.
Column 'observation_id' contains non-numeric or mixed data.
Column 'patient_id' contains non-numeric or mixed data.
Column 'units' contains non-numeric or mixed data.
Column 'value_numeric' contains numeric data.
Column 'value_text' contains non-numeric or mixed data.
1 out of 9 full numerical data columns


In [None]:
data_types(patients)

Column 'address' contains non-numeric or mixed data.
Column 'city' contains non-numeric or mixed data.
Column 'date_of_birth' contains non-numeric or mixed data.
Column 'first_name' contains non-numeric or mixed data.
Column 'gender' contains non-numeric or mixed data.
Column 'last_name' contains non-numeric or mixed data.
Column 'patient_id' contains non-numeric or mixed data.
Column 'phone_number' contains non-numeric or mixed data.
Column 'state' contains non-numeric or mixed data.
Column 'zip_code' contains numeric data.
1 out of 10 full numerical data columns


In [None]:
data_types(procedures)

Column 'date_performed' contains non-numeric or mixed data.
Column 'encounter_id' contains non-numeric or mixed data.
Column 'patient_id' contains non-numeric or mixed data.
Column 'procedure_code' contains numeric data.
Column 'procedure_description' contains non-numeric or mixed data.
Column 'procedure_id' contains non-numeric or mixed data.
1 out of 6 full numerical data columns


consistent columns and clean headers, lets now check missing values

lets write a simple function for this

In [81]:
def missing(df):
    missing_counts = df.isnull().sum()
    print("Missing values per column:")
    print(missing_counts)

In [82]:
missing(observations)

Missing values per column:
encounter_id                 0
observation_code             0
observation_datetime         0
observation_description      0
observation_id               0
patient_id                   0
units                       25
value_numeric               25
value_text                 861
dtype: int64


In [85]:
observations.shape

(886, 9)

a lot of missing values in value_text only 5 rows are filled

In [83]:
missing(patients)

Missing values per column:
address          0
city             0
date_of_birth    0
first_name       0
gender           0
last_name        0
patient_id       0
phone_number     0
state            0
zip_code         0
dtype: int64


Nice

In [84]:
missing(procedures)

Missing values per column:
date_performed           0
encounter_id             0
patient_id               0
procedure_code           0
procedure_description    0
procedure_id             0
dtype: int64


Also nice

Based on format and consistency, we can confidently say:  
**All three datasets are structured**.