# 📊 Data Profiling and Quality Assessment

**[ ➡️ JUMP TO: Data Quality Summary and Cleaning Strategy ](#summary)**

# Import Libraries

In [15]:
import pandas as pd
import numpy as np
from ydata_profiling import ProfileReport

# Load Datasets

#### Data decription

**1. df_appnt (Appointments):** Central fact table for all patient interactions, contains the transactional detail of every visit.


**2. df_dept (Departments):** Master dimension table for organizational structure, lists all hospital departments and their operational units.


**3. df_pat (Patients):** Master dimension table for patient details, stores static patient demographic and registration data.


**4. df_clinic (Clinicians):** Master dimension table for staff resources, provides details for all clinicians.

In [16]:

df_appnt = pd.read_csv('C:/Users/Pranav/Desktop/Portfolio Projects/Hospital Patient Flow/Data raw/appointments.csv')
df_dept = pd.read_csv('C:/Users/Pranav/Desktop/Portfolio Projects/Hospital Patient Flow/Data raw/departments.csv')
df_pat = pd.read_csv('C:/Users/Pranav/Desktop/Portfolio Projects/Hospital Patient Flow/Data raw/patients.csv') 
df_clinic = pd.read_csv('C:/Users/Pranav/Desktop/Portfolio Projects/Hospital Patient Flow/Data raw/clinicians.csv')

# 1. Data Profiling: Appointments

In [17]:
######################################################## Table-level profiling ########################################################################


# Appointments table

print("Table level profiling:")

# First 5 rows
print("\n# First 5 rows")
display(df_appnt.head())


#1.Table Dimensions (Rows and columns)
print("\n#1. Table Dimensions")
print("Shape (Rows, Columns):", df_appnt.shape)
print("Total Rows:", len(df_appnt))


#2. Column Names (Check headers)
print("\n#2. Column Headers:")
print(df_appnt.columns.tolist())


#3. Primary Key Integrity (appointment_id)
pk_duplicates = df_appnt['appointment_id'].duplicated().sum()
print(f"\n#3. Duplicate Appointment IDs(PK): {pk_duplicates}")
if pk_duplicates > 0:
    print("Duplicate primary keys found: ")


#4. Duplicate Rows
print("\n#4. Duplicate rows in table:", df_appnt.duplicated().sum())


#5. Initial Data Types
print("\n#5. Initial Table Info (Raw Data Types):")
df_appnt.info(memory_usage='deep')


######################################################## Column-level profiling ########################################################################


print("\nColumn level profiling: Appointments table")


# Convert time columns data type, object to datetime
print("\n#6. Time Conversion(to datetime)")
appnt_time_cols = ['appointment_datetime', 'check_in_time', 'consultation_start_time',
             'consultation_end_time', 'check_out_time']

df_appnt[appnt_time_cols] = df_appnt[appnt_time_cols].apply(pd.to_datetime, errors='coerce')

print("Updated Table Info After Time Conversion:")
df_appnt.info(memory_usage='deep')


# 6. Missing Data Summary (The "Null" Check)
print("\n#6. Missing Data Summary")

# Total missing values in the whole table
print("\nTotal Missing Values (Cells):", df_appnt.isnull().sum().sum())

# Null count and Null % in each column
missing_summary = df_appnt.isnull().sum()  
missing_pct = (df_appnt.isnull().sum() / len(df_appnt)) * 100
print(pd.DataFrame({'Missing Count': missing_summary, 'Missing %': missing_pct}))


#7. High-Level Statistics
print("\n#7. Descriptive Statistics:")
display(df_appnt.describe(include='all').T)


# 8. Categorical column check

print("\n#8. Categorical column values check")

print("\nAppointment type Value Check:")
display(df_appnt['appointment_type'].value_counts(dropna=False))

print("\nAppointment Status Value Check:")
display(df_appnt['appointment_status'].value_counts(dropna=False))

print("\nEncounter type Value Check:")
# Use value_counts() to expose "Cancleld"
display(df_appnt['encounter_type'].value_counts(dropna=False))


Table level profiling:

# First 5 rows


Unnamed: 0,appointment_id,patient_id,clinician_id,appointment_type,appointment_datetime,check_in_time,consultation_start_time,consultation_end_time,check_out_time,appointment_status,no_show_flag,room_or_area,department_id,encounter_type
0,500001,30626,20162,Emergency,2025-04-20 08:35:51.381719,2025-04-20 08:40:51.381719,2025-04-20 08:57:51.381719,2025-04-20 09:08:51.381719,2025-04-20 09:21:51.381719,Completed,False,Room-225,1,OPD
1,500002,31635,20296,Emergency,2025-02-08 18:08:57.923913,2025-02-08 18:00:57.923913,2025-02-08 18:18:57.923913,2025-02-08 18:34:57.923913,2025-02-08 18:41:57.923913,Completed,False,Room-402,6,ED
2,500003,21618,20115,Procedure,2025-07-15 21:41:44.657416,2025-07-15 21:39:44.657416,2025-07-15 21:51:44.657416,2025-07-15 22:02:44.657416,2025-07-15 22:09:44.657416,Completed,False,Room-211,8,IPD
3,500004,28946,20297,Emergency,2024-12-15 08:01:53.339738,2024-12-15 07:53:53.339738,2024-12-15 08:35:53.339738,2024-12-15 08:46:53.339738,2024-12-15 08:57:53.339738,Completed,False,Room-466,6,ED
4,500005,23377,20009,Follow-up,2025-09-18 10:33:01.055795,,,,,Cancelled,False,Room-432,2,OPD



#1. Table Dimensions
Shape (Rows, Columns): (120000, 14)
Total Rows: 120000

#2. Column Headers:
['appointment_id', 'patient_id', 'clinician_id', 'appointment_type', 'appointment_datetime', 'check_in_time', 'consultation_start_time', 'consultation_end_time', 'check_out_time', 'appointment_status', 'no_show_flag', 'room_or_area', 'department_id', 'encounter_type']

#3. Duplicate Appointment IDs(PK): 0

#4. Duplicate rows in table: 0

#5. Initial Table Info (Raw Data Types):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120000 entries, 0 to 119999
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   appointment_id           120000 non-null  int64 
 1   patient_id               120000 non-null  int64 
 2   clinician_id             120000 non-null  int64 
 3   appointment_type         120000 non-null  object
 4   appointment_datetime     120000 non-null  object
 5   check_in_time            90

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max,std
appointment_id,120000.0,,,,560000.5,500001.0,530000.75,560000.5,590000.25,620000.0,34641.160489
patient_id,120000.0,,,,22531.492625,10001.0,16283.75,22518.0,28798.0,35000.0,7216.895461
clinician_id,120000.0,,,,20151.03835,20001.0,20076.0,20151.0,20226.0,20300.0,86.500932
appointment_type,120000.0,4.0,Follow-up,30094.0,,,,,,,
appointment_datetime,120000.0,,,,2024-12-09 16:10:22.996804096,2023-12-10 07:34:35.055826,2024-06-08 16:44:30.070737408,2024-12-10 07:15:05.557103104,2025-06-12 01:01:56.934441472,2025-12-09 18:56:45.918334,
check_in_time,90005.0,,,,2024-12-09 13:05:12.285185024,2023-12-10 07:33:35.055826,2024-06-08 04:45:10.119846912,2024-12-09 20:41:16.110216960,2025-06-12 09:10:58.942427904,2025-12-09 18:49:00.943553,
consultation_start_time,90005.0,,,,2024-12-09 13:40:13.914427648,2023-12-10 08:00:35.055826,2024-06-08 05:11:19.656241920,2024-12-09 21:32:16.110216960,2025-06-12 09:53:44.096625920,2025-12-09 19:31:00.943553,
consultation_end_time,90005.0,,,,2024-12-09 14:07:42.248186880,2023-12-10 08:34:35.055826,2024-06-08 05:43:40.174373120,2024-12-09 22:16:16.110216960,2025-06-12 10:21:52.680775936,2025-12-09 19:56:00.943553,
check_out_time,90005.0,,,,2024-12-09 14:20:11.663219712,2023-12-10 08:47:35.055826,2024-06-08 05:50:40.174373120,2024-12-09 22:33:55.932089088,2025-06-12 10:27:52.680775936,2025-12-09 20:08:00.943553,
appointment_status,120000.0,4.0,Completed,90005.0,,,,,,,



#8. Categorical column values check

Appointment type Value Check:


appointment_type
Follow-up    30094
Emergency    30086
Procedure    29960
New Visit    29860
Name: count, dtype: int64


Appointment Status Value Check:


appointment_status
Completed    90005
Cancelled    18097
No-show      11897
Cancleld         1
Name: count, dtype: int64


Encounter type Value Check:


encounter_type
OPD    81672
IPD    23536
ED     14792
Name: count, dtype: int64

## Data Profiling Report: Appointments

In [4]:
#Profiling Report 
appnt_profile = ProfileReport(df_appnt, title="Appointments Data Profiling Report")
appnt_profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


[A%|          | 0/14 [00:00<?, ?it/s]
100%|██████████| 14/14 [00:02<00:00,  6.14it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## Business Logic Checks: Appointments

#### 1. Timeline Integrity:
Validates the patient journey sequence: **Check-in < Consult Start < Consult End < Check-out**

**Impact:** This integrity check is essential for ensuring the accurate calculation of core **operational KPIs**, including **Pre-Consult Wait Time**, **Consultation Duration**, and **Total Patient Throughput**.

In [5]:
print("Timeline Integrity: Appointments table")

# Logic1: Consultation Start Time MUST be >= Check-in Time
checkin_start_violation = df_appnt[
    (df_appnt['consultation_start_time'] < df_appnt['check_in_time'])
]

print("\n--- 1. Check-in/Consult Start Violation (Start < Check-in) ---")
print(f"Records where Consult Start < Check-in: {len(checkin_start_violation)}")
if len(checkin_start_violation) > 0:
    print("Sample of violation records:")
    print(checkin_start_violation[['appointment_id', 'check_in_time', 'consultation_start_time']].head())


# Logic2: Consultation End Time MUST be > Consultation Start Time
consult_duration_violation = df_appnt[
    (df_appnt['consultation_end_time'] <= df_appnt['consultation_start_time'])
]
print("\n--- 2. Consultation Duration Violation (End <= Start) ---")
print(f"Records where Consult End <= Consult Start: {len(consult_duration_violation)}")
if len(consult_duration_violation) > 0:
    print("Sample of violation records:")
    print(consult_duration_violation[['appointment_id', 'consultation_start_time', 'consultation_end_time']].head())


# Logic3: Check-out Time MUST be >= Consultation End Time
checkout_violation = df_appnt[
    (df_appnt['check_out_time'] < df_appnt['consultation_end_time'])
]

print("\n--- 3. Patient Departure Violation (Check-out < Consult End) ---")
print(f"Records where Check-out < Consult End: {len(checkout_violation)}")
if len(checkout_violation) > 0:
    print("Sample of violation records:")
    print(checkout_violation[['appointment_id', 'consultation_end_time', 'check_out_time']].head())


Timeline Integrity: Appointments table

--- 1. Check-in/Consult Start Violation (Start < Check-in) ---
Records where Consult Start < Check-in: 0

--- 2. Consultation Duration Violation (End <= Start) ---
Records where Consult End <= Consult Start: 0

--- 3. Patient Departure Violation (Check-out < Consult End) ---
Records where Check-out < Consult End: 0


#### 2. Consultation Duration Outliers:
Validates the plausibility of recorded service duration by flagging appointments exceeding the established maximum clinical boundary (120 minutes).

**Impact:** Identifying these outliers prevents the artificial inflation of Clinician Utilization rates and **ensures that average consultation duration metrics are not corrupted by logging errors**.

In [6]:

print("Consultation Duration Outliers: Appointments table")

# Logic: Maximum acceptable duration for a single consultation (2 hours)
MAX_CONSULT_MINS = 120

# 1. Calculate the consultation duration in minutes
df_appnt['consultation_duration_mins'] = (
    df_appnt['consultation_end_time'] - df_appnt['consultation_start_time']
).dt.total_seconds() / 60

# 2. Identify records exceeding the threshold
outlier_duration_records = df_appnt[
    df_appnt['consultation_duration_mins'] > MAX_CONSULT_MINS
]
print(f"\nRecords with Duration > {MAX_CONSULT_MINS} minutes: {len(outlier_duration_records)}")

if len(outlier_duration_records) > 0:
    print("Sample of violation records (Largest durations first):")
    display(outlier_duration_records[[
        'appointment_id',
        'consultation_duration_mins'
    ]].sort_values(by='consultation_duration_mins', ascending=False).head())

Consultation Duration Outliers: Appointments table

Records with Duration > 120 minutes: 0


#### 3. No-Show Flag vs. Status:

Identifies conflicts where a patient status is marked as 'No-Show' but contains a contradictory completed/Cancelled status, directly verifying the reliability of the automated logging system for revenue and capacity metrics.

**Impact**: Corrupts No-Show Rate, systematically misclassifies completed or cancelled appointments, leading to an incorrect calculation of lost revenue/capacity.

In [7]:
print("No-Show Flag vs. Status: Appointments table")

# Logic: Find records where the no_show_flag is TRUE, but the status is NOT 'No-show'.

conflicting_records = df_appnt[
    (df_appnt['no_show_flag'] == True) &
    (df_appnt['appointment_status'] != 'No-show')
].copy()

conflict_count = len(conflicting_records)
print(f"\nConflicting records Found: {conflict_count} record(s)")

# Display the conflicting record(s)
if conflict_count > 0:
    display(conflicting_records[[
        'appointment_id',
        'appointment_status',
        'no_show_flag',
        'check_in_time',
        'consultation_start_time' 
    ]])

No-Show Flag vs. Status: Appointments table

Conflicting records Found: 1 record(s)


Unnamed: 0,appointment_id,appointment_status,no_show_flag,check_in_time,consultation_start_time
14,500015,Completed,True,2024-09-26 07:11:38.784186,2024-09-26 07:39:38.784186


#### 4. Foreign Key Integrity Check

This check validates the referential integrity of the appointments fact table by ensuring that all linking identifiers **(Clinician, Patient, Department)** successfully match a primary key in their respective master dimension tables.

**Impact:** Failures in this check result in **orphaned records** that cannot be correctly attributed in analytical roll-ups

In [8]:

# --- Foreign Key Integrity Profiling (df_appnt) ---
print("--- Foreign Key Integrity: Appointments) ---\n")

# --- 1. Appointments -> Clinicians Check ---
# Master Column: clinician_id (from df_clinic) & FK Column: clinician_id (from df_appnt)

# Unique Clinician IDs
valid_clinician_ids = df_clinic['clinician_id'].astype(str).unique()

# Appointment records where the clinician_id is NOT in valid_clinicians_ids
orphaned_clinician_records = df_appnt[
    df_appnt['clinician_id'].notna() &
    ~df_appnt['clinician_id'].astype(str).isin(valid_clinician_ids)
]
print(f"1. Orphaned Clinician Records Found: {len(orphaned_clinician_records)}")

# --- 2. Appointments -> Patient Check ---
# Master Column: patient_id (from df_pat) & FK Column: patient_id (from df_appnt)

# Unique Patients IDs
valid_patient_ids = df_pat['patient_id'].astype(str).unique() 

# Appointment records where the patient_id is NOT in the valid_patient_ids
orphaned_patient_records = df_appnt[
    df_appnt['patient_id'].notna() &
    ~df_appnt['patient_id'].astype(str).isin(valid_patient_ids)
]
print(f"2. Orphaned Patient Records Found: {len(orphaned_patient_records)}")

# --- 3. Appointments -> Departments Check ---
# Master Column: department_id (from df_dept) &  FK Column: department_id (from df_appnt)

# Unique Departments IDs
valid_department_ids = df_dept['department_id'].astype(str).unique()

# Appointment records where the department_id is NOT in the valid_department_ids
orphaned_dept_records = df_appnt[
    df_appnt['department_id'].notna() &
    ~df_appnt['department_id'].astype(str).isin(valid_department_ids)
]
print(f"3.Orphaned Department Records Found: {len(orphaned_dept_records)}")

# --- Orphaned records summary
total_orphans = (
    len(orphaned_clinician_records) + 
    len(orphaned_patient_records) + 
    len(orphaned_dept_records)
)

print("\n--- Summary ---")
print(f"Total Orphaned Records (Sum of above): {total_orphans}")

--- Foreign Key Integrity: Appointments) ---

1. Orphaned Clinician Records Found: 0
2. Orphaned Patient Records Found: 0
3.Orphaned Department Records Found: 393

--- Summary ---
Total Orphaned Records (Sum of above): 393


### Further Investigation on orphaned 393 records

The following metrics confirm the analytical viability of imputing all orphaned records (source: **Invalid Department ID 999**) to the confirmed target, **Department ID 1 (General Practice)**.

| Metric | Value | Justification |
| :--- | :--- | :--- |
| **Source of Invalid Data** | **Department ID 999** | The orphaned Foreign Key being corrected. |
| **Target Department** | **ID 1 (General Practice)** | Based on cross-referenced evidence (Clinician GP designation $\rightarrow$ General Practice department). |
| **Current Volume (Dept 1)** | 11,703 appointments | Volume before correction. |
| **Records to Impute** | 393 appointments | Orphaned appointments recovered. |
| **New Volume (After Imputation)** | 12,096 appointments | Calculation: $11,703 + 393 = 12,096$. |
| **Percentage Increase (Skew)** | $3.36\%$ | Calculation: $\frac{393}{11,703} \times 100$. |


#### Conclusion: 

**Low Skew:** The resulting $3.36\%$ increase in volume for Department ID 1 is well within acceptable analytical tolerance (typically $5\%$ or less). This is not an artificial skew; it is a data recovery effort. 

**Evidence 1 (Direct):** Department Name matches Designation(GP) ($\text{Department 1} \approx \text{General Practice}$).

**Evidence 2 (High Confidence):** Fixing the FK error for a $\text{GP}$ must lead to the $\text{General Practice}$ department.

**Evidence 3 (Assumption Check):** It is logical to assume the $393$ associated appointments belong to the same category as the only clinician linked to the error.(clinician_id = 20010, clinician_name: Katie Pierce)

## 🔎 Findings: Appointments
**(Analysis of 120,000 Records)**

### 1. Business Logic Violations
* **Status vs. Flag Conflict (1 Record):**
    * **Finding:** 1 record (ID: `500015`) is marked as `appointment_status = 'Completed'` but also has `no_show_flag = TRUE`.
    * **Impact:** This contradiction corrupts both *No-Show Rate* (revenue loss metric) and *Completion Rate* (productivity metric).
    * **Action:** This record will be corrected to `no_show_flag = FALSE` based on the presence of valid timestamps.

### 2. Foreign key Integrity Failures  
* **Orphaned Department Keys (393 Records):**
    * **Finding:** **393 appointments** reference a `department_id` that does not exist in the Department Master table.
    * **Impact:** These appointments would be excluded from service line reporting, leading to significant under-reporting in Departmental Performance
    * **Initial Action (Not Recommended):** These IDs would typically be mapped to a placeholder ID (`999 - 'Unknown Department'`) to preserve data volume.
    * **Business Context:** Cross-referencing the Clinicians table revealed the invalid ID `999` was associated with a clinician(clinician_id: 20010) whose designation is **'GP'**. Department ID **1** is named **'General Practice'**.
    * **Refined Action(Recommended):** Based on this definitive logical link, the $393$ orphaned IDs will be imputed to the confirmed **Department ID 1 (General Practice)**. This high-confidence imputation strategy maximizes data accuracy and minimizes analytical skew ($\approx 3.36\%$ increase in volume). 

...
### 3. Data Completeness & Quality
* **High Missing Values in Timestamps (~25%):**
    * **Finding:** `check_in_time`, `consultation_start_time`, and `check_out_time` are missing for **29,995 rows** (approx. 25%).
    * **Context:** These nulls align perfectly with non-completed statuses (`Cancelled`, `No-show`), which is expected behavior.
    * **Action:** **No cleaning required** for these specific nulls, but analysis filters must explicitly exclude these rows when calculating *Wait Time* and *Length of Stay*.
  
### 4. Categorical Standardization
* **Finding:** The `appointment_status` column contains the misspelled value **"Cancleld"** (1 record).
* **Action:** Standardize value to **"Cancelled"**.


# 2. Data Profiling: Departments

In [9]:
# Departments table

print("Data profiling: Departments table")


print("\n# Departments table")
display(df_dept)


#1.Table Dimensions (Rows and columns)
print("\n#1. Table Dimensions")
print("Shape (Rows, Columns):", df_dept.shape)
print("Total Rows:", len(df_dept))


#2. Column Names (Check headers)
print("\n#2. Column Headers:")
print(df_dept.columns.tolist())


#3. Primary Key Integrity
# Verify if 'appointment_id' is truly unique
pk_duplicates = df_dept['department_id'].duplicated().sum()
print(f"\n#3. Duplicate Appointment IDs(PK): {pk_duplicates}")
if pk_duplicates > 0:
    print("Duplicate primary keys found:")

#4. Duplicate Rows
print("\n#4. Duplicate rows in table:", df_dept.duplicated().sum())


#5. Initial Data Types & Memory (Is the computer reading it right?)
# This is the raw status, where time columns might be 'object' (string).
print("\n#5. Initial Table Info (Raw Data Types):")
df_dept.info(memory_usage='deep')

# 6. Categorical column check

print("\n#6. Categorical column values check")

print("\nDepartment Group Value Check:")
display(df_dept['department_group'].value_counts(dropna=False))


Data profiling: Departments table

# Departments table


Unnamed: 0,department_id,department_name,department_group
0,1,General Practice,OPD
1,2,Cardiologgy,OPD
2,3,Pediatrics,OPD
3,4,Orthopedics,OPD
4,5,Dermatology,OPD
5,6,Emergency,ED
6,7,ICU,IPD
7,8,Surgery,IPD
8,9,Neurology,OPD



#1. Table Dimensions
Shape (Rows, Columns): (9, 3)
Total Rows: 9

#2. Column Headers:
['department_id', 'department_name', 'department_group']

#3. Duplicate Appointment IDs(PK): 0

#4. Duplicate rows in table: 0

#5. Initial Table Info (Raw Data Types):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   department_id     9 non-null      int64 
 1   department_name   9 non-null      object
 2   department_group  9 non-null      object
dtypes: int64(1), object(2)
memory usage: 1.2 KB

#6. Categorical column values check

Department Group Value Check:


department_group
OPD    6
IPD    2
ED     1
Name: count, dtype: int64

## 🔎 Findings: Departments
**(Analysis of 9 Records)**
### 1. Domain Integrity Violation (Name Misspelling)
* **Finding:** The `department_name` column contains a spelling error: **'Cardiologgy'** (Department ID 2). 
* **Impact:** This single error prevents accurate cross-system matching and will cause appointments related to Cardiology to be **misclassified or missed** if filtering relies on exact name matching. This breaks the **Domain Constraint** for department naming.
* **Action:** This is a simple standardization fix: Correct the name from 'Cardiologgy' to **'Cardiology'** in the cleaning step.


# 3. Data Profiling: Patients

In [10]:
######################################################## Table-level profiling ########################################################################

# Patients

print("Table level profiling: Patients table")


print("\n#First & Last 5 rows")
display(df_pat)


#1.Table Dimensions (Rows and columns)
print("\n#1. Table Dimensions")
print("Shape (Rows, Columns):", df_pat.shape)
print("Total Rows:", len(df_pat))


#2. Column Names (Check headers)
print("\n#2. Column Headers:")
print(df_pat.columns.tolist())


#3. Primary Key Integrity
# Verify if 'patient_id' is truly unique
pk_duplicates = df_pat['patient_id'].duplicated().sum()
print(f"\n#3. Duplicate Appointment IDs(PK): {pk_duplicates}")
if pk_duplicates > 0:
    print("Duplicate primary keys found:")

#4. Duplicate Rows
print("\n#4. Duplicate rows in table:", df_pat.duplicated().sum())


#5. Initial Data Types & Memory (Is the computer reading it right?)
# This is the raw status, where time columns might be 'object' (string).
print("\n#5. Initial Table Info (Raw Data Types):")
df_pat.info(memory_usage='deep')



######################################################## Column-level profiling ########################################################################



print("Column level profiling: Patients table")

#Convert time columns data type, object to datetime
print("\n#6. Time Conversion(to datetime)")
pat_time_cols = ['date_of_birth', 'registration_date']

df_pat[pat_time_cols] = df_pat[pat_time_cols].apply(pd.to_datetime, errors='coerce')

print("Updated Table Info After Time Conversion Attempt:")
df_pat.info(memory_usage='deep')


# 6. Missing Data Summary (The "Null" Check)

print("\n#6. Missing Data Summary")

# Total missing values in the whole table
print("\nTotal Missing Values (Cells):", df_pat.isnull().sum().sum())

missing_summary = df_pat.isnull().sum() 
missing_pct = (df_pat.isnull().sum() / len(df_pat)) * 100
print(pd.DataFrame({'Missing Count': missing_summary, 'Missing %': missing_pct}))


#7. High-Level Statistics
print("\n#7. Descriptive Statistics:")
display(df_pat.describe(include='all').T)


# 8. Categorical column check

print("\n#8. Categorical column values check")

print("\nGender Value Check:")
display(df_pat['gender'].value_counts(dropna=False))

print("\nNationality Value Check:")
display(df_pat['nationality'].value_counts(dropna=False))

print("\nInsurance provider Value Check:")
display(df_pat['insurance_provider'].value_counts(dropna=False))


Table level profiling: Patients table

#First & Last 5 rows


Unnamed: 0,patient_id,date_of_birth,gender,nationality,insurance_provider,registration_date,patient_name,patient_email,patient_phone
0,10001,1993-02-20,Male,Pakistan,MetLife,2021-03-07,Mary Lopez,kevin42@example.org,+1-538-801-0428
1,10002,1937-03-24,Female,UAE,Thiqa,2025-03-16,Claire Vance,jenniferolson@example.com,001-893-656-5741
2,10003,1959-12-24,Male,India,Self-Pay,2024-08-23,William Lee,rodriguezrebecca@example.org,+1-853-266-2351x454
3,10004,1955-04-07,Male,Egypt,Daman,2024-12-06,Crystal Owens,ysanders@example.com,(576)366-2410x746
4,10005,2001-12-20,Male,India,AXA,2022-10-24,Sydney Fisher,gillespiebrittany@example.com,809.363.4398x183
...,...,...,...,...,...,...,...,...,...
24995,34996,2005-08-27,Female,Philippines,Daman,2021-07-15,Daniel Walls,vrivera@example.com,554-990-0692x543
24996,34997,1940-05-10,Female,India,Oman Insurance,2022-01-31,Dr. Mason Russell,thompsonjanice@example.org,527-766-4133x783
24997,34998,1988-06-28,Male,India,AXA,2023-05-05,Ashley Lopez,hansonjudy@example.org,001-547-230-2071x05253
24998,34999,2006-09-24,Female,Philippines,Self-Pay,2022-11-29,Andrew Barnes,kevin15@example.org,6669596093



#1. Table Dimensions
Shape (Rows, Columns): (25000, 9)
Total Rows: 25000

#2. Column Headers:
['patient_id', 'date_of_birth', 'gender', 'nationality', 'insurance_provider', 'registration_date', 'patient_name', 'patient_email', 'patient_phone']

#3. Duplicate Appointment IDs(PK): 0

#4. Duplicate rows in table: 0

#5. Initial Table Info (Raw Data Types):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   patient_id          25000 non-null  int64 
 1   date_of_birth       25000 non-null  object
 2   gender              25000 non-null  object
 3   nationality         25000 non-null  object
 4   insurance_provider  25000 non-null  object
 5   registration_date   25000 non-null  object
 6   patient_name        25000 non-null  object
 7   patient_email       25000 non-null  object
 8   patient_phone       25000 non-null  object
dtypes:

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max,std
patient_id,25000.0,,,,22500.5,10001.0,16250.75,22500.5,28750.25,35000.0,7217.022701
date_of_birth,25000.0,,,,1980-05-20 09:51:22.752000,1934-12-15 00:00:00,1957-08-11 18:00:00,1980-06-02 12:00:00,2003-03-24 00:00:00,2025-12-10 00:00:00,
gender,25000.0,4.0,Female,12501.0,,,,,,,
nationality,25000.0,9.0,India,7479.0,,,,,,,
insurance_provider,25000.0,6.0,Daman,4223.0,,,,,,,
registration_date,25000.0,,,,2023-06-14 17:45:04.895999744,2020-12-13 00:00:00,2022-03-20 00:00:00,2023-06-09 12:00:00,2024-09-11 00:00:00,2025-12-13 00:00:00,
patient_name,25000.0,21899.0,Michael Jones,12.0,,,,,,,
patient_email,25000.0,23772.0,fsmith@example.org,6.0,,,,,,,
patient_phone,25000.0,25000.0,(524)206-8264x0498,1.0,,,,,,,



#8. Categorical column values check

Gender Value Check:


gender
Female    12501
Male      12497
M             1
FEMALE        1
Name: count, dtype: int64


Nationality Value Check:


nationality
India          7479
Philippines    3732
UAE            3713
Egypt          2577
Pakistan       2473
UK             1290
Lebanon        1271
Jordan         1235
USA            1230
Name: count, dtype: int64


Insurance provider Value Check:


insurance_provider
Daman             4223
Thiqa             4204
Self-Pay          4171
Oman Insurance    4163
MetLife           4137
AXA               4102
Name: count, dtype: int64

## Data Profiling Report: Patients

In [11]:
#Profiling Report 
pat_profile = ProfileReport(df_pat, title="Patients Data Profiling Report")
pat_profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


[A%|          | 0/9 [00:00<?, ?it/s]
[A%|█         | 1/9 [00:00<00:01,  4.64it/s]
[A%|██▏       | 2/9 [00:00<00:01,  6.62it/s]
[A%|███▎      | 3/9 [00:00<00:00,  6.30it/s]
[A%|████▍     | 4/9 [00:00<00:00,  5.25it/s]
[A%|█████▌    | 5/9 [00:01<00:00,  4.53it/s]
[A%|██████▋   | 6/9 [00:01<00:00,  5.56it/s]
100%|██████████| 9/9 [00:01<00:00,  5.47it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## Business Logic Checks: Patients

#### Date Logic check:|

Validates the fundamental business rule that **a patient's registration date cannot logically precede their date of birth.**

**Impact:** A violation signifies **a critical data entry error** that corrupts the chronological sequence necessary for accurate patient analysis.

In [12]:

# Logic: Identify records where registration occurred before the patient's birth
date_logic_violations = df_pat[
    df_pat['registration_date'] < df_pat['date_of_birth']
]

print("\n--- Date Logic Check (Registration Before Birth) ---")
print(f"\nRecords where Registration Date < Date of Birth: {len(date_logic_violations)}")

if len(date_logic_violations) > 0:
    print("\nSample of Violation Records:")
    display(date_logic_violations[[
        'patient_id',
        'date_of_birth',
        'registration_date'
    ]].head())


--- Date Logic Check (Registration Before Birth) ---

Records where Registration Date < Date of Birth: 673

Sample of Violation Records:


Unnamed: 0,patient_id,date_of_birth,registration_date
60,10061,2024-12-30,2023-07-04
86,10087,2025-09-23,2023-06-07
96,10097,2025-08-07,2021-03-23
182,10183,2024-07-05,2021-05-29
198,10199,2025-11-20,2022-10-19


## 🔎 Findings: Patients
**(Analysis of 25,000 Records)**

### 1. Business Logic Violation
* **Check:** Date Logic (Registration Before Birth)
* **Finding:** **673 records** (approx. 2.7%) violate the fundamental temporal integrity rule where the $\text{registration\_date}$ is recorded as being before the $\text{date\_of\_birth}$. 
* **Impact:** This signifies a critical data entry error. These records can cause downstream failures in time-series analysis (e.g., calculation of age at registration) and invalidate all age-based cohort reporting.
* **Action:** In the cleaning phase, the conflicting $\text{date\_of\_birth}$ for these 673 records will be invalidated (set to `NaT`) and the records will be flagged with a `date_logic_error_flag` to isolate them from age-based segmentation.
### 2. Categorical Standardization (Gender)
* **Finding:** The `gender` column contains inconsistent formatting values: **'M'** and **'FEMALE'** (ALL CAPS) alongside the standard 'Male' and 'Female'.
* **Impact:** Inconsistent categories split the data during aggregation (e.g., 'Male' and 'M' would be counted as two different genders), corrupting demographic reporting.
* **Action:** Map these variants to the standard values:
    * `'M'` $\rightarrow$ `'Male'`
    * `'FEMALE'` $\rightarrow$ `'Female'`

# 4. Data Profiling: Clinicians

In [13]:
######################################################## Table-level profiling ########################################################################

# Clinicians


print("Table level profiling: Clinicians table")

# First 5 rows
print("\n# First 5 rows")
display(df_clinic.head())


#1.Table Dimensions (Rows and columns)
print("\n#1. Table Dimensions")
print("Shape (Rows, Columns):", df_clinic.shape)
print("Total Rows:", len(df_clinic))


#2. Column Names (Check headers)
print("\n#2. Column Headers:")
print(df_clinic.columns.tolist())


#3. Primary Key Integrity
# Verify if 'clinician_id' is truly unique
pk_duplicates = df_clinic['clinician_id'].duplicated().sum()
print(f"\n#3. Duplicate Appointment IDs(PK): {pk_duplicates}")
if pk_duplicates > 0:
    print("Duplicate primary keys found:")

#4. Duplicate Rows
print("\n#4. Duplicate rows in table:", df_clinic.duplicated().sum())


#5. Initial Data Types & Memory (Is the computer reading it right?)
# This is the raw status, where time columns might be 'object' (string).
print("\n#5. Initial Table Info (Raw Data Types):")
df_clinic.info(memory_usage='deep')


######################################################## Column-level profiling ########################################################################



# 6. Missing Data Summary (The "Null" Check)


print("\nColumn level profiling: Clinicians table")

print("\n#6. Missing Data Summary")

# Total missing values in the whole table (across all columns)
print("\nTotal Missing Values (Cells):", df_clinic.isnull().sum().sum())

missing_summary = df_clinic.isnull().sum()   # Missing values per column
missing_pct = (df_clinic.isnull().sum() / len(df_clinic)) * 100
print(pd.DataFrame({'Missing Count': missing_summary, 'Missing %': missing_pct}))


#7. High-Level Statistics
# This now works better on the time columns since they are correctly typed!
print("\n#7. Descriptive Statistics:")
display(df_clinic.describe(include='all').T)


# 8. Categorical column check

print("\n#8. Categorical column values check")

print("\nDesignation Value Check:")
display(df_clinic['designation'].value_counts(dropna=False))

Table level profiling: Clinicians table

# First 5 rows


Unnamed: 0,clinician_id,clinician_name,department_id,designation,years_experience
0,20001,Mary Lopez,3,GP,29
1,20002,Claire Vance,6,GP,9
2,20003,William Lee,9,Specialist,10
3,20004,Crystal Owens,6,Specialist,16
4,20005,Sydney Fisher,4,Specialist,11



#1. Table Dimensions
Shape (Rows, Columns): (300, 5)
Total Rows: 300

#2. Column Headers:
['clinician_id', 'clinician_name', 'department_id', 'designation', 'years_experience']

#3. Duplicate Appointment IDs(PK): 0

#4. Duplicate rows in table: 0

#5. Initial Table Info (Raw Data Types):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   clinician_id      300 non-null    int64 
 1   clinician_name    300 non-null    object
 2   department_id     300 non-null    int64 
 3   designation       300 non-null    object
 4   years_experience  300 non-null    int64 
dtypes: int64(3), object(2)
memory usage: 42.0 KB

Column level profiling: Clinicians table

#6. Missing Data Summary

Total Missing Values (Cells): 0
                  Missing Count  Missing %
clinician_id                  0        0.0
clinician_name                0        0.0
depa

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
clinician_id,300.0,,,,20150.5,86.746758,20001.0,20075.75,20150.5,20225.25,20300.0
clinician_name,300.0,300.0,Gary Yates,1.0,,,,,,,
department_id,300.0,,,,8.153333,57.453222,1.0,3.0,5.0,7.0,999.0
designation,300.0,3.0,Specialist,150.0,,,,,,,
years_experience,300.0,,,,15.41,8.661716,1.0,8.0,15.0,23.0,30.0



#8. Categorical column values check

Designation Value Check:


designation
Specialist    150
GP             95
Consultant     55
Name: count, dtype: int64

## Foreign Key Integrity Check

In [14]:
# Relationship Integrity Check(FK)

# Unique department_id values from the departments master table (df_dept)
valid_department_ids = df_dept['department_id'].astype(str).unique()

# Check if the department_id in df_clinic IS NOT IN the list of valid_department_ids.
orphaned_clinicians = df_clinic[
    df_clinic['department_id'].notna() &
    ~df_clinic['department_id'].astype(str).isin(valid_department_ids)
]

print("--- Orphaned Department Check (Clinicians) ---")
print(f"\nClinicians assigned to an Invalid Department ID: {len(orphaned_clinicians)}")

if len(orphaned_clinicians) > 0:
    print("\nSample of Orphaned Clinicians (Requires FK Correction):")
    display(orphaned_clinicians[['clinician_id', 'department_id', 'clinician_name']].head())

--- Orphaned Department Check (Clinicians) ---

Clinicians assigned to an Invalid Department ID: 1

Sample of Orphaned Clinicians (Requires FK Correction):


Unnamed: 0,clinician_id,department_id,clinician_name
9,20010,999,Katie Pierce


## 🔎 Findings: Clinicians
**(Analysis of 300 Records)**

### Referential Integrity Failure (FK)
* **Check:** Orphaned Department ID (Clinicians $\rightarrow$ Departments)
* **Finding:** **1 Clinician record** (Clinician ID: `20010`) is assigned to an invalid `department_id` of **999**. 
* **Impact:** Although minimal in volume (1 record), this is a critical data linkage failure. This clinician's work will be **excluded from departmental roll-up metrics** and performance summaries until the link is corrected.
* **Action:** This record must be corrected by either:
    * Imputing the correct department ID (if known).
    * Mapping the invalid ID (999) to a designated 'Unknown/General' Department ID in the `df_dept` table if that department is meant to exist but was not defined.

<a id='summary'></a>
# 🛑 Data Quality Summary and Cleaning Strategy
This summary consolidates all critical data quality issues identified across the four datasets, defining the scope and prioritized action plan for the subsequent data cleaning phase.

| Dataset / Dimension | Issue Type | Records Affected | Business Impact | Cleaning Action |
| :--- | :--- | :--- | :--- | :--- |
| **Appointments (Fact)** | **Logical Conflict** (Status/Flag) | 1 Record | Corrupts **No-Show Rate** and Completion Rate KPIs. | Correct `no_show_flag` to FALSE for that record. |
| **Appointments (Fact)** | **Referential Integrity** (FK) | 393 Records | Excluded from Departmental performance reports. | Impute all invalid IDs to **Department ID: 1**  |
| **Appointments (Fact)** | **Domain/Standardization** | 1 Record | Misspelled status ("Cancleld"). | Standardize value to **"Cancelled"**. |
| **Patients (Dimension)** | **Temporal Integrity** (Logic) | 673 Records | Invalidates all **Age-Based Cohort** and time-series analysis. | Invalidate conflicting `date_of_birth` to `NaT` and flag error. |
| **Patients (Dimension)** | **Domain/Standardization** | Low | Splits reporting categories (e.g., 'Male' vs. 'M'). | Map inconsistent Gender values (`M`, `FEMALE`) to standard values. |
| **Departments (Dimension)** | **Domain/Standardization** | 1 Record | Prevents accurate data linkage for Cardiology appointments. | Correct spelling error: **'Cardiologgy'** $\rightarrow$ **'Cardiology'**. |
| **Clinicians (Dimension)** | **Referential Integrity** (FK) | 1 Record | Excludes clinician's work from department reports. | Re-assign from department_id: 999 to department_id: 1 |


---

### Conclusion

The overall data quality is high, with no primary key violations found across any table. The cleaning phase will prioritize resolving the **temporal (673 records)** and **referential integrity (393 records)** issues which pose the largest risk to analytical accuracy.

**Action:** Proceed to the cleaning script to execute the standardization and imputation strategy outlined above.