**Step 1: Data Exploration**
- Examine what data files you have
- Look at structure (columns, format, size)
- Identify data quality issues

In [1]:
from pathlib import Path
import glob
import os 
import pandas as pd
from graphviz import Digraph


pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)    


%load_ext autoreload
%autoreload 2

## The Data Files 

In [2]:
data_dir = Path('../data/raw/')
data_files = list(data_dir.glob('*.csv'))

data_dict = {
    file.stem.split('_')[0]: pd.read_csv(file) for file in data_files
}


## 1.1 The Data Structure  

In [3]:
df = data_dict['healthcare']
df.head()

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal


### **1.2 Structure (Columns, Format, Size)**

In [4]:
df_profile = pd.DataFrame({
    'Column': df.columns,
    'Data Type': df.dtypes.astype(str).values,
    'Non-Null Count': df.notnull().sum().values,
    'Null Count': df.isnull().sum().values,
    'Missing (%)': round(df.isnull().mean() * 100, 2),
    'Unique Values': df.nunique().values,
    'Unique (%)': round((df.nunique() / len(df)) * 100, 2),
    'Memory Usage': df.memory_usage(deep=True).values[1:]  # Exclude index
})

display(df_profile)


Unnamed: 0,Column,Data Type,Non-Null Count,Null Count,Missing (%),Unique Values,Unique (%),Memory Usage
Name,Name,object,55500,0,0.0,49992,90.08,3456123
Age,Age,int64,55500,0,0.0,77,0.14,444000
Gender,Gender,object,55500,0,0.0,2,0.0,2996952
Blood Type,Blood Type,object,55500,0,0.0,8,0.01,2844392
Medical Condition,Medical Condition,object,55500,0,0.0,6,0.01,3163733
Date of Admission,Date of Admission,object,55500,0,0.0,1827,3.29,3274500
Doctor,Doctor,object,55500,0,0.0,40341,72.69,3456165
Hospital,Hospital,object,55500,0,0.0,39876,71.85,3638177
Insurance Provider,Insurance Provider,object,55500,0,0.0,5,0.01,3208132
Billing Amount,Billing Amount,float64,55500,0,0.0,50000,90.09,444000


Dataset Size: 55,500 patient records x 15 columns
Data Quality: Excellent (0% missing values)
Memory Footprint: 38.33 MB (efficient storage)
Data Integrity: High (0.96% duplicate records)

HIGHLIGHTS:
---
* Complete dataset - no null values across any column
* Near-unique patient identification (49,992 unique names, ~90%)
* Diverse provider network (40,341 doctors, 39,876 hospitals)
* Comprehensive temporal tracking (~1,800 unique admission/discharge dates)
* Balanced data types (12 categorical, 3 numerical)

COLUMN BREAKDOWN:
---

1. Patient Identifiers & Demographics (3 columns):
   - Name: 49,992 unique values (~90% unique identifiers)
   - Age: 77 unique values (int64)
   - Gender: 2 categories (binary classification)

2. Medical Classification (4 columns):
   - Blood Type: 8 categories (complete ABO/Rh system)
   - Medical Condition: 6 categories
   - Medication: 5 treatment options
   - Test Results: 3 outcome categories

3. Healthcare Providers (2 columns):
   - Doctor: 40,341 unique physicians
   - Hospital: 39,876 unique facilities

4. Administrative & Facilities (2 columns):
   - Room Number: 400 unique rooms (int64)
   - Admission Type: 3 categories (Emergency/Elective/Urgent)

5. Temporal Data (2 columns):
   - Date of Admission: 1,827 unique dates (object - needs datetime conversion)
   - Discharge Date: 1,856 unique dates (object - needs datetime conversion)

6. Financial Data (2 columns):
   - Billing Amount: 50,000 unique values (float64)
   - Insurance Provider: 5 coverage options

KEY OBSERVATIONS:
---

Data Completeness:
* 100% data availability - exceptional data collection quality
* Consistent 55,500 records across all fields
* No missing data imputation required

Cardinality Analysis:
* High cardinality fields: Name, Doctor, Hospital, Billing Amount
  (suitable for granular analysis and pattern detection)
* Low cardinality fields: Gender, Blood Type, Medical Condition, Test Results
  (ideal for categorical analysis and classification)
* Medium cardinality: Room Number (400), Admission/Discharge dates (~1,800)
  (enables temporal and facility utilization analysis)

Data Type Distribution:
* Numerical: Age, Room Number, Billing Amount (3 columns)
* Categorical: All other fields (12 columns - requires encoding for ML models)
* Temporal: Date fields stored as objects (optimization needed)

Data Integrity:
* 534 duplicate records identified (0.96% of total)
* Recommendation: Review duplicates to identify legitimate repeated visits vs. data entry errors

RECOMMENDED NEXT STEPS:
---
1. Convert Date of Admission and Discharge Date to datetime format
2. Investigate 534 duplicate records (same patient, multiple admissions?)
3. Encode categorical variables for machine learning applications
4. Analyze temporal patterns in admission/discharge dates
5. Examine billing amount distribution and relationship with medical conditions

### Convert to datetime

In [5]:
# Convert to datetime
df['Date of Admission'] = pd.to_datetime(df['Date of Admission'])
df['Discharge Date'] = pd.to_datetime(df['Discharge Date'])

### **1.3 Identify Data Quality Issues**

In [6]:
# Identify potential issues
high_missing_cols = df.columns[df.isnull().mean() > 0.5].tolist()
constant_cols = df.columns[df.nunique() == 1].tolist()

df_quality = pd.DataFrame({
    'Metric': [
        'Total Rows',
        'Total Columns',
        'Duplicate Rows',
        'Duplicate Rows (%)',
        'Overall Missing Values (%)',
        'Columns with >50% Missing',
        'Constant Columns (Single Value)',
        'Total Memory Usage'
    ],
    'Value': [
        len(df),
        len(df.columns),
        df.duplicated().sum(),
        round(df.duplicated().mean() * 100, 2),
        round(df.isnull().mean().mean() * 100, 2),
        ', '.join(high_missing_cols) if high_missing_cols else 'None',
        ', '.join(constant_cols) if constant_cols else 'None',
        f"{df.memory_usage(deep=True).sum() / 1024**2:.2f} MB"
    ]
})

display(df_quality)

Unnamed: 0,Metric,Value
0,Total Rows,55500
1,Total Columns,15
2,Duplicate Rows,534
3,Duplicate Rows (%),0.96
4,Overall Missing Values (%),0.0
5,Columns with >50% Missing,
6,Constant Columns (Single Value),
7,Total Memory Usage,32.94 MB


Dataset Size: 55,500 records x 15 columns
Data Quality: Excellent (0% missing values)
Memory Footprint: 38.33 MB (efficient storage)
Data Integrity: High (0.96% duplicate records)

HIGHLIGHTS:
---
* Complete dataset - no null values across any column
* Minimal duplication (534 records) - manageable for cleaning
* No constant columns - all features contribute variability
* Ready for analysis with minimal preprocessing required

KEY OBSERVATIONS:
---

Data Completeness:
* 100% data availability across all 15 features
* Zero missing values eliminates need for imputation strategies

Data Integrity:
* 534 duplicate records identified (0.96% of total)
* Duplication rate below 1% threshold - considered acceptable
* No single-value columns detected - all features provide analytical value

Storage Efficiency:
* 38.33 MB total memory usage (~691 bytes per record)
* Suitable for in-memory processing on standard hardware

RECOMMENDED NEXT STEPS:
---
1. Investigate 534 duplicate records (determine true duplicates vs. valid repeats)
2. Perform exploratory data analysis to understand feature distributions
3. Validate data types match expected formats for each column
4. Consider deduplication if records are confirmed as errors

### Duplicate Investigation (You identified but didn't investigate)

In [7]:
# 1. Overview of duplicates
duplicates = df[df.duplicated(keep=False)]
print(f"Total duplicate records: {len(duplicates)}")
print(f"Unique records involved in duplication: {len(duplicates) // 2}\n")


# Count exact duplicates vs. same name different visits
exact_duplicates = df[df.duplicated(keep='first')]
print(f"Exact duplicate rows: {len(exact_duplicates)}")

Total duplicate records: 1068
Unique records involved in duplication: 534

Exact duplicate rows: 534


Finding:
--------
* 534 exact duplicate rows identified (0.96% of dataset)
* Each duplicate appears exactly twice - no variation in any fields
* Pattern indicates data entry/import errors, not repeated patient visits

Recommendation:
---------------
- Safe to remove all duplicates using drop_duplicates()
- Dataset will reduce from 55,500 to 54,966 records
- No manual review required - all are identical copies

### 2. Show sample of duplicate groups (better formatted)

In [8]:
#2. Show sample of duplicate groups (better formatted)
print("Sample of Duplicate Records:")
print("=" * 100)
sample_duplicates = duplicates.sort_values('Name').head(10)
display(sample_duplicates[['Name', 'Age', 'Gender', 'Medical Condition', 
                            'Date of Admission', 'Discharge Date', 'Billing Amount']])

Sample of Duplicate Records:


Unnamed: 0,Name,Age,Gender,Medical Condition,Date of Admission,Discharge Date,Billing Amount
54285,ABIgaIL YOung,41,Female,Hypertension,2022-12-15,2023-01-13,1983.568297
42407,ABIgaIL YOung,41,Female,Hypertension,2022-12-15,2023-01-13,1983.568297
26025,ALIcia taYLoR,78,Male,Asthma,2022-09-18,2022-10-15,31465.274979
53104,ALIcia taYLoR,78,Male,Asthma,2022-09-18,2022-10-15,31465.274979
50151,AMy GREEN,79,Female,Obesity,2021-03-30,2021-04-27,23402.358491
42323,AMy GREEN,79,Female,Obesity,2021-03-30,2021-04-27,23402.358491
51695,ANDREA HansEN,61,Male,Cancer,2021-07-02,2021-07-22,40026.763948
21675,ANDREA HansEN,61,Male,Cancer,2021-07-02,2021-07-22,40026.763948
51916,ANDrEA fREnCH,73,Male,Arthritis,2021-02-08,2021-02-24,9981.590235
36207,ANDrEA fREnCH,73,Male,Arthritis,2021-02-08,2021-02-24,9981.590235


Pattern Observed:
-----------------
All duplicate pairs are IDENTICAL across all fields:
- Same patient name, age, gender, and medical condition
- Same admission and discharge dates
- Same billing amount (down to the cent)

Examples from Sample:
---------------------
- ABIgalL YOung: Two identical records for hypertension admission (Dec 2022)
- ALIcia taYLoR: Two identical records for asthma admission (Sep 2022)
- AMy GREEN: Two identical records for obesity admission (Mar 2021)
- ANDREA HansEN: Two identical records for cancer admission (Jul 2021)
- ANDrEA frEnCH: Two identical records for arthritis admission (Feb 2021)

Observation:
------------
- Names show inconsistent capitalization (mixed case patterns)
- All other data fields are perfectly identical between pairs
- No temporal progression or different visits - pure duplicates

Conclusion:
-----------
These are systematic duplicate entries, not separate patient visits. 

### Check if same patients have multiple visits

In [9]:
# Check if same patients have multiple visits
duplicate_names = duplicates['Name'].unique()
print(f"Unique patient names with duplicates: {len(duplicate_names)}")



# Show example of three patients with multiple visits
for name in duplicate_names[:3]:
    patient_visits = df[df['Name'] == name].sort_values('Date of Admission')
    print(f"\nPatient Name: {name}")
    display(patient_visits[['Date of Admission', 'Discharge Date', 'Medical Condition', 'Billing Amount']])
    

Unique patient names with duplicates: 534

Patient Name: Nancy glOVeR


Unnamed: 0,Date of Admission,Discharge Date,Medical Condition,Billing Amount
64,2020-05-08,2020-06-01,Hypertension,19183.168885
54972,2020-05-08,2020-06-01,Hypertension,19183.168885



Patient Name: DAVid higgInS


Unnamed: 0,Date of Admission,Discharge Date,Medical Condition,Billing Amount
107,2021-03-05,2021-03-20,Arthritis,24948.477824
54404,2021-03-05,2021-03-20,Arthritis,24948.477824



Patient Name: RoBErt hIGGInS


Unnamed: 0,Date of Admission,Discharge Date,Medical Condition,Billing Amount
148,2021-05-06,2021-05-29,Asthma,13355.782085
53813,2021-05-06,2021-05-29,Asthma,13355.782085


In [10]:
# show patients with more than 3 visits
frequent_visitors = df['Name'].value_counts()
frequent_visitors = frequent_visitors[frequent_visitors > 2]
print("\nPatients with more than 3 visits:")
for name in frequent_visitors.index:
    patient_visits = df[df['Name'] == name].sort_values('Date of Admission')
    print(f"\nPatient Name: {name} - Total Visits: {frequent_visitors[name]}")
    display(patient_visits[['Date of Admission', 'Discharge Date', 'Medical Condition', 'Billing Amount']])



Patients with more than 3 visits:

Patient Name: DAvId muNoZ - Total Visits: 3


Unnamed: 0,Date of Admission,Discharge Date,Medical Condition,Billing Amount
46648,2022-08-04,2022-08-17,Cancer,32454.950609
2491,2022-12-20,2023-01-09,Diabetes,25773.328236
53672,2022-12-20,2023-01-09,Diabetes,25773.328236


## Data Quality Issue: Complete Record Duplication

### Finding Summary
- **Scale**: 534 unique patients affected (1,068 duplicate records total)
- **Pattern**: Each affected patient has exactly 2 identical entries
- **Impact**: Approximately 50% data redundancy in affected records

### Root Cause Analysis
**Type**: Exact duplicate records (100% field-level match)
- Identical across ALL dimensions: dates, amounts, conditions, facilities
- Not patient readmissions or separate visits
- Likely caused by: double data loading, ETL pipeline error, or source file duplication

### Evidence Example: Patient "Nancy glOVeR"

Row 64 & 54972 - Completely Identical:
├─ Admission Date: 2020-05-08
├─ Discharge Date: 2020-06-01  
├─ Hospital: Khan, and Rodriguez Fischer
├─ Medical Condition: Hypertension
└─ Billing Amount: $19,183.17


### Business Impact
- **Revenue Calculations**: Risk of double-counting (~$XXX,XXX in duplicated billing)
- **Patient Metrics**: Inflated visit counts affecting capacity planning
- **Clinical Analysis**: Skewed condition prevalence and treatment duration statistics

### Recommendation
**Action**: Apply deduplication using `drop_duplicates()` on all fields
**Justification**: Zero risk - records are byte-for-byte identical
**Verification**: Retain one record per duplicate set; log removed row IDs for audit trail

### Post-Cleaning Validation
- Verify patient count reduces from [original] to [expected]
- Confirm no unique patient data lost
- Document deduplication in data lineage

## Dropping Duplicates Rows 

### 1. Statistical Summary for Numerical Columns

In [11]:
# Descriptive statistics
print("NUMERICAL VARIABLES SUMMARY")
print("=" * 100)
display(df[['Age', 'Room Number', 'Billing Amount']].describe())

print("\nAge Distribution:")
print(f"Range: {df['Age'].min()} - {df['Age'].max()} years")
print(f"Mean: {df['Age'].mean():.1f} years")
print(f"Median: {df['Age'].median():.1f} years")

print("\nBilling Amount Distribution:")
print(f"Range: ${df['Billing Amount'].min():.2f} - ${df['Billing Amount'].max():.2f}")
print(f"Mean: ${df['Billing Amount'].mean():.2f}")
print(f"Median: ${df['Billing Amount'].median():.2f}")

NUMERICAL VARIABLES SUMMARY


Unnamed: 0,Age,Room Number,Billing Amount
count,55500.0,55500.0,55500.0
mean,51.539459,301.134829,25539.316097
std,19.602454,115.243069,14211.454431
min,13.0,101.0,-2008.49214
25%,35.0,202.0,13241.224652
50%,52.0,302.0,25538.069376
75%,68.0,401.0,37820.508436
max,89.0,500.0,52764.276736



Age Distribution:
Range: 13 - 89 years
Mean: 51.5 years
Median: 52.0 years

Billing Amount Distribution:
Range: $-2008.49 - $52764.28
Mean: $25539.32
Median: $25538.07


Age Distribution:
-----------------
- Range: 13 to 89 years (76-year span)
- Mean: 51.5 years
- Median: 52.0 years
- Standard Deviation: 19.6 years

Key Insights:
- Near-normal distribution (mean ≈ median indicates symmetry)
- Broad age coverage spans pediatric to geriatric populations
- 50% of patients fall between 35-68 years (IQR)
- Comprehensive representation across all adult age groups

Room Number Distribution:
-------------------------
- Range: 101 to 500 (400 available rooms)
- Mean: 301.1
- Median: 302.0
- Standard Deviation: 115.2

Key Insights:
- Uniform distribution across room inventory
- Sequential room numbering system (101-500)
- No apparent room allocation bias
- Efficient facility utilization

Billing Amount Distribution:
-----------------------------
- Range: -$2,008.49 to $52,764.28
- Mean: $25,539.32
- Median: $25,538.07
- Standard Deviation: $14,211.45

Key Insights:
- High variability in treatment costs (56% coefficient of variation)
- Negative billing detected (-$2,008.49) - indicates refunds/adjustments
- Wide cost spectrum reflects diverse medical procedures
- Symmetrical distribution (mean ≈ median)
- 50% of bills fall between $13,241 - $37,821

Critical Observations:
----------------------
1. Negative billing amount requires investigation (potential data quality issue)
2. Age distribution suggests dataset represents general adult population
3. Billing variability aligns with expected healthcare cost diversity
4. All variables show complete data coverage (55,500 records)

### 2. Categorical Distribution Analysis

In [12]:
print("CATEGORICAL VARIABLES DISTRIBUTION")
print("=" * 100)

print("\nMedical Condition Distribution:")
print(df['Medical Condition'].value_counts())
print(f"\nMost common: {df['Medical Condition'].mode()[0]}")

print("\n\nAdmission Type Distribution:")
print(df['Admission Type'].value_counts())

print("\n\nInsurance Provider Distribution:")
print(df['Insurance Provider'].value_counts())

print("\n\nGender Distribution:")
print(df['Gender'].value_counts())

print("\n\nBlood Type Distribution:")
print(df['Blood Type'].value_counts())

print("\n\nTest Results Distribution:")
print(df['Test Results'].value_counts())

print("\n\nMedication Distribution:")
print(df['Medication'].value_counts())

CATEGORICAL VARIABLES DISTRIBUTION

Medical Condition Distribution:
Medical Condition
Arthritis       9308
Diabetes        9304
Hypertension    9245
Obesity         9231
Cancer          9227
Asthma          9185
Name: count, dtype: int64

Most common: Arthritis


Admission Type Distribution:
Admission Type
Elective     18655
Urgent       18576
Emergency    18269
Name: count, dtype: int64


Insurance Provider Distribution:
Insurance Provider
Cigna               11249
Medicare            11154
UnitedHealthcare    11125
Blue Cross          11059
Aetna               10913
Name: count, dtype: int64


Gender Distribution:
Gender
Male      27774
Female    27726
Name: count, dtype: int64


Blood Type Distribution:
Blood Type
A-     6969
A+     6956
AB+    6947
AB-    6945
B+     6945
B-     6944
O+     6917
O-     6877
Name: count, dtype: int64


Test Results Distribution:
Test Results
Abnormal        18627
Normal          18517
Inconclusive    18356
Name: count, dtype: int64


Medication Dist

### CATEGORICAL VARIABLES ANALYSIS
==============================

Medical Condition Distribution:
-------------------------------
- 6 conditions: Arthritis (9,308), Diabetes (9,304), Hypertension (9,245), 
  Obesity (9,231), Cancer (9,227), Asthma (9,185)
- Near-uniform distribution (16.5-16.8% each)
- Maximum variance: 123 cases (1.3%)

Admission Type Distribution:
----------------------------
- Elective: 33.6% | Urgent: 33.5% | Emergency: 32.9%
- Balanced across all types

Insurance Provider Distribution:
--------------------------------
- 5 providers ranging from 19.7% to 20.3%
- Cigna (20.3%), Medicare (20.1%), UnitedHealthcare (20.0%), 
  Blue Cross (19.9%), Aetna (19.7%)

Gender Distribution:
-------------------
- Male: 50.0% (27,774) | Female: 50.0% (27,726)
- Perfect balance (48 case difference)

Blood Type Distribution:
-----------------------
- 8 blood types: 12.4-12.6% each
- Uniform distribution (does not reflect natural population distribution where O+ dominates at ~37%)

Test Results Distribution:
--------------------------
- Abnormal: 33.6% | Normal: 33.4% | Inconclusive: 33.1%
- Near-equal distribution

Medication Distribution:
-----------------------
- 5 medications: 19.9-20.1% each
- Lipitor, Ibuprofen, Aspirin, Paracetamol, Penicillin

Key Observation:
----------------
- All categorical variables show near-perfect uniform distribution (variance <2%). 

In [13]:
print("DATA QUALITY CHECKS")
print("=" * 100)

# Name formatting issues
print("\nName Formatting:")
inconsistent_caps = df['Name'].str.contains(r'[a-z][A-Z]|[A-Z][a-z][A-Z]').sum()
print(f"Names with inconsistent capitalization: {inconsistent_caps} ({inconsistent_caps/len(df)*100:.1f}%)")

# Date validation
print("\nDate Range Validation:")
print(f"Admission dates: {df['Date of Admission'].min()} to {df['Date of Admission'].max()}")
print(f"Discharge dates: {df['Discharge Date'].min()} to {df['Discharge Date'].max()}")

# Logical errors
invalid_dates = (df['Discharge Date'] < df['Date of Admission']).sum()
print(f"\nLogical errors - Discharge before admission: {invalid_dates}")

# Length of stay
df['Length of Stay'] = (df['Discharge Date'] - df['Date of Admission']).dt.days
print(f"\nLength of Stay:")
print(f"Range: {df['Length of Stay'].min()} - {df['Length of Stay'].max()} days")
print(f"Mean: {df['Length of Stay'].mean():.1f} days")
print(f"Median: {df['Length of Stay'].median():.1f} days")

DATA QUALITY CHECKS

Name Formatting:
Names with inconsistent capitalization: 54396 (98.0%)

Date Range Validation:
Admission dates: 2019-05-08 00:00:00 to 2024-05-07 00:00:00
Discharge dates: 2019-05-09 00:00:00 to 2024-06-06 00:00:00

Logical errors - Discharge before admission: 0

Length of Stay:
Range: 1 - 30 days
Mean: 15.5 days
Median: 15.0 days


Name Formatting Issue:
----------------------
- 54,396 names (98.0%) have inconsistent capitalization
- Examples: "Bobby JacksOn", "andrEw waTtS"
- Action Required: Standardize to title case before migration

Date Validation:
----------------
- Admission span: May 2019 to May 2024 (5 years)
- Discharge span: May 2019 to June 2024
- Zero logical errors (no discharge before admission)
- Clean temporal integrity

Length of Stay:
---------------
- Range: 1-30 days (capped at 30)
- Mean: 15.5 days | Median: 15.0 days
- Symmetric distribution

Critical Finding:
-----------------
Name capitalization affects 98% of records - must be cleaned.
All date logic validated with no errors.

# Architecture Drawings 

In [14]:
from graphviz import Digraph
import os

def create_professional_docker_diagram():
    """Professional Docker Infrastructure diagram"""
    
    docker = Digraph(comment='Docker Infrastructure', format='png')
    
    # Simpler, cleaner settings
    docker.attr(rankdir='LR', nodesep='1', ranksep='1.5')
    
    # Simple rectangular nodes
    docker.attr('node', shape='rectangle', style='filled', fontname='Arial', 
                fontsize='10', fillcolor='#E8EAF6', color='#3F51B5', penwidth='2')
    docker.attr('edge', color='#424242', fontsize='9', penwidth='1.5')
    
    # Containers
    docker.node('MongoDB', 'healthcare_mongodb\nMongoDB 8.2\n:27017')
    docker.node('Migration', 'healthcare_migration\nPython 3.13')
    docker.node('UI', 'healthcare_mongo_ui\nMongo Express\n:8081')
    
    # Volume
    docker.node('Volume', 'mongo_data\n(Volume)', shape='cylinder', 
                fillcolor='#FFF9C4', color='#F57C00')
    
    # Network
    docker.node('Network', 'healthcare_network', shape='ellipse', 
                fillcolor='#E0E0E0', color='#616161')
    
    # Connections with protocols
    docker.edge('Migration', 'MongoDB', label='mongodb://\n:27017')
    docker.edge('UI', 'MongoDB', label='http\n:27017')
    docker.edge('MongoDB', 'Volume', label='mounts', style='dashed')
    
    # Network connections
    docker.edge('Migration', 'Network', dir='none', style='dotted')
    docker.edge('MongoDB', 'Network', dir='none', style='dotted')
    docker.edge('UI', 'Network', dir='none', style='dotted')
    
    return docker


if __name__ == '__main__':
    # Create output directory
    os.makedirs('docs/images', exist_ok=True)
    
    # Generate Docker diagram
    docker = create_professional_docker_diagram()
    docker.render('../docs/images/docker_infrastructure', view=False, cleanup=True)
    print("Generated: docs/images/docker_infrastructure.png")

Generated: docs/images/docker_infrastructure.png


In [15]:
from graphviz import Digraph
import os

def create_compact_etl_pipeline_diagram():
    """Compact Professional ETL Pipeline Flow diagram"""
    
    etl = Digraph(comment='ETL Pipeline Flow', format='png')
    
    # Tighter graph settings
    etl.attr(rankdir='TB', nodesep='0.5', ranksep='0.4')
    
    # Slightly smaller nodes
    etl.attr('node', shape='rectangle', style='filled,rounded', fontname='Arial', 
             fontsize='9', fillcolor='#E8EAF6', color='#3F51B5', penwidth='1.5',
             height='0.5', width='2')
    etl.attr('edge', color='#424242', fontsize='8', penwidth='1.2', arrowsize='0.7')
    
    # Pipeline steps
    etl.node('Step1', '[1] Load Raw CSV\n55,500 rows', fillcolor='#BBDEFB')
    etl.node('Step2', '[2] Data Cleaning', fillcolor='#FFF9C4')
    etl.node('Step3', '[3] MongoDB Connection', fillcolor='#C8E6C9')
    etl.node('Step4', '[4] Load Cleaned Data\n54,966 rows', fillcolor='#BBDEFB')
    etl.node('Step5', '[5] Bulk Insert', fillcolor='#C8E6C9')
    etl.node('Step6', '[6] Integrity Validation', fillcolor='#FFCCBC')
    etl.node('Step7', '[7] Success', fillcolor='#C5E1A5')
    
    # Compact cleaning operations cluster
    with etl.subgraph(name='cluster_cleaning') as c:
        c.attr(style='filled,rounded', color='#E0E0E0', fillcolor='#F5F5F5',
               label='Cleaning Operations', fontsize='8', fontname='Arial',
               margin='8')
        c.attr('node', shape='note', fillcolor='white', color='#757575', 
               fontsize='8', penwidth='1', width='1.4', height='0.4')
        c.node('C1', 'Name Std.')
        c.node('C2', 'Dup. Removal\n(-534)')
        c.node('C3', 'Type Opt.\n(-20.96%)')
        c.node('C4', 'Quality Reports')
    
    # Main flow
    etl.edge('Step1', 'Step2')
    etl.edge('Step2', 'Step3')
    etl.edge('Step3', 'Step4')
    etl.edge('Step4', 'Step5')
    etl.edge('Step5', 'Step6')
    etl.edge('Step6', 'Step7')
    
    # Cleaning details
    etl.edge('Step2', 'C1', style='dashed', color='#9E9E9E', arrowhead='none')
    etl.edge('Step2', 'C2', style='dashed', color='#9E9E9E', arrowhead='none')
    etl.edge('Step2', 'C3', style='dashed', color='#9E9E9E', arrowhead='none')
    etl.edge('Step2', 'C4', style='dashed', color='#9E9E9E', arrowhead='none')
    
    return etl


if __name__ == '__main__':
    # Create output directory
    os.makedirs('docs/images', exist_ok=True)
    
    # Generate compact ETL diagram
    etl = create_compact_etl_pipeline_diagram()
    etl.render('../docs/images/etl_pipeline_flow', view=False, cleanup=True)
    print("Generated: ../docs/images/etl_pipeline_flow.png")

Generated: ../docs/images/etl_pipeline_flow.png


In [None]:
from graphviz import Digraph
import os

def create_professional_mongodb_schema_diagram():
    """Professional MongoDB Patient Schema (Architecture Style)"""

    dot = Digraph(comment='MongoDB Patient Schema', format='png')

    # Graph settings – disciplined and predictable
    dot.attr(
        rankdir='TB',
        nodesep='0.35',
        ranksep='0.45',
        splines='ortho'
    )

    # Node styling – neutral, professional
    dot.attr(
        'node',
        shape='rectangle',
        style='rounded,filled',
        fontname='Arial',
        fontsize='9',
        color='#37474F',
        fillcolor='#ECEFF1',
        penwidth='1.2',
        width='2.4',
        height='0.5'
    )

    # Edge styling – minimal, consistent
    dot.attr(
        'edge',
        color='#455A64',
        arrowsize='0.6',
        penwidth='1.0'
    )

    # Root collection
    dot.node(
        'Patient',
        'Patient\n(Collection)',
        fillcolor='#CFD8DC'
    )

    # -------------------- Core Document --------------------

    dot.node(
        'PatientInfo',
        'patient_info\nname\nage\ngender\nblood_type'
    )

    dot.node(
        'MedicalDetails',
        'medical_details\nmedical_condition (IDX)\nmedication\ntest_results'
    )

    dot.node(
        'AdmissionDetails',
        'admission_details\nadmission_date (IDX)\nadmission_type\nroom_number\ndischarge_date'
    )

    dot.node(
        'HospitalInfo',
        'hospital_info\nhospital (IDX*)\ndoctor'
    )

    dot.node(
        'Billing',
        'billing\ninsurance_provider\nbilling_amount'
    )

    dot.node(
        'Metadata',
        'metadata\ncreated_at\nupdated_at\ndata_source\nmigrated_by'
    )

    # -------------------- Layout Control --------------------

    # Force vertical stacking for visual balance
    with dot.subgraph() as s:
        s.attr(rank='same')
        s.node('PatientInfo')
        s.node('AdmissionDetails')

    # -------------------- Relationships --------------------

    dot.edge('Patient', 'PatientInfo')
    dot.edge('Patient', 'AdmissionDetails')

    dot.edge('PatientInfo', 'MedicalDetails')
    dot.edge('AdmissionDetails', 'HospitalInfo')

    dot.edge('HospitalInfo', 'Billing')

    # Metadata as cross-cutting concern (dashed, no arrows)
    dot.edge('PatientInfo', 'Metadata', style='dashed', arrowhead='none')
    dot.edge('MedicalDetails', 'Metadata', style='dashed', arrowhead='none')
    dot.edge('AdmissionDetails', 'Metadata', style='dashed', arrowhead='none')
    dot.edge('Billing', 'Metadata', style='dashed', arrowhead='none')

    return dot






if __name__ == '__main__':
    # Same output directory as ETL diagram
    os.makedirs('docs/images', exist_ok=True)

    diagram = create_professional_mongodb_schema_diagram()
    diagram.render(
        '../docs/images/mongodb_patient_schema',
        view=False,
        cleanup=True
    )

    print("Generated: ../docs/images/mongodb_patient_schema.png")


Generated: ../docs/images/mongodb_patient_schema.png
