# Generate Cleaned Dataset

This notebook extracts the essential columns from `final_data.csv` for analysis.

**Columns Included:**
- Patient ID
- Visit ID
- Triage End Timestamp
- Doctor Seen Timestamp
- Exit Timestamp
- Doctors On Duty
- Nurses On Duty
- Specialists On Call
- Shift Type
- Triage Level
- Disposition

## Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

## Load the Final Data

In [2]:
# Define file paths
input_file = '/Users/mukeshravichandran/Datathon/final_data.csv'
output_file = '/Users/mukeshravichandran/Datathon/final_data_cleaned.csv'

# Read the final_data.csv
df = pd.read_csv(input_file)

print(f"Original dataset shape: {df.shape}")
print(f"\nColumn names in original data:")
print(df.columns.tolist())

Original dataset shape: (15000, 36)

Column names in original data:
['Visit ID', 'Patient ID', 'Hospital ID', 'Facility Size (Beds)', 'ICU Beds', 'Regular Beds', 'Fast Track Beds', 'Arrival Time', 'Registration Start', 'Registration End', 'Triage Start', 'Triage End', 'Doctor Seen', 'Exit Time', 'Triage Level', 'Visit Date', 'Visit Time', 'WaitTime for Reg', 'Registration process time', 'Triage process time', 'WaitTime after Triage', 'DoctorVisit to Exit', 'TotalTime(Arrival To Exit)', 'Disposition', 'Satisfaction', 'Age', 'Gender', 'Insurance', 'Staff Date', 'Shift', 'ShiftStart', 'ShiftEnd', 'Nurses On Duty', 'Doctors On Duty', 'Specialists On Call', 'Fast Tracks Beds on shift']


## Select and Clean Columns

In [3]:
# Define columns needed
columns_needed = [
    'Patient ID',
    'Visit ID',
    'Triage End',
    'Doctor Seen',
    'Exit Time',
    'Doctors On Duty',
    'Nurses On Duty',
    'Specialists On Call',
    'Shift',
    'Triage Level',
    'Disposition'
]

# Create the cleaned dataset
cleaned_df = df[columns_needed].copy()

# Rename columns for clarity
cleaned_df = cleaned_df.rename(columns={
    'Triage End': 'Triage End Timestamp',
    'Doctor Seen': 'Doctor Seen Timestamp',
    'Exit Time': 'Exit Timestamp',
    'Shift': 'Shift Type'
})

print(f"Cleaned dataset shape: {cleaned_df.shape}")
print(f"\nNew column names:")
print(cleaned_df.columns.tolist())

Cleaned dataset shape: (15000, 11)

New column names:
['Patient ID', 'Visit ID', 'Triage End Timestamp', 'Doctor Seen Timestamp', 'Exit Timestamp', 'Doctors On Duty', 'Nurses On Duty', 'Specialists On Call', 'Shift Type', 'Triage Level', 'Disposition']


## Data Quality Check

In [4]:
# Display first few rows
print("First 10 rows of cleaned data:")
print(cleaned_df.head(10))

print("\n" + "="*80)
print("Data Types:")
print(cleaned_df.dtypes)

print("\n" + "="*80)
print("Missing Values:")
print(cleaned_df.isnull().sum())

print("\n" + "="*80)
print("Summary Statistics:")
print(cleaned_df.describe())

First 10 rows of cleaned data:
      Patient ID Visit ID Triage End Timestamp Doctor Seen Timestamp  \
0  MC180325-0433  V112722  2025-03-07 12:23:00   2025-03-07 12:38:00   
1  MC180325-2621  V103705  2025-03-07 10:40:00   2025-03-07 11:05:00   
2  MC180325-2621  V109897  2025-03-07 10:26:00   2025-03-07 11:00:00   
3  MC180325-3511  V107132  2025-03-07 12:07:00   2025-03-07 12:46:00   
4  MC180325-0427  V112438  2025-03-07 14:18:00   2025-03-07 14:55:00   
5  MC180325-0695  V113018  2025-03-07 12:28:00   2025-03-07 13:00:00   
6  MC180325-2987  V109087  2025-03-07 13:46:00   2025-03-07 14:07:00   
7  MC180325-3836  V114603  2025-03-07 07:32:00   2025-03-07 08:05:00   
8  MC180325-3382  V112183  2025-03-07 12:20:00   2025-03-07 12:38:00   
9  MC180325-0749  V107699  2025-03-07 11:45:00   2025-03-07 12:14:00   

        Exit Timestamp  Doctors On Duty  Nurses On Duty  Specialists On Call  \
0  2025-03-07 14:09:00                4               8                    2   
1  2025-03-07 12

## Verify Unique Values

In [5]:
print("Unique Shift Types:")
print(cleaned_df['Shift Type'].unique())
print(f"Count: {cleaned_df['Shift Type'].nunique()}\n")

print("Unique Triage Levels:")
print(sorted(cleaned_df['Triage Level'].unique()))
print(f"Count: {cleaned_df['Triage Level'].nunique()}\n")

print("Unique Dispositions:")
print(cleaned_df['Disposition'].unique())
print(f"Count: {cleaned_df['Disposition'].nunique()}")

Unique Shift Types:
['DAY' 'EVENING' 'NIGHT']
Count: 3

Unique Triage Levels:
[1, 2, 3, 4]
Count: 4

Unique Dispositions:
['TRANSFERRED' 'ADMITTED' 'DISCHARGED']
Count: 3


## Save Cleaned Dataset

In [6]:
# Save the cleaned dataset
cleaned_df.to_csv(output_file, index=False)

print(f"âœ… Cleaned dataset saved to: {output_file}")
print(f"\nTotal rows: {len(cleaned_df):,}")
print(f"Total columns: {len(cleaned_df.columns)}")
print(f"\nFile size: {Path(output_file).stat().st_size / (1024**2):.2f} MB")

âœ… Cleaned dataset saved to: /Users/mukeshravichandran/Datathon/final_data_cleaned.csv

Total rows: 15,000
Total columns: 11

File size: 1.51 MB


## Summary

In [7]:
# Display final summary
print("\n" + "="*80)
print("CLEANED DATASET SUMMARY")
print("="*80)
print(f"\nðŸ“Š Dataset Info:")
print(f"   Rows: {len(cleaned_df):,}")
print(f"   Columns: {len(cleaned_df.columns)}")
print(f"\nðŸ“‹ Columns Included:")
for idx, col in enumerate(cleaned_df.columns, 1):
    print(f"   {idx}. {col}")
print(f"\nâœ… Output file: {output_file}")
print("="*80)


CLEANED DATASET SUMMARY

ðŸ“Š Dataset Info:
   Rows: 15,000
   Columns: 11

ðŸ“‹ Columns Included:
   1. Patient ID
   2. Visit ID
   3. Triage End Timestamp
   4. Doctor Seen Timestamp
   5. Exit Timestamp
   6. Doctors On Duty
   7. Nurses On Duty
   8. Specialists On Call
   9. Shift Type
   10. Triage Level
   11. Disposition

âœ… Output file: /Users/mukeshravichandran/Datathon/final_data_cleaned.csv
