<a href="https://colab.research.google.com/github/kpsalida/HospitalDischarges_Visualization/blob/Katerina/viz_eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Dataset acquisition

The analysis utilizes the 2019 Hospital Inpatient Discharges (SPARCS De-Identified) dataset, publicly available through the New York State Open Data Portal (https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/4ny4-j5zv).  

SPARCS (Statewide Planning and Research Cooperative System) is a comprehensive administrative database that captures patient-level inpatient discharge records across New York State. The dataset includes clinical, operational, and financial variables such as length of stay (LOS), total charges, payer type, severity of illness, risk of mortality, and facility identifiers.

The dataset was restricted to discharges occurring in calendar year 2019 to avoid analysis inconsistencies arising from COVID-19 pandemic. Additionally it was narrowd to 3 Hospitals  Furthermore, the scope of the study was narrowed to  allow comparisons cross - facilities but also to maintain computational efficiency within the PowerBI environment. Two of the hospitals are major institutions in NYC while the third is more regional with a possibility of different LOS patterns and a different cost structure

With this filtering we are hoping for a more standardized performance comparison using normalized key performance indicators (KPIs), such as average LOS and average charges per discharge, independent of hospital size.   
#  
Each row in the dataset does not represent one patient but rather one discharge. So, we are analyzing discharges and this could occur to the same patient.  
#
Differences in discharge volume across hospitals were addressed using normalized performance metrics to ensure comparability.

In [27]:
df.columns

Index(['Hospital Service Area', 'Hospital County',
       'Operating Certificate Number', 'Permanent Facility Id',
       'Facility Name', 'Age Group', 'Zip Code - 3 digits', 'Gender', 'Race',
       'Ethnicity', 'Length of Stay', 'Type of Admission',
       'Patient Disposition', 'Discharge Year', 'CCSR Diagnosis Code',
       'CCSR Diagnosis Description', 'CCSR Procedure Code',
       'CCSR Procedure Description', 'APR DRG Code', 'APR DRG Description',
       'APR MDC Code', 'APR MDC Description', 'APR Severity of Illness Code',
       'APR Severity of Illness Description', 'APR Risk of Mortality',
       'APR Medical Surgical Description', 'Payment Typology 1',
       'Payment Typology 2', 'Payment Typology 3', 'Birth Weight',
       'Emergency Department Indicator', 'Total Charges', 'Total Costs'],
      dtype='object')

# Explanation of Data Fields

**Hospital County**
Type is Char. Length is 11. A description of the county in which the hospital is located. Blank for records with enhanced de-identification.  
#  
**Facility Name**
Type is Char. Length is 112. The name of the facility where services were performed based on the Permanent Facility Identifier (PFI), as maintained by the NYSDOH Division of Health Facility Planning. For records with enhanced de-identification, ‘Redacted for Confidentiality’ appears.  

**Age Group**
Type is Char. Length is 11. Age in years at time of discharge. Grouped into the following age groups: 0 to 17, 18 to 29, 30 to 49, 50 to 69, and 70 or Older  

**Gender**
Type is Char. Length is 1. Patient gender: (M) Male, (F) Female, (U) Unknown.  

**Race**
Type is Char. Length is 32. Patient race. Black/African American, Multi-racial, Other Race, White. Other Race includes Native Americans and Asian/Pacific Islander.  

**Ethnicity**
Type is Char. Length is 20. Patient ethnicity. The ethnicity of the patient: Spanish/Hispanic Origin, Not of Spanish/Hispanic Origin, Multi-ethnic Unknown.  

**Length of Stay**
Type is Char. Length is 5. The total number of patient days at an acute level and/or other than acute care level (excluding leave of absence days) (Discharge Date - Admission Date) + 1. Length of Stay greater than or equal to 120 days has been aggregated to 120+ days  

**Type of Admission**
Type is Char. Length is 15. A description of the manner in which the patient was admitted to the health care facility: Elective, Emergency, Newborn, Not Available, Trauma, Urgent   

**Patient Disposition**
Type is Char. Length is 37. The patient's destination or status upon discharge  

**APR DRG Description ??**
Type is Char. Length is 500. The APR-DRG Classification Code Description in Calendar Year 2019, Version 36 of the APR- DRG Grouper. http://www.health.ny.gov/statistics/sparcs/sysdoc/appy.htm  
#
**APR Severity of Illness Description**
Type is Char. Length is 8. All Patient Refined Severity of Illness (APR SOI) Description: Undetermined (0), Minor (1), Moderate (2), Major (3), Extreme (4)  

**APR Risk of Mortality**
Type is Char. Length is 8. All Patient Refined Risk of Mortality (APR ROM) Description: Undetermined (0), Minor (1), Moderate (2), Major (3), Extreme (4).  
#
**Payment Typology 1**
Type is Char. Length is 25. A description of the type of payment for this occurrence.  

**Emergency Department Indicator**
Type is Char. Length is 1. The Emergency Department Indicator is set based on the submitted revenue codes. If the record contained an Emergency Department revenue code of 045X, the indicator is set to "Y", otherwise it will be “N”.  

**Total Charges**
Type is Char. Length is 8. Total charges for the discharge.  
#
**Total Costs**
Type is Char. Length is 8. Total estimated cost for the discharge.


# Column Selection

We excluded deep medical analysis with CCSR terms
All ids that might be confusing.  
Discharge year as our dataset is related only to 2019  
#
We selected the following features according to business logic:
#

**Identification**  
Hospital County  
Facility Name  

**Demographics**  
Age Group  
Gender  
Race / Ethnicity (only if you want social analysis — otherwise skip)    
  
**OPERATIONS**  
Length of Stay (IMPORTANT → convert to numeric)  
Type of Admission  
Emergency Department Indicator  
'Patient Disposition'
  
**FINANCIAL**  
Total Charges (convert to numeric)  
Total Costs (convert to numeric)  

**CLINICAL COMPLEXITY**  
APR Severity of Illness Description

APR Risk of Mortality

APR DRG Description      

**PAYER**

Payment Typology 1 (KEEP ONLY THIS)


# **About Low Variance Features**

In [35]:
# Import necessary libraries
import pandas as pd              # Data manipulation and analysis
import numpy as np               # Numerical operations, arrays, and math functions
import matplotlib.pyplot as plt  # Basic plotting library
import seaborn as sns            # Advanced statistical visualizations built on matplotlib

# Set pandas display option to show up to 100 columns when printing a DataFrame
pd.set_option('display.max_columns', 500)

# Load the colab drive
from google.colab import drive
import os

# Mount Google Drive to access files.

In [4]:
# Mounts the drive; if already mounted, it just continues
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
#Jump directly into your specific project folder
%cd /content/drive/MyDrive/Colab Notebooks/FinalProject_Visualization

/content/drive/MyDrive/Colab Notebooks/FinalProject_Visualizatoin


In [13]:
df = pd.read_csv('raw_hospital_discharges_2019.csv', low_memory = False)

# Initial Identification of the dataframe.  


In [38]:
df.head()

Unnamed: 0,Hospital County,Facility Name,Age Group,Gender,Race,Ethnicity,Length of Stay,Type of Admission,Patient Disposition,APR DRG Description,APR Severity of Illness Description,APR Risk of Mortality,Payment Typology 1,Emergency Department Indicator,Total Charges,Total Costs
0,Kings,Maimonides Medical Center,50 to 69,F,Other Race,Not Span/Hispanic,21,Urgent,Skilled Nursing Home,CARDIAC VALVE PROCEDURES W AMI OR COMPLEX PDX,Extreme,Extreme,Medicaid,False,"$401,807.05","$100,294.11"
1,Kings,Maimonides Medical Center,0 to 17,F,Black/African American,Not Span/Hispanic,3,Emergency,Home or Self Care,ASTHMA,Major,Moderate,Medicaid,True,"$46,725.01","$23,058.16"
2,Kings,Maimonides Medical Center,0 to 17,F,White,Unknown,1,Newborn,Home or Self Care,"NEONATE BIRTHWT >2499G, NORMAL NEWBORN OR NEON...",Minor,Minor,Medicaid,False,"$3,393",$490.45
3,Kings,Maimonides Medical Center,18 to 29,F,White,Unknown,2,Urgent,Home or Self Care,VAGINAL DELIVERY,Minor,Minor,Medicaid,True,"$20,858.02","$5,445.28"
4,Kings,Maimonides Medical Center,18 to 29,F,White,Not Span/Hispanic,2,Urgent,Home or Self Care,VAGINAL DELIVERY,Minor,Minor,Medicaid,False,"$21,116.01","$5,295.36"


In [45]:
print("Number of Rows:", df.shape[0])
print("Number of Columns:", df.shape[1])

Number of Rows: 53390
Number of Columns: 16


In [40]:
df.columns

Index(['Hospital County', 'Facility Name', 'Age Group', 'Gender', 'Race',
       'Ethnicity', 'Length of Stay', 'Type of Admission',
       'Patient Disposition', 'APR DRG Description',
       'APR Severity of Illness Description', 'APR Risk of Mortality',
       'Payment Typology 1', 'Emergency Department Indicator', 'Total Charges',
       'Total Costs'],
      dtype='object')

In [42]:
df.describe().transpose()

Unnamed: 0,count,unique,top,freq
Hospital County,53390,2,Kings,41129
Facility Name,53390,3,Maimonides Medical Center,41129
Age Group,53390,5,70 or Older,15611
Gender,53390,2,F,30447
Race,53390,4,White,29307
Ethnicity,53390,4,Not Span/Hispanic,41826
Length of Stay,53390,109,2,15630
Type of Admission,53390,5,Emergency,32419
Patient Disposition,53390,16,Home or Self Care,36907
APR DRG Description,53377,308,"NEONATE BIRTHWT >2499G, NORMAL NEWBORN OR NEON...",7555


In [43]:
df.value_counts('APR DRG Description',dropna=False)

Unnamed: 0_level_0,count
APR DRG Description,Unnamed: 1_level_1
"NEONATE BIRTHWT >2499G, NORMAL NEWBORN OR NEONATE W OTHER PROBLEM",7555
VAGINAL DELIVERY,6066
SEPTICEMIA & DISSEMINATED INFECTIONS,3204
CESAREAN DELIVERY,2071
HEART FAILURE,1595
...,...
UNGROUPABLE,1
GASTRIC FUNDOPLICATION,1
CRANIOTOMY FOR MULTIPLE SIGNIFICANT TRAUMA,1
"BRAIN CONTUSION/LACERATION & COMPLICATED SKULL FX, COMA < 1 HR OR NO COMA",1


In [None]:
df.value_counts('Facility Name',dropna=False)

In [None]:
df.value_counts('Permanent Facility Id',dropna=False)

In [None]:
df.value_counts('Payment Typology 3', dropna=False)

In [None]:
df.value_counts('Type of Admission', dropna=False)

In [None]:
df.value_counts('Emergency Department Indicator',dropna=False)

In [None]:
df.value_counts('Patient Disposition',dropna=False)

In [None]:
df.info()

# Keep only selected columns

In [36]:

# columns to KEEP

columns_to_keep = [
    "Hospital County",
    "Facility Name",
    "Age Group",
    "Gender",
    "Race",
    "Ethnicity",
    "Length of Stay",
    "Type of Admission",
    "Patient Disposition",
    "APR DRG Description",
    "APR Severity of Illness Description",
    "APR Risk of Mortality",
    "Payment Typology 1",
    "Emergency Department Indicator",
    "Total Charges",
    "Total Costs"
]


df = df[columns_to_keep]

## Overview and Structure of my dataframe

In [37]:
df.head()

Unnamed: 0,Hospital County,Facility Name,Age Group,Gender,Race,Ethnicity,Length of Stay,Type of Admission,Patient Disposition,APR DRG Description,APR Severity of Illness Description,APR Risk of Mortality,Payment Typology 1,Emergency Department Indicator,Total Charges,Total Costs
0,Kings,Maimonides Medical Center,50 to 69,F,Other Race,Not Span/Hispanic,21,Urgent,Skilled Nursing Home,CARDIAC VALVE PROCEDURES W AMI OR COMPLEX PDX,Extreme,Extreme,Medicaid,False,"$401,807.05","$100,294.11"
1,Kings,Maimonides Medical Center,0 to 17,F,Black/African American,Not Span/Hispanic,3,Emergency,Home or Self Care,ASTHMA,Major,Moderate,Medicaid,True,"$46,725.01","$23,058.16"
2,Kings,Maimonides Medical Center,0 to 17,F,White,Unknown,1,Newborn,Home or Self Care,"NEONATE BIRTHWT >2499G, NORMAL NEWBORN OR NEON...",Minor,Minor,Medicaid,False,"$3,393",$490.45
3,Kings,Maimonides Medical Center,18 to 29,F,White,Unknown,2,Urgent,Home or Self Care,VAGINAL DELIVERY,Minor,Minor,Medicaid,True,"$20,858.02","$5,445.28"
4,Kings,Maimonides Medical Center,18 to 29,F,White,Not Span/Hispanic,2,Urgent,Home or Self Care,VAGINAL DELIVERY,Minor,Minor,Medicaid,False,"$21,116.01","$5,295.36"


In [44]:
print("Number of Rows:", df.shape[0])
print("Number of Columns:", df.shape[1])

Number of Rows: 53390
Number of Columns: 16
