# Patient Administrative Outcomes Predictive Model

## 0. Project Init

Install required dependencies

In [2]:
# !pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.3.tar.gz (317.3 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:06[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting py4j==0.10.9.7 (from pyspark)
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl.metadata (1.5 kB)
Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
Building wheels for collected packages: pyspark
  Building wheel for pyspark (pyproject.toml) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.5.3-py2.py3-none-any.whl size=317840629 sha256=d80d9c80d08f0ded5c0b65cddfc910a09e9c392f35cbd80f559b485afb0ed626
  Stored in directory: /home/snowblade/.cache/pip/wheels/07/a0/a3/d24c94bf043ab5c7e38c30491199a2a11fef8d2584e6df7fb7
Successfully built pyspark
Inst

Start Pyspark Session

In [63]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T

import json

spark = SparkSession.builder.appName("FhirDataApplication").getOrCreate()
display(spark)

## 1. Load Datasets

### 1.1 Patient Resource
- `patientID`: Unique identifier for the patient
- `gender`: Gender of the patient (male/female)
- `birthDate`: Birth Date
- `maritalStatus`: Marital Status 

In [7]:
with open("schema/Patient.json") as f:
    schema_read = json.loads(f.read())
schema =  T.StructType.fromJson(schema_read)
df_MimicPatient =  spark.read.schema(schema).json("_dataset/MimicPatient.ndjson.gz")
    
df_MimicPatient.createOrReplaceTempView("df_patient")

24/10/28 00:08:59 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


In [28]:
fm_patient = spark.sql("""
select 
    id AS patientId,
    gender AS gender,
    to_date(birthDate) birthDate,
    maritalStatus.coding[0].code AS maritalStatus
from df_patient
""")
fm_patient.createOrReplaceTempView('fm_patient')
fm_patient.show(5, truncate=False)

+------------------------------------+------+----------+-------------+
|patientId                           |gender|birthDate |maritalStatus|
+------------------------------------+------+----------+-------------+
|c1a51556-8f1b-5ca1-a8b5-ece5e7bb8602|male  |2048-07-11|M            |
|363f1c3f-87b7-54f9-b22e-d9ed1bc03906|female|2131-12-18|UNK          |
|e0461572-bda1-5fc1-9703-607152294019|male  |2091-06-06|UNK          |
|d16c5811-06c5-5f7f-87d7-fe2adb7bd10c|male  |2105-04-05|UNK          |
|7458d66f-3fd9-5208-9223-4fdf0bacbfb8|male  |2129-02-16|UNK          |
+------------------------------------+------+----------+-------------+
only showing top 5 rows



In [15]:
# fm_patient.write.mode("overwrite").parquet("_output/fm_patient")

                                                                                

### 1.2 Encounter

The Encounter is a resource that represents an interaction between a patient and healthcare provider(s) for the purpose of providing healthcare services or assessing the patient's health status.

It records the full span of a hospital stay, including admission, stay and discharge. It includes details such as admission start and end time, context for the admission and patient movements within the hospital.


**Table Attributes**

- `encounterId`: Unique identifier for the encounter.
- `patientId`: Unique identifier for the patient.
- `ref_encounterId`:Reference to a related encounter; can reference both future and past encounters.
- `periodStart`: Start timestamp of the encounter period.
- `periodEnd`:End timestamp of the encounter period.
- `duration`: Total duration of encounter
- `status`: Current status of the encounter (e.g., planned, in-progress, finished).
- `encounterClass`: Classification of the encounter (e.g., inpatient, outpatient); Helps categorize the nature of the healthcare service provided.
- `codedType`: Code representing the specific type of encounter.
- `displayType`: Display name for the type of encounter.
- `systemType`: System from which the type code is derived.
- `priority`: Urgency of the encounter such as routine, urgent, or emergency; 

In [17]:
with open("schema/Encounter.json") as f:
    schema_read = json.loads(f.read())
schema =  T.StructType.fromJson(schema_read)
df_MimicEncounter =  spark.read.schema(schema).json("_dataset/MimicEncounter.ndjson.gz") \
    .withColumn("sourceName", F.split_part(F.input_file_name(), F.lit("/"), F.lit(-1)))

df_MimicEncounterED =  spark.read.schema(schema).json("_dataset/MimicEncounterED.ndjson.gz") \
    .withColumn("sourceName", F.split_part(F.input_file_name(), F.lit("/"), F.lit(-1)))

df_encounter = df_MimicEncounter.union(df_MimicEncounterED)
df_encounter.createOrReplaceTempView("df_encounter")

In [29]:
fm_encounter = spark.sql("""
SELECT
    id AS encounterId,
    replace(subject.reference, "Patient/", "") AS patientId,
    replace(partOf.reference, "Encounter/", "") AS ref_encounterId,
    CAST(period.start AS timestamp) periodStart,
    CAST(period.end AS timestamp) periodEnd,
    date_diff(day, periodStart, periodEnd) duration,
    status AS status,
    class.code AS encounterClass,
    type[0].coding[0].display AS displayType,
    nvl(priority.coding[0].display, 'emergency') AS priority,
    -- Next EncounterID & Readmission Status
    LEAD(id) OVER (PARTITION BY subject.reference ORDER BY period.start) AS nextEncounterId,
    CASE 
        WHEN DATEDIFF(day, period.end, LEAD(period.start) OVER (PARTITION BY subject.reference ORDER BY period.start)) <= 30 THEN 'Readmission'
        ELSE 'No Readmission'
    END AS readmissionStatus,
    sourceName
FROM df_encounter
""")
fm_encounter.createOrReplaceTempView('fm_encounter')
fm_encounter.show(5, truncate = False)



+------------------------------------+------------------------------------+------------------------------------+-------------------+-------------------+--------+--------+--------------+---------------------------+---------+------------------------------------+-----------------+--------------------------+
|encounterId                         |patientId                           |ref_encounterId                     |periodStart        |periodEnd          |duration|status  |encounterClass|displayType                |priority |nextEncounterId                     |readmissionStatus|sourceName                |
+------------------------------------+------------------------------------+------------------------------------+-------------------+-------------------+--------+--------+--------------+---------------------------+---------+------------------------------------+-----------------+--------------------------+
|c2c18e9c-1177-5b71-aa93-e1eae57868b5|0000ad47-7103-5f54-970d-dafc42fd12f9|NULL   

                                                                                

In [19]:
# fm_encounter.write.mode("overwrite").parquet("_output/fm_encounter")

                                                                                

### 1.3 Condition

The Condition resource in FHIR is used to record detailed information about a patient’s health state, including diagnoses, problems, or other clinical concerns. Here are the key points:

- Scope and Usage: It captures conditions that have risen to a level of concern, such as diseases, health issues, or post-procedure states.
- Clinical Context: Conditions can be recorded based on a clinician’s assessment or expressed by the patient or care team members.
- Examples: Conditions like pregnancy, post-surgical states, or chronic illnesses can be documented. It can also include social determinants of health like unemployment or lack of transportation.

Data Preprocessing Condition Code System is both ICD 9 & ICD 10, Standardize ICD-9 and ICD-10 codes to a common standard. GEMs (General Equivalence Mappings) are crosswalks between ICD-9 and ICD-10 codes. They help map codes from ICD-9-CM to ICD-10-CM and vice versa.

https://www.cms.gov/Medicare/Coding/ICD10/Downloads/ICD-10_GEM_fact_sheet.pdf

**Table Attributes**

- `conditionId`: The unique identifier for the condition.
- `patientId`: The unique identifier for the patient.
- `encounterId`: The unique identifier for the encounter.
- `categoryCode`: Condition category.
- `conditionCode`: The code representing the specific condition.
- `conditionDisplay`: The display name for the condition.
- `conditionSystem`: The system from which the condition code is derived.

In [20]:
with open("schema/Condition.json") as f:
    schema_read = json.loads(f.read())
schema =  T.StructType.fromJson(schema_read)
df_MimicCondition =  spark.read.schema(schema).json("_dataset/MimicCondition.ndjson.gz") \
    .withColumn("sourceName", F.split_part(F.input_file_name(), F.lit("/"), F.lit(-1)))

df_MimicConditionED =  spark.read.schema(schema).json("_dataset/MimicConditionED.ndjson.gz") \
    .withColumn("sourceName", F.split_part(F.input_file_name(), F.lit("/"), F.lit(-1)))

df_condition = df_MimicCondition.union(df_MimicConditionED)
#  Remove Corrupt Dataset
df_condition = df_condition.filter(df_condition["id"].isNotNull())
df_condition.createOrReplaceTempView("df_condition")

In [30]:
fm_condition = spark.sql("""
SELECT
    id AS conditionId,
    replace(subject.reference, "Patient/", "") AS patientId,
    replace(encounter.reference, "Encounter/", "") AS encounterId,
    category[0].coding[0].code categoryCode,
    code.coding[0].code AS conditionCode,
    code.coding[0].display AS conditionDisplay,
    code.coding[0].system AS conditionSystem,
    sourceName
FROM df_condition
""")
fm_condition.createOrReplaceTempView('fm_condition')
fm_condition.show(5, truncate=False)

+------------------------------------+------------------------------------+------------------------------------+-------------------+-------------+--------------------------------------------------+---------------------------------------------------------------+------------------------+
|conditionId                         |patientId                           |encounterId                         |categoryCode       |conditionCode|conditionDisplay                                  |conditionSystem                                                |sourceName              |
+------------------------------------+------------------------------------+------------------------------------+-------------------+-------------+--------------------------------------------------+---------------------------------------------------------------+------------------------+
|10ce1ed4-c6cc-59b7-99b3-f16d355642c3|00000027-c5e0-554f-8e85-b097c3b177d4|c8816dcb-47f0-55f0-933f-122d0caac629|encounter-diagnosis|7455   

In [23]:
# fm_condition.write.mode("overwrite").parquet("_output/fm_condition")

                                                                                

### 1.4 Procedure

The Procedure resource in FHIR is used to record details of current and historical procedures performed on or for a patient.

**Table Attributes**

- `procedureId`: The unique identifier for the procedure.
- `patientId`: The unique identifier for the patient.
- `encounterId`: The unique identifier for the encounter.
- `status`: The current status of the procedure (e.g., completed, in-progress, not-done)
- `performedDateTime`: The date and time when the procedure was performed.
- `procedureCode`: The code representing the specific procedure.
- `procedureDisplay`: The display name for the procedure.
- `procedureSystem`: The system from which the procedure code is derived.

In [24]:
with open("schema/Procedure.json") as f:
    schema_read = json.loads(f.read())
schema =  T.StructType.fromJson(schema_read)
df_MimicProcedure =  spark.read.schema(schema).json("_dataset/MimicProcedure.ndjson.gz") \
    .withColumn("sourceName", F.split_part(F.input_file_name(), F.lit("/"), F.lit(-1)))

df_MimicProcedureED =  spark.read.schema(schema).json("_dataset/MimicProcedureED.ndjson.gz") \
    .withColumn("sourceName", F.split_part(F.input_file_name(), F.lit("/"), F.lit(-1)))

df_procedure = df_MimicProcedure.union(df_MimicProcedureED)
df_procedure.createOrReplaceTempView("df_procedure")

In [31]:
fm_procedure = spark.sql("""
SELECT
    id AS procedureId,
    replace(subject.reference, "Patient/", "") AS patientId,
    replace(encounter.reference, "Encounter/", "") AS encounterId,
    status AS status,
    nvl(to_date(performedDateTime), to_date(performedPeriod.start)) AS performedDateTime,
    code.coding[0].code AS procedureCode,
    code.coding[0].display AS procedureDisplay,
    code.coding[0].system AS procedureSystem,
    sourceName
FROM df_procedure
""")
fm_procedure.createOrReplaceTempView('fm_procedure')
fm_procedure.show(5, truncate = False)

+------------------------------------+------------------------------------+------------------------------------+---------+-----------------+-------------+--------------------------------------------------------------+---------------------------------------------------------------+------------------------+
|procedureId                         |patientId                           |encounterId                         |status   |performedDateTime|procedureCode|procedureDisplay                                              |procedureSystem                                                |sourceName              |
+------------------------------------+------------------------------------+------------------------------------+---------+-----------------+-------------+--------------------------------------------------------------+---------------------------------------------------------------+------------------------+
|06572895-c093-5678-a2ee-52025f3fe961|00000027-c5e0-554f-8e85-b097c3b177d4|c881

In [26]:
# fm_procedure.write.mode("overwrite").parquet("_output/fm_procedure")

                                                                                

## 2. Data Preparation 

### 2.1. Standardarize Condition Coding System

There are two coding system in condition domain: ICD-9 & ICD-10.

To facilitate transition, CMS and CDC developed GEMs, which act as crosswalks, translating ICD-9 codes to their ICD-10 equivalents.

Learn more about this crosswalk on [ICD-9-CM to and from ICD-10-CM and ICD-10-PCS Crosswalk or General Equivalence Mappings](https://www.nber.org/research/data/icd-9-cm-and-icd-10-cm-and-icd-10-pcs-crosswalk-or-general-equivalence-mappings)

In [35]:
spark.sql("""
SELECT conditionSystem, count(*) FROM fm_condition
group by 1
order by 1 desc
""").show(truncate=False)



+----------------------------------------------------------------+--------+
|conditionSystem                                                 |count(1)|
+----------------------------------------------------------------+--------+
|http://mimic.mit.edu/fhir/mimic/CodeSystem/mimic-diagnosis-icd9 |3209891 |
|http://mimic.mit.edu/fhir/mimic/CodeSystem/mimic-diagnosis-icd10|2445484 |
+----------------------------------------------------------------+--------+



                                                                                

ICD-10 Code Structure
- Characters 1:3 = Indicate the category of the diagnosis
- Characters 4:6 = Indicate etiology, anatomic site, severity or other clinical detail
- Character 7 = Extension

In [40]:
ClinicalModificationGem_df = spark.read.csv("_dataset/icd9toicd10cmgem.csv", header = True)
ClinicalModificationGem_df.createOrReplaceTempView("ClinicalModificationGem")

In [45]:
standardize_condition_df = spark.sql("""
    SELECT
        *
    FROM fm_condition condition 
    WHERE condition.conditionSystem LIKE '%icd10'
    
    UNION
    
    SELECT
        condition.conditionId,
        condition.patientId,
        condition.encounterId,
        condition.categoryCode,
        ClinicalModificationGem.icd10cm AS conditionCode,
        CAST(NULL AS STRING) conditionDisplay,
        CAST(NULL AS STRING) AS conditionSystem,
        condition.sourceName
    FROM fm_condition condition 
    JOIN ClinicalModificationGem
    ON condition.conditionCode = ClinicalModificationGem.icd9cm
    WHERE condition.conditionSystem LIKE '%icd9'
""")

standardize_condition_df.createOrReplaceTempView("ICD10Condition")

In [46]:
condition_pivot_df = spark.sql("""
    WITH RankedConditions AS (
        SELECT 
            patientId,
            encounterId,
            conditionCode,
            COUNT(*) AS conditionCount,
            ROW_NUMBER() OVER (PARTITION BY encounterId ORDER BY COUNT(*) DESC) AS rank
        FROM ICD10Condition
        GROUP BY patientId, encounterId, conditionCode
    )
    SELECT
        patientId,
        encounterId,
        MAX(CASE WHEN rank = 1 THEN conditionCode END) AS condition_1,
        MAX(CASE WHEN rank = 2 THEN conditionCode END) AS condition_2,
        MAX(CASE WHEN rank = 3 THEN conditionCode END) AS condition_3
    FROM RankedConditions
    GROUP BY patientId, encounterId
""")

In [81]:
# condition_pivot_df.write.mode("overwrite").parquet("_output/fm2_condition")

### 2.2. Standardarize Procedure Coding System

In [48]:
spark.sql("""
SELECT procedureSystem, count(*) FROM fm_procedure
group by 1
order by 1 desc
""").show(truncate=False)



+----------------------------------------------------------------+--------+
|procedureSystem                                                 |count(1)|
+----------------------------------------------------------------+--------+
|http://snomed.info/sct                                          |1989697 |
|http://mimic.mit.edu/fhir/mimic/CodeSystem/mimic-procedure-icd9 |446079  |
|http://mimic.mit.edu/fhir/mimic/CodeSystem/mimic-procedure-icd10|223107  |
+----------------------------------------------------------------+--------+



                                                                                

In [52]:
spark.sql("""
select distinct procedureDisplay from fm_procedure
where procedureSystem like '%sct'       
""").show(5, truncate=False)



+-------------------------------------------------+
|procedureDisplay                                 |
+-------------------------------------------------+
|Taking patient vital signs assessment (procedure)|
|Triage: emergency center (procedure)             |
+-------------------------------------------------+



                                                                                

In [49]:
ProcedureCodingSystemGem_df = spark.read.csv("_dataset/icd9toicd10pcsgem.csv", header = True)
ProcedureCodingSystemGem_df.createOrReplaceTempView("ProcedureCodingSystemGem")

In [57]:
standardize_procedure_df = spark.sql("""
    SELECT
        *
    FROM fm_procedure procedure 
    WHERE procedure.procedureSystem LIKE '%icd10'
    
    UNION
    
    SELECT
        procedure.procedureId,
        procedure.patientId,
        procedure.encounterId,
        procedure.status,
        procedure.performedDateTime,
        ProcedureCodingSystemGem.icd10cm AS procedureCode,
        CAST(NULL AS STRING) procedureDisplay,
        CAST(NULL AS STRING) AS procedureSystem,
        procedure.sourceName
    FROM fm_procedure procedure 
    JOIN ProcedureCodingSystemGem
    ON procedure.procedureCode = ProcedureCodingSystemGem.icd9cm
    WHERE procedure.procedureSystem LIKE '%icd9'
""")

standardize_procedure_df.createOrReplaceTempView("ICD10procedure")

In [59]:
procedure_pivot_df = spark.sql("""
    WITH Rankedprocedures AS (
        SELECT 
            patientId,
            encounterId,
            procedureCode,
            COUNT(*) AS procedureCount,
            ROW_NUMBER() OVER (PARTITION BY encounterId ORDER BY COUNT(*) DESC) AS rank
        FROM ICD10procedure
        GROUP BY patientId, encounterId, procedureCode
    )
    SELECT
        patientId,
        encounterId,
        MAX(CASE WHEN rank = 1 THEN procedureCode END) AS procedure_1,
        MAX(CASE WHEN rank = 2 THEN procedureCode END) AS procedure_2,
        MAX(CASE WHEN rank = 3 THEN procedureCode END) AS procedure_3
    FROM Rankedprocedures
    GROUP BY patientId, encounterId
""")

In [61]:
# procedure_pivot_df.write.mode("overwrite").parquet("_output/fm2_procedure")

### 2.3. Consolidated Encounter

In [64]:
fm_encounter = spark.read.parquet("_output/fm_encounter/")
fm_encounter.createOrReplaceTempView('fm_encounter')

fm_patient = spark.read.parquet("_output/fm_patient/")
fm_patient.createOrReplaceTempView('fm_patient')

fm2_condition = spark.read.parquet("_output/fm2_condition/")
fm2_condition.createOrReplaceTempView('fm2_condition')

fm2_procedure = spark.read.parquet("_output/fm2_procedure/")
fm2_procedure.createOrReplaceTempView('fm2_procedure')


In [70]:
encounter_df = spark.sql("""
SELECT
    encounter.encounterId,
    encounter.patientId,
    patient.gender,
    CASE 
        WHEN ROUND(datediff(patient.birthDate, encounter.periodStart)/365, 0) BETWEEN 18 AND 29 THEN 'young adults'
        WHEN ROUND(datediff(patient.birthDate, encounter.periodStart)/365, 0) BETWEEN 30 AND 39 THEN 'young adulthood'
        WHEN ROUND(datediff(patient.birthDate, encounter.periodStart)/365, 0) BETWEEN 40 AND 49 THEN 'early-middle age'
        WHEN ROUND(datediff(patient.birthDate, encounter.periodStart)/365, 0) BETWEEN 50 AND 59 THEN 'late-middle age'
        WHEN ROUND(datediff(patient.birthDate, encounter.periodStart)/365, 0) BETWEEN 60 AND 69 THEN 'mid-old age'
        WHEN ROUND(datediff(patient.birthDate, encounter.periodStart)/365, 0) BETWEEN 70 AND 79 THEN 'senior-old age'
        WHEN ROUND(datediff(patient.birthDate, encounter.periodStart)/365, 0) BETWEEN 80 AND 89 THEN 'very senior-old'
        WHEN ROUND(datediff(patient.birthDate, encounter.periodStart)/365, 0) BETWEEN 90 AND 115 THEN 'centenarians'
        ELSE 'other age groups'
    END AS ageGroup,
    patient.maritalStatus,
    encounter.duration encounterDuration,
    encounter.status encounterStatus,
    encounter.encounterClass,
    encounter.displayType encounterType,
    encounter.priority,
    -- Conditions
    condition.condition_1,
    condition.condition_2,
    condition.condition_3,
    -- Procedures
    procedure.procedure_1,
    procedure.procedure_2,
    procedure.procedure_3,
    encounter.readmissionStatus,
    encounter.sourceName
FROM fm_encounter encounter
LEFT JOIN fm_patient patient
    ON encounter.patientId = patient.patientID
LEFT JOIN fm2_condition condition
    ON encounter.encounterId = condition.encounterId
    AND encounter.patientID = condition.patientID
LEFT JOIN fm2_procedure procedure
    ON encounter.encounterId = procedure.encounterId
    AND encounter.patientID = procedure.patientID
""")
# encounter_df.write.mode("overwrite").parquet("_output/encounter")

                                                                                