# Patient Administrative Outcomes Predictive Model

Data Source: https://physionet.org/content/mimic-iv-fhir/1.0/

- [Methods for De-identification of PHI | Health and Human Services](https://www.hhs.gov/hipaa/for-professionals/special-topics/de-identification/index.html)
- [General Equivalence Mappings GEMS](https://society.asco.org/practice-policy/billing-coding-reporting/icd-10/general-equivalence-mappings-gems)

https://www.nber.org/research/data/icd-9-cm-and-icd-10-cm-and-icd-10-pcs-crosswalk-or-general-equivalence-mappings Download link

ICD 10 is the latest, convert ICD to 9
ICD 10 Code first three character defines a Category Codes

**FHIR Resources**
- Patient: For demographic information.
- Encounter: To track patient visits and interactions.
- Condition: For diagnoses and health conditions.
- Procedure: For medical procedures performed.
- Observation: For lab results and vital signs

## 0. Project Init

In [None]:
import os
import pandas as pd
import json
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T

spark = SparkSession.builder.appName("FhirDataApplication").getOrCreate()
display(spark)

## 1. Load Datasets

### 1.1 Patient
- `patientID`: Unique identifier for the patient
- `gender`: Gender of the patient (male/female)
- `birthDate`: Birth Date
- `maritalStatus`: Marital Status 

In [3]:
with open("schema/Patient.json") as f:
    schema_read = json.loads(f.read())
schema =  T.StructType.fromJson(schema_read)
df_MimicPatient =  spark.read.schema(schema).json("_dataset/MimicPatient.ndjson")
    
df_MimicPatient.createOrReplaceTempView("df_patient")

In [None]:
fm_patient = spark.sql("""
select 
    id AS patientId,
    gender AS gender,
    to_date(birthDate) birthDate,
    maritalStatus.coding[0].code AS maritalStatus
from df_patient
""")
print('Total Patient:', fm_patient.count())
fm_patient.createOrReplaceTempView("patient")
# fm_patient.toPandas().to_parquet('_dataset/MartPatient.parquet', index=False)
display(fm_patient.toPandas().head(5))

### 1.2 Encounter

The Encounter is a resource that represents an interaction between a patient and healthcare provider(s) for the purpose of providing healthcare services or assessing the patient's health status.

It records the full span of a hospital stay, including admission, stay and discharge. It includes details such as admission start and end time, context for the admission and patient movements within the hospital.


**Table Attributes**

- `encounterId`: Unique identifier for the encounter.
- `patientId`: Unique identifier for the patient.
- `ref_encounterId`:Reference to a related encounter; can reference both future and past encounters.
- `periodStart`: Start timestamp of the encounter period.
- `periodEnd`:End timestamp of the encounter period.
- `duration`: Total duration of encounter
- `status`: Current status of the encounter (e.g., planned, in-progress, finished).
- `encounterClass`: Classification of the encounter (e.g., inpatient, outpatient); Helps categorize the nature of the healthcare service provided.
- `codedType`: Code representing the specific type of encounter.
- `displayType`: Display name for the type of encounter.
- `systemType`: System from which the type code is derived.
- `priority`: Urgency of the encounter such as routine, urgent, or emergency; 

In [3]:
with open("schema/Encounter.json") as f:
    schema_read = json.loads(f.read())
schema =  T.StructType.fromJson(schema_read)
df_MimicEncounter =  spark.read.schema(schema).json("_dataset/MimicEncounter.ndjson") \
    .withColumn("sourceName", F.split_part(F.input_file_name(), F.lit("/"), F.lit(-1)))

df_MimicEncounterED =  spark.read.schema(schema).json("_dataset/MimicEncounterED.ndjson") \
    .withColumn("sourceName", F.split_part(F.input_file_name(), F.lit("/"), F.lit(-1)))

# df_MimicEncounterICU =  spark.read.schema(schema).json("_dataset/MimicEncounterICU.ndjson") \
#     .withColumn("sourceName", F.split_part(F.input_file_name(), F.lit("/"), F.lit(-1)))

df_encounter = df_MimicEncounter.union(df_MimicEncounterED) #.union(df_MimicEncounterICU)
df_encounter.createOrReplaceTempView("df_encounter")

In [None]:
fm_encounter = spark.sql("""
SELECT
    id AS encounterId,
    replace(subject.reference, "Patient/", "") AS patientId,
    replace(partOf.reference, "Encounter/", "") AS ref_encounterId,
    CAST(period.start AS timestamp) periodStart,
    CAST(period.end AS timestamp) periodEnd,
    date_diff(day, periodStart, periodEnd) duration,
    status AS status,
    class.code AS encounterClass,
    -- type[0].coding[0].code AS codedType,
    type[0].coding[0].display AS displayType,
    -- type[0].coding[0].system AS systemType,
    nvl(priority.coding[0].display, 'emergency') AS priority,
    -- Next EncounterID & Readmission Status
    LEAD(id) OVER (PARTITION BY subject.reference ORDER BY period.start) AS nextEncounterId,
    CASE 
        WHEN DATEDIFF(day, period.end, LEAD(period.start) OVER (PARTITION BY subject.reference ORDER BY period.start)) <= 30 THEN 'Readmission'
        ELSE 'No Readmission'
    END AS readmissionStatus,
    sourceName
FROM df_encounter
""")
fm_encounter.createOrReplaceTempView("encounter")
print('Total Encounter:', fm_encounter.count())
# fm_encounter.toPandas().to_parquet('_dataset/MartEncounter.parquet', index=False)
display(fm_encounter.toPandas().head(5))

### 1.3 Condition

The Condition resource in FHIR is used to record detailed information about a patient’s health state, including diagnoses, problems, or other clinical concerns. Here are the key points:

- Scope and Usage: It captures conditions that have risen to a level of concern, such as diseases, health issues, or post-procedure states.
- Clinical Context: Conditions can be recorded based on a clinician’s assessment or expressed by the patient or care team members.
- Examples: Conditions like pregnancy, post-surgical states, or chronic illnesses can be documented. It can also include social determinants of health like unemployment or lack of transportation.

Data Preprocessing Condition Code System is both ICD 9 & ICD 10, Standardize ICD-9 and ICD-10 codes to a common standard. GEMs (General Equivalence Mappings) are crosswalks between ICD-9 and ICD-10 codes. They help map codes from ICD-9-CM to ICD-10-CM and vice versa.

https://www.cms.gov/Medicare/Coding/ICD10/Downloads/ICD-10_GEM_fact_sheet.pdf

**Table Attributes**

- `conditionId`: The unique identifier for the condition.
- `patientId`: The unique identifier for the patient.
- `encounterId`: The unique identifier for the encounter.
- `categoryCode`: Condition category.
- `conditionCode`: The code representing the specific condition.
- `conditionDisplay`: The display name for the condition.
- `conditionSystem`: The system from which the condition code is derived.

In [3]:
with open("schema/Condition.json") as f:
    schema_read = json.loads(f.read())
schema =  T.StructType.fromJson(schema_read)
df_MimicCondition =  spark.read.schema(schema).json("_dataset/MimicCondition.ndjson") \
    .withColumn("sourceName", F.split_part(F.input_file_name(), F.lit("/"), F.lit(-1)))

df_MimicConditionED =  spark.read.schema(schema).json("_dataset/MimicConditionED.ndjson") \
    .withColumn("sourceName", F.split_part(F.input_file_name(), F.lit("/"), F.lit(-1)))

df_condition = df_MimicCondition.union(df_MimicConditionED)
#  Remove Corrupt Dataset
df_condition = df_condition.filter(df_condition["id"].isNotNull())
df_condition.createOrReplaceTempView("df_condition")

In [None]:
fm_condition = spark.sql("""
SELECT
    id AS conditionId,
    replace(subject.reference, "Patient/", "") AS patientId,
    replace(encounter.reference, "Encounter/", "") AS encounterId,
    category[0].coding[0].code categoryCode,
    code.coding[0].code AS conditionCode,
    code.coding[0].display AS conditionDisplay,
    code.coding[0].system AS conditionSystem,
    sourceName
FROM df_condition
""")
fm_condition.createOrReplaceTempView("condition")
print('Total Condition:', fm_condition.count())

In [None]:
# partition_size = 1_000_000
# total_records = fm_condition.count()
# num_partitions = (total_records // partition_size) + 1

# for i in range(num_partitions):
#     partition_df = fm_condition.limit(partition_size).offset(i * partition_size)
#     pandas_df = partition_df.toPandas()
#     pandas_df.to_parquet(f'_dataset/MartCondition{i}.parquet', engine='pyarrow')
#     print(f'Condition Partition {i} saved successfully')

### 1.4 Procedure

The Procedure resource in FHIR is used to record details of current and historical procedures performed on or for a patient.

**Table Attributes**

- `procedureId`: The unique identifier for the procedure.
- `patientId`: The unique identifier for the patient.
- `encounterId`: The unique identifier for the encounter.
- `status`: The current status of the procedure (e.g., completed, in-progress, not-done)
- `performedDateTime`: The date and time when the procedure was performed.
- `procedureCode`: The code representing the specific procedure.
- `procedureDisplay`: The display name for the procedure.
- `procedureSystem`: The system from which the procedure code is derived.

In [9]:
with open("schema/Procedure.json") as f:
    schema_read = json.loads(f.read())
schema =  T.StructType.fromJson(schema_read)
df_MimicProcedure =  spark.read.schema(schema).json("_dataset/MimicProcedure.ndjson") \
    .withColumn("sourceName", F.split_part(F.input_file_name(), F.lit("/"), F.lit(-1)))

df_MimicProcedureED =  spark.read.schema(schema).json("_dataset/MimicProcedureED.ndjson") \
    .withColumn("sourceName", F.split_part(F.input_file_name(), F.lit("/"), F.lit(-1)))

# df_MimicProcedureICU =  spark.read.schema(schema).json("_dataset/MimicProcedureICU.ndjson") \
#     .withColumn("sourceName", F.split_part(F.input_file_name(), F.lit("/"), F.lit(-1)))

df_procedure = df_MimicProcedure.union(df_MimicProcedureED) #.union(df_MimicProcedureICU)
df_procedure.createOrReplaceTempView("df_procedure")

In [None]:
fm_procedure = spark.sql("""
SELECT
    id AS procedureId,
    replace(subject.reference, "Patient/", "") AS patientId,
    replace(encounter.reference, "Encounter/", "") AS encounterId,
    status AS status,
    nvl(to_date(performedDateTime), to_date(performedPeriod.start)) AS performedDateTime,
    code.coding[0].code AS procedureCode,
    code.coding[0].display AS procedureDisplay,
    code.coding[0].system AS procedureSystem,
    sourceName
FROM df_procedure
""")
fm_procedure.createOrReplaceTempView("procedure")
print('Total Procedure:', fm_procedure.count())
# fm_procedure.toPandas().to_parquet('_dataset/procedure.parquet', index=False)

In [None]:
# partition_size = 1_000_000
# total_records = fm_procedure.count()
# num_partitions = (total_records // partition_size) + 1

# for i in range(num_partitions):
#     partition_df = fm_procedure.limit(partition_size).offset(i * partition_size)
#     pandas_df = partition_df.toPandas()
#     pandas_df.to_parquet(f'_dataset/MartProcedure{i}.parquet')
#     print(f'Procedure Partition {i} saved successfully')

## 2. Exploratory Data Analysis

### 2.1. Patient

#### BirthYear Distribution Line Chart

In [None]:
patient_birthYear_df = spark.sql("select date_format(birthDate, 'yyyy') birthYear, count(*) count  from patient group by 1 order by 1").toPandas()

fig = px.line(
    patient_birthYear_df, 
    x='birthYear', 
    y='count', 
    title='Patient BirthYear Distribution [De-Identified]'
)

fig.update_layout(
    xaxis_title='Birth Year',
    yaxis_title='Number of Patients',
    template='plotly_white'  
)
fig.show()

#### Gender Distribution Pie Chart

In [None]:
gender_count_df = fm_patient.groupBy("gender").count()
fig_patient_gender = px.pie(gender_count_df.toPandas(), values='count', names='gender', title='Gender Distribution')
fig_patient_gender.show()

#### Gender & Marital Status Grouped Bar Chart

In [None]:
gender_marital_status_count_df = fm_patient.groupBy("gender", "maritalStatus").count()
fig_patient_gender_marital_status = px.bar(
    gender_marital_status_count_df.toPandas(), 
    x='maritalStatus', 
    y='count', 
    color='gender', 
    title='Count by Gender and Marital Status', 
    barmode='group'
)
fig_patient_gender_marital_status.show()

### 2.2. Encounter

#### Encounter Year Distribution Line Chart

In [None]:
encounter_year_df = spark.sql("select date_format(periodStart, 'yyyy') encounterYear, count(*) count  from encounter group by 1 order by 1").toPandas()

fig = px.line(
    encounter_year_df, 
    x='encounterYear', 
    y='count', 
    title='Encounter Period Year Distribution [De-Identified]'
)

fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Number of Encounters',
    template='plotly_white'  
)
fig.show()

#### Source Distribution Pie Chart

In [None]:
px.pie(fm_encounter.groupBy("sourceName").count().toPandas(), values='count', names='sourceName', title='Source Distribution').show()

#### Encounter Class Distribution Pie Chart

In [None]:
px.pie(fm_encounter.groupBy("encounterClass").count().toPandas(), values='count', names='encounterClass', title='Encounter Class Distribution').show()

#### Encounter Priority Distribution Pie Chart

In [None]:
px.pie(fm_encounter.groupBy("priority").count().toPandas(), values='count', names='priority', title='Encounter Priority Distribution').show()

#### Encounter Class by Priority Grouped Bar Chart

In [None]:
grouped_data = fm_encounter.groupBy("encounterClass", "priority").count().toPandas()

fig = px.bar(
    grouped_data, 
    x='priority', 
    y='count', 
    color='encounterClass', 
    title='Encounter Class by Priority', 
    labels={'count': 'Count'},
    barmode='group'
)

fig.update_layout(xaxis_title='Priority',yaxis_title='Count',legend_title='Encounter Class')

fig.show()

#### Patient Readmission Pie Chart

In [None]:
px.pie(fm_encounter.groupBy("readmissionStatus").count().toPandas(), values='count', names='readmissionStatus', title='Patient Readmission').show()

#### Encounter Over the Years by ReadmissionStatus

In [None]:
readmission_df = fm_encounter.groupBy(
    "readmissionStatus", F.date_format("periodStart", "yyyy").alias("year")
).count().orderBy("year")

fig = px.line(readmission_df.toPandas(), 
              x='year', 
              y='count', 
              color='readmissionStatus',  # Differentiate lines by readmissionStatus
              title='Encounter Over the Years by ReadmissionStatus', 
              labels={'year': 'Year', 'count': 'Number of Encounters'})

fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Number of Encounters',
    legend_title='Readmission Status',
    template='plotly_white'  
)
fig.show()

#### Readmission Encounters: Priority Transitions Sankey Diagram

In [None]:
priority_encounter_df = spark.sql("""
SELECT
    e1.priority priority1,
    e2.priority priority2,
    count(*) count
FROM encounter e1
JOIN encounter e2
ON e1.nextEncounterId = e2.encounterID
WHERE e1.readmissionStatus == 'Readmission'
GROUP BY all
ORDER by 3 desc
""")

priority_encounter_pd = priority_encounter_df.toPandas()
all_priorities = list(set(priority_encounter_pd['priority1'].tolist() + 
                          priority_encounter_pd['priority2'].tolist()))
node_map = {priority: idx for idx, priority in enumerate(all_priorities)}
priority_encounter_pd['source'] = priority_encounter_pd['priority1'].map(node_map)
priority_encounter_pd['target'] = priority_encounter_pd['priority2'].map(node_map)

table_trace = go.Table(
    header=dict(values=["Priority 1", "Priority 2", "Count"], fill_color='lightgrey', align='center'),
    cells=dict(values=[priority_encounter_pd['priority1'], priority_encounter_pd['priority2'], priority_encounter_pd['count']],
               fill_color='white', align='center')
)

sankey_trace = go.Sankey(
    node=dict(
        pad=15,  # Padding between nodes
        thickness=20,  # Node thickness
        line=dict(color="black", width=0.5),  # Node border settings
        label=all_priorities  # Node labels
    ),
    link=dict(
        source=priority_encounter_pd['source'],  # Source nodes (indices)
        target=priority_encounter_pd['target'],  # Target nodes (indices)
        value=priority_encounter_pd['count']     # Flow values (counts)
    )
)

fig = make_subplots(
    rows=1, cols=2, 
    column_widths=[0.7, 0.3],  # Adjust column widths (30% table, 70% Sankey)
    specs=[[{"type": "table"}, {"type": "sankey"}]],  # Specify types for each subplot
)


fig.add_trace(sankey_trace, row=1, col=1)
fig.add_trace(table_trace, row=1, col=2)

fig.update_layout(
    title_text="Readmission Encounters: Priority Transitions Sankey Diagram and Table",
    font_size=12,
    height=500
)
fig.show()

#### Encounter Count by Source and Duration Group

In [None]:
%%sparksql --l 100
SELECT 
    sourceName, 
    CASE 
        WHEN duration BETWEEN 0 AND 5 THEN '0 to 5 days'
        WHEN duration BETWEEN 6 AND 10 THEN '6 to 10 days'
        WHEN duration BETWEEN 11 AND 20 THEN '11 to 20 days'
        WHEN duration BETWEEN 21 AND 30 THEN '21 to 30 days'
        WHEN duration BETWEEN 31 AND 50 THEN '31 to 50 days'
        WHEN duration BETWEEN 51 AND 100 THEN '51 to 100 days'
        WHEN duration BETWEEN 101 AND 150 THEN '101 to 150 days'
        WHEN duration BETWEEN 151 AND 200 THEN '151 to 200 days'
        WHEN duration BETWEEN 201 AND 250 THEN '201 to 250 days'
        WHEN duration BETWEEN 251 AND 300 THEN '251 to 300 days'
        ELSE 'More than 300 days'
    END AS duration_group,
    COUNT(*) AS encounter_count
FROM encounter
GROUP BY sourceName, duration_group
ORDER BY sourceName, double(split_part(duration_group, ' ', 1));

In [None]:
encounter_duration_df = spark.sql("""SELECT 
    sourceName,
    duration AS duration_group,
    COUNT(*) AS encounter_count
FROM encounter
where duration BETWEEN 1 AND 30
GROUP BY sourceName, duration_group
ORDER BY sourceName, duration_group;""").toPandas()

px.line(encounter_duration_df, x='duration_group', y='encounter_count', color='sourceName',
              markers=True, title="Encounter Count by Source and Duration Group").show()

### 2.3. Condition

In [None]:
%%sparksql
select * from condition limit 3

In [None]:
condition_system_df = spark.sql("select split_part(conditionSystem, '/', -1) conditionSystem, count(*) count from condition group by 1").toPandas()
px.pie(condition_system_df, values='count', names='conditionSystem', title='Condition System Count Distribution').show()

Based on the `Condition System Count Distribution` chart, the dataset incorporates both ICD-9 and ICD-10 coding systems in condition table.

The difference between these two coding systems can introduce challenges which may affect the model performance and interpretability.

1. Inconsistent Coding:
    - ICD-10 has more granularity in specifying types of diseases, injury location, and severity.
    - ICD-9 has fewer and less specific codes than ICD-10, which means that a single ICD-9 code could map to multiple ICD-10 codes.
    - This inconsistency can create noisy features in your dataset if the same condition is coded differently depending on the coding system used. This could confuse your model, leading to reduced predictive accuracy.
2. Feature Engineering Complexity:
    - ICD-9 and ICD-10 are structured differently, both in terms of the number of codes and their specificity.
    - This complicates the creation of features related to diagnosis categories or comorbidities.
    - The features might not capture the full clinical picture, potentially leading to underfitting or overfitting
3. Data Heterogeneity:
    - Introduction of temporal bias into the model
    - This could cause the model to overestimate or underestimate readmission risks if it correlates newer coding systems with better or worse outcomes.
4. Model Interpretability:
    - Mixed coding systems make model interpretation harder, especially if using interpretable models like decision trees or logistic regression.
    - This may end up with features that are not comparable between ICD-9 and ICD-10, complicating efforts to explain your model's predictions.


In [None]:
%%sparksql
select conditionCode, conditionDisplay, count(*) count from condition
where conditionSystem like '%mimic-diagnosis-icd10'
group by all order by 3 desc 
limit 10

In [None]:
%%sparksql
select conditionCode, conditionDisplay, count(*) count from condition
where conditionSystem like '%mimic-diagnosis-icd9'
group by all order by 3 desc 
limit 10

On comparing top condition for both coding systems, there are few notable common types between them.

For Example:

- ICD 9 Code 4019 `Unspecified essential hypertension` is similar to ICD 10 Code I10 `Essential (primary) hypertension`.
- ICD 9 Code V1582 `Personal history of tobacco use` is similar to ICD 10 Code Z87891 `Personal history of nicotine dependence`.
- ICD 9 Code 2724 `Other and unspecified hyperlipidemia	` is similar to ICD 10 Code E785 `Hyperlipidemia, unspecified`.

The dataset must be standardized in order to gain better accuracy of predictive model.

One solution is to map ICD-9 codes to ICD-10 equivalents using tools like the General Equivalence Mappings (GEMs). This allows you to convert ICD-9 codes to ICD-10 to standardize the datase


### 2.4. Procedure

In [None]:
%%sparksql
select * from procedure limit 3

In [None]:
procedure_system_df = spark.sql("select split_part(procedureSystem, '/', -1) procedureSystem, count(*) count from procedure group by 1").toPandas()
px.pie(procedure_system_df, values='count', names='procedureSystem', title='Procedure System Count Distribution').show()

Compared to Condition Domain, Procedure Domain has an extra coding system `http://snomed.info/sct`
- Issue with ICD-9 & ICD-10 can easily be resolved by converting them.
- New issue arose here with SCT coding system

In [None]:
%%sparksql
select distinct procedureCode, procedureDisplay from procedure
where procedureSystem = 'http://snomed.info/sct'

Coding System `http://snomed.info/sct` only has two procedure code (386478007 & 410188000).
- These procedures are quite common and performed for nearly every patient in an emergency or hospital setting.
- They are generic procedures and may not help differentiate between patients who are at high risk of readmission versus those who are not due to its low variability.
- Including procedures like these may add noise to the model and clutter the dataset.

## 3. Data Preparation 

### 3.1 Condition Domain

#### Standardarize Coding System

There are two coding system in condition domain: ICD-9 & ICD-10.

To facilitate transition, CMS and CDC developed GEMs, which act as crosswalks, translating ICD-9 codes to their ICD-10 equivalents.

Learn more about this crosswalk on [ICD-9-CM to and from ICD-10-CM and ICD-10-PCS Crosswalk or General Equivalence Mappings](https://www.nber.org/research/data/icd-9-cm-and-icd-10-cm-and-icd-10-pcs-crosswalk-or-general-equivalence-mappings)

In [None]:
%%sparksql
SELECT conditionSystem, count(*) FROM condition group by conditionSystem order by 1 desc

ICD-10 Code Structure
- Characters 1:3 = Indicate the category of the diagnosis
- Characters 4:6 = Indicate etiology, anatomic site, severity or other clinical detail
- Character 7 = Extension

In [5]:
df = spark.read.csv("_dataset/icd9toicd10cmgem.csv", header = True)
df.createOrReplaceTempView("GemMapping")

In [6]:
standardize_condition_df = spark.sql("""
    SELECT
        *
    FROM condition 
    WHERE condition.conditionSystem LIKE '%icd10'
    
    UNION
    
    SELECT
        condition.conditionId,
        condition.patientId,
        condition.encounterId,
        condition.categoryCode,
        GemMapping.icd10cm AS conditionCode,
        CAST(NULL AS STRING) conditionDisplay,
        CAST(NULL AS STRING) AS conditionSystem,
        condition.sourceName
    FROM condition 
    JOIN GemMapping
    ON condition.conditionCode = GemMapping.icd9cm
    WHERE condition.conditionSystem LIKE '%icd9'
""")

standardize_condition_df.createOrReplaceTempView("ICD10Condition")

#### Pivot Condition Table based on Encounter

In [7]:
condition_pivot_df = spark.sql("""
    WITH RankedConditions AS (
        SELECT 
            patientId,
            encounterId,
            conditionCode,
            COUNT(*) AS conditionCount,
            ROW_NUMBER() OVER (PARTITION BY encounterId ORDER BY COUNT(*) DESC) AS rank
        FROM ICD10Condition
        GROUP BY patientId, encounterId, conditionCode
    )
    SELECT
        patientId,
        encounterId,
        MAX(CASE WHEN rank = 1 THEN conditionCode END) AS condition_1,
        MAX(CASE WHEN rank = 2 THEN conditionCode END) AS condition_2,
        MAX(CASE WHEN rank = 3 THEN conditionCode END) AS condition_3
    FROM RankedConditions
    GROUP BY patientId, encounterId
""")
# condition_pivot_df.createOrReplaceTempView('condition_pivot_df')
# .toPandas()
# condition_pivot_df.to_parquet('_dataset/PivotCondition.parquet', engine='pyarrow',index=False)

In [None]:
%%sparksql
select * from condition_pivot_df limit 10

### 3.1. Pivot Condition Table

In [None]:
%%dql
SELECT count(*) FROM read_parquet("_dataset/MartCondition2.parquet")

In [5]:
condition_pivot_df = spark.sql("""
    WITH RankedConditions AS (
        SELECT 
            patientId,
            encounterId,
            conditionDisplay,
            COUNT(*) AS conditionCount,
            ROW_NUMBER() OVER (PARTITION BY encounterId ORDER BY COUNT(*) DESC) AS rank
        FROM condition
        GROUP BY patientId, encounterId, conditionDisplay
    )
    SELECT
        patientId,
        encounterId,
        MAX(CASE WHEN rank = 1 THEN conditionDisplay END) AS condition_1,
        -- MAX(CASE WHEN rank = 1 THEN conditionCount END) AS condition_count_1,
        MAX(CASE WHEN rank = 2 THEN conditionDisplay END) AS condition_2,
        -- MAX(CASE WHEN rank = 2 THEN conditionCount END) AS condition_count_2,
        MAX(CASE WHEN rank = 3 THEN conditionDisplay END) AS condition_3
        -- MAX(CASE WHEN rank = 3 THEN conditionCount END) AS condition_count_3
    FROM RankedConditions
    GROUP BY patientId, encounterId
""").toPandas()
# condition_pivot_df.to_parquet('_dataset/PivotCondition.parquet', index=False)

### 3.2. Pivot Procedure Table

In [6]:
procedure_pivot_df = spark.sql("""
    WITH RankedProcedures AS (
        SELECT 
            patientId,
            encounterId,
            procedureDisplay,
            COUNT(*) AS procedureCount,
            ROW_NUMBER() OVER (PARTITION BY encounterId ORDER BY COUNT(*) DESC) AS rank
        FROM procedure
        GROUP BY patientId, encounterId, procedureDisplay
    )
    SELECT
        patientId,
        encounterId,
        MAX(CASE WHEN rank = 1 THEN procedureDisplay END) AS procedure_1,
        -- MAX(CASE WHEN rank = 1 THEN procedureCount END) AS procedure_count_1,
        MAX(CASE WHEN rank = 2 THEN procedureDisplay END) AS procedure_2,
        -- MAX(CASE WHEN rank = 2 THEN procedureCount END) AS procedure_count_2,
        MAX(CASE WHEN rank = 3 THEN procedureDisplay END) AS procedure_3
        -- MAX(CASE WHEN rank = 3 THEN procedureCount END) AS procedure_count_3
    FROM RankedProcedures
    GROUP BY patientId, encounterId
""").toPandas()
# procedure_pivot_df.to_parquet('_dataset/PivotProcedure.parquet', index=False)

### 3.3. Consolidate Encounter

In [None]:
COPY (
    SELECT
        encounter.encounterId,
        encounter.patientId,
        patient.gender,
        CASE 
            WHEN date_diff('year', patient.birthDate,encounter.periodStart) BETWEEN 18 AND 29 THEN 'young adults'
            WHEN date_diff('year', patient.birthDate,encounter.periodStart) BETWEEN 30 AND 39 THEN 'young adulthood'
            WHEN date_diff('year', patient.birthDate,encounter.periodStart) BETWEEN 40 AND 49 THEN 'early-middle age'
            WHEN date_diff('year', patient.birthDate,encounter.periodStart) BETWEEN 50 AND 59 THEN 'late-middle age'
            WHEN date_diff('year', patient.birthDate,encounter.periodStart) BETWEEN 60 AND 69 THEN 'mid-old age'
            WHEN date_diff('year', patient.birthDate,encounter.periodStart) BETWEEN 70 AND 79 THEN 'senior-old age'
            WHEN date_diff('year', patient.birthDate,encounter.periodStart) BETWEEN 80 AND 89 THEN 'very senior-old'
            WHEN date_diff('year', patient.birthDate,encounter.periodStart) BETWEEN 90 AND 115 THEN 'centenarians'
            ELSE 'other age groups'
        END AS ageGroup,
        patient.maritalStatus,
        encounter.duration encounterDuration,
        encounter.status encounterStatus,
        encounter.encounterClass,
        encounter.displayType encounterType,
        encounter.priority,
        -- Conditions
        condition.condition_1,
        condition.condition_2,
        condition.condition_3,
        -- Procedures
        procedure.procedure_1,
        procedure.procedure_2,
        procedure.procedure_3,
        encounter.readmissionStatus,
        encounter.sourceName
    FROM read_parquet("_dataset/encounter.parquet") encounter
    LEFT JOIN read_parquet("_dataset/patient.parquet") patient
        ON encounter.patientId = patient.patientID
    LEFT JOIN read_parquet("_dataset/PivotCondition.parquet") condition
        ON encounter.encounterId = condition.encounterId
        AND encounter.patientID = condition.patientID
    LEFT JOIN read_parquet("_dataset/PivotProcedure.parquet") procedure
        ON encounter.encounterId = procedure.encounterId
        AND encounter.patientID = procedure.patientID
)
TO '_dataset/ConsolidatedEncounter.parquet'
(FORMAT 'parquet');

## 4. Model Training

Data Source: `_dataset/ConsolidatedEncounter.parquet`

- `encounterId`: Unique Identifier for the Encounter
- `patientId`: Patient Unique Identifier
- `gender`: Patient Gender
- `ageGroup`: Patient Age Group
- `maritalStatus`: Patient Marital Status
- `encounterDuration`: Total Encounter Duration in days
- `encounterStatus`: Current status of the encounter (e.g., planned, in-progress, finished).
- `encounterClass`: Classification of the encounter (e.g., inpatient, outpatient); Helps categorize the nature of the healthcare service provided.
- `encounterType`: Specific type of Encounter
- `priority`: Urgency of the encounter such as routine, urgent, or emergency; 
- `condition_1`: Primary Condition
- `condition_2`: Secondary Condition
- `condition_3`: Other Condition
- `procedure_1`: Primary Procedure
- `procedure_2`: Secondary Procedure
- `procedure_3`: Other Procedure
- `readmissionStatus`: Is the Encounter Readmitted in the Future
- `sourceName`: Source of the Encounter

In [97]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_parquet("_dataset/ConsolidatedEncounter.parquet")
df.head(10)

In [99]:
df['gender'] = df['gender'].astype('category')
df['ageGroup'] = df['ageGroup'].astype('category')
df['maritalStatus'] = df['maritalStatus'].astype('category')

df['encounterStatus'] = df['encounterStatus'].astype('category')
df['encounterClass'] = df['encounterClass'].astype('category')
df['encounterType'] = df['encounterType'].astype('category')
df['priority'] = df['priority'].astype('category')
df['encounterStatus'] = df['encounterStatus'].astype('category')


df['condition_1'] = df['condition_1'].astype('category')
df['condition_2'] = df['condition_2'].astype('category')
df['condition_3'] = df['condition_3'].astype('category')

df['procedure_1'] = df['procedure_1'].astype('category')
df['procedure_2'] = df['procedure_2'].astype('category')
df['procedure_3'] = df['procedure_3'].astype('category')

df['readmissionStatus'] = df['readmissionStatus'].astype('category')

In [None]:
df.columns

In [None]:
df.info()

In [None]:
# Checking the unique values in the dataset
df.nunique()

In [None]:
# Checking the null value
df.isnull().sum()

In [None]:
# Checking the duplicate values
df.duplicated().sum()

In [None]:
es = df['readmissionStatus'].map({'Readmission': 1, 'No Readmission': 0})

x = df['ageGroup'].cat.codes
y = df['gender'].cat.codes
z = df['maritalStatus'].cat.codes

c = es
fig = plt.figure(figsize=(15, 10))
ax = fig.add_subplot(111, projection='3d')

img = ax.scatter(x, y, z, c=c, cmap='coolwarm', marker='o')
fig.colorbar(img)

ax.set_xlabel('Age Group')
ax.set_ylabel('Gender')
ax.set_zlabel('Marital Status')

plt.title('Count of Readmitted Patients for Patient Demographics')
plt.legend(*img.legend_elements(), title='Readmitted')
plt.show()

### 4.1. LabelEncoder 

Convert categorical string labels (or values) into numeric labels. This is often required for machine learning algorithms that expect numerical input rather than categorical text data.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
le = LabelEncoder()
# Apply the LabelEncoder to the categorical columns
df['gender'] = le.fit_transform(df['gender'])
df['ageGroup'] = le.fit_transform(df['ageGroup'])
df['maritalStatus'] = le.fit_transform(df['maritalStatus'])
df['encounterStatus'] = le.fit_transform(df['encounterStatus'])
df['encounterClass'] = le.fit_transform(df['encounterClass'])
df['encounterType'] = le.fit_transform(df['encounterType'])
df['encounterStatus'] = le.fit_transform(df['encounterStatus'])
df['priority'] = le.fit_transform(df['priority'])
df['condition_1'] = le.fit_transform(df['condition_1'])
df['condition_2'] = le.fit_transform(df['condition_2'])
df['condition_3'] = le.fit_transform(df['condition_3'])
df['procedure_1'] = le.fit_transform(df['procedure_1'])
df['procedure_2'] = le.fit_transform(df['procedure_2'])
df['procedure_3'] = le.fit_transform(df['procedure_3'])
df['readmissionStatus'] = le.fit_transform(df['readmissionStatus'])

df.head(3)

In [106]:
df.drop(columns='encounterId', inplace=True)
df.drop(columns='patientId', inplace=True)
df.drop(columns='sourceName', inplace=True)

In [None]:
# Show correlation of numerical columns
df_numeric = df.select_dtypes(include=['number'])

# Calculate correlation matrix between selected columns
correlation_matrix = df_numeric.corr()

# Visualizing correlation between variables
plt.figure(figsize=(16,12))
sns.heatmap(correlation_matrix, cmap='YlGnBu', fmt='.2g', annot=True)

plt.title('Heatmap showing correlation amongst variables',y=1.03)
plt.show()

### 4.2. Split Train and Test Dataset

In [108]:
from sklearn.model_selection import train_test_split

In [None]:
# Split the data into two DataFrames: X (features) and y (target variable)
X = df.drop(columns=['readmissionStatus'],axis=1)  # Specify at least one column as a feature
y = df["readmissionStatus"]  # Specify one column as the target variable

# Split the data into train and test subsets
# You can adjust the test size and random state
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.275, random_state=123
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

### 4.3. Random Forest Classifer

In [110]:
from sklearn.ensemble import RandomForestClassifier

In [111]:
# Define parameters: these will need to be tuned to prevent overfitting and underfitting
params = {
    "n_estimators": 97,  # Number of trees in the forest
    "max_depth": 10,  # Max depth of the tree
    "min_samples_split": 3,  # Min number of samples required to split a node
    "min_samples_leaf": 1,  # Min number of samples required at a leaf node
    "ccp_alpha": 0,  # Cost complexity parameter for pruning
    "random_state": 123,
}
#{'n_estimators': 97, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_depth': 10, 'ccp_alpha': 0}

# Create a RandomForestRegressor object with the parameters above
rf = RandomForestClassifier(**params)

# Train the random forest on the train set
rf = rf.fit(X_train, y_train)

# Predict the outcomes on the test set
y_pred = rf.predict(X_test)

In [112]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

In [None]:
y_pred = rf.predict(X_test)
print('Training Set:', rf.score(X_train,y_train))
print('Test Set:', rf.score(X_test,y_test))

# Calculate the accuracy, precision, and recall scores
print("Accuracy:", "{:.2f}%".format(accuracy_score(y_test, y_pred)*100))
print("Precision:", "{:.2f}%".format(precision_score(y_test, y_pred)*100))
print("Recall:", "{:.2f}%".format(recall_score(y_test, y_pred)*100))
print("Area Under the Curve:", "{:.2f}%".format(roc_auc_score(y_test, y_pred)*100))