## Patient Pathways Sankey Diagram

In this section, we build a Sankey diagram to visualize **patient journeys** from  
**Condition → Visit Type → Outcome (Alive vs Deceased)**.

Each block in the diagram represents a category (condition, visit type, or outcome),  
and each flow represents **how many patients** move from one stage to the next.  
Thicker bands = more patients following that route.

This view helps answer:

- Which conditions are most common?
- How often do they lead to outpatient, inpatient, or ER visits?
- How frequently do those pathways end in death vs survival?


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np

In [2]:
#  Setup 
sns.set(style="whitegrid", context="talk")
plt.rcParams["figure.figsize"] = (10, 6)

# Initialize random number generator for reproducibility, Panda's utilizes NumPy module not "random"
np.random.seed(42)

In [3]:
# Load data
df = pd.read_csv("final_with_deceased.csv")
df

Unnamed: 0,person_id,birth_datetime,race_source_value,ethnicity_source_value,gender_source_value,visit_occurrence_id,visit_start_date,visit_end_date,visit_type,condition,...,systolic,diastolic,heart_rate_bpm,oxygen_saturation_percent,respiratory_rate_per_minute,flu_last_administered,tdap_last_administered,mmr_last_administered,polio_last_administered,deceased
0,1,1958-12-02,white,nonhispanic,F,1,2020-03-11,2020-04-01,Inpatient Visit,Dyspnea:Pneumonia:Respiratory distress:Wheezing,...,132.0,81.0,178.9,84.8,37.0,2019-09-11,2010-12-02,1962-12-02,1962-12-02,Y
1,2,1945-10-02,white,nonhispanic,F,28,2020-05-07,2020-05-07,Outpatient Visit,Viral sinusitis,...,,,,,,2019-12-04,2017-10-02,1949-10-02,1949-10-02,N
2,3,1968-04-20,white,nonhispanic,M,188,2020-03-15,2020-03-15,Outpatient Visit,Sore throat symptom:Dyspnea:Wheezing,...,108.0,76.0,57.1,78.4,32.1,2019-11-19,2010-04-20,1972-04-20,1972-04-20,N
3,5,1988-08-09,white,nonhispanic,F,198,1992-08-15,1992-08-29,Outpatient Visit,Perennial allergic rhinitis,...,,,,,,1991-10-24,,1992-08-09,1992-08-09,N
4,5,1988-08-09,white,nonhispanic,F,206,2020-03-10,2020-03-10,Outpatient Visit,Cough,...,130.0,84.0,132.8,88.4,14.4,2019-10-23,2010-08-09,1992-08-09,1992-08-09,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
156025,124148,2017-08-11,white,nonhispanic,M,3139386,2020-03-08,2020-03-08,Outpatient Visit,Cough,...,108.0,83.0,170.0,84.8,32.6,2019-12-27,,2018-08-11,2018-02-11,N
156026,124149,1948-12-16,black,nonhispanic,F,3139391,2019-10-23,2019-10-23,Outpatient Visit,Viral sinusitis,...,,,,,,2018-12-08,2010-12-16,1952-12-16,1952-12-16,N
156027,124149,1948-12-16,black,nonhispanic,F,3139397,2020-02-14,2020-03-06,Inpatient Visit,Acute respiratory failure:Pneumonia:Respirator...,...,119.0,73.0,149.8,87.5,13.6,2019-12-12,2010-12-16,1952-12-16,1952-12-16,N
156028,124149,1948-12-16,black,nonhispanic,F,3139393,2020-03-17,2020-03-17,Outpatient Visit,Viral sinusitis,...,,,,,,2019-12-04,2010-12-16,1952-12-16,1952-12-16,N


### Step 1 — Standardize Date Columns

First, we convert all date-like columns into proper datetime objects.

This doesn’t directly drive the Sankey, but it makes the dataset **consistent and safe** for  
joins, filters, and time-aware analysis later on (e.g., visit order, timing, follow-up work).


In [4]:
for c in ["visit_start_date", "visit_end_date", "birth_datetime", "measurement_Date","flu_last_administered","tdap_last_administered","mmr_last_administered","polio_last_administered"]:
    if c in df.columns:
        df[c] = pd.to_datetime(df[c], errors="coerce")

### Step 2 — Make Outcome Labels Human-Readable

The raw dataset encodes death status as `"Y"` and `"N"`.  
Here we map those into clear categories: **Alive** and **Deceased**.

This becomes the **final node** in the Sankey flow: the outcome for each visit.


In [5]:

# Modify labels for deceased column
df["deceased_flag"] = df["deceased"].map({"Y": "Deceased", "N": "Alive"}).fillna("Unknown").astype("category")


### Step 3 — Split Multi-Condition Strings

A single visit can list multiple conditions in one text field (e.g.,  
`"Sinusitis : Cough : Respiratory infection"`).

We turn those colon-separated strings into a **clean list of individual conditions**, so that:
- Each condition is treated separately
- The Sankey can count flows **per condition**, not per raw text string


In [6]:
import re

# robust split on ":" allowing extra spaces; keep NaN if empty
def split_conditions(s):
    if pd.isna(s) or str(s).strip() == "":
        return []
    # split on ":" with optional surrounding spaces
    parts = re.split(r"\s*:\s*", str(s))
    # normalize: strip, drop empties, lower (or title-case if you prefer)
    parts = [p.strip() for p in parts if p and p.strip()]
    return parts

# apply once to create a list-typed column
df["condition_list"] = df["condition"].map(split_conditions)

### Step 4 — Reshape to One-Condition-Per-Row

We now **explode** the condition lists so that:

- Each row = one visit–condition pair  
- Empty or duplicated conditions are removed

This “long” format is what allows us to later count:

- how often each condition leads to each **visit type**, and  
- how visit types split into **Alive vs Deceased**.


In [7]:
cond_long = (
    df[["visit_occurrence_id", "person_id", "visit_start_date"]]
      .assign(condition_item=df["condition_list"])
      .explode("condition_item", ignore_index=True)
)

# drop rows where no condition exists after cleaning
cond_long = cond_long.dropna(subset=["condition_item"])

# (optional) dedupe within visit in case the same condition appears twice
cond_long = cond_long.drop_duplicates(subset=["visit_occurrence_id", "condition_item"])


### Step 5 — Build the Core Table for the Sankey

Here we assemble a working table with, for each visit:

- the **condition** (from `cond_long`)
- the **visit type** (outpatient / inpatient / ER)
- the **outcome** (Alive / Deceased)

This merged table is the **foundation** of the Sankey: it links condition → visit type → outcome.


In [8]:
# Sankey (Condition → Visit Type → Deceased) with robust rendering & diagnostics

import plotly.graph_objects as go
import plotly.io as pio

pio.renderers.default = "notebook_connected"   # works in Jupyter classic/JLab

base = (cond_long[["visit_occurrence_id", "condition_item"]]
        .merge(df[["visit_occurrence_id", "visit_type", "deceased_flag"]],
               on="visit_occurrence_id", how="inner")
        .dropna(subset=["condition_item", "visit_type", "deceased_flag"]))


### Step 6 — Focus on the Top Conditions

To keep the Sankey readable, we restrict the plot to the **top 12 most frequent conditions**.

That way, the diagram highlights the most important and common pathways,  
instead of being cluttered by rare edge cases.


In [9]:

TOP_K = 12

top_conditions = base["condition_item"].value_counts().head(TOP_K).index
base = base[base["condition_item"].isin(top_conditions)]


### Step 7 — Define Nodes for Condition, Visit Type, and Outcome

A Sankey diagram is defined in terms of:

- **Nodes**: blocks (conditions, visit types, outcomes)
- **Links**: flows between those nodes

Here we:
- collect all distinct condition labels,
- all visit types, and
- the outcome labels (Alive / Deceased),

and assign each one a numeric ID so Plotly can connect them.


In [10]:
# Build node list
conditions   = sorted(base["condition_item"].unique().tolist())
visit_types  = sorted(base["visit_type"].unique().tolist())
deceased_flg = sorted(base["deceased_flag"].unique().tolist())

nodes = conditions + visit_types + deceased_flg
node_idx = {name: i for i, name in enumerate(nodes)}


### Step 8 — Measure Flows from Condition → Visit Type

We now count how many visits go from each **condition** to each **visit type**.

These counts become the **first set of links** in the Sankey, representing the left half  
of the diagram (Condition → Visit Type).


In [11]:

# Links: condition - visit_type
cv = (base.groupby(["condition_item", "visit_type"], observed=True)
           .size().reset_index(name="count"))
src1 = [node_idx[c] for c in cv["condition_item"]]
tgt1 = [node_idx[v] for v in cv["visit_type"]]
val1 = cv["count"].tolist()




### Step 9 — Measure Flows from Visit Type → Outcome

Next, we count how many visits go from each **visit type** to each **outcome** (Alive vs Deceased).

These counts form the **second set of links**, representing the right half of the diagram  
(Visit Type → Outcome).


In [12]:
# Links: visit_type - deceased
vd = (base.groupby(["visit_type", "deceased_flag"], observed=True)
           .size().reset_index(name="count"))
src2 = [node_idx[v] for v in vd["visit_type"]]
tgt2 = [node_idx[d] for d in vd["deceased_flag"]]
val2 = vd["count"].tolist()



### Step 10 — Build and Render the Sankey Diagram

Finally, we pass:

- our list of **nodes** (labels), and  
- our two sets of **links** (Condition → Visit Type and Visit Type → Outcome)

into Plotly’s `go.Sankey` function.

This produces a single interactive diagram that summarizes:

- which conditions dominate the dataset,
- how they distribute across visit types, and
- how those visit types split into Alive vs Deceased outcomes.


In [13]:
fig = go.Figure(data=[go.Sankey(
    arrangement="snap",
    node=dict(
        pad=16, thickness=16, line=dict(width=0.5, color="gray"),
        label=nodes
    ),
    link=dict(
        source=src1 + src2,
        target=tgt1 + tgt2,
        value=val1 + val2
    )
)])

fig.update_layout(
    title=f"Sankey: Condition → Visit Type → Deceased (Top {TOP_K} conditions)",
    font=dict(size=12)
)
fig.show()
