# Focusing On Sankey Graph

**- Build a profile of providers referring patients to the major hospitals in Nashville**<br>
**- Are certain specialties more likely to refer to a particular hospital over the others?**

In [1]:
import pandas as pd
import plotly.graph_objects as go

In [2]:
pd.set_option("display.max_columns", 500)
pd.set_option('display.max_rows', 500)

## Reading-in Datasets

In [3]:
nashville_referrals = pd.read_csv("../data/nashville_referrals_normalised_only_hospitals_6436.csv")
# for_testing = pd.read_csv("../data/nashville-referrals-summarized-6436.csv")

## Cleaning Datasets

- `from_npi` - The provider seen first in sequence, coded by NPI
- `to_npi` - The provider seen second in sequence, coded by NPI
- `patient_count` - The total number of patients shared between the two providers over the entire time period (the time period is typically one year)
- `transaction_count` - The count of times that a patient switched between the two providers, in the from-to direction.
- `average_day_wait` - The average amount of days it took for a “HOP” to occur. Which is the the time it took, in days, for a patient to switch to the second provider after having seen the first provider.
- `std_day_wait` - The standard deviation of days it took for a HOP to occur.

For our purpose, we only need the following columns:
- from_npi
- from_npi_specialty
- to_npi
- to_facility_group
- to_facility_name_normalised
- patient_count

In [4]:
nashville_referrals = nashville_referrals[[
    "from_npi",
    "from_npi_specialty",
    "to_npi",
    "to_facility_group",
    "to_facility_name_normalised",
    "patient_count"
]]

In [5]:
# Fix data types
nashville_referrals["from_npi"] = nashville_referrals["from_npi"].astype(str)
nashville_referrals["to_npi"] = nashville_referrals["to_npi"].astype(str)
nashville_referrals["from_npi_specialty"] = nashville_referrals["from_npi_specialty"].astype(str)

In [6]:
# Rename columns
nashville_referrals = nashville_referrals.rename(columns={
    "from_npi": "referrer.npi", 
    "from_npi_specialty": "referrer.specialty",
    "to_npi": "hospital.npi",
    "to_facility_group": "hospital.facility_group", 
    "to_facility_name_normalised": "hospital.facility_name",
    "patient_count": "referral.patient_count"
})

In [7]:
display(nashville_referrals.shape)
display(nashville_referrals.head())
display(nashville_referrals.info())

(6436, 6)

Unnamed: 0,referrer.npi,referrer.specialty,hospital.npi,hospital.facility_group,hospital.facility_name,referral.patient_count
0,1013179860,Internal Medicine,1417938846,Macon County General Hospital,Macon County General Hospital,71
1,1336126887,Urology,1417938846,Macon County General Hospital,Macon County General Hospital,55
2,1336230424,Internal Medicine,1417938846,Macon County General Hospital,Macon County General Hospital,38
3,1346288966,Specialist,1417938846,Macon County General Hospital,Macon County General Hospital,146
4,1326086653,Internal Medicine,1417938846,Macon County General Hospital,Macon County General Hospital,43


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6436 entries, 0 to 6435
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   referrer.npi             6436 non-null   object
 1   referrer.specialty       6436 non-null   object
 2   hospital.npi             6436 non-null   object
 3   hospital.facility_group  6436 non-null   object
 4   hospital.facility_name   6436 non-null   object
 5   referral.patient_count   6436 non-null   int64 
dtypes: int64(1), object(5)
memory usage: 301.8+ KB


None

In [8]:
nashville_referrals['referrer.specialty'].unique()

array(['Internal Medicine', 'Urology', 'Specialist', 'Family Medicine',
       'Physician Assistant', 'Radiology',
       'Physical Medicine & Rehabilitation', 'Surgery', 'Chiropractor',
       'Emergency Medicine', 'Nurse Practitioner', 'General Practice',
       'Podiatrist', 'Orthopaedic Surgery',
       'Thoracic Surgery (Cardiothoracic Vascular Surgery)',
       'Psychiatry & Neurology', 'Anesthesiology',
       'Nurse Anesthetist, Certified Registered', 'Neurological Surgery',
       'Counselor', 'Ophthalmology', 'Obstetrics & Gynecology',
       'Hospitalist', 'Optometrist', 'Otolaryngology', 'Dermatology',
       'Clinical Nurse Specialist', 'Social Worker', 'Psychologist',
       'Physical Therapist', 'Pain Medicine', 'Allergy & Immunology',
       'Pediatrics', 'nan', 'Pathology', 'Registered Nurse',
       'Plastic Surgery', 'Independent Medical Examiner',
       'Legal Medicine', 'Clinical Neuropsychologist',
       'Transplant Surgery',
       'Student in an Organized Heal

Make a backup copy of the `nashville_referrals` dataset so we can re-use it later.

In [9]:
nash_ref_backup = nashville_referrals.copy(deep=True)

## Sankey Diagram 1 - Specialities to Hospitals

**`referrer.specialty` OPTION 1**

This is a list of `referrer.specialty` that are not considered for analysis. We are moving these under the *Other Specialties* category.

In [10]:
# Specializations: List of those beyond top 7 to change to "Other Specialties"
to_others_list = nashville_referrals.groupby([
    "referrer.specialty"
]).agg({
    "referral.patient_count": "sum"
}).sort_values(by=["referral.patient_count"], ascending=(False)).index[7:] # Keep Top 7 Specializations: Beyond that ==> Other Specialties

**`referrer.specialty` OPTION 2**

In other graphs, we manually override the list to using these specifically. Anything beyond these can go under Others**

In [11]:
# # This is in order of importance: If you need to remove, start from the bottom
original_to_keep = [
    'Internal Medicine',
    'Radiology',
    'Nurse Practitioner',
#     'Family Medicine',
    'Pathology',
    'Orthopaedic Surgery',
    'Psychiatry & Neurology',
    'Surgery',
#     'Specialist',
    'Physician Assistant',
    'Ophthalmology',
    'Urology',
#     'Otolaryngology'
]

Replace the specialities in `to_others_list` with `Other Specialities` in `nashville_referrals`

In [12]:
# # OPTION 1: Replace them in nashville_referrals
# for spec in nashville_referrals["referrer.specialty"]:
#     if spec in to_others_list:
#         nashville_referrals.loc[nashville_referrals["referrer.specialty"] == spec, "referrer.specialty"] = "Other Specialities"
#         continue

In [13]:
# OPTION 2: Keep them in nashville_referrals
# Using the original_to_keep
for spec in nashville_referrals["referrer.specialty"]:
    if spec in original_to_keep:
        continue
    else:
        nashville_referrals.loc[nashville_referrals["referrer.specialty"] == spec, "referrer.specialty"] = "Other Specialities"
        continue

**`hospital.facility_name`**

This is a list of `hospital.facility_name` that are not considered for analysis. We are moving these under the *Other Specialties* category.

In [14]:
# # hospital.facility_name: List of those beyond top 7 to change to "Others"
# to_others_list = nashville_referrals.groupby([
#     "hospital.facility_name"
# ]).agg({
#     "referral.patient_count": "sum"
# }).sort_values(by=["referral.patient_count"], ascending=(False)).index[7:] # Keep Top 7 Facilities: Beyond that ==> Others

Replace the facility names in `to_others_list` with `Others` in `nashville_referrals`

In [15]:
# # Replace them in nashville_referrals
# for name in nashville_referrals["hospital.facility_name"]:
#     if name in to_others_list:
#         nashville_referrals.loc[nashville_referrals["hospital.facility_name"] == name, "hospital.facility_name"] = "Others"
#         continue

This is a good spot to check where we are

In [16]:
display(nashville_referrals.shape)
display(nashville_referrals[nashville_referrals["referrer.specialty"] == "Other Specialities"])
display(nashville_referrals["referrer.specialty"].unique())
display(nashville_referrals[nashville_referrals["hospital.facility_name"] == "Others"])
display(nashville_referrals["hospital.facility_name"].unique())

(6436, 6)

Unnamed: 0,referrer.npi,referrer.specialty,hospital.npi,hospital.facility_group,hospital.facility_name,referral.patient_count
3,1346288966,Other Specialities,1417938846,Macon County General Hospital,Macon County General Hospital,146
6,1508916586,Other Specialities,1417938846,Macon County General Hospital,Macon County General Hospital,96
9,1104837327,Other Specialities,1417938846,Macon County General Hospital,Macon County General Hospital,52
16,1437107265,Other Specialities,1417938846,Macon County General Hospital,Macon County General Hospital,33
18,1407836356,Other Specialities,1417938846,Macon County General Hospital,Macon County General Hospital,208
...,...,...,...,...,...,...
6424,1386878957,Other Specialities,1447639398,Ascension Saint Thomas,Saint Thomas Stones River Hospital,75
6425,1376544767,Other Specialities,1447639398,Ascension Saint Thomas,Saint Thomas Stones River Hospital,285
6427,1922278126,Other Specialities,1447639398,Ascension Saint Thomas,Saint Thomas Stones River Hospital,107
6432,1770521577,Other Specialities,1447639398,Ascension Saint Thomas,Saint Thomas Stones River Hospital,91


array(['Internal Medicine', 'Urology', 'Other Specialities',
       'Physician Assistant', 'Radiology', 'Surgery',
       'Nurse Practitioner', 'Orthopaedic Surgery',
       'Psychiatry & Neurology', 'Ophthalmology', 'Pathology'],
      dtype=object)

Unnamed: 0,referrer.npi,referrer.specialty,hospital.npi,hospital.facility_group,hospital.facility_name,referral.patient_count


array(['Macon County General Hospital', 'Maury Regional Medical Center',
       'StoneCrest Medical Center HCA',
       'Southern Hills Medical Center HCA',
       'Ashland City Medical Center HCA',
       'Hendersonville Medical Center HCA',
       'TriStar Skyline Medical Center HCA', 'Summit Medical Center HCA',
       'Saint Thomas West Hospital', 'Centennial Medical Center HCA',
       'TriStar Horizon Medical Center HCA',
       'Vanderbilt University Lebanon Medical Center',
       'Williamson County Hospital', 'Saint Thomas Midtown Hospital',
       'NorthCrest Medical Center', 'Nashville General Hosptial',
       'Saint Thomas Rutherford Hospital',
       'Vanderbilt University Medical Center',
       'Sumner Regional Medical Center',
       'Riverview Regional Medical Center', 'Trousdale Medical Center',
       'Saint Thomas Stones River Hospital'], dtype=object)

### Base Dataset

Here, we are building the dataset on which we will based the building of the final diagram

In [17]:
nash_referrals_by_speciality = nashville_referrals.groupby([
    "referrer.specialty",
    "hospital.facility_name",
]).agg({
    "referral.patient_count": "sum"
}).reset_index().rename(columns={
    "hospital.facility_name": "hospital.name",
    "referral.patient_count": "sum_patient_count"
}).sort_values(by=["sum_patient_count"], ascending=(False))\
  .reset_index(drop=True)

In [18]:
display(nash_referrals_by_speciality)

Unnamed: 0,referrer.specialty,hospital.name,sum_patient_count
0,Internal Medicine,Vanderbilt University Medical Center,138232
1,Other Specialities,Vanderbilt University Medical Center,96452
2,Radiology,Vanderbilt University Medical Center,69766
3,Internal Medicine,Centennial Medical Center HCA,50122
4,Internal Medicine,Saint Thomas West Hospital,46091
5,Radiology,Centennial Medical Center HCA,40110
6,Radiology,Saint Thomas West Hospital,34746
7,Nurse Practitioner,Vanderbilt University Medical Center,33901
8,Other Specialities,Saint Thomas West Hospital,29437
9,Radiology,TriStar Skyline Medical Center HCA,27040


### Prepping the Diagram

This is every single individual *node* on the diagram. Each node should be unique on its own.

In [19]:
all_nodes = list(nash_referrals_by_speciality["referrer.specialty"]\
                 .append(nash_referrals_by_speciality["hospital.name"])\
                 .unique())
len(all_nodes)

33

This is the list of all the individual sources (referrers) in a specific order.<br>
There will be duplicates in this order but that is ok.

In [20]:
source_data = nash_referrals_by_speciality["referrer.specialty"]
len(source_data)

213

This is the matching indices for every single individual sources (refferers) above

In [21]:
source_data_idx = []

for s in source_data:
    for i, n in enumerate(all_nodes):
        if s == n:
            source_data_idx.append(i)
            break

print(len(source_data_idx))
print(source_data_idx)

213
[0, 1, 2, 0, 0, 2, 2, 3, 1, 2, 2, 1, 0, 0, 2, 1, 0, 1, 2, 2, 0, 2, 1, 0, 0, 1, 1, 1, 4, 5, 0, 2, 2, 1, 0, 6, 1, 5, 2, 2, 2, 1, 2, 1, 7, 0, 1, 8, 3, 1, 0, 3, 3, 0, 0, 0, 9, 2, 7, 3, 1, 7, 5, 7, 10, 5, 0, 8, 3, 1, 3, 5, 10, 3, 3, 8, 8, 3, 10, 5, 7, 8, 3, 4, 4, 3, 4, 1, 5, 9, 3, 8, 10, 3, 10, 0, 5, 5, 6, 7, 7, 4, 8, 0, 10, 9, 5, 7, 7, 2, 9, 5, 3, 6, 7, 9, 10, 9, 4, 1, 10, 3, 9, 1, 7, 7, 4, 2, 9, 10, 3, 3, 10, 8, 9, 8, 1, 7, 8, 10, 10, 10, 2, 5, 10, 10, 1, 8, 2, 7, 8, 5, 7, 4, 8, 4, 5, 6, 2, 4, 8, 7, 8, 0, 6, 4, 9, 3, 4, 4, 9, 9, 6, 2, 8, 4, 9, 3, 9, 9, 5, 3, 0, 10, 0, 7, 4, 4, 9, 7, 6, 10, 6, 3, 8, 4, 6, 6, 6, 7, 9, 9, 6, 4, 9, 10, 4, 8, 10, 8, 6, 7, 7]


This is the list of all the individual target (hospitals) in a specific order.<br>
There will be duplicates in this order but that is ok.

In [22]:
target_data = nash_referrals_by_speciality["hospital.name"]
len(target_data)

213

This is the matching indices for every single individual target (hospitals) above

In [23]:
target_data_idx = []

for t in target_data:
    for i, n in enumerate(all_nodes):
        if t == n:
            target_data_idx.append(i)
            break
            
print(len(target_data_idx))
print(target_data_idx)

213
[11, 11, 11, 12, 13, 12, 13, 11, 13, 14, 15, 15, 16, 15, 17, 12, 18, 18, 18, 16, 19, 19, 19, 20, 14, 17, 16, 14, 11, 11, 17, 21, 22, 20, 21, 11, 21, 12, 23, 24, 25, 22, 26, 23, 11, 23, 24, 11, 13, 26, 22, 15, 12, 24, 25, 26, 11, 20, 15, 14, 25, 19, 13, 12, 15, 24, 27, 13, 16, 28, 20, 16, 12, 21, 18, 16, 12, 19, 11, 25, 17, 15, 22, 12, 15, 23, 14, 27, 20, 12, 26, 19, 21, 17, 22, 28, 22, 23, 12, 13, 16, 13, 18, 29, 17, 19, 17, 23, 21, 27, 15, 19, 24, 13, 18, 13, 16, 20, 19, 30, 20, 25, 18, 29, 20, 14, 18, 31, 21, 14, 28, 27, 19, 17, 17, 26, 31, 26, 21, 23, 27, 18, 29, 21, 25, 26, 32, 14, 28, 25, 20, 14, 24, 20, 22, 16, 18, 22, 30, 23, 23, 22, 25, 32, 15, 24, 24, 32, 22, 17, 14, 16, 19, 32, 24, 25, 22, 30, 26, 25, 30, 31, 30, 24, 31, 28, 21, 26, 28, 27, 20, 29, 26, 29, 27, 29, 18, 28, 21, 29, 23, 29, 24, 28, 27, 28, 27, 28, 31, 29, 25, 31, 30]


This is the value to represent on the diagram: Here, we are using **Total Patient Count** for each referral

In [24]:
total_patient_count = list(nash_referrals_by_speciality["sum_patient_count"])
len(total_patient_count)

213

### Building The Diagram

In [25]:
fig = go.Figure(data=[go.Sankey(
        
    # Define all the nodes
    node = dict(
      pad = 30, # The spacing size between the separations
      thickness = 50, # Thickness of the nodes
      line = dict(color = "black", width = 0.5), # Margin Line of the nodes
      label = all_nodes, # Label of each successive nodes: Just the npis for now
      #color = "green" # Color of the nodes: 1 value for overall color, or a list to match the label order
    ),
    
    # Define how the nodes will be linked to each other
    link = dict(
      source = source_data_idx, # From indices: Index that correspond to position in label
      target = target_data_idx, # To indices: Index that correspond to position in label
      value = total_patient_count, # Amount: Correspond to label positions
      #label = data['data'][0]['link']['label'], # Custom Hover label on the connector
      #color = data['data'][0]['link']['color'] # Color of the connector
))])

# Styling
fig.update_layout(
    title_text="Total Patient Count Referrals:<br>From Referrer Specialities to Individual Hospital",
    font_size=14
)

# Display figure
fig.show()

# Export to HTML
fig.write_html("./speciality-to-individual-hospital.html")

## Sankey Diagram 2: Specialties to Facility Groups

Restore `nashville_referrals` from backup

In [26]:
nashville_referrals = nash_ref_backup.copy(deep=True)

Now, just repeat the previous processes

**Specialties**

In [27]:
# # OPTION 1: Specialties: List of those beyond top 7 to change to "Other Specialties"
# to_others_list = nashville_referrals.groupby([
#     "referrer.specialty"
# ]).agg({
#     "referral.patient_count": "sum"
# }).sort_values(by=["referral.patient_count"], ascending=(False)).index[7:] # Keep Top 7 Specializations: Beyond that ==> Other Specialties

In [28]:
# # OPTION 1: Specialties: Replace them in nashville_referrals
# for spec in nashville_referrals["referrer.specialty"]:
#     if spec in to_others_list:
#         nashville_referrals.loc[nashville_referrals["referrer.specialty"] == spec, "referrer.specialty"] = "Other Specialities"
#         continue

In [29]:
# OPTION 2: Specialties to keep
# This is in order of importance: If you need to remove, start from the bottom
original_to_keep = [
    'Internal Medicine',
    'Radiology',
    'Nurse Practitioner',
#     'Family Medicine',
    'Pathology',
    'Orthopaedic Surgery',
    'Psychiatry & Neurology',
    'Surgery',
#     'Specialist',
    'Physician Assistant',
    'Ophthalmology',
    'Urology',
#     'Otolaryngology'
]

In [30]:
# OPTION 2: Specialties: Keep them in nashville_referrals
# Using the original_to_keep
for spec in nashville_referrals["referrer.specialty"]:
    if spec in original_to_keep:
        continue
    else:
        nashville_referrals.loc[nashville_referrals["referrer.specialty"] == spec, "referrer.specialty"] = "Other Specialities"
        continue

**Facility Group**

Comment-out if you want to keep everything

In [31]:
# # hospital.facility_group: List of those beyond top 7 to change to "Others"
# to_others_list = nashville_referrals.groupby([
#     "hospital.facility_group"
# ]).agg({
#     "referral.patient_count": "sum"
# }).sort_values(by=["referral.patient_count"], ascending=(False)).index[7:] # Keep Top 7 Facilities: Beyond that ==> Others

In [32]:
# # Replace them in nashville_referrals
# for el in nashville_referrals["hospital.facility_group"]:
#     if el in to_others_list:
#         nashville_referrals.loc[nashville_referrals["hospital.facility_group"] == el, "hospital.facility_group"] = "Others"
#         continue

### Base Dataset

In [33]:
nash_referrals_by_speciality = nashville_referrals.groupby([
    "referrer.specialty",
    "hospital.facility_group",
]).agg({
    "referral.patient_count": "sum"
}).reset_index().rename(columns={
    "hospital.facility_group": "facility_group",
    "referral.patient_count": "sum_patient_count"
}).sort_values(by=["sum_patient_count"], ascending=(False))\
  .reset_index(drop=True)

### Prepping the Diagram

In [34]:
all_nodes = list(nash_referrals_by_speciality["referrer.specialty"]\
                 .append(nash_referrals_by_speciality["facility_group"])\
                 .unique())
len(all_nodes)

22

In [35]:
source_data = nash_referrals_by_speciality["referrer.specialty"]
len(source_data)

107

In [36]:
source_data_idx = []

for s in source_data:
    for i, n in enumerate(all_nodes):
        if s == n:
            source_data_idx.append(i)
            break

print(len(source_data_idx))
print(source_data_idx)

107
[0, 1, 0, 2, 0, 2, 1, 1, 2, 3, 1, 2, 0, 4, 0, 1, 2, 0, 3, 4, 5, 3, 2, 6, 7, 7, 1, 8, 9, 2, 4, 3, 9, 9, 5, 0, 10, 1, 10, 7, 7, 7, 8, 0, 5, 8, 2, 3, 10, 3, 6, 9, 5, 8, 2, 4, 3, 9, 0, 0, 6, 10, 1, 10, 4, 10, 5, 8, 2, 7, 3, 3, 8, 9, 7, 8, 1, 8, 2, 1, 9, 5, 0, 6, 3, 6, 1, 10, 7, 5, 10, 7, 6, 8, 6, 3, 9, 5, 6, 7, 10, 5, 10, 8, 5, 9, 9]


In [37]:
target_data = nash_referrals_by_speciality["facility_group"]
len(target_data)

107

In [38]:
target_data_idx = []

for t in target_data:
    for i, n in enumerate(all_nodes):
        if t == n:
            target_data_idx.append(i)
            break
            
print(len(target_data_idx))
print(target_data_idx)

107
[11, 12, 12, 11, 13, 12, 11, 13, 13, 11, 14, 14, 14, 12, 15, 15, 15, 16, 12, 11, 11, 13, 16, 11, 11, 12, 17, 12, 11, 17, 13, 14, 13, 12, 12, 17, 11, 16, 12, 14, 13, 15, 14, 18, 13, 11, 19, 16, 13, 15, 12, 14, 14, 13, 18, 16, 17, 15, 19, 20, 13, 15, 18, 14, 15, 16, 15, 16, 20, 16, 19, 18, 15, 17, 17, 18, 20, 17, 21, 19, 16, 16, 21, 14, 21, 15, 21, 17, 19, 17, 19, 18, 16, 20, 17, 20, 18, 20, 19, 20, 20, 19, 18, 19, 18, 19, 20]


In [39]:
total_patient_count = list(nash_referrals_by_speciality["sum_patient_count"])
len(total_patient_count)

107

### Building The Diagram

In [40]:
fig = go.Figure(data=[go.Sankey(
        
    # Define all the nodes
    node = dict(
      pad = 30, # The spacing size between the separations
      thickness = 50, # Thickness of the nodes
      line = dict(color = "black", width = 0.5), # Margin Line of the nodes
      label = all_nodes, # Label of each successive nodes: Just the npis for now
      #color = "green" # Color of the nodes: 1 value for overall color, or a list to match the label order
    ),
    
    # Define how the nodes will be linked to each other
    link = dict(
      source = source_data_idx, # From indices: Index that correspond to position in label
      target = target_data_idx, # To indices: Index that correspond to position in label
      value = total_patient_count, # Amount: Correspond to label positions
      #label = data['data'][0]['link']['label'], # Custom Hover label on the connector
      #color = data['data'][0]['link']['color'] # Color of the connector
))])

# Styling
fig.update_layout(
    title_text="Total Patient Count Referrals:<br>From Referrer Specialities to Facility Groups",
    font_size=14
)

# Display figure
fig.show()

# Export to HTML
fig.write_html("./speciality-to-facility-group.html")