# Using Synthea

Synthea is a synthetic data set that models the population of the state of [Massachusetts](https://en.wikipedia.org/wiki/Massachusetts) in the USA. The [source code](https://github.com/synthetichealth/synthea) used to generate this data is open source and can be adopted to other populations.

The data we are using is a small subset from the overall Synthea data set.

In this notebook we will use [Pandas](https://pandas.pydata.org/) to read in and visualize some of the data. We are going to doing a very simple approach; much better analyses could be conducted.


In [None]:
import pandas as pd
import itables
from venn import venn
%matplotlib inline

def report_alert_frac(alerts, reference):
    return "%3.1f%%"%(100*(len(alerts)/ len(reference)))

#### Read in the Synthea data

In [None]:
encdata = pd.read_csv("encounters.csv.gz")
encdata.shape # how many observations and variables we have

## Examing tEMR activity

In a separate activity, we designed an alert to notify physicians in outpatient clinics that they should consider ordering an [A1C test](https://en.wikipedia.org/wiki/Glycated_hemoglobin#Measurement) for the current patient as a screening for diabetes.

Our alert was limited to 

- AGE > 45
- BMI > 25

## What kind of encounters do we have?

In [None]:
encdata.drop_duplicates(subset='Id_ENC', keep='first')["ENCOUNTERCLASS"].value_counts().plot.bar()

### Before going further let us limit data to outpatient encounters


In [None]:
encdata = encdata[encdata["ENCOUNTERCLASS"].isin(["wellness", "ambulatory", "outpatient"])]
encdata.drop_duplicates(subset='Id_ENC', keep='first')["ENCOUNTERCLASS"].value_counts().plot.bar()

### What is the distribution of ages in the data set?

In [None]:
encdata.drop_duplicates(subset='Id_ENC', keep='first')["AGE@ENC"].plot.hist(bins=50)

### What is the distribution of BMI in the data set?

In [None]:
encdata[encdata["DESCRIPTION_OBS"]=="Body Mass Index"].drop_duplicates(subset='Id_ENC', keep='first')["VALUENUMERIC"].plot.hist(bins=50)

### Other features

In [None]:
encdata.drop_duplicates(subset='Id_ENC', keep='first')["RACE"].value_counts().plot.bar()

In [None]:
encdata.drop_duplicates(subset='Id_ENC', keep='first')["GENDER"].value_counts().plot.bar()

### Use itables to explore the data

- `nan` indicates a missing value
- The data set is too large to explore in its entirety, so I'm randomly sampling 200 rows; repeat running the cell as many times as you like or change n (but not too large). 

In [None]:
itables.show(encdata.sample(n=200), maxColumns=0)

### How many encounters and patients do we have?

In [None]:
len(encdata['Id_ENC'].unique())

In [None]:
len(encdata['PATIENT'].unique())

### Select BMI Data

Limit data to encounters with a BMI observation

In [None]:
bmi=encdata[encdata['DESCRIPTION_OBS']=='Body Mass Index']

In [None]:
from collections import Counter
cs = Counter(bmi.Id_ENC)
cs.most_common(10)

In [None]:
itables.show(bmi[bmi.Id_ENC=='8a3a36ec-980e-449e-b21c-2b73c8980cd3'], maxColumns=0)

In [None]:
itables.show(bmi.sample(n=100), maxColumns=0)

### Make sure there are not any duplicate encounters

In [None]:
bmi = bmi.drop_duplicates(subset='Id_ENC', keep='first')
bmi.shape

## Filter based on our alert

- Within Pandas we can combine conditions with
    - __|__: OR
    - __&__: AND

In [None]:
alerts = bmi[(bmi["AGE@ENC"]>45 ) | (bmi["VALUENUMERIC"]>25)]
itables.show(alerts, maxBytes=0)

## What Fraction of the Encounters Generate Our Alert?

- We will use the shape of the data frames to get the proportion

In [None]:
alerts.shape[0]/encdata.drop_duplicates(subset='Id_ENC', keep='first').shape[0]

In [None]:
encdata.drop_duplicates(subset='Id_ENC', keep='first').shape[0], bmi.shape[0]

### How Many Unique Patients Did We Generate an Alert for?

In [None]:
len(alerts["PATIENT"].unique())

## What are other data we could filter on?

### What are our conditions?

In [None]:
cons = list(encdata["DESCRIPTION_CON"].dropna().unique())
cons.sort()
for c in cons:
    print(c)

### Diabetes conditions

In [None]:
for d in encdata["DESCRIPTION_CON"].dropna().unique():
    if 'diabetes' in d.lower():
        print(d)

### What are our observations?

In [None]:
obs = list(encdata["DESCRIPTION_OBS"].dropna().unique())
obs.sort()
for o in obs:
    print(o)

#### Potentially useful

- Body mass index (BMI) [Percentile] Per age and gender
- Glucose
- Hemoglobin A1c/Hemoglobin.total in Blood

### What are our medications?

In [None]:
meds = list(encdata["DESCRIPTION_MED"].dropna().unique())
meds.sort()
for m in meds:
    print(m)

In [None]:
for m in encdata["DESCRIPTION_MED"].dropna().unique():
    if 'insul' in m.lower():
        print(m)

### What are our procedures?

In [None]:
pros = list(encdata["DESCRIPTION_PRO"].dropna().unique())
pros.sort()
for p in pros:
    print(p)

In [None]:
for p in encdata["DESCRIPTION_PRO"].dropna().unique():
    if 'diabetes' in p.lower():
        print(p)

# Simple Exploration of Cohorts and Alert Frequency

The following portion of the notebook explores a (relatively) simple way to explore the cohort we would identify/alerts we would generate. This exploration ignores all temporal information (e.g., we cannot look back at at values/observations/procedures at a previous encounter).

We will use [sets](https://en.wikipedia.org/wiki/Set_(mathematics)) to create unique collections of encounter IDs and then use set operations to combine these. Sets are useful because they do not contain duplicate values. Thus $\left\{ 1, 1, 2, 3, 3 \right\} = \left\{1, 2, 3 \right\}$

- Union ($A \cup B$): The set of all elements that are in $A$ OR $B$.
- Intersecton ($A \cap B$: The set of all elements are are in $A$ AND $B$
- Difference ($A \setminus B$): The set of all element that are in $A$ and are NOT in $B$

#### Examples



In [None]:
A = {"Brian", "Wendy", "Susan", "Daniel"}
B = {"Brian", "Marta", "Matt", "Dennis", "Chris"}
C = {"Daniel", "Javeria", "Kathleen"}

In [None]:
A.union(B)

In [None]:
A.intersection(B)

In [None]:
(A.union(B)).difference(C)

In [None]:
(A.intersection(B)).difference(C)

In [None]:
(A.intersection(B)).union(C)

## Create a set with all encounters

In [None]:
all_enc = set(encdata["Id_ENC"])

## Potential exclusions

In [None]:
diabetes_conds= [
"Diabetes",
"Neuropathy due to type 2 diabetes mellitus (disorder)",
"Diabetic retinopathy associated with type II diabetes mellitus (disorder)",
"Nonproliferative diabetic retinopathy due to type 2 diabetes mellitus (disorder)",
"Microalbuminuria due to type 2 diabetes mellitus (disorder)",
"Macular edema and retinopathy due to type 2 diabetes mellitus (disorder)",
"Proliferative diabetic retinopathy due to type II diabetes mellitus (disorder)"
]

insulin = [
"insulin human  isophane 70 UNT/ML / Regular Insulin  Human 30 UNT/ML Injectable Suspension [Humulin]",
"Insulin Lispro 100 UNT/ML Injectable Solution [Humalog]"
]

screen = ["Urine screening test for diabetes"]

In [None]:
diabetes_patients = set(encdata[encdata["DESCRIPTION_CON"].isin(diabetes_conds)]["PATIENT"])
diabetes = set(encdata[encdata["PATIENT"].isin(diabetes_patients)]["Id_ENC"])

## Potential inclusions

In [None]:
age_o = set(encdata[encdata["AGE@ENC"]>= 45]["Id_ENC"]) # old age 😀
bmi_h = set(encdata[(encdata['DESCRIPTION_OBS']=='Body Mass Index') & (encdata['VALUENUMERIC']>25)]["Id_ENC"])
hyperlipidemia = set(encdata[(encdata['DESCRIPTION_CON']=='Hyperlipidemia')]["Id_ENC"])

In [None]:
len(bmi_h), len(age_o), len(diabetes), len(hyperlipidemia)

### How do these features/sets relate to each other?

- Use a Venn Diagram to visualize
- __Note__: ellipses are not scaled by the set size

In [None]:
features = {"diabetes":diabetes, "age":age_o, "bmi":bmi_h, "hyperlipidemia":hyperlipidemia}
venn(features)

### Let's generate a variety of alerts and see how they work

- We will evaluate peformance by the percentage of encounters that generate an alert

#### Are original alert

In [None]:
a0 = bmi_h.union(age_o)
report_alert_frac(a0, all_enc)

#### Let's generate an alert for everyone over 45 OR (union) with a BMI > 25 that does NOT have a diabetes condition"

In [None]:
a1 = bmi_h.union(age_o).difference(diabetes)
report_alert_frac(a1, all_enc)

### How about AND (intersection) instead of OR?

In [None]:
a2 = bmi_h.intersection(age_o).difference(diabetes)
report_alert_frac(a2, all_enc)

In [None]:
len(all_enc.difference(dia))

In [None]:
num_diabetes_patients = len(encdata[encdata["Id_ENC"].isin(diabetes)]["PATIENT"].unique())

In [None]:
num_diabetes_patients / len(encdata["PATIENT"].unique())