# SyntheticMass
Analysis of [Synthea SyntheticMass](https://synthea.mitre.org/downloads) (open-source synthetic patient population)

## Predicting Hypertension

The goal of this exercise is to predict the probability that a patient has hypertension (diagnosis code 38341003) on a given date.

On inspecting the data, we find that 98% of patients "developed" hypertension between the ages of 18 and 20.  The likely reason for that can be found in the source code for SyntheticMass.  The [clinical care map](https://github.com/synthetichealth/synthea/blob/master/src/main/resources/modules/hypertension.json) used to model hypertension specifies that the "onset" and progression will take place within several months of first wellness encounter (anyone under 18 is excluded).

My prediction is this: the probability that anyone has hypertension on a given date is 15% for people over 20 years old.  There are no other features to engineer.  This is due to how this data is generated.  The proposed exercise is not very meaningful on the provided dataset.

The inspection of `observations.csv` corroborates this notion: 85% of systolic blood pressure measurements are uniformly distributed between 100 and 139 mmHg, and the other 15% uniformly distributed between 140 and 200 mmHg (leading to the diagnosis of hypertension).  Looking longitudinally, we can see that one does not "develop" hypertension: they either have it or not:

```
cat observations.csv | grep Systolic | awk -F, '{print $2","$1","$(NF-1)}' | sort | head -1000

0005923b-f692-4608-ac67-ed12ca51e596,2011-05-12,166.0
0005923b-f692-4608-ac67-ed12ca51e596,2012-06-08,142.0
0005923b-f692-4608-ac67-ed12ca51e596,2013-05-28,167.0
0005923b-f692-4608-ac67-ed12ca51e596,2014-04-09,188.0
0005923b-f692-4608-ac67-ed12ca51e596,2015-03-13,196.0
0005923b-f692-4608-ac67-ed12ca51e596,2016-01-25,200.0

0005fffa-4b7c-4952-91ac-19d854e089eb,2011-06-15,127.0
0005fffa-4b7c-4952-91ac-19d854e089eb,2013-12-08,136.0
0005fffa-4b7c-4952-91ac-19d854e089eb,2016-11-25,105.0

0006b2a0-f823-460f-9682-9a52fa7aec34,2011-06-01,141.0
0006b2a0-f823-460f-9682-9a52fa7aec34,2013-02-18,166.0
0006b2a0-f823-460f-9682-9a52fa7aec34,2015-03-21,187.0
0006b2a0-f823-460f-9682-9a52fa7aec34,2016-02-21,173.0
0006b2a0-f823-460f-9682-9a52fa7aec34,2017-03-11,143.0

0007f2f0-f663-4455-a438-d3e06b4590b6,2012-05-11,105.0
0007f2f0-f663-4455-a438-d3e06b4590b6,2014-05-18,125.0
0007f2f0-f663-4455-a438-d3e06b4590b6,2015-11-06,119.0

000863b8-0dd6-43b0-8f34-47277da5cd7c,2010-06-05,191.0
000863b8-0dd6-43b0-8f34-47277da5cd7c,2011-05-29,173.0
000863b8-0dd6-43b0-8f34-47277da5cd7c,2012-05-06,146.0
000863b8-0dd6-43b0-8f34-47277da5cd7c,2013-05-05,162.0
000863b8-0dd6-43b0-8f34-47277da5cd7c,2014-04-10,163.0
000863b8-0dd6-43b0-8f34-47277da5cd7c,2015-04-05,158.0
000863b8-0dd6-43b0-8f34-47277da5cd7c,2016-04-04,183.0
000863b8-0dd6-43b0-8f34-47277da5cd7c,2017-01-12,199.0

0009366f-5e37-423d-9c28-ff61f93ebdf8,2010-11-01,121.0
0009366f-5e37-423d-9c28-ff61f93ebdf8,2014-04-26,115.0
0009366f-5e37-423d-9c28-ff61f93ebdf8,2017-04-05,129.0

00094a4e-5624-4e0e-a4be-991a4072ad0a,2011-11-24,102.0
00094a4e-5624-4e0e-a4be-991a4072ad0a,2013-11-29,101.0
00094a4e-5624-4e0e-a4be-991a4072ad0a,2015-11-11,108.0
00094a4e-5624-4e0e-a4be-991a4072ad0a,2017-05-14,139.0
```


## Boilerplate

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import logging
import sys

formatter = logging.Formatter(
    fmt='%(asctime)s.%(msecs)03d %(levelname)s [%(name)s] %(message)s',
    datefmt='%y%m%d@%H:%M:%S',
)

# logger = logging.getLogger('synthea')
# logger.setLevel(logging.DEBUG)  # change to INFO or lower for fewer messages
# h = logging.StreamHandler(stream=sys.stdout)
# h.setFormatter(formatter)

# if not logger.hasHandlers():
#     logger.addHandler(h)  # log to STDOUT or Jupyter
    
from datetime import datetime
import os
import pandas as pd
pd.options.display.max_rows = 96
pd.options.display.max_columns = 96
pd.options.display.max_colwidth = 160

In [3]:
def top_items(df, col, n=1) -> list:
    """Returns cardinality and top `n` items from `col` in a `df`."""
    rel = 100 / len(df)
    counts = df[col].value_counts(dropna=False)
    c = len(counts)
    keys = counts.index[:n]
    vals = counts.values[:n]
    pcts = counts.values[:n] * rel
    return c, [[x, y, round(z, 1)] for x, y, z in zip(keys, vals, pcts)]


def ntop(df, n=3) -> pd.DataFrame:
    """Overview of top `n` items in all columns of a `df`."""
    df = df.loc[:, ~df.columns.duplicated()]
    dftop = pd.DataFrame(
        index=df.columns,
        columns=['cardinality','top_items','coverage'],
    )
    for col in df.columns:
        top = top_items(df, col, n=n)
        dftop.loc[col, 'cardinality'] = top[0]
        dftop.loc[col, 'coverage'] = sum([i[2] for i in top[1]])
        dftop.loc[col, 'top_items'] = top[1]
    return dftop


def age(birth, death, unit='y'):
    """Calculates age.  If no death date, returns age today."""
    if pd.isnull(death):
        death = datetime.now()  # I'm a glass half-full kinda guy
    if unit == 'y':
        denom = 365.25
    else:
        denom = 1
    return round((death - birth).days / denom)

In [4]:
folder='/Users/Alexander/Desktop/code/synthea/data'
os.listdir(folder)

['conditions.csv.gz',
 'patients.csv.gz',
 'medications.csv.gz',
 'procedures.csv.gz']

In [5]:
dfs = {}
for f in os.listdir(folder):
    file = os.path.join(folder, f)
    dfs[f[:4]] = pd.read_csv(file, nrows=7e5)
    print(f, dfs[f[:4]].shape)

conditions.csv.gz (483462, 6)
patients.csv.gz (132607, 17)
medications.csv.gz (397877, 8)
procedures.csv.gz (628785, 7)


In [6]:
dfs['pati']['AGE'] = dfs['pati'].apply(
    lambda x: age(pd.Timestamp(x.BIRTHDATE), pd.Timestamp(x.DEATHDATE)), axis=1)

*Only patient ids found in the medication and procedure tables will be put into the model.*  

Clarification needed: patients found in *either* table or in *both* tables?  We'll assume that we only want patients with records in *both* tables.

In [7]:
# in either table
patient_pop = pd.concat([
    dfs['medi'].PATIENT,
    dfs['proc'].PATIENT
]).drop_duplicates()
len(patient_pop)

116465

In [8]:
# in both tables
patient_pop = pd.merge(
    dfs['medi'].PATIENT,
    dfs['proc'].PATIENT,
    how='inner',
).stack().drop_duplicates()

# we lost a few records while cleaning patients.csv
# this ensures that we keep only records existing in all three tables
patient_pop = patient_pop[patient_pop.isin(dfs['pati'].ID)]
len(patient_pop)

93565

In [9]:
patient_pop.name='patient_pop'
for k, df in dfs.items():
    left_key = 'PATIENT' if 'PATIENT' in df.columns else 'ID'
    dfs[k] = pd.merge(
        df,
        patient_pop,
        left_on=left_key,
        right_on='patient_pop',
        how='inner'
    )[df.columns]
    print(k, dfs[k].shape)

cond (407152, 6)
pati (93565, 18)
medi (356161, 8)
proc (593170, 7)


In [10]:
# add DOB, age and gender
for k in ['cond','medi','proc']:
    dfs[k] = dfs[k].set_index('PATIENT').merge(
        dfs['pati'].set_index('ID')[['GENDER','BIRTHDATE','AGE']],
        left_index=True,
        right_index=True,
        how='left',
    )

# add date of diagnosis
dfs['cond']['AGE_START'] = dfs['cond'].apply(
    lambda x: age(pd.Timestamp(x.BIRTHDATE), pd.Timestamp(x.START)), axis=1)

In [11]:
ntop(dfs['pati'])

Unnamed: 0,cardinality,top_items,coverage
ID,93565,"[[511bc323-42d3-48b6-9dd1-88b04e5d7c8e, 1, 0.0], [1539ac10-5615-4704-b864-9910a612869c, 1, 0.0], [7ff2fd0d-6cc8-4ba5-aef9-e433f0c7ff71, 1, 0.0]]",0.0
BIRTHDATE,30297,"[[1962-08-30, 12, 0.0], [1959-02-12, 12, 0.0], [1999-03-25, 12, 0.0]]",0.0
DEATHDATE,2349,"[[nan, 86783, 92.8], [2017-01-20, 9, 0.0], [2016-01-21, 9, 0.0]]",92.8
SSN,88379,"[[999-20-9796, 4, 0.0], [999-72-1056, 4, 0.0], [999-55-2306, 4, 0.0]]",0.0
DRIVERS,52239,"[[nan, 15291, 16.3], [S99946879, 7, 0.0], [S99998656, 7, 0.0]]",16.3
PASSPORT,36641,"[[false, 36685, 39.2], [nan, 20234, 21.6], [X59769894X, 2, 0.0]]",60.8
PREFIX,4,"[[Mr., 35540, 38.0], [Mrs., 27234, 29.1], [nan, 17779, 19.0]]",86.1
FIRST,3008,"[[Lemuel35, 51, 0.1], [Jaunita995, 50, 0.1], [Lilyan484, 50, 0.1]]",0.3
LAST,474,"[[Beier837, 256, 0.3], [West921, 244, 0.3], [Armstrong275, 241, 0.3]]",0.9
SUFFIX,4,"[[nan, 92562, 98.9], [MD, 352, 0.4], [PhD, 340, 0.4]]",99.7


In [12]:
ntop(dfs['proc'])

Unnamed: 0,cardinality,top_items,coverage
DATE,2558,"[[2015-03-24, 304, 0.1], [2015-10-06, 301, 0.1], [2011-07-01, 300, 0.1]]",0.3
ENCOUNTER,539288,"[[59306077-aec0-4908-8f0d-aa2bdc86cd8c, 71, 0.0], [9d94db7c-9e5f-4f85-b162-27a33571701c, 71, 0.0], [57a3854d-f4f4-49d6-af08-c60d9c4a230b, 70, 0.0]]",0.0
CODE,78,"[[428191000124101, 238610, 40.2], [180256009, 91335, 15.4], [76601001, 41767, 7.0]]",62.6
DESCRIPTION,80,"[[Documentation of current medications, 238610, 40.2], [Subcutaneous immunotherapy, 91335, 15.4], [Intramuscular injection, 41767, 7.0]]",62.6
REASONCODE,49,"[[nan, 437271, 73.7], [72892002.0, 31116, 5.2], [10509002.0, 26402, 4.5]]",83.4
REASONDESCRIPTION,49,"[[nan, 437271, 73.7], [Normal pregnancy, 31116, 5.2], [Acute bronchitis (disorder), 26402, 4.5]]",83.4
GENDER,2,"[[F, 344282, 58.0], [M, 248888, 42.0]]",100.0
BIRTHDATE,30297,"[[2000-04-10, 294, 0.0], [1999-10-03, 283, 0.0], [1999-03-11, 279, 0.0]]",0.0
AGE,102,"[[21, 19744, 3.3], [23, 19309, 3.3], [24, 18538, 3.1]]",9.7


In [13]:
ntop(dfs['medi'])

Unnamed: 0,cardinality,top_items,coverage
START,27345,"[[2014-03-02, 127, 0.0], [2017-03-06, 125, 0.0], [2015-03-05, 125, 0.0]]",0.0
STOP,4109,"[[nan, 179756, 50.5], [2010-09-08, 114, 0.0], [2010-06-11, 109, 0.0]]",50.5
ENCOUNTER,206158,"[[59306077-aec0-4908-8f0d-aa2bdc86cd8c, 139, 0.0], [1c9de4f9-bf5c-4650-84bd-61e147db01a7, 133, 0.0], [9d94db7c-9e5f-4f85-b162-27a33571701c, 133, 0.0]]",0.0
CODE,98,"[[834060, 52443, 14.7], [608680, 26981, 7.6], [834101, 25289, 7.1]]",29.4
DESCRIPTION,99,"[[Penicillin V Potassium 250 MG, 52443, 14.7], [Acetaminophen 160 MG, 26981, 7.6], [Penicillin V Potassium 500 MG, 25289, 7.1]]",29.4
REASONCODE,41,"[[nan, 114598, 32.2], [43878008.0, 77262, 21.7], [10509002.0, 35396, 9.9]]",63.8
REASONDESCRIPTION,41,"[[nan, 114598, 32.2], [Streptococcal sore throat (disorder), 77262, 21.7], [Acute bronchitis (disorder), 35396, 9.9]]",63.8
GENDER,2,"[[F, 190365, 53.4], [M, 165796, 46.6]]",100.0
BIRTHDATE,30297,"[[1951-12-15, 187, 0.1], [1960-04-23, 171, 0.0], [1962-07-26, 171, 0.0]]",0.1
AGE,102,"[[56, 7080, 2.0], [58, 6892, 1.9], [57, 6882, 1.9]]",5.8


In [14]:
ntop(dfs['cond'])

Unnamed: 0,cardinality,top_items,coverage
START,24879,"[[2015-09-28, 272, 0.1], [2011-02-28, 271, 0.1], [2014-08-28, 263, 0.1]]",0.3
STOP,7208,"[[nan, 158345, 38.9], [2015-05-12, 130, 0.0], [2012-02-22, 129, 0.0]]",38.9
ENCOUNTER,309865,"[[137b2704-d4ef-440d-b8c7-ea9b0c6bc514, 21, 0.0], [4fb78c49-1fa1-4147-a423-50ab77479cea, 20, 0.0], [bb9c020f-f30f-4b31-9860-639e2e40b7ce, 19, 0.0]]",0.0
CODE,127,"[[444814009, 69559, 17.1], [195662009, 39088, 9.6], [10509002, 35593, 8.7]]",35.4
DESCRIPTION,127,"[[Viral sinusitis (disorder), 69559, 17.1], [Acute viral pharyngitis (disorder), 39088, 9.6], [Acute bronchitis (disorder), 35593, 8.7]]",35.4
GENDER,2,"[[F, 222019, 54.5], [M, 185133, 45.5]]",100.0
BIRTHDATE,30222,"[[1959-02-12, 81, 0.0], [1957-02-19, 66, 0.0], [1963-10-20, 61, 0.0]]",0.0
AGE,102,"[[58, 7367, 1.8], [61, 7101, 1.7], [56, 6987, 1.7]]",5.2
AGE_START,100,"[[19, 17711, 4.3], [18, 15746, 3.9], [3, 8444, 2.1]]",10.3


## Hypertension
Here I was going to study the age of onset, and it looked funny...

In [15]:
ht = dfs['cond'].CODE == 38341003
dfs['cond'][ht].AGE_START.value_counts().sort_index() #.plot(kind='bar')

18    10951
19    11157
20       14
21        7
23       38
24       62
25        7
26       53
27       76
28       17
29       71
30       53
31       41
32       40
33       45
34       50
35       30
36       46
37       48
38       37
39       53
40       50
41       44
42       51
43       53
44       35
45       38
46       43
47       48
48       29
49       28
50       32
51       32
52       34
53       14
54        9
55        8
56        1
57        5
58        3
59        1
60        3
61        2
62        5
63        4
64        2
65        5
66        6
67        3
68        3
69        2
70        2
71        2
72        3
73        1
74        3
75        3
76        1
78        3
79        2
81        1
Name: AGE_START, dtype: int64

In [19]:
dfs['cond'][ht].sample(8)

Unnamed: 0,START,STOP,ENCOUNTER,CODE,DESCRIPTION,GENDER,BIRTHDATE,AGE,AGE_START
c7347306-fd7a-43a9-893b-e2c238f61911,1960-11-10,,54727938-71f1-46a7-96fd-82251294a262,38341003,Hypertension,F,1942-10-16,78,18
0e191a02-1be7-4a3a-a99f-0c14e74cc240,1967-05-28,,579f9391-5132-42cd-9647-4824bae16780,38341003,Hypertension,F,1949-05-06,66,18
49dfe831-2a34-42fe-8e9a-4b209ab3cc8e,1990-12-06,,2410c600-842a-4083-b2de-d148f2e36e84,38341003,Hypertension,M,1972-07-18,49,18
3336777d-cfd8-46e3-a5b1-a4b06dc41016,1973-12-17,,c35ef5e9-6e2a-41c5-815f-13a2b63cb7dd,38341003,Hypertension,F,1955-09-07,66,18
26ff67e8-50c0-4f15-a3a1-f3ec40da6a71,2012-11-10,,839be0ca-79a9-451c-90a8-c9e4061eeb86,38341003,Hypertension,M,1994-06-17,27,18
21be165a-f494-4bfa-91d2-f133987561de,1982-07-14,,c75efdc7-c3fe-4a20-9b93-177915565d52,38341003,Hypertension,F,1963-09-11,58,19
541ecac4-bfb4-4e86-a301-8112114d6f32,1977-01-02,,5d35d7cd-45d8-423b-8e27-ba1579511351,38341003,Hypertension,M,1958-03-28,63,19
f807b9ec-628e-40d1-a175-9c36931684eb,1952-05-31,,ecdb043c-e23f-4b54-91aa-f8d1313e68c4,38341003,Hypertension,F,1933-10-03,82,19
