# SyntheticMass
Analysis of [Synthea SyntheticMass](https://synthea.mitre.org/downloads) (open-source synthetic patient population)

## Predicting Hypertension

The goal of this exercise is to predict the probability that a patient has hypertension (diagnosis code 38341003) on a given date.

On inspecting the data, we find that 98% of patients "developed" hypertension between the ages of 18 and 20.  The likely reason for that can be found in the source code for SyntheticMass.  The [clinical care map](https://github.com/synthetichealth/synthea/blob/master/src/main/resources/modules/hypertension.json) used to model hypertension specifies that the "onset" and progression will take place within several months of first wellness encounter (anyone under 18 is excluded).

My prediction is this: the probability that anyone has hypertension on a given date is 15% for people over 20 years old.  There are no other features to engineer.  This is due to how this data is generated.  The proposed exercise is not very meaningful on the provided dataset.

The inspection of `observations.csv` corroborates this notion: 85% of systolic blood pressure measurements are uniformly distributed between 100 and 139 mmHg, and the other 15% uniformly distributed between 140 and 200 mmHg (leading to the diagnosis of hypertension).  Looking longitudinally, we can see that one does not "develop" hypertension: they either have it or not:

```
cat observations.csv | grep Systolic | awk -F, '{print $2","$1","$(NF-1)}' | sort | head -1000

0005923b-f692-4608-ac67-ed12ca51e596,2011-05-12,166.0
0005923b-f692-4608-ac67-ed12ca51e596,2012-06-08,142.0
0005923b-f692-4608-ac67-ed12ca51e596,2013-05-28,167.0
0005923b-f692-4608-ac67-ed12ca51e596,2014-04-09,188.0
0005923b-f692-4608-ac67-ed12ca51e596,2015-03-13,196.0
0005923b-f692-4608-ac67-ed12ca51e596,2016-01-25,200.0

0005fffa-4b7c-4952-91ac-19d854e089eb,2011-06-15,127.0
0005fffa-4b7c-4952-91ac-19d854e089eb,2013-12-08,136.0
0005fffa-4b7c-4952-91ac-19d854e089eb,2016-11-25,105.0

0006b2a0-f823-460f-9682-9a52fa7aec34,2011-06-01,141.0
0006b2a0-f823-460f-9682-9a52fa7aec34,2013-02-18,166.0
0006b2a0-f823-460f-9682-9a52fa7aec34,2015-03-21,187.0
0006b2a0-f823-460f-9682-9a52fa7aec34,2016-02-21,173.0
0006b2a0-f823-460f-9682-9a52fa7aec34,2017-03-11,143.0

0007f2f0-f663-4455-a438-d3e06b4590b6,2012-05-11,105.0
0007f2f0-f663-4455-a438-d3e06b4590b6,2014-05-18,125.0
0007f2f0-f663-4455-a438-d3e06b4590b6,2015-11-06,119.0

000863b8-0dd6-43b0-8f34-47277da5cd7c,2010-06-05,191.0
000863b8-0dd6-43b0-8f34-47277da5cd7c,2011-05-29,173.0
000863b8-0dd6-43b0-8f34-47277da5cd7c,2012-05-06,146.0
000863b8-0dd6-43b0-8f34-47277da5cd7c,2013-05-05,162.0
000863b8-0dd6-43b0-8f34-47277da5cd7c,2014-04-10,163.0
000863b8-0dd6-43b0-8f34-47277da5cd7c,2015-04-05,158.0
000863b8-0dd6-43b0-8f34-47277da5cd7c,2016-04-04,183.0
000863b8-0dd6-43b0-8f34-47277da5cd7c,2017-01-12,199.0

0009366f-5e37-423d-9c28-ff61f93ebdf8,2010-11-01,121.0
0009366f-5e37-423d-9c28-ff61f93ebdf8,2014-04-26,115.0
0009366f-5e37-423d-9c28-ff61f93ebdf8,2017-04-05,129.0

00094a4e-5624-4e0e-a4be-991a4072ad0a,2011-11-24,102.0
00094a4e-5624-4e0e-a4be-991a4072ad0a,2013-11-29,101.0
00094a4e-5624-4e0e-a4be-991a4072ad0a,2015-11-11,108.0
00094a4e-5624-4e0e-a4be-991a4072ad0a,2017-05-14,139.0
```


## Boilerplate

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import logging
import sys

formatter = logging.Formatter(
    fmt='%(asctime)s.%(msecs)03d %(levelname)s [%(name)s] %(message)s',
    datefmt='%y%m%d@%H:%M:%S',
)

# logger = logging.getLogger('synthea')
# logger.setLevel(logging.DEBUG)  # change to INFO or lower for fewer messages
# h = logging.StreamHandler(stream=sys.stdout)
# h.setFormatter(formatter)

# if not logger.hasHandlers():
#     logger.addHandler(h)  # log to STDOUT or Jupyter
    
from datetime import datetime
import os
import pandas as pd
pd.options.display.max_rows = 96
pd.options.display.max_columns = 96
pd.options.display.max_colwidth = 160

In [3]:
def top_items(df, col, n=1) -> list:
    """Returns cardinality and top `n` items from `col` in a `df`."""
    rel = 100 / len(df)
    counts = df[col].value_counts(dropna=False)
    c = len(counts)
    keys = counts.index[:n]
    vals = counts.values[:n]
    pcts = counts.values[:n] * rel
    return c, [[x, y, round(z, 1)] for x, y, z in zip(keys, vals, pcts)]


def ntop(df, n=3) -> pd.DataFrame:
    """Overview of top `n` items in all columns of a `df`."""
    df = df.loc[:, ~df.columns.duplicated()]
    dftop = pd.DataFrame(
        index=df.columns,
        columns=['cardinality','top_items','coverage'],
    )
    for col in df.columns:
        top = top_items(df, col, n=n)
        dftop.loc[col, 'cardinality'] = top[0]
        dftop.loc[col, 'coverage'] = sum([i[2] for i in top[1]])
        dftop.loc[col, 'top_items'] = top[1]
    return dftop


def age(birth, death, unit='y'):
    """Calculates age.  If no death date, returns age today."""
    if pd.isnull(death):
        death = datetime.now()  # I'm a glass half-full kinda guy
    if unit == 'y':
        denom = 365.25
    else:
        denom = 1
    return round((death - birth).days / denom)

In [4]:
folder='../data'
os.listdir(folder)

['patients.csv.gz',
 'conditions.csv.gz',
 'medications.csv.gz',
 'procedures.csv.gz']

In [5]:
dfs = {}
for f in os.listdir(folder):
    file = os.path.join(folder, f)
    dfs[f[:4]] = pd.read_csv(file, nrows=7e5)
    print(f, dfs[f[:4]].shape)

patients.csv.gz (132607, 17)
conditions.csv.gz (483462, 6)
medications.csv.gz (397877, 8)
procedures.csv.gz (628785, 7)


In [6]:
dfs['pati']['AGE'] = dfs['pati'].apply(
    lambda x: age(pd.Timestamp(x.BIRTHDATE), pd.Timestamp(x.DEATHDATE)), axis=1)

*Only patient ids found in the medication and procedure tables will be put into the model.*  

Clarification needed: patients found in *either* table or in *both* tables?  We'll assume that we only want patients with records in *both* tables.

In [7]:
# in either table
patient_pop = pd.concat([
    dfs['medi'].PATIENT,
    dfs['proc'].PATIENT
]).drop_duplicates()
len(patient_pop)

116465

In [8]:
# in both tables
patient_pop = pd.merge(
    dfs['medi'].PATIENT,
    dfs['proc'].PATIENT,
    how='inner',
).stack().drop_duplicates()

# we lost a few records while cleaning patients.csv
# this ensures that we keep only records existing in all three tables
patient_pop = patient_pop[patient_pop.isin(dfs['pati'].ID)]
len(patient_pop)

93565

In [9]:
patient_pop.name='patient_pop'
for k, df in dfs.items():
    left_key = 'PATIENT' if 'PATIENT' in df.columns else 'ID'
    dfs[k] = pd.merge(
        df,
        patient_pop,
        left_on=left_key,
        right_on='patient_pop',
        how='inner'
    )[df.columns]
    print(k, dfs[k].shape)

pati (93565, 18)
cond (407152, 6)
medi (356161, 8)
proc (593170, 7)


In [10]:
# add DOB, age and gender
for k in ['cond','medi','proc']:
    dfs[k] = dfs[k].set_index('PATIENT').merge(
        dfs['pati'].set_index('ID'),
        left_index=True,
        right_index=True,
        how='left',
    )

# add date of diagnosis
dfs['cond']['AGE_START'] = dfs['cond'].apply(
    lambda x: age(pd.Timestamp(x.BIRTHDATE), pd.Timestamp(x.START)), axis=1)

In [11]:
ntop(dfs['pati'])

Unnamed: 0,cardinality,top_items,coverage
ID,93565,"[[993930ed-8960-4dc0-b817-a8d306e53977, 1, 0.0], [1ce9a0ea-ae81-4782-8a0f-8c5f66e23ba4, 1, 0.0], [be8958b1-9a12-4a64-bcfd-759a7e22cbe6, 1, 0.0]]",0.0
BIRTHDATE,30297,"[[1962-08-30, 12, 0.0], [1957-02-19, 12, 0.0], [1999-03-25, 12, 0.0]]",0.0
DEATHDATE,2349,"[[nan, 86783, 92.8], [2016-01-21, 9, 0.0], [2016-02-04, 9, 0.0]]",92.8
SSN,88379,"[[999-55-2306, 4, 0.0], [999-71-4492, 4, 0.0], [999-72-1056, 4, 0.0]]",0.0
DRIVERS,52239,"[[nan, 15291, 16.3], [S99998656, 7, 0.0], [S99946879, 7, 0.0]]",16.3
PASSPORT,36641,"[[false, 36685, 39.2], [nan, 20234, 21.6], [X68719242X, 2, 0.0]]",60.8
PREFIX,4,"[[Mr., 35540, 38.0], [Mrs., 27234, 29.1], [nan, 17779, 19.0]]",86.1
FIRST,3008,"[[Lemuel35, 51, 0.1], [Effie723, 50, 0.1], [Jaunita995, 50, 0.1]]",0.3
LAST,474,"[[Beier837, 256, 0.3], [West921, 244, 0.3], [Armstrong275, 241, 0.3]]",0.9
SUFFIX,4,"[[nan, 92562, 98.9], [MD, 352, 0.4], [PhD, 340, 0.4]]",99.7


In [12]:
ntop(dfs['proc'])

Unnamed: 0,cardinality,top_items,coverage
DATE,2558,"[[2015-03-24, 304, 0.1], [2015-10-06, 301, 0.1], [2011-07-01, 300, 0.1]]",0.3
ENCOUNTER,539288,"[[59306077-aec0-4908-8f0d-aa2bdc86cd8c, 71, 0.0], [9d94db7c-9e5f-4f85-b162-27a33571701c, 71, 0.0], [57a3854d-f4f4-49d6-af08-c60d9c4a230b, 70, 0.0]]",0.0
CODE,78,"[[428191000124101, 238610, 40.2], [180256009, 91335, 15.4], [76601001, 41767, 7.0]]",62.6
DESCRIPTION,80,"[[Documentation of current medications, 238610, 40.2], [Subcutaneous immunotherapy, 91335, 15.4], [Intramuscular injection, 41767, 7.0]]",62.6
REASONCODE,49,"[[nan, 437271, 73.7], [72892002.0, 31116, 5.2], [10509002.0, 26402, 4.5]]",83.4
REASONDESCRIPTION,49,"[[nan, 437271, 73.7], [Normal pregnancy, 31116, 5.2], [Acute bronchitis (disorder), 26402, 4.5]]",83.4
BIRTHDATE,30297,"[[2000-04-10, 294, 0.0], [1999-10-03, 283, 0.0], [1999-03-11, 279, 0.0]]",0.0
DEATHDATE,2349,"[[nan, 551689, 93.0], [2016-06-07, 199, 0.0], [2016-03-15, 114, 0.0]]",93.0
SSN,88379,"[[999-79-7879, 118, 0.0], [999-93-9792, 108, 0.0], [999-94-1038, 106, 0.0]]",0.0
DRIVERS,52239,"[[nan, 101862, 17.2], [S99919181, 146, 0.0], [S99934754, 144, 0.0]]",17.2


In [13]:
ntop(dfs['medi'])

Unnamed: 0,cardinality,top_items,coverage
START,27345,"[[2014-03-02, 127, 0.0], [2015-03-05, 125, 0.0], [2011-06-28, 125, 0.0]]",0.0
STOP,4109,"[[nan, 179756, 50.5], [2010-09-08, 114, 0.0], [2010-06-11, 109, 0.0]]",50.5
ENCOUNTER,206158,"[[59306077-aec0-4908-8f0d-aa2bdc86cd8c, 139, 0.0], [1c9de4f9-bf5c-4650-84bd-61e147db01a7, 133, 0.0], [9d94db7c-9e5f-4f85-b162-27a33571701c, 133, 0.0]]",0.0
CODE,98,"[[834060, 52443, 14.7], [608680, 26981, 7.6], [834101, 25289, 7.1]]",29.4
DESCRIPTION,99,"[[Penicillin V Potassium 250 MG, 52443, 14.7], [Acetaminophen 160 MG, 26981, 7.6], [Penicillin V Potassium 500 MG, 25289, 7.1]]",29.4
REASONCODE,41,"[[nan, 114598, 32.2], [43878008.0, 77262, 21.7], [10509002.0, 35396, 9.9]]",63.8
REASONDESCRIPTION,41,"[[nan, 114598, 32.2], [Streptococcal sore throat (disorder), 77262, 21.7], [Acute bronchitis (disorder), 35396, 9.9]]",63.8
BIRTHDATE,30297,"[[1951-12-15, 187, 0.1], [1962-07-26, 171, 0.0], [1960-04-23, 171, 0.0]]",0.1
DEATHDATE,2349,"[[nan, 302535, 84.9], [2016-06-07, 254, 0.1], [2014-03-27, 185, 0.1]]",85.1
SSN,88379,"[[999-82-4187, 143, 0.0], [999-58-2167, 135, 0.0], [999-35-8705, 134, 0.0]]",0.0


In [14]:
ntop(dfs['cond'])

Unnamed: 0,cardinality,top_items,coverage
START,24879,"[[2015-09-28, 272, 0.1], [2011-02-28, 271, 0.1], [2014-08-28, 263, 0.1]]",0.3
STOP,7208,"[[nan, 158345, 38.9], [2015-05-12, 130, 0.0], [2012-02-22, 129, 0.0]]",38.9
ENCOUNTER,309865,"[[137b2704-d4ef-440d-b8c7-ea9b0c6bc514, 21, 0.0], [4fb78c49-1fa1-4147-a423-50ab77479cea, 20, 0.0], [27731b55-09ec-4fdc-b371-c8a174902b00, 19, 0.0]]",0.0
CODE,127,"[[444814009, 69559, 17.1], [195662009, 39088, 9.6], [10509002, 35593, 8.7]]",35.4
DESCRIPTION,127,"[[Viral sinusitis (disorder), 69559, 17.1], [Acute viral pharyngitis (disorder), 39088, 9.6], [Acute bronchitis (disorder), 35593, 8.7]]",35.4
BIRTHDATE,30222,"[[1959-02-12, 81, 0.0], [1957-02-19, 66, 0.0], [1963-10-20, 61, 0.0]]",0.0
DEATHDATE,2343,"[[nan, 372529, 91.5], [2017-01-20, 61, 0.0], [2016-02-04, 60, 0.0]]",91.5
SSN,87291,"[[999-69-2966, 31, 0.0], [999-92-1028, 26, 0.0], [999-14-3166, 26, 0.0]]",0.0
DRIVERS,51794,"[[nan, 52430, 12.9], [S99998656, 37, 0.0], [S99929484, 37, 0.0]]",12.9
PASSPORT,36195,"[[false, 168301, 41.3], [nan, 70013, 17.2], [X55867787X, 22, 0.0]]",58.5


## Hypertension
Here I was going to study the age of onset, and it looked funny...

In [15]:
ht = dfs['cond'].CODE == 38341003
dfs['cond'][ht].AGE_START.value_counts().sort_index() #.plot(kind='bar')

18    10951
19    11157
20       14
21        7
23       38
24       62
25        7
26       53
27       76
28       17
29       71
30       53
31       41
32       40
33       45
34       50
35       30
36       46
37       48
38       37
39       53
40       50
41       44
42       51
43       53
44       35
45       38
46       43
47       48
48       29
49       28
50       32
51       32
52       34
53       14
54        9
55        8
56        1
57        5
58        3
59        1
60        3
61        2
62        5
63        4
64        2
65        5
66        6
67        3
68        3
69        2
70        2
71        2
72        3
73        1
74        3
75        3
76        1
78        3
79        2
81        1
Name: AGE_START, dtype: int64

In [16]:
dfs['cond'][ht].sample(8)

Unnamed: 0,START,STOP,ENCOUNTER,CODE,DESCRIPTION,BIRTHDATE,DEATHDATE,SSN,DRIVERS,PASSPORT,PREFIX,FIRST,LAST,SUFFIX,MAIDEN,MARITAL,RACE,ETHNICITY,GENDER,BIRTHPLACE,ADDRESS,AGE,AGE_START
9d22a244-b88f-442b-b621-1b8e662df744,2006-08-17,,12a8cc0d-c59f-418f-8bc4-5d94463aa9b9,38341003,Hypertension,1988-04-18,,999-93-8394,S99962025,false,Mrs.,Cassie602,Ward255,,Kilback494,M,black,african,F,Natick MA US,5823 Reichert Port Apt. 694 Agawam MA 01001 US,33,18
fff00723-105c-4db3-8e16-fe4c2100ae36,1971-06-07,,0fdae1d6-6953-4c2e-9d9c-e03a2fa32c89,38341003,Hypertension,1953-05-19,,999-77-7420,S99971875,X49211427X,Mrs.,Anissa110,Herman911,,Shields195,M,black,dominican,F,Mendon MA US,61866 Towne Camp Northborough MA 01532 US,68,18
acb8a9c8-7d23-4c5f-9fb1-4708bb12832d,2006-06-05,,00257fcb-9b27-49db-8f03-7c16f4e9af21,38341003,Hypertension,1973-11-24,,999-21-5873,S99967368,false,Mrs.,Pinkie661,VonRueden892,,Ankunding471,M,asian,asian_indian,F,Newton MA US,332 Berge Lodge Suite 419 Easthampton MA 01027 US,47,33
500c491d-3d26-4bb9-ad3a-e33c26eaa9ab,1997-02-10,,5a9ae8d6-1e16-4dda-8973-5c0539fb8f3b,38341003,Hypertension,1978-11-07,,999-68-9940,S99932552,false,Mr.,Johathan542,Hamill808,,,S,black,dominican,M,Maynard MA US,14761 Bauch Brooks Springfield MA 01108 US,42,18
10d21b1f-a075-4417-bb49-c4a34a6e8efd,1985-11-09,,fc628817-c16b-4854-8059-e075a4e61c97,38341003,Hypertension,1967-06-05,2012-07-09,999-81-9353,S99921158,X31556854X,Mr.,Keshawn545,Spencer391,,,M,white,german,M,Boston MA US,226 Antonetta Green Apt. 146 New Bedford MA 02744 US,45,18
25487cba-48de-4816-92e1-cb0469645e0c,1997-02-09,,fc019f76-246e-4cb1-a136-d352a77eea15,38341003,Hypertension,1978-03-08,,999-23-4977,S99980789,false,Mrs.,Carlos269,Christiansen946,,Fisher354,M,white,irish,F,Webster MA US,69309 Tierra Views Ludlow MA 01056 US,43,19
735e3dfc-b62e-4368-b89b-baf416276834,1993-11-12,,a4c956dc-9e9d-471d-8fbe-94585e6440e1,38341003,Hypertension,1974-11-26,,999-88-9233,S99929935,false,Mrs.,Zackery101,Miller112,,Kovacek705,M,white,english,F,Acushnet MA US,7284 Hane Radial Newburyport MA 01950 US,46,19
758dfc38-576c-4eb7-a162-a170ebe852ae,1959-12-10,,27dd8aae-7d28-4d52-959f-0e3b68cc6e3a,38341003,Hypertension,1941-07-18,,999-83-1720,S99929121,false,Ms.,Vergie449,Nolan207,,,S,white,russian,F,Boston MA US,13598 Jaime Plains Suite 683 Marshfield MA 02050 US,80,18


In [17]:
dfs['pati'].head()

Unnamed: 0,ID,BIRTHDATE,DEATHDATE,SSN,DRIVERS,PASSPORT,PREFIX,FIRST,LAST,SUFFIX,MAIDEN,MARITAL,RACE,ETHNICITY,GENDER,BIRTHPLACE,ADDRESS,AGE
0,660bec03-9e58-47f2-98b9-2f1c564f3838,1996-07-26,,999-70-3315,S99945940,false,Ms.,Geovany567,Reichert456,,,,white,irish,F,Fitchburg MA US,20810 Bart Inlet Eastham MA 02642 US,25
1,5125d2b2-3aef-4ae2-aa5c-335f7e206b92,1996-09-24,,999-89-6289,S99991246,false,Ms.,Tianna156,Kuphal267,,,,white,french_canadian,F,Westborough MA US,295 Walter Mill Dennis MA 02638 US,25
2,26626faf-cbd5-48d5-a3bf-a7b21ae08e4b,1944-09-01,2015-09-04,999-79-2204,S99913823,X19963891X,Mr.,Ryleigh341,Mraz432,,,M,white,irish,M,Fall River MA US,23401 Gerhold Fords Eastham MA 02642 US,71
3,f509a0f0-77ef-477f-977d-e2784a241b52,1964-05-14,2010-07-11,999-70-3377,S99930834,false,Mrs.,Amparo640,Bergstrom813,,Green619,M,white,french,F,Cambridge MA US,55368 Suzanne Viaduct Barnstable MA 02630 US,46
4,6e9f8b3e-5a21-401e-868d-2d62e0e7f452,1979-11-03,,999-93-5171,S99927177,X29737332X,Mrs.,Dovie975,Ritchie555,,Hirthe52,M,white,irish,F,Wilmington MA US,603 Mathilde Skyway Barnstable MA 02630 US,41


In [18]:
dfs['pati']['ht'] = dfs['pati'].ID.isin(dfs['cond'][ht].index)
dfs['pati'].groupby(['ht']).RACE.value_counts()

ht     RACE    
False  white       54115
       hispanic     6789
       black        5233
       asian        3917
       native          1
True   white       17937
       hispanic     2407
       black        1683
       asian        1481
       native          2
Name: RACE, dtype: int64

In [19]:
pd.crosstab(dfs['pati'].ht, dfs['pati'].RACE, normalize=0)

RACE,asian,black,hispanic,native,white
ht,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
False,0.055913,0.074698,0.09691,1.4e-05,0.772464
True,0.062994,0.071587,0.102382,8.5e-05,0.762952


In [20]:
pd.crosstab(dfs['pati'].ht, dfs['pati'].GENDER, normalize=0)

GENDER,F,M
ht,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.524574,0.475426
True,0.533815,0.466185


In [21]:
pd.crosstab(dfs['pati'].ht, dfs['pati'].MARITAL, normalize=0)

MARITAL,M,S
ht,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.798038,0.201962
True,0.802868,0.197132
