The aim of [Synthea](https://github.com/synthetichealth/synthea) is to generate rich, high quality, representative patient records.

Handbuilt [modules](https://synthetichealth.github.io/module-builder/) reflect epidemiology with respect to prevelance, disease heterogeneity, prognosis, etc.

These are based on US statistics.

Data can be generated or downloaded directly, including [specialized](https://synthea.mitre.org/downloads) datasets.

I haven't made any attempt at understanding the ouput FHIR resources.

It's fairly old now but they wrote a [paper](https://academic.oup.com/jamia/article/25/3/230/4098271?login=true).

In [None]:
#if working in google colab
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import os

os.chdir("PATH_TO/synthea patient data/csv")

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Mounted at /content/drive


In [None]:
#read interesting files
#others relate to billing, organizations
patient_df = pd.read_csv('patients-sample.csv')
allergy_df = pd.read_csv('allergies-sample.csv')
careplan_df = pd.read_csv('careplans-sample.csv')
conditions_df = pd.read_csv('conditions-sample.csv')
device_df = pd.read_csv('devices-sample.csv')
encounter_df = pd.read_csv('encounters-sample.csv')
imaging_study_df = pd.read_csv('imaging_studies-sample.csv')
immunization_df = pd.read_csv('immunizations-sample.csv')
medication_df = pd.read_csv('medications-sample.csv')
observation_df = pd.read_csv('observations-sample.csv')
procedure_df = pd.read_csv('procedures-sample.csv')

In [None]:
#population subset can be generated from demographic or clincial features
#i.e. only women with hypertension

#join patients to conditions by patient id
patient_df = patient_df.rename(columns={"Id":"PATIENT"})
df = conditions_df.merge(patient_df, how="left", on="PATIENT")

#filter by features
#patient id (PATIENT) can then by used to link patient encounters
df[(df["GENDER"] == "F") & (df["DESCRIPTION"] == 'Essential hypertension (disorder)')].head()


Unnamed: 0,START,STOP,PATIENT,ENCOUNTER,CODE,DESCRIPTION,BIRTHDATE,DEATHDATE,SSN,DRIVERS,...,CITY,STATE,COUNTY,FIPS,ZIP,LAT,LON,HEALTHCARE_EXPENSES,HEALTHCARE_COVERAGE,INCOME
95,2014-08-14,,29f970ce-bddd-ff20-7925-3a762b94dcb9,6c18a773-f9dd-bd5a-1c8f-ab42e2413d61,59621000,Essential hypertension (disorder),1968-07-25,,999-62-7437,S99928841,...,Franklin,Massachusetts,Norfolk County,25021.0,2038,42.134645,-71.391318,248893.1,487477.96,29597
280,1989-12-01,,08f62b6a-7154-48ea-2641-2e48efdf193c,1846ba3c-fb25-693d-16d0-c3fbab243796,59621000,Essential hypertension (disorder),1961-09-22,,999-75-9391,S99933372,...,Adams,Massachusetts,Berkshire County,25003.0,1220,42.628794,-73.098499,162473.83,1188524.06,34616
356,1998-05-29,,444a32d6-1da4-6d86-1910-8927315faf4a,fa9d3a09-1044-c8c4-e59b-3ee5900847cf,59621000,Essential hypertension (disorder),1964-03-13,,999-46-9375,S99987125,...,Dover,Massachusetts,Norfolk County,25021.0,2030,42.191099,-71.249693,343244.56,461313.96,133018
435,1992-01-13,,57994850-86b9-a639-2213-1fcb7e23863b,1323cb93-4e48-f5f8-36ad-696b9c53761b,59621000,Essential hypertension (disorder),1966-11-07,,999-65-2481,S99939859,...,Cambridge,Massachusetts,Middlesex County,25017.0,2142,42.346561,-71.128471,119116.05,52126.8,170481
502,2013-08-17,,cfe56cd5-98dd-c6a1-004f-056494149bd0,4eb94aa1-724b-7b3e-39db-90a7ee6f1b5a,59621000,Essential hypertension (disorder),1991-06-15,,999-58-2066,S99916724,...,Revere,Massachusetts,Suffolk County,25025.0,2151,42.430337,-70.989395,12003.31,865032.67,20963


All patient records (recorded medical events) are organized by Encounter as an overarching container.

Encounter provides a high-level view of a medical event for a given patient. The encounter is then broken down by its component details:

A single encounter may involve dealing with multiple conditions, different procedures, observations and medications all generating data.

Although highly structured, the encounter may not explicitly describe how this data needs to be intepreted: which diagnostic tests and treatments are assocated with which conditions, patient criteria, etc both within a single encounter and across patient history.

In [None]:
#single encounter row
#snomed reason is often empty
encounter_df.iloc[995]

Id                     2bbe616e-da23-1243-0bfc-e94f729cc7fc
START                                  2014-05-12T12:17:56Z
STOP                                   2014-05-12T12:35:23Z
PATIENT                a2dd404f-2f02-e3e3-cc2b-d6329308afd2
ORGANIZATION           6aae7a31-90df-3455-ad8d-81f8cf2d21e8
PROVIDER               c63a1f19-be21-3201-a898-9391d5a0b725
PAYER                  a735bf55-83e9-331a-899d-a82a60b9f60c
ENCOUNTERCLASS                                   outpatient
CODE                                              308335008
DESCRIPTION                     Patient encounter procedure
BASE_ENCOUNTER_COST                                  142.58
TOTAL_CLAIM_COST                                    2705.35
PAYER_COVERAGE                                      2164.28
REASONCODE                                      389095005.0
REASONDESCRIPTION       Contraception care (regime/therapy)
Name: 995, dtype: object

For the purpose of testing intergration in the shorter term: the scale, richness and structure of data is probably most useful.

But it is interesting to look at the longitude and how coherent the patient history seems to be:

In [None]:
#files from big data generation filtered for a single patient with an interesting history
#patients with long term conditions can generate hundreds of encounters
patient_df = pd.read_csv('patient-patient-sample.csv')
allergy_df = pd.read_csv('patient-allergy-sample.csv')
careplan_df = pd.read_csv('patient-careplan-sample.csv')
condition_df = pd.read_csv('patient-condition-sample.csv')
device_df = pd.read_csv('patient-device-sample.csv')
encounter_df = pd.read_csv('patient-encounter-sample.csv')
imaging_study_df = pd.read_csv('patient-imaging-sample.csv')
immunization_df = pd.read_csv('patient-immunization-sample.csv')
medication_df = pd.read_csv('patient-medication-sample.csv')
observation_df = pd.read_csv('patient-observation-sample.csv')
procedure_df = pd.read_csv('patient-procedure-sample.csv')

In [None]:
#summary patient info
print(f"number of encounters: {len(encounter_df)}")
print("")
print(f"conditions over lifetime:\n")
print("\n".join(list(conditions_df["DESCRIPTION"].unique()[:20])))
print("...\n\n")

print("demographic data:\n")
patient_df.loc[0]

number of encounters: 69

conditions over lifetime:

Medication review due (situation)
Sprain (morphologic abnormality)
Sprain of ankle
Risk activity involvement (finding)
Fracture of bone (disorder)
Fracture subluxation of wrist
Housing unsatisfactory (finding)
Received higher education (finding)
Part-time employment (finding)
Social isolation (finding)
Stress (finding)
Reports of violence in the environment (finding)
Viral sinusitis (disorder)
Full-time employment (finding)
Normal pregnancy
Burn injury (morphologic abnormality)
Second degree burn
Patient transfer to skilled nursing facility (procedure)
Body mass index 30+ - obesity (finding)
Received certificate of high school equivalency (finding)
...


demographic data:



PATIENT                9f169b69-7c4e-7b34-ec7a-326ff6313205
BIRTHDATE                                        1950-07-29
DEATHDATE                                        2013-10-10
SSN                                             999-10-3611
DRIVERS                                           S99951019
PASSPORT                                           X256672X
PREFIX                                                 Mrs.
FIRST                                             Temeka297
LAST                                          Rutherford999
SUFFIX                                                  NaN
MAIDEN                                           Waelchi213
MARITAL                                                   M
RACE                                                  asian
ETHNICITY                                       nonhispanic
GENDER                                                    F
BIRTHPLACE                       Belmont  Massachusetts  US
ADDRESS                                 

In [None]:
#list of all encounters by Id
#some encounters may not have any more granular data, (rountine checks, etc.)
encounters = encounter_df["Id"]

Some interesting encounters in this patient's history:

index 19:

patient put on medication for Hyperlipidemia

description and reason for encounter given snomed code

medication given rxnorm code

index 26:

patient has stress and prediabetes

medical device given unique device identifier

index 36:

follow up for Hyperlipidemia generates loads of structured data

observations given loinc code

index 38:

immunization given CVX vaccine code

general exam generates text and structured data

survey questions given loinc code

not all breakdown tables have a reason code. reason may be deduced if encounter covers single issue but not otherwise.

could be nice challenge for reasoner?

index 54:

patient has chronic congestive heart failure

imaging procedure and site given snomed codes

procedures given snomed code

symptom observations dont have snomed code. but text looks like exact snomed term match so can hopefully append codes

index 67:

patient moved into hospice

index 68:

patient death certificate

In [None]:
#declare encounter by index
encounter_index = 54

#print encounter-level data
print(f"encounter {encounter_index}:\n")
encounter_df.loc[encounter_index]
print("")

#loop through tables and print table if it has any data linked to that encounter
breakdown_df = []
for dataframe in [
                ("allergy", allergy_df),
                ("careplan", careplan_df),
                ("condition", condition_df),
                ("device", device_df),
                ("imaging", imaging_study_df),
                ("immunization", immunization_df),
                ("medication", medication_df),
                ("procedure", procedure_df),
                ("observation", observation_df)]:

                if len(dataframe[1][dataframe[1]["ENCOUNTER"] == encounters[encounter_index]]) > 0:
                    breakdown_df.append((dataframe[0], dataframe[1][dataframe[1]["ENCOUNTER"] == encounters[encounter_index]]))

for df in breakdown_df:
  print("")
  print(df[0]+":\n")
  df[1]
  print("")

encounter 54:



Id                            72150543-1944-2ceb-67fa-7ab1b241cfdd
START                                         2011-01-22T10:37:35Z
STOP                                          2011-01-22T11:14:23Z
PATIENT                       9f169b69-7c4e-7b34-ec7a-326ff6313205
ORGANIZATION                  74ab949d-17ac-3309-83a0-13b4405c66aa
PROVIDER                      a6835180-1a9f-31d9-8632-53b5cefa59e5
PAYER                         df166300-5a78-3502-a46a-832842197811
ENCOUNTERCLASS                                          ambulatory
CODE                                                     185347001
DESCRIPTION                                  Encounter for problem
BASE_ENCOUNTER_COST                                          85.55
TOTAL_CLAIM_COST                                          11344.96
PAYER_COVERAGE                                            11294.96
REASONCODE                                              88805009.0
REASONDESCRIPTION      Chronic congestive heart failure (disor



condition:



Unnamed: 0,START,STOP,PATIENT,ENCOUNTER,CODE,DESCRIPTION
51,2011-01-22,,9f169b69-7c4e-7b34-ec7a-326ff6313205,72150543-1944-2ceb-67fa-7ab1b241cfdd,88805009,Chronic congestive heart failure (disorder)




imaging:



Unnamed: 0,Id,DATE,PATIENT,ENCOUNTER,SERIES_UID,BODYSITE_CODE,BODYSITE_DESCRIPTION,MODALITY_CODE,MODALITY_DESCRIPTION,INSTANCE_UID,SOP_CODE,SOP_DESCRIPTION,PROCEDURE_CODE
0,0398c9d0-e7cb-4d1b-ea7c-6776d8469d22,2011-01-22T10:44:23Z,9f169b69-7c4e-7b34-ec7a-326ff6313205,72150543-1944-2ceb-67fa-7ab1b241cfdd,1.2.840.99999999.1.48712235.1295693063795,51185008,Thoracic structure (body structure),CR,Computed Radiography,1.2.840.99999999.1.1.50625298.1295693063795,1.2.840.10008.5.1.4.1.1.1.1,Digital X-Ray Image Storage – for Presentation,399208008




procedure:



Unnamed: 0,START,STOP,PATIENT,ENCOUNTER,CODE,DESCRIPTION,BASE_COST,REASONCODE,REASONDESCRIPTION
42,2011-01-22T10:37:35Z,2011-01-22T10:44:23Z,9f169b69-7c4e-7b34-ec7a-326ff6313205,72150543-1944-2ceb-67fa-7ab1b241cfdd,29303009,Electrocardiographic procedure,763.82,88805009.0,Chronic congestive heart failure (disorder)
43,2011-01-22T10:44:23Z,2011-01-22T11:14:23Z,9f169b69-7c4e-7b34-ec7a-326ff6313205,72150543-1944-2ceb-67fa-7ab1b241cfdd,399208008,Plain chest X-ray (procedure),9542.13,,
44,2011-01-22T10:44:23Z,2011-01-22T11:00:02Z,9f169b69-7c4e-7b34-ec7a-326ff6313205,72150543-1944-2ceb-67fa-7ab1b241cfdd,762998009,Assessment using New York Heart Association Cl...,431.4,88805009.0,Chronic congestive heart failure (disorder)




observation:



Unnamed: 0,DATE,PATIENT,ENCOUNTER,CATEGORY,CODE,DESCRIPTION,VALUE,UNITS,TYPE
454,2011-01-22T10:37:35Z,9f169b69-7c4e-7b34-ec7a-326ff6313205,72150543-1944-2ceb-67fa-7ab1b241cfdd,exam,75325-1,Symptom,Dyspnea on exertion (finding),,text
455,2011-01-22T10:37:35Z,9f169b69-7c4e-7b34-ec7a-326ff6313205,72150543-1944-2ceb-67fa-7ab1b241cfdd,exam,75325-1,Symptom,Dyspnea (finding),,text
456,2011-01-22T10:37:35Z,9f169b69-7c4e-7b34-ec7a-326ff6313205,72150543-1944-2ceb-67fa-7ab1b241cfdd,exam,75325-1,Symptom,Paroxysmal dyspnea (finding),,text
457,2011-01-22T10:37:35Z,9f169b69-7c4e-7b34-ec7a-326ff6313205,72150543-1944-2ceb-67fa-7ab1b241cfdd,exam,75325-1,Symptom,Orthopnea (finding),,text
458,2011-01-22T10:37:35Z,9f169b69-7c4e-7b34-ec7a-326ff6313205,72150543-1944-2ceb-67fa-7ab1b241cfdd,exam,75325-1,Symptom,Edema (finding),,text
459,2011-01-22T10:37:35Z,9f169b69-7c4e-7b34-ec7a-326ff6313205,72150543-1944-2ceb-67fa-7ab1b241cfdd,exam,75325-1,Symptom,Rales (finding),,text
460,2011-01-22T10:37:35Z,9f169b69-7c4e-7b34-ec7a-326ff6313205,72150543-1944-2ceb-67fa-7ab1b241cfdd,vital-signs,29463-7,Body Weight,61.0,kg,numeric
461,2011-01-22T10:37:35Z,9f169b69-7c4e-7b34-ec7a-326ff6313205,72150543-1944-2ceb-67fa-7ab1b241cfdd,laboratory,33762-6,NT-proBNP,1739.2,pg/mL,numeric
462,2011-01-22T10:37:35Z,9f169b69-7c4e-7b34-ec7a-326ff6313205,72150543-1944-2ceb-67fa-7ab1b241cfdd,laboratory,6690-2,Leukocytes [#/volume] in Blood by Automated count,7.7,10*3/uL,numeric
463,2011-01-22T10:37:35Z,9f169b69-7c4e-7b34-ec7a-326ff6313205,72150543-1944-2ceb-67fa-7ab1b241cfdd,laboratory,789-8,Erythrocytes [#/volume] in Blood by Automated ...,4.1,10*6/uL,numeric



