# Extract HT/WT from Synthea CSV

For a directory of CSV files generated by Synthea, extract a single file of height and weight observations for use in growthcleanr data testing.

## Synthea modifications and patient generation

This requires two changes to Synthea. First, the growth data error generator is currently hard-coded to limit error generation to the age of 20. We are looking at adults, so change `MAX_AGE` in the line below.

```bash
11:59:12 ❯ git diff src/main/java/org/mitre/synthea/editors/GrowthDataErrorsEditor.java

src/main/java/org/mitre/synthea/editors/GrowthDataErrorsEditor.java
───────────────────────────────────────────────────────────────────────────────────────────────────

────────────────────────────────────────────────────────────────────┐
public class GrowthDataErrorsEditor implements HealthRecordEditor { │
────────────────────────────────────────────────────────────────────┘
 34 ⋮ 34 │
 35 ⋮ 35 │  public GrowthDataErrorsEditor() { }
 36 ⋮ 36 │
 37 ⋮    │  public static int MAX_AGE = 20;
    ⋮ 37 │  // public static int MAX_AGE = 20;
    ⋮ 38 │  public static int MAX_AGE = 70;
 38 ⋮ 39 │  public static double POUNDS_PER_KG = 2.205;
 39 ⋮ 40 │  public static double INCHES_PER_CM = 0.394;
```

Second, enable the `growtherrors` module, and enable CSV exports:

```bash
11:59:23 ❯ git diff src/main/resources/synthea.properties

src/main/resources/synthea.properties
───────────────────────────────────────────────────────────────────────────────────────────────────

─────────────────────────────────────────┐
exporter.practitioner.fhir.export = true │
─────────────────────────────────────────┘
 25 ⋮ 25 │exporter.practitioner.fhir_stu3.export = false
 26 ⋮ 26 │exporter.practitioner.fhir_dstu2.export = false
 27 ⋮ 27 │exporter.encoding = UTF-8
 28 ⋮    │exporter.csv.export = false
    ⋮ 28 │# exporter.csv.export = false
    ⋮ 29 │exporter.csv.export = true
 29 ⋮ 30 │# if exporter.csv.append_mode = true, then each run will add new data to any existing CSVs. if false, each run will clear out the files and start fresh
 30 ⋮ 31 │exporter.csv.append_mode = false
 31 ⋮ 32 │# if exporter.csv.folder_per_run = true, then each run will have CSVs placed into a unique subfolder. if false, each run will only use the top-level csv folder

─────────────────────────────────┐
physiology.state.enabled = false │
─────────────────────────────────┘
226 ⋮227 │
227 ⋮228 │# set to true to introduce errors in height, weight and BMI observations for people
228 ⋮229 │# under 20 years old
229 ⋮    │growtherrors = false
    ⋮230 │# growtherrors = false
    ⋮231 │growtherrors = true
```

Finally, recompile and run Synthea:

```bash
% ./gradlew build
```

...wait for it... then:

```bash
% ./run_synthea -s 42 -p 1000 -a 18-70
```

This runs synthea with a random seed of 42 and generates 1000 patients with ages between 18 and 70. Output should look something like this:

In [1]:
!exa --long ~/projects/synthea/output/csv

.[1;33mr[31mw[0m[38;5;244m-[33mr[38;5;244m--[33mr[38;5;244m--[0m  [1;32m50[0m[32mk[0m [1;33mdlchudnov[0m [34m 9 Oct 12:57[0m allergies.csv
.[1;33mr[31mw[0m[38;5;244m-[33mr[38;5;244m--[33mr[38;5;244m--[0m [1;32m993[0m[32mk[0m [1;33mdlchudnov[0m [34m 9 Oct 12:57[0m careplans.csv
.[1;33mr[31mw[0m[38;5;244m-[33mr[38;5;244m--[33mr[38;5;244m--[0m [1;32m1.8[0m[32mM[0m [1;33mdlchudnov[0m [34m 9 Oct 12:57[0m conditions.csv
.[1;33mr[31mw[0m[38;5;244m-[33mr[38;5;244m--[33mr[38;5;244m--[0m  [1;32m35[0m[32mk[0m [1;33mdlchudnov[0m [34m 9 Oct 12:57[0m devices.csv
.[1;33mr[31mw[0m[38;5;244m-[33mr[38;5;244m--[33mr[38;5;244m--[0m  [1;32m14[0m[32mM[0m [1;33mdlchudnov[0m [34m 9 Oct 12:57[0m encounters.csv
.[1;33mr[31mw[0m[38;5;244m-[33mr[38;5;244m--[33mr[38;5;244m--[0m  [1;32m87[0m[32mk[0m [1;33mdlchudnov[0m [34m 9 Oct 12:57[0m imaging_studies.csv
.[1;33mr[31mw[0m[38;5;244m-[33mr[38;5;244m--[33mr[

Note that the Synthea CSV export schema is detailed at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.

## Extracting HT/WT for growthcleanr

The files `patients.csv` and `observations.csv` should have everything we need.

In [2]:
from datetime import timedelta
import numpy as np
import pandas as pd

In [3]:
patients = pd.read_csv("~/projects/synthea/output/csv/patients.csv")
patients.head()

Unnamed: 0,Id,BIRTHDATE,DEATHDATE,SSN,DRIVERS,PASSPORT,PREFIX,FIRST,LAST,SUFFIX,...,BIRTHPLACE,ADDRESS,CITY,STATE,COUNTY,ZIP,LAT,LON,HEALTHCARE_EXPENSES,HEALTHCARE_COVERAGE
0,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,1980-01-30,,999-32-1574,S99934551,X26770241X,Ms.,Adriana394,Delatorre612,,...,Juarez Chihuahua MX,704 Spinka View,Boston,Massachusetts,Suffolk County,2114.0,42.39131,-71.016923,983906.1,4696.24
1,3829c803-1f4c-74ed-0d8f-36e502cadd0f,1977-03-13,,999-21-2332,S99919628,X54784958X,Mr.,Cordell41,Eichmann909,,...,Chelmsford Massachusetts US,560 Ritchie Way Suite 68,Swansea,Massachusetts,Bristol County,,41.748125,-71.182914,999629.9,3603.8
2,a074203a-4773-9330-fc6a-06307ed6b3d7,1999-04-03,,999-19-8874,S99921918,X16490456X,Ms.,Melodi744,Aufderhar910,,...,Hanoi Hà Đông VN,685 Balistreri Mall Apt 21,Weymouth,Massachusetts,Norfolk County,2191.0,42.287965,-70.969244,524335.14,3732.2
3,a3795ec8-54f3-e99e-a4b1-4c067f3141d7,1959-01-13,,999-62-4431,S99950943,X4238287X,Mr.,Dick869,Streich926,,...,Swansea Massachusetts US,1064 Hickle View Apt 7,Chicopee,Massachusetts,Hampden County,1020.0,42.198239,-72.554752,20974.02,0.0
4,79981661-8e0a-e0ba-6c1d-9b7f58ce8ec3,1964-08-09,,999-60-5682,S99933038,X28305188X,Mrs.,Earnestine14,Corwin846,,...,Fall River Massachusetts US,206 Stokes Lane,Dartmouth,Massachusetts,Bristol County,,41.553624,-70.931731,1295872.75,13882.82


In [4]:
patients.columns

Index(['Id', 'BIRTHDATE', 'DEATHDATE', 'SSN', 'DRIVERS', 'PASSPORT', 'PREFIX',
       'FIRST', 'LAST', 'SUFFIX', 'MAIDEN', 'MARITAL', 'RACE', 'ETHNICITY',
       'GENDER', 'BIRTHPLACE', 'ADDRESS', 'CITY', 'STATE', 'COUNTY', 'ZIP',
       'LAT', 'LON', 'HEALTHCARE_EXPENSES', 'HEALTHCARE_COVERAGE'],
      dtype='object')

In [5]:
patients['GENDER'].describe()

count     1097
unique       2
top          F
freq       560
Name: GENDER, dtype: object

From `patients` we need only [Id, BIRTHDATE, GENDER].

In [6]:
p = patients[["Id", "BIRTHDATE", "GENDER"]]
p.head()

Unnamed: 0,Id,BIRTHDATE,GENDER
0,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,1980-01-30,F
1,3829c803-1f4c-74ed-0d8f-36e502cadd0f,1977-03-13,M
2,a074203a-4773-9330-fc6a-06307ed6b3d7,1999-04-03,F
3,a3795ec8-54f3-e99e-a4b1-4c067f3141d7,1959-01-13,M
4,79981661-8e0a-e0ba-6c1d-9b7f58ce8ec3,1964-08-09,F


In [7]:
observations = pd.read_csv("~/projects/synthea/output/csv/observations.csv")
observations.head()

Unnamed: 0,DATE,PATIENT,ENCOUNTER,CODE,DESCRIPTION,VALUE,UNITS,TYPE
0,2011-04-20T05:26:45Z,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,31253c44-4abb-47ff-a956-c42d581be22e,8302-2,Body Height,158.1,cm,numeric
1,2011-04-20T05:26:45Z,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,31253c44-4abb-47ff-a956-c42d581be22e,72514-3,Pain severity - 0-10 verbal numeric rating [Sc...,4.0,{score},numeric
2,2011-04-20T05:26:45Z,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,31253c44-4abb-47ff-a956-c42d581be22e,29463-7,Body Weight,50.6,kg,numeric
3,2011-04-20T05:26:45Z,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,31253c44-4abb-47ff-a956-c42d581be22e,39156-5,Body Mass Index,20.2,kg/m2,numeric
4,2011-04-20T05:26:45Z,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,31253c44-4abb-47ff-a956-c42d581be22e,8462-4,Diastolic Blood Pressure,70.0,mm[Hg],numeric


Verify that there's only one code for each of Body Height and Body Weight, and that units are consistent.

In [8]:
observations.loc[observations["DESCRIPTION"] == "Body Weight"].groupby(["CODE", "DESCRIPTION", "UNITS"])["UNITS"].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,unique,top,freq
CODE,DESCRIPTION,UNITS,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
29463-7,Body Weight,kg,16403,1,kg,16403


In [9]:
observations.loc[observations["DESCRIPTION"] == "Body Height"].groupby(["CODE", "DESCRIPTION", "UNITS"])["UNITS"].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,unique,top,freq
CODE,DESCRIPTION,UNITS,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
8302-2,Body Height,cm,12622,1,cm,12622


Looks right! Now slim that down.

In [10]:
o = observations.loc[observations["DESCRIPTION"].isin(["Body Height", "Body Weight"])].copy()

In [11]:
o["id"] = np.arange(len(o)) + 1
o = o.assign(param=lambda r: np.where(r["DESCRIPTION"] == "Body Height", "HEIGHTCM", "WEIGHTKG"))
o = o.rename(columns={"PATIENT": "subjid", "VALUE": "measurement"})
o = o[["id", "DATE", "subjid", "param", "measurement"]]
o.head()

Unnamed: 0,id,DATE,subjid,param,measurement
0,1,2011-04-20T05:26:45Z,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,HEIGHTCM,158.1
2,2,2011-04-20T05:26:45Z,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,WEIGHTKG,50.6
22,3,2014-04-23T05:26:45Z,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,HEIGHTCM,158.1
24,4,2014-04-23T05:26:45Z,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,WEIGHTKG,50.6
55,5,2017-04-26T05:26:45Z,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,HEIGHTCM,158.1


### Target format

```csv
id   subjid  sex  age_years  param     measurement
1    1       1    18         HEIGHTCM  212.491019752261
2    1       1    18.8       HEIGHTCM  208.312323310149
```

In [12]:
p.head()

Unnamed: 0,Id,BIRTHDATE,GENDER
0,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,1980-01-30,F
1,3829c803-1f4c-74ed-0d8f-36e502cadd0f,1977-03-13,M
2,a074203a-4773-9330-fc6a-06307ed6b3d7,1999-04-03,F
3,a3795ec8-54f3-e99e-a4b1-4c067f3141d7,1959-01-13,M
4,79981661-8e0a-e0ba-6c1d-9b7f58ce8ec3,1964-08-09,F


In [13]:
c = o.merge(p, left_on="subjid", right_on="Id").drop(columns=["Id"])
c.head()

Unnamed: 0,id,DATE,subjid,param,measurement,BIRTHDATE,GENDER
0,1,2011-04-20T05:26:45Z,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,HEIGHTCM,158.1,1980-01-30,F
1,2,2011-04-20T05:26:45Z,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,WEIGHTKG,50.6,1980-01-30,F
2,3,2014-04-23T05:26:45Z,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,HEIGHTCM,158.1,1980-01-30,F
3,4,2014-04-23T05:26:45Z,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,WEIGHTKG,50.6,1980-01-30,F
4,5,2017-04-26T05:26:45Z,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,HEIGHTCM,158.1,1980-01-30,F


In [14]:
c = c.assign(age_years=lambda r: (r["DATE"].astype(np.datetime64) - r["BIRTHDATE"].astype(np.datetime64)) / timedelta(days=365.25))
c = c.drop(columns=["DATE", "BIRTHDATE"])
c.head()

Unnamed: 0,id,subjid,param,measurement,GENDER,age_years
0,1,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,HEIGHTCM,158.1,F,31.220334
1,2,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,WEIGHTKG,50.6,F,31.220334
2,3,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,HEIGHTCM,158.1,F,34.229232
3,4,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,WEIGHTKG,50.6,F,34.229232
4,5,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,HEIGHTCM,158.1,F,37.23813


In [15]:
c = c.assign(sex=lambda r: np.where(r["GENDER"] == "M", 0, 1))
c = c.drop(columns=["GENDER"])
c.head()

Unnamed: 0,id,subjid,param,measurement,age_years,sex
0,1,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,HEIGHTCM,158.1,31.220334,1
1,2,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,WEIGHTKG,50.6,31.220334,1
2,3,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,HEIGHTCM,158.1,34.229232,1
3,4,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,WEIGHTKG,50.6,34.229232,1
4,5,e74107cd-69f6-56b0-b7b2-70d82d50ad4e,HEIGHTCM,158.1,37.23813,1


In [16]:
c.to_csv("/tmp/synthetic-observations.csv", index=False)