Global Burden of Disease (GBD) Study data is a fundamental data source for our simulation models. Understanding what data is available in the GBD and what modeling processes produced it is a difficult task. Some helpful resources for understanding the GBD study are listed below:
IHME onboarding trainings
GBD capstone papers and their methods appendices, such as:
The GBD compare tool, which allows you to visualize GBD estimates
Your simulation science team members!
Talking to GBD modelers directly
IHME central computation maintains functions for accessing GBD data, referred to as "Shared Functions." The main HUB page for shared functions can be found here
Note that there is a central computation maintained conda environment that is guaranteed to have the latest version of all GBD shared functions, called gbd_env
, as described on the Shared Functions HUB page.
- Note that archived GBD rounds (for example, GBD 2017) may require archived GBD environments to access - see the "Current and Archive GBD environments" subpage for more details.
- Also note that while the
gbd_env
environment is guaranteed to have the most up to date versions of shared functions, it is unlikely to include additional packages you may want to use, which is a downside of using this environment.
If you wish to use your own environment and add shared functions to that environment, you may do so using pip
, but you will need to add artifactory.ihme.washington.edu as a trusted host in you ~/.pip/pip.conf
file first, as described on this HUB page. See :ref:`the computing onboarding resource page <computing>` for more information on managing conda environments.
The packages most relevant to pulling GBD data using shared functions include db_queries
and get_draws
.
Documentation for db_queries can be found here.
Some particularly helpful functions in db_queries
include:
get_ids
: Returns a list of GBD IDs for any entities in GBD (age groups, locations, causes, etc.)get_outputs
: Returns mean value and uncertainty interval for GBD resultsget_population
: Returns population size estimatesget_covariate_estimates
: Returns mean value and uncertainty interval for GBD covariates
Documentation for get_draws can be found here.
get_draws
differs from db_queries.get_outputs
in that rather than returning a mean estimate and uncertainty interval, it returns draw-level estimates from which a mean value and uncertainty interval can be estimated. Unlike db_queries.get_outputs
, get_draws
does not automatically aggregate results from the most detailed estimates (for instance: it returns sex-specific values and will not automatically return vaues for sex_id=3/"both" sexes combined).
Additionally, there are certain intermediate values used in GBD that are not available in GBD's final results found in db_queries.get_outputs
and can only be pulled using get_draws
, such as risk exposures and relative risks. The various data source available in :code`get_draws` are summarized in the table below and also described in more detail on the get_draws documentation page here.
Source | Description | GBD ID type | Note |
---|---|---|---|
epi |
Dismod and custom epi models. This source contains data that is computed by GBD modelers and often used as inputs to central GBD processes. | modelable_entity_id | |
codcorrect |
Deaths and YLLs | cause_id | Returns counts only |
como |
YLDs, incidence, and comorbidity-adjusted prevalence | cause_id, sequela_id, rei_id | Returns rates only |
dalynator |
DALYs | cause_id | |
exposure |
Risk factor exposure | rei_id | Can be continuous (like mean BMI) or categorical (like stunting prevalence) |
exposure_sd |
Risk exposure standard deviation | rei_id | Only for continuous risks |
rr |
Risk factor relative risk | rei_id, cause_id | Will return values for all affected causes unless a cause_id is specified |
burdenator |
Risk attributable burden (deaths/dalys/ylls/ylds) and mediated/aggregated PAFs | cause_id, rei_id | |
paf |
Pre-burdenator (non-finalized) PAF estimates | rei_id, cause_id | |
sev |
Summary exposure values | rei_id | |
tmrel |
Risk factor theoretical minimum exposure level | rei_id | |
rr_max |
Relative risk maximum value | rei_id, cause_id | |
codem |
codem models and custom cod models | cause_id | |
stgpr |
ST-GPR models | modelable_entity_id | If you pass an MEID with a dismod model type but try to use the ST-GPR source, get_draws will use the epi source instead. |
Decomposition (or "decomp") steps are a versioning scheme used in some GBD rounds that allowed updates to GBD results based on iterative updates to certain parts of the computation process. For instance, the first step may be equivalent to the prior GBD round in all aspects except for an updated demographic model; the second step may be equivalent to the prior steps, but with updated risk exposures; and so on. This process allowed GBD researchers to evaluate how individual components of the many changes included in a GBD round advancement influenced the main results of the GBD study, rather than updating the entire pipeline at once.
When pulling GBD data from GBD rounds that used decomp step versioning, you are required to specify a decomp_step
value in your shared functions call.
Unfortunately, the steps are not necessarily equivalent between GBD rounds. For this reason, we advise consulting the HUB space specific to the GBD round you are interested in, which often contains information about that round's "Decomposition rules."
For reference, the decomposition rules for GBD 2021 can be found here
Additionally, you may be required to specify a version_id
, release_id
, and/or status
when pulling GBD results from certain GBD rounds. The HUB space for a given GBD round is a good resource on where to obtain this information, but do not hesitate to open a helpdesk ticket to inquire or confirm whether you are using appropriate versioning IDs for you GBD shared functions call.
.. todo:: Discuss release_id as preferred alternative to gbd_round_id + decomp_step.
There are two main packages within the Vivarium software framework that are especially useful for interacting with GBD data: gbd_mapping and vivarium_inputs.
Both of these packages translate ID numbers used in GBD to human-readable text.
gbd_mapping
provides a convenient way to access all of the metadata associated with a given GBD entity (ex: diarrheal diseases cause or child growth failure risk factor), but does not return any estimates associated with that entity (ex: prevalence or relative risks).
vivarium_inputs
provides simplified functions to query GBD data and reformats the data to be compatible with the data structure required for building Vivarium Artifact objects. vivarium_inputs
generally returns data for the most up-to-date complete GBD round/release and does not allow for user-specification of prior rounds/releases -- ask the software engineers if you have questions about which GBD round/release is active in vivarium_inputs
at any given time. Additionally, if there is any doubt as to which GBD versioning is being returned by a given vivarium_inputs
call, you can utilize get_raw_data
, which will return full data including GBD versioning IDs for a given call.
For documentation on Vivarium Inputs, click here.
Some important notes and considerations not included in the documentation above are listed below:
All data returned is filtered to 500 draws (draw 0 through 499), even if more draws are available
Returns data for all most-detailed age groups and sexes - if any such data is missing in GBD, NaNs will be filled with zeros
Returns default version IDs within GBD system
Returns data specific to most recent published year unless user specifies to return all available years
- Note: will return data specific to 2021 for GBD 2021, despite estimates being available for 2022 because the 2022 year was not published as part of GBD 2021
- Note: for non-log-linear relative risks, GBD returns data for a single year only. In these cases, vivarium inputs will return that data and label it as 2021 data (even if GBD does not claim it to be specific to 2021 - notably, however, relative risks in GBD do not vary by year)
Note
The above notes and considerations were written in May of 2024. Updates to vivarium inputs may affect these notes and they should be updated accordingly.
Measure | Data returned | Note |
---|---|---|
'incidence' |
GBD_incidence / (1 - GBD_prevalence) | By default, get_measures automatically converts GBD's "population-level incidence rates" to "susceptible population incidence rates" using the GBD estimate of prevalence. Note that if a model is using an alternative value for prevalence, this rescaling should be done separately using that prevalence value. |
'raw_incidence_rate' |
GBD_incidence | |
'cause_specific_mortality' |
GBD_death_count / GBD_population_counts | |
'excess_mortality' |
cause_specific_mortality / GBD_prevalence | By default, get_measures calculates excess mortality rates in accordance with the GBD estimate of prevalence. If a model is using an alternative value for cause prevalence, excess mortality rates should likely be calculated separately using that prevalence value. |
.. todo:: Link notebook that shows examples of using these functions.
Generally, GBD shared functions offer greater flexibility in querying GBD data than Vivarium Inputs, but require specification of detailed IDs that are not human-readable and require translation with get_ids. Vivarium Inputs offers less flexibility in favor of the convenience of returning a human-readable version of the most relevant data for running Vivarium simulations and compatibility with required Vivarium Artifact formatting. Therefore, GBD shared functions may be the code base to use when taking deep dives into GBD data, and Vivarium Inputs when preparing GBD data for Vivarium simulations. Some additional specific considerations about the differences between the two options are summarized in the table below.
Topic | GBD Shared Functions | Vivarium Inputs |
---|---|---|
GBD round | Able to specify any GBD round/release; useful for noting and comparing major changes between rounds | Returns most recent complete GBD round/release only |
DALYs | Returns YLD, YLL, DALY estimates | Does not return YLD, YLL, or DALY estimates |
Metrics | Returns counts, rates, and prevalence estimates | Returns rate estimates with the exception of population structure, which are in counts; convenient |
Summary values | Can return mean, upper, and lower estimates using get_outputs | Returns draw-level estimates only |
Age/sex/location specificity | Allows for specification across all these parameters, allows for grouping (via get_outputs) and/or aggregation (via make_custom_aggregates) across demographic categories | Returns all most-detailed age and sex estimates. Supports only one location at a time. |
Format | Generally uses ID numbers that are not human-readable before pairing with get_ids information | Converts to human readable entity names rather than IDs and is compatible with formatting required for vivarium Artifacts and simulations |
Note
Generally, to convert between GBD shared function entity names (such as cause_name) to the entity name in Vivarium inputs, convert the GBD shared function entity name to all lower case and replace spaces with underscores. Python code to do this is shown below:
vivarium_inputs_entity_name = gbd_entity_name.lower().replace(' ', '_')
There are some exceptions to this code that will require additional conversion, which can be viewed in the vivarium inputs source code found in the clean_entity_list
method, found here.