Skip to content

Latest commit

 

History

History
236 lines (180 loc) · 14.7 KB

File metadata and controls

236 lines (180 loc) · 14.7 KB

Understanding and Pulling GBD Data

Global Burden of Disease (GBD) Study data is a fundamental data source for our simulation models. Understanding what data is available in the GBD and what modeling processes produced it is a difficult task. Some helpful resources for understanding the GBD study are listed below:

Pulling GBD Data using Shared Functions

IHME central computation maintains functions for accessing GBD data, referred to as "Shared Functions." The main HUB page for shared functions can be found here

Note that there is a central computation maintained conda environment that is guaranteed to have the latest version of all GBD shared functions, called gbd_env, as described on the Shared Functions HUB page.

  • Note that archived GBD rounds (for example, GBD 2017) may require archived GBD environments to access - see the "Current and Archive GBD environments" subpage for more details.
  • Also note that while the gbd_env environment is guaranteed to have the most up to date versions of shared functions, it is unlikely to include additional packages you may want to use, which is a downside of using this environment.

If you wish to use your own environment and add shared functions to that environment, you may do so using pip, but you will need to add artifactory.ihme.washington.edu as a trusted host in you ~/.pip/pip.conf file first, as described on this HUB page. See :ref:`the computing onboarding resource page <computing>` for more information on managing conda environments.

The packages most relevant to pulling GBD data using shared functions include db_queries and get_draws.

Overview of db_queries

Documentation for db_queries can be found here.

Some particularly helpful functions in db_queries include:

  • get_ids: Returns a list of GBD IDs for any entities in GBD (age groups, locations, causes, etc.)
  • get_outputs: Returns mean value and uncertainty interval for GBD results
  • get_population: Returns population size estimates
  • get_covariate_estimates: Returns mean value and uncertainty interval for GBD covariates

Overview of get_draws

Documentation for get_draws can be found here.

get_draws differs from db_queries.get_outputs in that rather than returning a mean estimate and uncertainty interval, it returns draw-level estimates from which a mean value and uncertainty interval can be estimated. Unlike db_queries.get_outputs, get_draws does not automatically aggregate results from the most detailed estimates (for instance: it returns sex-specific values and will not automatically return vaues for sex_id=3/"both" sexes combined).

Additionally, there are certain intermediate values used in GBD that are not available in GBD's final results found in db_queries.get_outputs and can only be pulled using get_draws, such as risk exposures and relative risks. The various data source available in :code`get_draws` are summarized in the table below and also described in more detail on the get_draws documentation page here.

Sources of draws
Source Description GBD ID type Note
epi Dismod and custom epi models. This source contains data that is computed by GBD modelers and often used as inputs to central GBD processes. modelable_entity_id  
codcorrect Deaths and YLLs cause_id Returns counts only
como YLDs, incidence, and comorbidity-adjusted prevalence cause_id, sequela_id, rei_id Returns rates only
dalynator DALYs cause_id  
exposure Risk factor exposure rei_id Can be continuous (like mean BMI) or categorical (like stunting prevalence)
exposure_sd Risk exposure standard deviation rei_id Only for continuous risks
rr Risk factor relative risk rei_id, cause_id Will return values for all affected causes unless a cause_id is specified
burdenator Risk attributable burden (deaths/dalys/ylls/ylds) and mediated/aggregated PAFs cause_id, rei_id  
paf Pre-burdenator (non-finalized) PAF estimates rei_id, cause_id  
sev Summary exposure values rei_id  
tmrel Risk factor theoretical minimum exposure level rei_id  
rr_max Relative risk maximum value rei_id, cause_id  
codem codem models and custom cod models cause_id  
stgpr ST-GPR models modelable_entity_id If you pass an MEID with a dismod model type but try to use the ST-GPR source, get_draws will use the epi source instead.

Handling GBD versioning

Decomposition (or "decomp") steps are a versioning scheme used in some GBD rounds that allowed updates to GBD results based on iterative updates to certain parts of the computation process. For instance, the first step may be equivalent to the prior GBD round in all aspects except for an updated demographic model; the second step may be equivalent to the prior steps, but with updated risk exposures; and so on. This process allowed GBD researchers to evaluate how individual components of the many changes included in a GBD round advancement influenced the main results of the GBD study, rather than updating the entire pipeline at once.

When pulling GBD data from GBD rounds that used decomp step versioning, you are required to specify a decomp_step value in your shared functions call.

Unfortunately, the steps are not necessarily equivalent between GBD rounds. For this reason, we advise consulting the HUB space specific to the GBD round you are interested in, which often contains information about that round's "Decomposition rules."

For reference, the decomposition rules for GBD 2021 can be found here

Additionally, you may be required to specify a version_id, release_id, and/or status when pulling GBD results from certain GBD rounds. The HUB space for a given GBD round is a good resource on where to obtain this information, but do not hesitate to open a helpdesk ticket to inquire or confirm whether you are using appropriate versioning IDs for you GBD shared functions call.

.. todo::

   Discuss release_id as preferred alternative to gbd_round_id + decomp_step.

Pulling GBD Data using Vivarium Inputs

There are two main packages within the Vivarium software framework that are especially useful for interacting with GBD data: gbd_mapping and vivarium_inputs.

Both of these packages translate ID numbers used in GBD to human-readable text.

Overview of gbd_mapping

gbd_mapping provides a convenient way to access all of the metadata associated with a given GBD entity (ex: diarrheal diseases cause or child growth failure risk factor), but does not return any estimates associated with that entity (ex: prevalence or relative risks).

Overview of vivarium_inputs

vivarium_inputs provides simplified functions to query GBD data and reformats the data to be compatible with the data structure required for building Vivarium Artifact objects. vivarium_inputs generally returns data for the most up-to-date complete GBD round/release and does not allow for user-specification of prior rounds/releases -- ask the software engineers if you have questions about which GBD round/release is active in vivarium_inputs at any given time. Additionally, if there is any doubt as to which GBD versioning is being returned by a given vivarium_inputs call, you can utilize get_raw_data, which will return full data including GBD versioning IDs for a given call.

For documentation on Vivarium Inputs, click here.

Some important notes and considerations not included in the documentation above are listed below:

  • All data returned is filtered to 500 draws (draw 0 through 499), even if more draws are available

  • Returns data for all most-detailed age groups and sexes - if any such data is missing in GBD, NaNs will be filled with zeros

  • Returns default version IDs within GBD system

  • Returns data specific to most recent published year unless user specifies to return all available years

    • Note: will return data specific to 2021 for GBD 2021, despite estimates being available for 2022 because the 2022 year was not published as part of GBD 2021
    • Note: for non-log-linear relative risks, GBD returns data for a single year only. In these cases, vivarium inputs will return that data and label it as 2021 data (even if GBD does not claim it to be specific to 2021 - notably, however, relative risks in GBD do not vary by year)

Note

The above notes and considerations were written in May of 2024. Updates to vivarium inputs may affect these notes and they should be updated accordingly.

Notable default behavior of get_measures
Measure Data returned Note
'incidence' GBD_incidence / (1 - GBD_prevalence) By default, get_measures automatically converts GBD's "population-level incidence rates" to "susceptible population incidence rates" using the GBD estimate of prevalence. Note that if a model is using an alternative value for prevalence, this rescaling should be done separately using that prevalence value.
'raw_incidence_rate' GBD_incidence  
'cause_specific_mortality' GBD_death_count / GBD_population_counts  
'excess_mortality' cause_specific_mortality / GBD_prevalence By default, get_measures calculates excess mortality rates in accordance with the GBD estimate of prevalence. If a model is using an alternative value for cause prevalence, excess mortality rates should likely be calculated separately using that prevalence value.

Applied examples

.. todo::

   Link notebook that shows examples of using these functions.

Considerations of each approach

Generally, GBD shared functions offer greater flexibility in querying GBD data than Vivarium Inputs, but require specification of detailed IDs that are not human-readable and require translation with get_ids. Vivarium Inputs offers less flexibility in favor of the convenience of returning a human-readable version of the most relevant data for running Vivarium simulations and compatibility with required Vivarium Artifact formatting. Therefore, GBD shared functions may be the code base to use when taking deep dives into GBD data, and Vivarium Inputs when preparing GBD data for Vivarium simulations. Some additional specific considerations about the differences between the two options are summarized in the table below.

Topic GBD Shared Functions Vivarium Inputs
GBD round Able to specify any GBD round/release; useful for noting and comparing major changes between rounds Returns most recent complete GBD round/release only
DALYs Returns YLD, YLL, DALY estimates Does not return YLD, YLL, or DALY estimates
Metrics Returns counts, rates, and prevalence estimates Returns rate estimates with the exception of population structure, which are in counts; convenient
Summary values Can return mean, upper, and lower estimates using get_outputs Returns draw-level estimates only
Age/sex/location specificity Allows for specification across all these parameters, allows for grouping (via get_outputs) and/or aggregation (via make_custom_aggregates) across demographic categories Returns all most-detailed age and sex estimates. Supports only one location at a time.
Format Generally uses ID numbers that are not human-readable before pairing with get_ids information Converts to human readable entity names rather than IDs and is compatible with formatting required for vivarium Artifacts and simulations

Note

Generally, to convert between GBD shared function entity names (such as cause_name) to the entity name in Vivarium inputs, convert the GBD shared function entity name to all lower case and replace spaces with underscores. Python code to do this is shown below:

vivarium_inputs_entity_name = gbd_entity_name.lower().replace(' ', '_')

There are some exceptions to this code that will require additional conversion, which can be viewed in the vivarium inputs source code found in the clean_entity_list method, found here.