Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vitamin A Deficiency: Start Vivarium modeling section #149

Merged
merged 13 commits into from
Mar 16, 2020
Merged

Conversation

NathanielBlairStahn
Copy link
Contributor

It would be great to get some feedback about the following:

  1. Is my description of the "propensity model" accurate? What about the different terms I used to refer to it ("exposure model" and "prevalence-only model")?
  2. Does my rationale for using this model sound accurate?
  3. What needs to go in the data tables and what should the table format be?

the population with serum retinol concentration <0·7 μmol/L. Like iron
deficiency, the cause VAD is also a population attributable fraction (PAF) of 1
cause with the VAD risk factor. That is, 100% of the VAD cases are attributable
to the VAD risk factor. VAD Risk exposure and VAD cause prevalence data are the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that all data that goes into the GBD exposure model is the same as the data that goes into the GBD prevalence model, or that everyone with VAD exposure also has the VAD cause? (i.e., is this about GBD model inputs or outputs?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely address. PAF of one can mean many distinct things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the GBD outputs are the same. There is only one VAD model, not separate models for the risk and cause. I can add some wording to clarify this.

of vitamin A deficiency at each time step to determine whether the simulant has
VAD during that time step. Each simulant's propensity is assigned only once, but
the underlying prevalence distribution can change throughout the course of the
simulation, which may result in a change in the simulant's vitamin A status.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the underlying prevalence distribution stayed the same over the course of the simulation, why wouldn't the simulants' VAD statuses randomly hop around, if there were multiple "solutions" to fit one set of propensities and a prevalence distribution? (Or do we not care if this happens, because on average the population looks the same?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i just read your algorithm; nevermind!


In Global Burden of Disease (GBD) 2017, VAD exposure definition is proportion of
the population with serum retinol concentration <0·7 μmol/L. Like iron
deficiency, the cause VAD is also a population attributable fraction (PAF) of 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not bring iron deficiency into this model since iron is a more complex example of a paf of 1 relationship and looking there is not going to help people understand this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can edit this.

Copy link
Member

@aflaxman aflaxman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V&V: this model should get prevalence and YLDs right (meaning the prevalence and YLDs in the sim should match that in GBD). It will not necessarily get incidence and remission right, and we should take a look at how wrong it is, to note in the limitations section.

Limitations: In addition to probably not getting incidence and remission of VAD right, this model has a particular implication about who does not get VAD. GBD has estimated that the prevalence of VAD is around 30% and the duration until remission is around 1 year. GBD has not estimated what fraction of the population will have ever had VAD over a time longer than a year, however. Will most kids have experienced VAD by the time they are five? Or are the same 60% cycling in and out of VAD to mantain the 30% prevalence and 1 year duration? Probably something in between these extremes, but we have no data on yet, and we don't have guidance from GBD about how to do it. So it is hard to even know how wrong our model is when we don't get remission right, let alone how much it matters for quantifying the impact of LSFF.

* - Measure
- ID
- Data source
* - Remission
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Incidence and remission are measures for the cause.
  2. I thought we decided not to use them at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a todo to address this.

:ref:`lower respiratory infections <2017_cause_lower_respiratory_infections>`,
:ref:`diarrhoeal diseases <2017_cause_diarrhea>`, :ref:`measles
:ref:`diarrhoeal diseases <2017_cause_diarrhea>`, and :ref:`measles
<2017_cause_measles>`. The relative risks for these causes appear in Table 4 on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would put a big note here that LRI is dropped in the most recent round of GBD due to insufficient causal criteria and mention it again in your limitations. It's worth pointing out to the client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a note about this.

:widths: 15 13 15 15
:header-rows: 1

* - Cause
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're welcome to include the actual data means and uncertainties here, but unless you want to do something differently that use the draw level rrs from GBD, put the rei_id and cause_id associated with the risk outcome pair you're using in the table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a todo to reformat the RR table.

- 3.65 (2.23 - 5.97)
- No (only one study)

The above relative risks for GBD 2017 should be interpreted as rate ratios for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good. I think you should have the interpretation as a column in the table though. A single risk factor may have different kinds of effects on different outcomes and should be specified pairwise. Also include the numerator and denominator or a link to what rate ratio means (I think we made a glossary in the documentation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a todo to reformat the RR table.

Vivarium Modeling Strategy
--------------------------

We will use an **exposure model** (or **prevalence-only model** or **propensity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think exposure model (vs. cause model) is an appropriate way to refer to this. We should put a description of this in the general risk documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this brings up another point I wanted to check in about: Does this mean the VAD documentation should not include a cause model diagram because the underlying algorithm is different from a standard cause model? Or is a cause model diagram still useful?

I think a simple SIS diagram actually still makes sense in terms of representing what states a simulant can be in (I = has VAD / S = does not have VAD) and what transitions are possible (incidence = simulant develops VAD due to change in exposure distribution / remission = simulant recovers from VAD due to change in exposure distribution).

Is it useful to have a diagram to visualize this, or would that be confusing because it's not a rate-based model?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a diagram with risk categories that map into disease states makes sense. I would hesitate to put transitions on the diagram (or if we put them, we need to be super clear that they're implicit and based on resampling risk exposure).

Vivarium Modeling Strategy
--------------------------

We will use an **exposure model** (or **prevalence-only model** or **propensity
model**) for a vitamin A deficiency, in which each simulant is initialized with a "propensity" for vitamin A deficiency, and the simulant's vitamin A status is determined by comparing this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why we call this propensity except out of habit. I think it might be clearer just to call it a percentile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For use with our standard sampling technique: https://en.wikipedia.org/wiki/Inverse_transform_sampling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, it's not a percentile -- it's the inverse of the percentile/quantile under the quantile function, i.e. the probability of lying below the corresponding quantile. I can't find a standard terminology for this. Perhaps "continuous rank statistic"?

See https://en.wikipedia.org/wiki/Quantile and https://en.wikipedia.org/wiki/Quantile_function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man, I've been thinking about percentile wrong for a while. My stats professors would be ashamed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There really isn't a name for this. Huh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! Apparently it's called the percentile rank (or presumably quantile rank if we're talking about values in [0,1] instead of [0,100]).

So if Q is the quantile function and p is a probability, then q = Q(p) is a quantile with quantile rank p, i.e. a p-quantile. Or in reverse, if F is the CDF and q is an arbitrary value of the random variable, then the probability p = F(q) is the quantile rank of q, so q is a p-quantile. (I'm ignoring some subtleties about nonuniqueness when F is discontinuous or has constant pieces.)

I've been confused about this too. But it's the statistics professors' fault! For not laying out the definitions clearly.

model**) for a vitamin A deficiency, in which each simulant is initialized with a "propensity" for vitamin A deficiency, and the simulant's vitamin A status is determined by comparing this
propensity to the overall VAD exposure/prevalence in the population.
Such
propensity/exposure models have been used in Vivarium for other risk factors and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are used for all other risk models. It is the standard. They are infrequently used for cause models because we usually trust the dynamic disease parameters more or we care about counting cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a todo to reword this.


In more detail, the basic strategy is to initialize each simulant with a
propensity score distributed uniformly in [0,1], then compare this propensity
score with the (location/age/sex/year/intervention-status)-dependent prevalence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use exposure throughout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? Do you mean I should call this the "(location/age/sex/year/intervention-status)-dependent exposure distribution," which in this case happens to represent prevalence of VAD?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think in a risk-based model, we should explicitly use exposure since the disease prevalence is derived from it. As opposed to a PAF of 1 model of VitA in which the disease model was the source of truth and the risk exposure was derived from prevalence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some wording and a todo to reword everything to ensure consistent terminology.

I guess I was thinking about this from the GBD perspective - my impression is that they ran a DisMod model to get the cause data, and just copied the prevalence data over to the risk factor to get exposure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. As long as your consistent, it seems fine either way.

remission data for vitamin A deficiency, but only *prevalence* (which is the
same as the exposure data for the VAD risk factor). The rationale for this approach is twofold:

1. We want to guarantee that the simulated baseline prevalence of vitamin A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only doing this because of option 2. We expect the cause version of the model to get incidence and remission correct in addition to getting prevalence correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it is more important for us to use the intervention data correctly than it is to get the dynamic parameters of the disease correct. Details about the limitations and the expected impact to be found in .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With regard to (1), there's also the concern mentioned in @aflaxman 's comment above: If we use the incidence and remission data from GBD, then basically the whole population will end up getting VAD if we run the simulation for 5 years, but we suspect this would not be representative of reality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a todo to clarify this.


Explain why the prevalence-only model is a reasonable strategy, citing
incidence, remission, and prevalence data, as well as expert opinions about
VAD. (Perhaps this explanation should come later, e.g. in the Assumptions and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assumptions and limitations is right, I think. That's where I'd look for the justification.


1. **Initialize:** When simulant :math:`i` enters the simulation (e.g. at the
start of the simulation or at the time step when the simulant is born),
assign the simulant a random number :math:`v_i \sim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should standardize the notation for these things. I follow scipy conventions in the code and I think they're reasonable:

x_i - the random variable (the exposure)
q_i - the percentile or propensity of the exposure x_i in the distribution (called q because the inverse cdf is the quantile function, though also called the percent point function).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialize should also describe how we actually get x_i.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@collijk In this case x_i would be a binary variable, "has VAD"/"does not have VAD", correct? How would you initialize this before following the procedure in the "Update" step? I assumed the "update" part would first happen during the same time step as initialization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The procedure is the same as in update, but it has to happen before a time step takes place. The value of other attributes the simulant is initialized with (e.g. whether or not they are receiving vitamin a fortification) may be dependent on their initial vitamin a deficiency status. This is a fencepost error (https://en.wikipedia.org/wiki/Off-by-one_error#Fencepost_error). All attributes of a simulant must be assigned an initial value before the first time step starts or you introduce extremely hairy issues into the order you must update simulant state each time step.

Here's the procedure:

  1. Initialization:
  • Sample propensity and store it for all time.
  • Use attributes initial distribution is conditional on (age, sex, year) to construct probability mass function
  • Use propensity to sample pmf and assign initial status.
  1. Update:
  • Your procedure looks good here.

c) If :math:`v_i < p_\text{VAD}(\text{subpop}(i,t))`, the simulant has
vitamin A deficiency on the next time step; otherwise, they don't.

To address a point of potential confusion in the above algorithm, note that a
Copy link
Member

@collijk collijk Mar 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving notes. I think this section is great. I think we want to pull it out into the general risk model for the standard way to sample from categorical distributions. In order to do so, we have to generalize this section a bit. Technically what we do is start with a distribution

p = [P_1, P_2, ..., P_N, 1 - (P_1 + ... +P_n)]

where P_j is the probability that an individual is in category j.

We then take the cumulative sum over the distribution

p_cum = [P_1, P_1 + P_2, ..., P_1 + ...+ P_N, 1]

to form the right bounds of a partition of the interval [0, 1] with each subinterval mapping to a risk exposure category with probability p_cum[right] - p_cum[left].

We then select the exposure category k by arg_max_k (q_i < p_cum[k]) and assign that category as x_i.

This means the interpretation of propensity is dependent on the sort order of the categories.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default sort order is worst to best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w/r/t risk effect. We have not had to deal with unordered categorical risks.

Copy link
Member

@collijk collijk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general comment here: this PR is too big and stayed open too long. We need to come up with some strategies and guidance about putting these things together and getting fast reviews.

@NathanielBlairStahn NathanielBlairStahn merged commit 2754b54 into master Mar 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants