-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vitamin A Deficiency: Start Vivarium modeling section #149
Conversation
the population with serum retinol concentration <0·7 μmol/L. Like iron | ||
deficiency, the cause VAD is also a population attributable fraction (PAF) of 1 | ||
cause with the VAD risk factor. That is, 100% of the VAD cases are attributable | ||
to the VAD risk factor. VAD Risk exposure and VAD cause prevalence data are the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that all data that goes into the GBD exposure model is the same as the data that goes into the GBD prevalence model, or that everyone with VAD exposure also has the VAD cause? (i.e., is this about GBD model inputs or outputs?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely address. PAF of one can mean many distinct things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means the GBD outputs are the same. There is only one VAD model, not separate models for the risk and cause. I can add some wording to clarify this.
of vitamin A deficiency at each time step to determine whether the simulant has | ||
VAD during that time step. Each simulant's propensity is assigned only once, but | ||
the underlying prevalence distribution can change throughout the course of the | ||
simulation, which may result in a change in the simulant's vitamin A status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if the underlying prevalence distribution stayed the same over the course of the simulation, why wouldn't the simulants' VAD statuses randomly hop around, if there were multiple "solutions" to fit one set of propensities and a prevalence distribution? (Or do we not care if this happens, because on average the population looks the same?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i just read your algorithm; nevermind!
|
||
In Global Burden of Disease (GBD) 2017, VAD exposure definition is proportion of | ||
the population with serum retinol concentration <0·7 μmol/L. Like iron | ||
deficiency, the cause VAD is also a population attributable fraction (PAF) of 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not bring iron deficiency into this model since iron is a more complex example of a paf of 1 relationship and looking there is not going to help people understand this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I can edit this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V&V: this model should get prevalence and YLDs right (meaning the prevalence and YLDs in the sim should match that in GBD). It will not necessarily get incidence and remission right, and we should take a look at how wrong it is, to note in the limitations section.
Limitations: In addition to probably not getting incidence and remission of VAD right, this model has a particular implication about who does not get VAD. GBD has estimated that the prevalence of VAD is around 30% and the duration until remission is around 1 year. GBD has not estimated what fraction of the population will have ever had VAD over a time longer than a year, however. Will most kids have experienced VAD by the time they are five? Or are the same 60% cycling in and out of VAD to mantain the 30% prevalence and 1 year duration? Probably something in between these extremes, but we have no data on yet, and we don't have guidance from GBD about how to do it. So it is hard to even know how wrong our model is when we don't get remission right, let alone how much it matters for quantifying the impact of LSFF.
* - Measure | ||
- ID | ||
- Data source | ||
* - Remission |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Incidence and remission are measures for the cause.
- I thought we decided not to use them at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a todo to address this.
:ref:`lower respiratory infections <2017_cause_lower_respiratory_infections>`, | ||
:ref:`diarrhoeal diseases <2017_cause_diarrhea>`, :ref:`measles | ||
:ref:`diarrhoeal diseases <2017_cause_diarrhea>`, and :ref:`measles | ||
<2017_cause_measles>`. The relative risks for these causes appear in Table 4 on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would put a big note here that LRI is dropped in the most recent round of GBD due to insufficient causal criteria and mention it again in your limitations. It's worth pointing out to the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a note about this.
:widths: 15 13 15 15 | ||
:header-rows: 1 | ||
|
||
* - Cause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're welcome to include the actual data means and uncertainties here, but unless you want to do something differently that use the draw level rrs from GBD, put the rei_id and cause_id associated with the risk outcome pair you're using in the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a todo to reformat the RR table.
- 3.65 (2.23 - 5.97) | ||
- No (only one study) | ||
|
||
The above relative risks for GBD 2017 should be interpreted as rate ratios for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good. I think you should have the interpretation as a column in the table though. A single risk factor may have different kinds of effects on different outcomes and should be specified pairwise. Also include the numerator and denominator or a link to what rate ratio means (I think we made a glossary in the documentation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a todo to reformat the RR table.
Vivarium Modeling Strategy | ||
-------------------------- | ||
|
||
We will use an **exposure model** (or **prevalence-only model** or **propensity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think exposure model (vs. cause model) is an appropriate way to refer to this. We should put a description of this in the general risk documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this brings up another point I wanted to check in about: Does this mean the VAD documentation should not include a cause model diagram because the underlying algorithm is different from a standard cause model? Or is a cause model diagram still useful?
I think a simple SIS diagram actually still makes sense in terms of representing what states a simulant can be in (I = has VAD / S = does not have VAD) and what transitions are possible (incidence = simulant develops VAD due to change in exposure distribution / remission = simulant recovers from VAD due to change in exposure distribution).
Is it useful to have a diagram to visualize this, or would that be confusing because it's not a rate-based model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a diagram with risk categories that map into disease states makes sense. I would hesitate to put transitions on the diagram (or if we put them, we need to be super clear that they're implicit and based on resampling risk exposure).
Vivarium Modeling Strategy | ||
-------------------------- | ||
|
||
We will use an **exposure model** (or **prevalence-only model** or **propensity | ||
model**) for a vitamin A deficiency, in which each simulant is initialized with a "propensity" for vitamin A deficiency, and the simulant's vitamin A status is determined by comparing this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why we call this propensity except out of habit. I think it might be clearer just to call it a percentile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For use with our standard sampling technique: https://en.wikipedia.org/wiki/Inverse_transform_sampling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, it's not a percentile -- it's the inverse of the percentile/quantile under the quantile function, i.e. the probability of lying below the corresponding quantile. I can't find a standard terminology for this. Perhaps "continuous rank statistic"?
See https://en.wikipedia.org/wiki/Quantile and https://en.wikipedia.org/wiki/Quantile_function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Man, I've been thinking about percentile wrong for a while. My stats professors would be ashamed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There really isn't a name for this. Huh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha! Apparently it's called the percentile rank (or presumably quantile rank if we're talking about values in [0,1] instead of [0,100]).
So if Q is the quantile function and p is a probability, then q = Q(p) is a quantile with quantile rank p, i.e. a p-quantile. Or in reverse, if F is the CDF and q is an arbitrary value of the random variable, then the probability p = F(q) is the quantile rank of q, so q is a p-quantile. (I'm ignoring some subtleties about nonuniqueness when F is discontinuous or has constant pieces.)
I've been confused about this too. But it's the statistics professors' fault! For not laying out the definitions clearly.
model**) for a vitamin A deficiency, in which each simulant is initialized with a "propensity" for vitamin A deficiency, and the simulant's vitamin A status is determined by comparing this | ||
propensity to the overall VAD exposure/prevalence in the population. | ||
Such | ||
propensity/exposure models have been used in Vivarium for other risk factors and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are used for all other risk models. It is the standard. They are infrequently used for cause models because we usually trust the dynamic disease parameters more or we care about counting cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a todo to reword this.
|
||
In more detail, the basic strategy is to initialize each simulant with a | ||
propensity score distributed uniformly in [0,1], then compare this propensity | ||
score with the (location/age/sex/year/intervention-status)-dependent prevalence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use exposure throughout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? Do you mean I should call this the "(location/age/sex/year/intervention-status)-dependent exposure distribution," which in this case happens to represent prevalence of VAD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think in a risk-based model, we should explicitly use exposure since the disease prevalence is derived from it. As opposed to a PAF of 1 model of VitA in which the disease model was the source of truth and the risk exposure was derived from prevalence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some wording and a todo to reword everything to ensure consistent terminology.
I guess I was thinking about this from the GBD perspective - my impression is that they ran a DisMod model to get the cause data, and just copied the prevalence data over to the risk factor to get exposure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. As long as your consistent, it seems fine either way.
remission data for vitamin A deficiency, but only *prevalence* (which is the | ||
same as the exposure data for the VAD risk factor). The rationale for this approach is twofold: | ||
|
||
1. We want to guarantee that the simulated baseline prevalence of vitamin A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are only doing this because of option 2. We expect the cause version of the model to get incidence and remission correct in addition to getting prevalence correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, it is more important for us to use the intervention data correctly than it is to get the dynamic parameters of the disease correct. Details about the limitations and the expected impact to be found in .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With regard to (1), there's also the concern mentioned in @aflaxman 's comment above: If we use the incidence and remission data from GBD, then basically the whole population will end up getting VAD if we run the simulation for 5 years, but we suspect this would not be representative of reality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a todo to clarify this.
|
||
Explain why the prevalence-only model is a reasonable strategy, citing | ||
incidence, remission, and prevalence data, as well as expert opinions about | ||
VAD. (Perhaps this explanation should come later, e.g. in the Assumptions and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assumptions and limitations is right, I think. That's where I'd look for the justification.
|
||
1. **Initialize:** When simulant :math:`i` enters the simulation (e.g. at the | ||
start of the simulation or at the time step when the simulant is born), | ||
assign the simulant a random number :math:`v_i \sim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should standardize the notation for these things. I follow scipy
conventions in the code and I think they're reasonable:
x_i - the random variable (the exposure)
q_i - the percentile or propensity of the exposure x_i in the distribution (called q because the inverse cdf is the quantile function, though also called the percent point function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialize should also describe how we actually get x_i
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@collijk In this case x_i
would be a binary variable, "has VAD"/"does not have VAD", correct? How would you initialize this before following the procedure in the "Update" step? I assumed the "update" part would first happen during the same time step as initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The procedure is the same as in update, but it has to happen before a time step takes place. The value of other attributes the simulant is initialized with (e.g. whether or not they are receiving vitamin a fortification) may be dependent on their initial vitamin a deficiency status. This is a fencepost error (https://en.wikipedia.org/wiki/Off-by-one_error#Fencepost_error). All attributes of a simulant must be assigned an initial value before the first time step starts or you introduce extremely hairy issues into the order you must update simulant state each time step.
Here's the procedure:
- Initialization:
- Sample propensity and store it for all time.
- Use attributes initial distribution is conditional on (age, sex, year) to construct probability mass function
- Use propensity to sample pmf and assign initial status.
- Update:
- Your procedure looks good here.
c) If :math:`v_i < p_\text{VAD}(\text{subpop}(i,t))`, the simulant has | ||
vitamin A deficiency on the next time step; otherwise, they don't. | ||
|
||
To address a point of potential confusion in the above algorithm, note that a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just leaving notes. I think this section is great. I think we want to pull it out into the general risk model for the standard way to sample from categorical distributions. In order to do so, we have to generalize this section a bit. Technically what we do is start with a distribution
p = [P_1, P_2, ..., P_N, 1 - (P_1 + ... +P_n)]
where P_j is the probability that an individual is in category j.
We then take the cumulative sum over the distribution
p_cum = [P_1, P_1 + P_2, ..., P_1 + ...+ P_N, 1]
to form the right bounds of a partition of the interval [0, 1] with each subinterval mapping to a risk exposure category with probability p_cum[right] - p_cum[left].
We then select the exposure category k by arg_max_k (q_i < p_cum[k]) and assign that category as x_i.
This means the interpretation of propensity is dependent on the sort order of the categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default sort order is worst to best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
w/r/t risk effect. We have not had to deal with unordered categorical risks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a general comment here: this PR is too big and stayed open too long. We need to come up with some strategies and guidance about putting these things together and getting fast reviews.
It would be great to get some feedback about the following: