Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adding this so I can experiment with the entropic model.
For risk-averse models, we want to sample bad trajectories more frequently. The main motivation is that if you're using a risk measure, you care about the tails. So we want a good policy in the tails, which means adding more cuts in the tails. But if we sample with the nominal distribution, then the tails aren't going to get many cuts!
The standard way around this is to use some sort of importance sampling on the forward pass (e.g., https://arxiv.org/pdf/1901.01302.pdf, https://arxiv.org/pdf/2001.06026.pdf, probably some others, msppy calls it "biased sampling").
This takes a different approach: we just periodically resample bad trajectories that we have already seen, sampled based on the risk-adjusted probability of the cumulative objective values (I don't know what it will do if you have a cyclic policy graph? Repeatedly sample the longest trajectories?)
This is potentially better than the importance sampling, because it refines things we actually care about (bad trajectories).
The importance sampling approach could be overly conservative, because it assumes at each time step that a bad thing is more likely to happen, when in reality it's unlikely to go bad-bad-bad (if this was true, you're modeling it wrong; use a Markovian policy graph). There is also evidence that resampling trajectories can be a good thing (http://www.optimization-online.org/DB_HTML/2021/05/8397.html).
The other reason not to do importance sampling is that it's hard to implement because the data-structures aren't setup to do it :(. We'd need a way for each node to track a single vector of probabilities, instead of node->realization pairs, and a way for the backward pass to communicate the risk-adjusted probabilities with the forward pass.