Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polish tutorial 2 #72

Open
patrickmineault opened this issue Apr 6, 2022 · 1 comment
Open

Polish tutorial 2 #72

patrickmineault opened this issue Apr 6, 2022 · 1 comment
Labels
documentation Improvements or additions to documentation

Comments

@patrickmineault
Copy link

I really enjoyed this tutorial, overall. It shows the power of Agent. I think it's a little hard to read in some places, and I wanted to flag some of these issues:

  1. It's a non-standard bandit task, because there's a third action, HINT. It would be nice to indicate this in the intro - or to refer to a version of the task which includes the hint action, because there's no indication in the only reference, which is the Wikipedia link.
  2. Going from tutorial 1 to tutorial 2, there's an unexplained shift in the definition of A. In tutorial 1, it was a two-dimensional array p(o|s), but now it's a four-dimensional array. I realized a bit late in the process that this was all explained in the pymdp fundamentals part of the doc - a backlink would be nice.
  3. Generally I find using raw arrays to define states difficult to read. I presume that this is an artifact of the history of the package (Matlab cell arrays). It would be helpful to remind people the definition of the axes as we go along. OTOH, I did find the diagrams very helpful.
  4. I found the last part about manipulating the reward very confusing. I had to go back to definitions to figure out that the last manipulation changed the loss into a reward (+5) - if I just look at the output, it looks like the agent always picks the arm resulting in a loss.

One thing that's not explained here is how the agent's inference mechanism differs from those of other, more well-known agents from the RL literature. Having read the first few chapters of Richard Sutton's book eons ago, I wondered whether the inference was equivalent to a finite horizon dynamic programming solution, or similar in spirit to an agent that uses a UCB heuristic or Thompson sampling. If you could have a few references in the tutorial about this, it would be great.

@conorheins
Copy link
Collaborator

conorheins commented Apr 9, 2022

Hi @patrickmineault thanks a lot for these helpful suggestions. I've addressed points 1-4.

That bit you discovered in Point 4 with the reward manipulation was a mistake on my part, I didn't mean to have the user flip the 'valence' of reward and punishment like that, although the outcome you observed makes perfect sense. Because in that case, the 'meaning' of the reward and loss observations are simply switched, in terms of the agent's prior over observations (they now "expect themselves" to see Loss more than Reward, meaning the observation formerly referred to as "Loss" is technically the reward now). I've now changed that section of the demo to make the agent more risk-averse, but still actively "Loss-seeking" as it was before.

I'm going to leave this issue open for now, because I haven't added in text to address your last point, which is also a great suggestion and should be incorporated: "I wondered whether the inference was equivalent to a finite horizon dynamic programming solution, or similar in spirit to an agent that uses a UCB heuristic or Thompson sampling. If you could have a few references in the tutorial about this, it would be great." This is related to this paper I think: https://arxiv.org/abs/2009.08111.

I will have to do some more research on these exact relationships and then incorporate that into the text of that demo.

@conorheins conorheins added the documentation Improvements or additions to documentation label Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants