# Monte Carlo Tree Search for Teaching

Monte Carlo Tree Search (MCTS) is a popular method for determining good decision strategies in large decision spaces (i.e., those with a large branching factor).  This notebook is a little example code for learning the basics of the algorithm.  It draws heavily from the tutorial by Jeff Bradberry published here: https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/

---
**How to turn this into a teaching situation?**

The `Student` class below gives the basic structure of this teacher-student interaction.  The teacher model provides instructions to the student which influences the state of the student.  The state is directly observable.  The `Student` class is just a model in the sense that it is a fiction in the mind of the teacher.  The student might not exactly behave this way.  However, if the student *did* act this way then the MCTS planning algorithm should be able to decide what the best sequence of teaching actions is.

1. **Uncooperative student in a grid world**: The student is an agent in a grid world.  Objective is to reach a particular goal state for a large positive reward.  What instruction should the teacher give?  Assume additionally the student isn't perfect so might not follow directions perfectly... The teacher should anticipate this. (Does the teacher understand the maze or no? No means give instructions and simply observes the outcomes and decides the policy directly during planning.)

1. **Growing a plant**: The student is a plant.  Objective is to keep alive until it reaches fruiting stage.  One action is ignore the plant (which can actually be good).  Other's are like apply water, apply fertilizer, apply sunlight, etc...  It is kind of like the two player game because you don't know what the other agent will do.  No partial observability issues here, yet, technically because you can just look at the plant height (let's say).

1. **Transparent sequence learner**:  The student is a person learning a digit string.  Objective is to get the student to know the sequence.  Actions are <?>.  Agent can forget.  No partial observability because each time you do a teaching episode the student gives a complete read-out of their current memory (as the state) so you always know where you are in terms of the student's knowledge.



In [None]:
class Student(object):
    
    def start(self):
        # Returns a representation of the starting state of the game.
        pass

    def next_state(self, state, instruction):
        # Takes the game state, and the move to be applied.
        # Returns the new game state.
        pass

    def teaching_actions(self, state_history):
        # Takes a sequence of learners states representing the full
        # teaching history, and returns the full list of actions that
        # are legal teaching actions
        # Question: why does the full teaching histoy influence the actions?
        pass

    def reward(self, state_history):
        # Takes a sequence of learner states representing the full
        # teaching history.  If the "game" is now won, return a large
        #positive reward. If the game is still ongoing, return zero.
        pass