# Warming up to OCaml through Reinforcement Learning

In this tutorial, you are going to implement a few basic OCaml functions (see section "Student's mission" below) that we will pass in as input to a “reinforcement learning” functor implementing a simple game of tic-tac-toe (see also [full code](https://github.com/ghennequin/pkp-tutorials/blob/master/src/reinforcement_learning.ml)). Thus, most of the logic is already pre-implemented, but you are going to contribute key sub-parts. I'll assume everyone is familiar with tic-tac-toe ─ if not, take 1 min now to play a game with your tutorial buddy. Hint: I am not talking about [_that one_](https://www.youtube.com/watch?v=00JwptZb2Wk).

Going through a simple example of RL is a nice way of beginning a neuroscience course, since RL is perhaps one of the most fundamental of all brain computations. Living organisms must learn for themselves how to behave in an environment that can sometimes be rewarding, often hostile, and certainly mysterious (doesn't come with a user manual!). A few basic considerations:

1. No amount of sensory, motor, or memory abilities will ever compensate for poor action selection (or, put it another way, the only reason your brain can interpret your surroundings, drive movement, or memorise stuff is to allow you to choose appropriate actions).
2. Learning to select good actions is difficult, as nobody will ever teach you exactly what to do in every detail ─ btw, that includes me :)
3. Thus, agents must “try things out”, evaluate outcomes in light of the rewards or punishments they receive from the environment, remember (“learn”), and do better next time round. 

The goal of this session is to code up a player that plays “optimally well”. The mathematics of RL are well established, and anyone interested in going deeper should take a look at the classic [textbook by Sutton and Barto](http://incompleteideas.net/book/the-book-2nd.html). Let me give you a quick tour, which should be enough for you to understand the context of what is primarily a programming tutorial.

## RL basics 

Say it's our turn to play, and the board is in a certain configuration $B$. Where shall we put our `X` mark? Ideally, we would put it in the square that maximises our _expected return_:

- “return”: say 1 point for a win, 0.5 for a tie, and 0 for a loss.
- “expected”, because we don't know how the opponent will play, so we need to average over all possible future “paths” in the game (see below).

In RL, the _expected return_ is called the **value function**, denoted by $Q(B,a)$. It is the value of taking action $a$ (writing our `X` in a given available square) when the board is in state $B$. Given a board configuration $B$, if we know $Q(B,a)$ as a function of $a$, it is straightforward to enumerate all possible actions and pick the one with highest $Q$ value. But how do we compute $Q(B, a)$? The answer lies in the so-called **Bellman equation**, which looks pretty scary but is easy to grasp intuitively:

$$ Q(B, a) = \sum_{B'} p(B'|a) \left[ \sum_{a'} \pi(a'|B') Q(B', a') \right] $$

The outer sum, $\sum_{B'} p(B'|a) \left[ \cdots \right]$, simply means that we are averaging $[\cdots]$ over all possible configurations $B'$ we might face in our next turn if we take action $a$ now (note that $[\cdots]$ is a function of $B'$). This will of course depend on how the opponent plays. Next, we need to compute $[...]$, which is an average of the next expected return $Q(B',a')$ over all possible next actions $a'$ our playing strategy might cause us to take. Here, $\pi(a'|B')$ is the probability of us picking $a'$ as our next move if our next board is $B'$. It basically describes our playing strategy.

If we decide to play “optimally”, i.e. always pick actions that maximise value, then $a'$ is a deterministic function of $B'$, so the value function $Q(B, a)$ only depends on $B$, and obeys a simpler version of the Bellman equation:

$$ Q(B) = \max_{a} \sum_{B'} p(B'|a) \, Q(B') $$

As mentioned above, we have "termination conditions" such as $Q(B)=1$ for a winning board, $0.5$ for a tie, and $0$ for a loss.

## FP implementation

When I said FP makes it easy to write in code what we mean in the maths, I literally meant easy. Here is a skeleton:

```ocaml
let rec q_value board =
  match evaluate board with
  | Win -> 1.0
  | Tie -> 0.5
  | Lose -> 0.0
  | Unfinished ->
    max_over_possible_actions (fun a ->
        let outcomes = outcomes_of board a in
        average_over_outcomes q_value)
```

in which `let rec` simply expresses the recursion that is apparent in the Bellman equation.
Now, even though code like this would work, it's horrendously inefficient, as we are constantly re-computing values for the same boards. For example, this board here:

<div style="margin-top:1em; text-align:center; font-family:'Liberation Mono'">
<table style="border: 1px solid;">
    <tr><td>X</td><td>O</td><td>_</td></tr>
    <tr><td>O</td><td>X</td><td>_</td></tr>
    <tr><td>_</td><td>_</td><td>_</td></tr>
</table>
</div>

is a possible “future outcome" for both this board

<div style="margin-top:1em; text-align:center; font-family:'Liberation Mono'">
<table style="border: 1px solid;">
    <tr><td>X</td><td>O</td><td>_</td></tr>
    <tr><td>_</td><td>_</td><td>_</td></tr>
    <tr><td>_</td><td>_</td><td>_</td></tr>
</table>
</div>

and that one

<div style="margin-top:1em; text-align:center; font-family:'Liberation Mono'">
<table style="border: 1px solid;">
    <tr><td>_</td><td>_</td><td>_</td></tr>
    <tr><td>O</td><td>X</td><td>_</td></tr>
    <tr><td>_</td><td>_</td><td>_</td></tr>
</table>
</div>

and so its value would be evaluated as part of computing the value of each of these two boards: so at least one time too many! This only get worse for "fuller boards", which can be arrived through very, very many paths, and so their values will be computed many, many times. To fix this issue, we use a FP technique called “memoization”, also known in applied maths as “dynamic programming”. This simply consists in caching (remembering) each $Q(B)$, in such as way as to not have to compute it again when the same board $B$ arises again. This is pretty straightforward to implement ─ see the code of `module Pkp.Reinforcement_learning` on [github](https://github.com/pkp-neuro/pkp-tutorials/blob/master/src/reinforcement_learning.ml).

## In real life...

That's it for the intro. Just a quick note before we proceed to the actual programming tutorial: I certainly don't want to leave you with the impression that RL is as simple as using recursion & memoization. In many ways, tic-tac-toe is an easy case for RL; in real-world scenarios, the following features make our brain's life way more difficult:

1. We never quite know what state the environment is in ─ you only have partial observations; we'll talk about this in our lecture on perception.
2. The reward function is stochastic, so that if we take the same action in the same state several times, we won't always get identical rewards; we'll talk about that in our lecture on decision making.
3. We might get rewards / punishment “along the way”, not necessarily just at the “end of the game” ─ this means the value function must be defined in terms of “total cumulative future reward”.
4. Perhaps more importantly, the state space might be very large or continuous, making deep exploration of the future (and even representing the Q function) practically impossible. The same is true when the action space is large or continuous, precluding direct enumeration of all possible actions.

These problems, in addition to what's at stake in solving RL, make RL a a very hot area of machine learning research these days.

In [None]:
;;
#require "pkp"

In [None]:
open Pkp.Reinforcement_learning

Your mission is to implement a module of type `Mission` ([documented here](https://pkp-neuro.github.io/pkp-tutorials/pkp/Pkp/Reinforcement_learning/index.html)). I recommend defining a module that includes the provided `Solution` module, and to progressively add your own functions underneath, which will shadow those in `Solution`:

In [None]:
module MySolution = struct
  include Solution

  (* now include your own functions here, before "end" *)
end

Now you can pass in your solution module to the `Make` functor (a functor is a module that takes another module as parameter); this yields a new module which you can then use to actually play the game and test your solutions:

In [None]:
module M = Make (MySolution)

Take a moment to look at the functions provided in that new module, [documented here](https://pkp-neuro.github.io/pkp-tutorials/pkp/Pkp/Reinforcement_learning/Make/index.html). Let's use them now to test your implementation (or indeed to test the `Solution` module, for a start :)).

# Optimal player vs. random player

Let's run an example game between the optimal X player, and a random O player:

In [None]:
let _ =
  let opt = M.optimal X in
  let dumb = M.random O in
  M.play (opt, dumb)

Try running it several times to convince yourself that the optimal player does pretty well! Try also reversing the order of the players, so that the dumb player gets to start. What do you notice? Does the optimal player ever lose?

# Optimal player vs. sub-optimal player

Now define an suboptimal player that interpolates between the optimal player and a random player: in each turn, it plays like a random player with probability $p$, and like the optimal player with probability $(1-p)$:

```ocaml
let suboptimal p mark = [...]
```

In [None]:
(* go ahead here: *)

Once you are done, you can try out such an intermediate player ─ you may want to play with parameter $p$, and swap the player ordering:

In [None]:
let _ =
 let opt = M.optimal X in
 let subopt = suboptimal 0.5 O in
 M.play (opt, subopt)

Let us now run many games, collect some statistics, and make pretty plots!

In [None]:
(* collect statistics of results between the optimal player and the (suboptimal p) player, for some p *)
let stats p =
  let opt = M.optimal X in
  let subopt = suboptimal p O in
  let n_games = 1000 in
  let games = List.init n_games (fun _ -> M.play ~display:false (subopt, opt)) in
  let n_wins = games |> List.filter (fun winner -> winner = Some opt.mark) |> List.length in
  let n_ties = games |> List.filter (fun winner -> winner = None) |> List.length in
  float n_wins /. float n_games, float n_ties /. float n_games

In [None]:
let () =
  let open Owl in
  let ps = Mat.linspace 0. 1. 20 in
  let results = ps |> Mat.to_array |> Array.map stats in
  let wins = results |> Array.map fst |> fun v -> Mat.of_array v 1 (-1) in
  let ties = results |> Array.map snd |> fun v -> Mat.of_array v 1 (-1) in
  let lose = Mat.(1. $- wins + ties) in
  let open Gp in
  let figure (module P : Plot) =
    P.plots
      [ item (L [ ps; wins ]) ~style:"lp pt 7 lc 7 ps 0.6" ~legend:"win"
      ; item (L [ ps; ties ]) ~style:"lp pt 7 lc 8 ps 0.6" ~legend:"tie"
      ; item (L [ ps; lose ]) ~style:"lp pt 7 lc 3 ps 0.6" ~legend:"lose"
      ]
      [ barebone
      ; set "key at graph 1.1, graph 1 top left"
      ; tics "out nomirror"
      ; borders [ `bottom; `left ]
      ; xlabel "probability of opponent playing randomly"
      ; ylabel "win / tie probabilities"
      ; margins [ `right 0.6 ]
      ]
  in
  Juplot.draw ~fmt:`svg ~size:(500, 200) figure

Take-home exercise: redo the stats above, in the case where the order of play gets decided randomly (50-50) at the beginning of every game!