# Chapter 6: Temporal-Difference Learning

## 6.1 TD Predictions

**Non-stationary Monte Carlo** prediction

$$
V(S_t) \gets V(S_t) + \alpha \left[G_t - V(S_t)\right]
$$

**Temporal Difference Prediction**

$$
$$

_TD error_

$$
\begin{align*}
V(S_t) &\gets V(S_t) + \alpha\left[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\right]\\\\
\delta_t &\;\dot{=}\; R_{t+1} + \gamma V(S_{t+1}) - V(S_t)
\end{align*}
$$

Relationship between TD-Error and Monte-Carlo error

$$
\begin{align*}
G_t - V(S_t) &= R_{t+1} + \gamma G_{t+1} - V(S_t)\\\\
&= R_{t+1} + \gamma G_{t+1} - V(S_t) + \gamma V(S_{t+1}) - \gamma V(S_{t+1})\\\\
&= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) + \gamma G_{t+1} - \gamma V(S_{t+1})\\\\
&= \delta_t + \gamma\left(G_{t+1} - V(S_{t+1})\right)\\\\
&= \delta_t + \gamma(\delta_{t+1} + \gamma(G_{t+2} - V(S_{t+2}))\\\\
&= \delta_t + \gamma\delta_{t+1} + \gamma^2(G_{t+2} - V(S_{t+2}))\\\\
&= \delta_t + \gamma\delta_{t+1} + \gamma^2(\delta_{t+2} + \gamma(G_{t+3} - V(S_{t+3}))\\\\
&= \delta_t + \gamma\delta_{t+1} + \gamma^2\delta_{t+2} + \cdots + \gamma^{T-t-1}\delta_{T-1} + \gamma^{T-1}(G_{T} - V(S_{T}))\\\\
&= \delta_t + \gamma\delta_{t+1} + \gamma^2\delta_{t+2} + \cdots + \gamma^{T-t-1}\delta_{T-1} + \gamma^{T-1}(0 - 0)\\\\
&= \sum_{k=t}{\gamma^{k-t}\delta_k}
\end{align*}
$$

TD prediction makes use both of sampling _and_ bootstrapping.

1. DP makes use of the model and bootstrapping.
2. Monte Carlo makes use of sampling returns
3. TD prediction makes use of sampled returns as well as bootstrapping, without a model.

> _Exercise 6.1_ If $V$ changes during the episode, then (6.6) only holds approximately; what would the difference be between the two sides? Let $V_t$ denote the array of state values used at time $t$ in the TD error (6.5) and in the TD update (6.2). Redo the derivation above to determine the additional amount that must be added to the sum of TD errors in order to equal the Monte Carlo error.

$$
V_{t+1}(S_t) = V_t(S_t) + \alpha\left[R_{t+1} + \gamma V_t(S_{t+1}) - V_t(S_t)\right] = V_t(S_t) + \alpha \delta_t
$$

$$
\delta_t = R_{t+1} + \gamma V_t(S_{t+1}) - V_t(S_t)
$$

$$
\begin{align*}
G_t - V_t(S_t) &= R_{t+1} + \gamma G_{t+1} - V_t(S_t)\\\\
&= R_{t+1} + \gamma G_{t+1} - V_t(S_t) + \gamma V_t(S_{t+1}) - \gamma V_{t+1}(S_{t+1}) + \gamma\alpha\delta_t\mathbb{1}[S_t = S_{t+1}]\\\\
&= R_{t+1} + \gamma V_t(S_{t+1}) - V_t(S_t)  + \gamma\alpha\delta_t\mathbb{1}[S_t = S_{t+1}] + \gamma G_{t+1} - \gamma V_{t+1}(S_{t+1})\\\\
&= (1 + \gamma\alpha\mathbb{1}[S_t = S_{t+1}])\delta_t + \gamma\left(G_{t+1} - V(S_{t+1})\right)\\\\
&= (1 + \gamma\alpha\mathbb{1}[S_t = S_{t+1}])\delta_t + \gamma\left((1 + \gamma\alpha\mathbb{1}[S_{t+1} = S_{t+2}])\delta_{t+1} + \gamma\left(G_{t+2} - V_{t+2}(S_{t+2}))\right)\right)\\\\
&= \tilde{\delta}_t + \gamma\tilde{\delta}_{t+1} + \gamma^2\tilde{\delta}_{t+3} + \cdots + \gamma^{T-t-1}\tilde{\delta}_{T-1} + \gamma(0 - 0)\\\\
&= \sum_{k=t}^{T-1}{\gamma^{k-t}\tilde{\delta}_{k}}
\end{align*}
$$

Where

$$
\tilde{\delta}_t = (1 + \alpha\gamma\mathbb{1}[S_t = S_{t+1}])\delta_t
$$

We can compute the excess error as

$$
\sum_{k=t}^{T-1}{\alpha\gamma^{k-t+1}\mathbb{1}[S_k = S_{k+1}]\delta_t}
$$

> _Exercise 6.2_ This is an exercise to help develop your intuition about why TD methods are often more effcient than Monte Carlo methods. Consider the driving home example and how it is addressed by TD and Monte Carlo methods. Can you imagine a scenario in which a TD update would be better on average than a Monte Carlo update? Give an example scenario-a description of past experience and a current state—in which you would expect the TD update to be better. Here’s a hint: Suppose you have lots of experience driving home from work. Then you move to a new building and a new parking lot (but you still enter the highway at the same place). Now you are starting to learn predictions for the new building. Can you see why TD updates are likely to be much better, at least initially, in this case? Might the same sort of thing happen in the original scenario?

Imagine a scenario that has a zero variance reward $R_t = r$ at some fixed state $S_t = s$ that occurs in the middle of most trajectories, but much noisier reward before and after. Then, learning using a Monte Carlo estimator of future reward for $S_t$ totally disregards this structure in the reward signal, since we still need to predict all of the variance associated with the discounted sum of rewards subsequent to $R_t$. Therefore, our Monte Carlo estimate of will continue to be noisy until we have high quality MC estimates for all possible states.

However, if we used TD error with $\alpha = 1$, then we can immediately take advantage of this structure in the reward and calculate that $V(s) = r + \gamma\mathbb{E}_{S_{t+1}\sim \pi(\cdot|s)}[V(S_{t+1})]$, regardless of the values of the other states are. This will propagate information more efficiently through this state to earlier states in trajectories, allowing updates to subsequent state values to more quickly update prior state values.

We can consider a concrete example of a deterministic MRP with four states, $s_1, s_2, s_3$ and $s_\verb|terminal|$. First we transition from $s_1$ to $s_2$ and obtain a reward of $R_2 \sim \mathcal{N}(10, \sigma_2)$, then we transition from $s_2$ to $s_3$ and obtain a deterministic reward $R_3 = 10$, and then lastly, we obtain another stochastic reward when transitioning from $s_3$ to $s_\verb|terminal|$.



## 6.2 Advantages of TD Prediction Methods

> _Exercise 6.3_ From the results shown in the left graph of the random walk example it appears that the first episode results in a change in only $V(A)$. What does this tell you about what happened on the first episode? Why was only the estimate for this one state changed? By exactly how much was it changed?

This means the episode terminated by exiting through the left end node with 0 reward. Due to the initialization of $V_1(s) = 0.5$ for all $s$, all intermediate steps before the end for that trajectory would update:

$$
V_{t+1}(s) = V_t(s) + \alpha(0 + V_t(s') - V_t(s)) = V_t(s) + \alpha(0 + 0.5 - 0.5) = V(s)
$$

However, for the transition from $\verb|A|$ to the terminal state, we have the update:

$$
V_{t+1}(\verb|A|) = V_t(\verb|A|) + \alpha(0 + V_t(s_{\verb|terminal|}) - V_t(\verb|A|)) = V_t(\verb|A|) - 0.5\alpha = 0.45
$$

> _Exercise 6.4_ The specific results shown in the right graph of the random walk example are dependent on the value of the step-size parameter, $\alpha$. Do you think conclusions about which algorithm is better would be affected if a wider range of $\alpha$ values were used? Is there a different, fixed value of $\alpha$ at which either algorithm would have performed significantly better than shown? Why or why not?

A lower value of alpha presumably lowers the baseline error for the TD method since it will integrate information over a longer time horizon. However, a lower value of $\alpha$ would also slow convergence of the method. I would hypothesize that there is no value of $\alpha$ that minimizes the the value error (i.e there will always be a smaller $\alpha$ that eventually converges to a lower value error).

For the MC methods, it seems that smaller $\alpha$ values also have a smoothing effect at the cost of slower convergence. Additionally the MC methods probably have the same property of not having a single $\alpha$ that will result in an optimal value error.

> _Exercise 6.5_ In the right graph of the random walk example, the RMS error of the TD method seems to go down and then up again, particularly at high $\alpha$s. What could have caused this? Do you think this always occurs, or might it be a function of how the approximate value function was initialized?

This likely always happens, since the TD method will eventually reach a point at which the the estimated values oscillate around the true value, with a variance related to the choice of $\alpha$. In particular, it seems that as alpha increases, the estimate converges faster to it's "baseline" error rate, and then stays mostly level or increases as further experience causes old experience to be downweighted while new experience is overweighted. However, for lower $\alpha$, it seems that the oscillations settle at a lower baseline error and oscillate less.

> _Exercise 6.6_ In Example 6.2 we stated that the true values for the random walk example are $\frac{1}{6}, \frac{2}{6}, \frac{3}{6}, \frac{4}{6}, \frac{5}{6}$, for states A through E. Describe at least two different ways that these could have been computed. Which would you guess we actually used? Why?

There are two ways to calculate each of these values. First, we can write the Bellman equations for the set of variables $V(s)$ for $s \in \{ \verb|A|, \verb|B|, \verb|C|, \verb|D|, \verb|E|\}$. 

$$
\begin{align*}
v(\verb|A|) &= \mathbb{E}_{A_t\sim\pi}[R_t + \gamma v(S_{t+1}) | S_t = s ] = \frac{1}{2}\left(0 + v(s_{\verb|terminal|}\right) + \frac{1}{2}\left(0 + v(\verb|B|)\right) = \frac{1}{2}v(\verb|B|)\\\\
v(\verb|B|) &= \frac{1}{2}v(\verb|A|) + \frac{1}{2}v(\verb|C|)\\\\
v(\verb|C|) &= \frac{1}{2}v(\verb|B|) + \frac{1}{2}v(\verb|D|)\\\\
v(\verb|D|) &= \frac{1}{2}v(\verb|C|) + \frac{1}{2}v(\verb|E|)\\\\
v(\verb|E|) &= \mathbb{E}_{A_t\sim\pi}[R_t + \gamma v(S_{t+1}) | S_t = s ] = \frac{1}{2}\left(1 + v(s_{\verb|terminal|})\right) + \frac{1}{2}\left(0 + v(\verb|D|)\right) = \frac{1}{2} + \frac{1}{2}v(\verb|D|)
\end{align*}
$$

Solving these equations from top to bottom, we can find:

$$
\begin{align*}
v(\verb|B|) &= \frac{2}{3}v(\verb|C|)\\\\
v(\verb|C|) &= \frac{3}{4}v(\verb|D|)\\\\
v(\verb|D|) &= \frac{4}{5}v(\verb|E|)\\\\
v(\verb|E|) &= \frac{5}{6}
\end{align*}
$$

And then back-substituting:
$$
\begin{align*}
v(\verb|D|) &= \frac{4}{6}\\\\
v(\verb|C|) &= \frac{3}{6}\\\\
v(\verb|B|) &= \frac{2}{6}\\\\
v(\verb|A|) &= \frac{1}{6}
\end{align*}
$$

This is presumably the way the authors' have solved this problem because it is the simplest to show.

The alternative technique would be to use a dynamic programming technique like policy evaluation in order to numerically compute the solution.

## 6.3 Optimality of TD(0)

> _Exercise 6.7_ Design an off-policy version of the TD(0) update that can be used with arbitrary target policy $\pi$ and covering behavior policy $b$, using at each step $t$ the importance sampling ratio $\rho_{t:t}$

If the standard TD(0) update is:

$$
\begin{align*}
V_{t+1}(S_t) &= V(S_t) + \alpha\left[R_t + \gamma V(S_{t+1}) - V(S_t)\right]\\
&= V(S_t) + \alpha \delta_t
\end{align*}
$$

We can update this to be off policy by using an importance sampling adjustment on $\delta_t$

$$
\mathbb{E}_\pi\left[\delta_t\right] = \mathbb{E}_b\left[\frac{\pi(A_t \mid S_t)}{b(A_t \mid S_t)}\delta_t\right]
$$

And so we arrive at

$$
\begin{align*}
V_{t+1}(S_t) &= V(S_t) + \alpha \rho_{t:t}\delta_t\\
&= V(S_t) + \alpha\left[\frac{\pi(A_t\mid S_t)}{b(A_t\mid S_t)}\left(R_t + \gamma V(S_{t+1}) - V(S_t)\right)\right]
\end{align*}
$$

## 6.4 Sarsa: On-policy TD Control

> _Exercise 6.8_ Show that an action-value version of (6.6) holds for the action-value form of the TD error $\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)$, again assuming that the values don’t change from step to step.

$$
\begin{align*}
G_t - Q(S_t, A_t) &= R_{t+1} + \gamma G_{t+1} - Q(S_t, A_t)\\\\
&= R_{t+1} + \gamma G_{t+1} - Q(S_t, A_t) + \gamma Q(S_{t+1}, A_{t+1}) - \gamma Q(S_{t+1}, A_{t+1})\\\\
&= R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) + \gamma G_{t+1} - \gamma Q(S_{t+1}, A_{t+1})\\\\
&= \delta_t + \gamma\left(G_{t+1} - Q(S_{t+1}, A_{t+1})\right)\\\\
&= \delta_t + \gamma(\delta_{t+1} + \gamma(G_{t+2} - Q(S_{t+2}, A_{t+2}))\\\\
&= \delta_t + \gamma\delta_{t+1} + \gamma^2(G_{t+2} - Q(S_{t+2}, A_{t+2}))\\\\
&= \delta_t + \gamma\delta_{t+1} + \gamma^2(\delta_{t+2} + \gamma(G_{t+3} -Q(S_{t+3}, A_{t+3}))\\\\
&= \delta_t + \gamma\delta_{t+1} + \gamma^2\delta_{t+2} + \cdots + \gamma^{T-t-1}\delta_{T-1} + \gamma^{T-1}(G_{T} - Q(S_{T}, A_{T}))\\\\
&= \delta_t + \gamma\delta_{t+1} + \gamma^2\delta_{t+2} + \cdots + \gamma^{T-t-1}\delta_{T-1} + \gamma^{T-1}(0 - 0)\\\\
&= \sum_{k=t}{\gamma^{k-t}\delta_k}
\end{align*}
$$

> _Exercise 6.9: Windy Gridworld with King's Moves (programming)_ Re-solve the windy gridworld assuming eight possible actions, including the diagonal moves, rather than four. How much better can you do with the extra actions? Can you do even better by including a ninth action that causes no movement at all other than that caused by the wind?

In [100]:
function sarsa(p0, p, 𝓢, 𝓐; γ=0.99, N=1000, α=0.1, ε=0.05)
    ε_greedy(s) = rand() < ε ? rand(𝓐) : argmax(a -> Q[(s, a)], 𝓐)
    Q = Dict((s, a) => 0.0 for s in 𝓢, a in 𝓐)
    for i in 1:N
        S = p0()
        A = ε_greedy(S)
        while true
            R, Sp = p(S, A)
            if Sp == nothing
                break
            end
            Ap = ε_greedy(Sp)
            Q[(S, A)] += α * (R + γ*Q[(Sp, Ap)] - Q[(S, A)])
            S, A = Sp, Ap
        end
    end

    return Q, s -> argmax(a -> Q[(s, a)], 𝓐)
end

𝓢 = [ (i, j) for i in 1:7, j in 1:10 ]
𝓐 = [ (dy, dx) for dx in -1:1 for dy in -1:1 if dx != 0 || dy != 0 ]
wind = [ 0, 0, 0, 1, 1, 1, 2, 2, 1, 0 ]
function p(S, A)
    y, x = S
    nS = (S .+ A) .- (wind[x], 0)
    nS = (clamp(nS[1], 1, 7), clamp(nS[2], 1, 10))

    return nS == (4, 8) ? (0, nothing) : (-1, nS)
end
p0() = (4, 1)

Q_star, π_star = sarsa(p0, p, 𝓢, 𝓐; γ=1, ε=0.1, α=0.5, N=8000);

(Dict(((1, 4), (-1, -1)) => -11.685766570219627, ((4, 5), (1, -1)) => -11.536480528242041, ((7, 4), (-1, 0)) => -11.27053175197489, ((4, 1), (1, 0)) => -11.029200017347137, ((2, 7), (-1, -1)) => -10.005228583350872, ((6, 9), (-1, -1)) => 0.0, ((2, 5), (-1, 1)) => -10.201741766457037, ((3, 8), (0, -1)) => -9.391290216375333, ((6, 4), (0, -1)) => -9.49021558026956, ((5, 4), (-1, -1)) => -11.976464789213786…), var"#334#339"{Vector{Tuple{Int64, Int64}}}([(-1, -1), (0, -1), (1, -1), (-1, 0), (1, 0), (-1, 1), (0, 1), (1, 1)], Core.Box(Dict(((1, 4), (-1, -1)) => -11.685766570219627, ((4, 5), (1, -1)) => -11.536480528242041, ((7, 4), (-1, 0)) => -11.27053175197489, ((4, 1), (1, 0)) => -11.029200017347137, ((2, 7), (-1, -1)) => -10.005228583350872, ((6, 9), (-1, -1)) => 0.0, ((2, 5), (-1, 1)) => -10.201741766457037, ((3, 8), (0, -1)) => -9.391290216375333, ((6, 4), (0, -1)) => -9.49021558026956, ((5, 4), (-1, -1)) => -11.976464789213786…))))

In [102]:
arrow = Dict(
    (-1, -1) => "↖️",
    (-1, 0) => "⬆️",
    (-1, 1) => "↗️",
    (0, -1) => "⬅️",
    (0, 1) => "➡️",
    (1, -1) => "↙️",
    (1, 0) => "⬇️",
    (1, 1) => "↘️"
)

M = vcat([ arrow[π_star((i, j))] for i in 1:7, j in 1:10 ], (x->"$(x)").(wind'))
M[4, 8] = "G"
for i in 1:8
    println(join(M[i,:]))
end

⬇️⬇️↖️↘️➡️↘️↘️↗️↘️⬇️
↖️➡️↘️➡️↗️➡️↗️↗️↘️↘️
⬇️⬇️⬅️↗️↘️➡️➡️↘️↘️↙️
↘️⬇️↙️↙️➡️↘️➡️G↙️↘️
↘️⬇️⬇️↙️↘️↘️↘️⬇️⬅️↙️
➡️↘️↙️➡️➡️↘️➡️↙️↖️⬅️
↘️↘️↘️↘️↘️↘️↗️⬆️➡️⬆️
0001112210


In [103]:
S = (4, 1)
τ = []
while S != nothing
    push!(τ, S)
    S = p(S, π_star(S))[2]
end
τ

8-element Vector{Any}:
 (4, 1)
 (5, 2)
 (6, 2)
 (7, 3)
 (7, 4)
 (7, 5)
 (7, 6)
 (7, 7)

In [104]:
for s in τ
    M[s[1], s[2]] = "✅"
end
for i in 1:7
    println(join(M[i,:]))
end

⬇️⬇️↖️↘️➡️↘️↘️↗️↘️⬇️
↖️➡️↘️➡️↗️➡️↗️↗️↘️↘️
⬇️⬇️⬅️↗️↘️➡️➡️↘️↘️↙️
✅⬇️↙️↙️➡️↘️➡️G↙️↘️
↘️✅⬇️↙️↘️↘️↘️⬇️⬅️↙️
➡️✅↙️➡️➡️↘️➡️↙️↖️⬅️
↘️↘️✅✅✅✅✅⬆️➡️⬆️


> _Exercise 6.10: Stochastic Wind (programming)_ Re-solve the windy gridworld task with King’s moves, assuming that the effect of the wind, if there is any, is stochastic, sometimes varying by 1 from the mean values given for each column. That is, a third of the time you move exactly according to these values, as in the previous exercise, but also a third of the time you move one cell above that, and another third of the time you move one cell below that. For example, if you are one cell to the right of the goal and you move left, then one-third of the time you move one cell above the goal, one-third of the time you move two cells above the goal, and one-third of the time you move to the goal.

In [105]:
function p_stochastic(S, A)
    y, x = S
    nS = (S .+ A) .- (wind[x] + rand(-1:1), 0)
    nS = (clamp(nS[1], 1, 7), clamp(nS[2], 1, 10))

    return nS == (4, 8) ? (0, nothing) : (-1, nS)
end

p_stochastic (generic function with 1 method)

In [106]:
Q_star, π_star = sarsa(p0, p_stochastic, 𝓢, 𝓐; γ=1, ε=0.1, α=0.5, N=8000);

(Dict(((1, 4), (-1, -1)) => -962.73615293127, ((4, 5), (1, -1)) => -959.9420573316884, ((7, 4), (-1, 0)) => -958.4275043410613, ((4, 1), (1, 0)) => -960.7688247275237, ((2, 7), (-1, -1)) => -962.3534426586518, ((6, 9), (-1, -1)) => -958.703889965327, ((2, 5), (-1, 1)) => -962.6895819999335, ((3, 8), (0, -1)) => -962.2086808714864, ((6, 4), (0, -1)) => -959.3756457312688, ((5, 4), (-1, -1)) => -959.5744685009267…), var"#334#339"{Vector{Tuple{Int64, Int64}}}([(-1, -1), (0, -1), (1, -1), (-1, 0), (1, 0), (-1, 1), (0, 1), (1, 1)], Core.Box(Dict(((1, 4), (-1, -1)) => -962.73615293127, ((4, 5), (1, -1)) => -959.9420573316884, ((7, 4), (-1, 0)) => -958.4275043410613, ((4, 1), (1, 0)) => -960.7688247275237, ((2, 7), (-1, -1)) => -962.3534426586518, ((6, 9), (-1, -1)) => -958.703889965327, ((2, 5), (-1, 1)) => -962.6895819999335, ((3, 8), (0, -1)) => -962.2086808714864, ((6, 4), (0, -1)) => -959.3756457312688, ((5, 4), (-1, -1)) => -959.5744685009267…))))

In [107]:
M = vcat([ arrow[π_star((i, j))] for i in 1:7, j in 1:10 ], (x->"$(x)").(wind'))
M[4, 8] = "G"
for i in 1:7
    println(join(M[i,:]))
end

↘️⬇️↙️↖️⬅️➡️↘️↘️↘️↘️
⬅️⬇️⬇️⬆️⬆️↘️↖️↙️⬆️↘️
↘️↗️↖️⬇️⬅️➡️↙️↙️↘️↘️
↙️↘️↘️↙️↖️➡️➡️G↘️⬇️
↙️↗️⬆️↖️⬆️↙️⬆️↙️↙️↙️
↘️↘️↘️↘️↙️↘️↗️⬇️⬅️⬇️
➡️↗️↗️↘️↖️⬇️⬅️↙️↗️⬅️


In [110]:
𝓐p = [ (0, -1), (-1, 0), (1, 0), (0, 1) ]
Q_star, π_star = sarsa(p0, p, 𝓢, 𝓐p; γ=1, ε=0.1, α=0.5, N=8000)
M = vcat([ arrow[π_star((i, j))] for i in 1:7, j in 1:10 ], (x->"$(x)").(wind'))
M[4, 8] = "G"
for i in 1:7
    println(join(M[i,:]))
end

➡️➡️⬇️➡️➡️➡️➡️➡️➡️⬇️
⬆️➡️⬇️➡️➡️⬆️⬇️➡️⬆️⬇️
➡️⬆️➡️⬆️➡️➡️⬇️➡️➡️⬇️
➡️➡️➡️⬆️➡️➡️➡️G➡️⬇️
➡️➡️➡️➡️➡️➡️⬅️⬇️⬅️⬅️
➡️➡️➡️➡️➡️⬅️⬅️⬇️⬇️⬇️
➡️➡️➡️➡️⬅️⬅️⬅️⬅️⬆️⬅️


In [111]:
S = (4, 1)
τ = []
while S != nothing
    push!(τ, S)
    S = p(S, π_star(S))[2]
end
τ

16-element Vector{Any}:
 (4, 1)
 (4, 2)
 (4, 3)
 (4, 4)
 (2, 4)
 (1, 5)
 (1, 6)
 (1, 7)
 (1, 8)
 (1, 9)
 (1, 10)
 (2, 10)
 (3, 10)
 (4, 10)
 (5, 10)
 (5, 9)

In [112]:
for s in τ
    M[s[1], s[2]] = "✅"
end
for i in 1:7
    println(join(M[i,:]))
end

➡️➡️⬇️➡️✅✅✅✅✅✅
⬆️➡️⬇️✅➡️⬆️⬇️➡️⬆️✅
➡️⬆️➡️⬆️➡️➡️⬇️➡️➡️✅
✅✅✅✅➡️➡️➡️G➡️✅
➡️➡️➡️➡️➡️➡️⬅️⬇️✅✅
➡️➡️➡️➡️➡️⬅️⬅️⬇️⬇️⬇️
➡️➡️➡️➡️⬅️⬅️⬅️⬅️⬆️⬅️
