# Introduction

1. This book represents a _computational_ approach to learning from _interaction_.
2. Focus on idealized situations
3. Goal directed learning from interaction
4. >Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize a numerical reward signal.
5. > actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards.
6. > We formalize the problem of reinforcement ... as the optimal control of incompletely-known Markov decision processes
7. Reinforcement learning is not supervised learning, nor unsupervised learning.
   1. Supervised learning requires labels for all possible situations, obviating the need for delayed credit assignment
   2. Unsupervised learning is finding structure in data, which dos not necessarily imply maximization of a reward signal
8. Exploratation vs Exploitation
   1. Explore and find new high value actions, and better estimate the value of existing actions
   2. Exploit existing actions for additional reward
9. Reinforcement learning does _not_ require a model of the environment to start with
10. Reinforcement learning inspired by biological and psychological models
11. Terminology
   1. _policy_: Mapping from states to actions
      1. Stimulus/response rules
      2. Can be simple function / lookup table, or involve computation, search or planning
      3. Can be stochastic
   2. _reward signal_: A scalar received by the agent after each time step

## Implementation of simple TD learner

In [40]:
using DataStructures

@enum Tile X O blank

Board = Array{Tile, 2}

initial_board() = fill(blank, (3, 3))

function is_win(board::Board, tile::Tile)
    for i in 1:3
        if board[i, 1] == board[i, 2] == board[i, 3] == tile
            return true
        elseif board[1, i] == board[2, i] == board[3, i] == tile
            return true
        end
    end
    return (board[1, 1] == board[2, 2] == board[3, 3] == tile) || (board[1,3] == board[2,2] == board[3,1] == tile)
end

is_cats(board::Board) = isempty(findall(x -> x == blank, board))

function play(board::Board, x::Int, y::Int, tile::Tile)
    board = copy(board)
    board[x, y] = tile
    board
end

function play(board::Board, idx::CartesianIndex{2}, tile::Tile)
    board = copy(board)
    board[idx] = tile
    board
end

function random_policy(board::Board)
    moves = findall(x -> x == blank, board)
    rand(moves)
end

function greedy_policy(board, value, tile; ϵ=1e-2)
    if rand() < ϵ
        return random_policy(board), false
    else
        max_move = nothing
        max_value = -Inf
        for pos in findall(x -> x == blank, board)
            next_board = play(board, pos, tile)
            max_move, max_value = value[next_board] > max_value ? (pos, value[next_board]) : (max_move, max_value)
        end
        return max_move, true
    end
end

function default_val(board; tile=X)
    if is_win(board, tile) 
        return 1 
    elseif is_win(board, tile == X ? O : X) || is_cats(board)
        return 0 
    else
        return 0.5 
    end
end

function learn_policy(;iter=100, α=0.1, ϵ=1e-2)
    value = DefaultDict{Board, Float64}(;passkey=true) do key
        default_val(key)
    end
    for i in 1:iter
        board = initial_board()
        while true
            max_move, greedy = greedy_policy(board, value, X; ϵ=ϵ)
            next_board = play(board, max_move, X)
            if greedy
                value[board] = value[board] + α*(value[next_board] - value[board])
            end
                
            if is_win(next_board, X) || is_cats(next_board)
                break
            end
            board = next_board
            
            next_board = play(board, random_policy(board), O)
            if is_win(next_board, O) || is_cats(next_board)
                break
            end
            board = next_board
        end
    end
    return value
end

function play_game(value; ϵ=0.0)
    board = initial_board()
    current_player = X
    while true
        max_move, _ = greedy_policy(board, value, X; ϵ=ϵ)
        next_board = play(board, max_move, X)
        if is_win(next_board, X)
            return true
        elseif is_cats(next_board)
            return 0
        end
        board = next_board
            
        next_board = play(board, random_policy(board), O)
        if is_win(next_board, O) || is_cats(next_board)
            return 0
        end
        board = next_board
    end
end

play_game (generic function with 1 method)

In [41]:
agent = learn_policy(;iter=200000, α=0.5)

DefaultDict{Matrix{Tile}, Float64, var"#56#57"} with 1837 entries:
  [O X blank; X O O; X blank X]                 => 0.5
  [X X blank; X O O; blank O blank]             => 0.999878
  [X X blank; blank blank O; blank O blank]     => 0.999939
  [X O X; blank blank blank; O blank blank]     => 0.5
  [X blank O; X blank O; blank X blank]         => 0.5
  [X blank blank; O X O; X blank O]             => 0.996094
  [X O blank; blank O blank; blank X X]         => 0.5
  [X blank X; O X O; O blank blank]             => 0.75
  [O blank O; X X O; X blank blank]             => 0.5
  [blank X O; blank X blank; blank X O]         => 1.0
  [X O O; blank O blank; X X X]                 => 1.0
  [X blank blank; blank blank X; O blank O]     => 0.5
  [X O blank; X O blank; X X O]                 => 1.0
  [X blank O; blank O blank; blank blank X]     => 0.5
  [blank blank blank; blank X blank; blank X O] => 0.5
  [blank X blank; O X blank; blank blank blank] => 0.5
  [X blank blank; blank O O; X blank 

In [42]:
sum([ play_game(agent; ϵ=0.0) for _ in 1:100_000 ]) / 100_000

0.87464

# Exercises

_Exercise 1.1: Self-Play_

> Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?

Consider first the case where we know that the value function for both players has converged to some final value for each state. Then, necessarily, these players are in a Nash-equilibrium.

Specifically, we can consider each player's strategy as being fully determined by a vector in lying in the $n$-dimensional hyper cube, where $n$ is the number of valid Tic-Tac-Toe states. Assuming that all updates to the value vector move it in a direction of strictly increasing utility w.r.t the current values of the other strategy, it follows that it would only converge when the two vectors are in a Nash equilibrium

In [25]:
struct TDLearner
    value::DefaultDict{Board, Float64}
    α::Float64
    ϵ::Float64
    tile::Tile
end

function (learner::TDLearner)(board::Board)
    action, greedy = greedy_policy(board, learner.value, learner.tile; ϵ=learner.ϵ)
    next_board = play(board, action, learner.tile)
    if greedy
        learner.value[board] = learner.value[board] + learner.α*(learner.value[next_board] - learner.value[board])
    end
    return next_board
end

In [36]:
function learn_self_play(;iter=1000, α=0.1, ϵ=1e-2)
    x_value = DefaultDict{Board, Float64}(;passkey=true) do key
        default_val(key)
    end
    o_value = DefaultDict{Board, Float64}(;passkey=true) do key
        default_val(key)
    end
    learner_x = TDLearner(x_value, α, ϵ, X)
    learner_o = TDLearner(o_value, α, ϵ, O)
    for i in 1:iter
        board = initial_board()
        while true
            next_board = learner_x(board)
            if is_win(next_board, X) || is_cats(next_board)
                break
            end
            board = next_board
            
            next_board = learner_o(board)
            if is_win(next_board, O) || is_cats(next_board)
                break
            end
            board = next_board
        end
    end
    return learner_x, learner_o
end

learn_self_play (generic function with 1 method)

In [37]:
learner_x, learner_o = learn_self_play()

(TDLearner(DefaultDict{Matrix{Tile}, Float64, var"#43#45"}([X X blank; blank blank O; blank blank blank] => 0.5, [X X O; O O blank; X blank X] => 0.5, [X O blank; X X O; O blank blank] => 0.55, [X blank X; blank blank blank; blank blank O] => 0.5, [X blank blank; O X blank; X O blank] => 0.5, [O blank X; X blank blank; O X blank] => 0.5, [X X blank; blank blank blank; blank blank O] => 0.5, [X blank X; O blank blank; blank blank blank] => 0.5, [X blank blank; X X O; O blank blank] => 0.5, [X blank blank; O blank blank; X X O] => 0.5…), 0.1, 0.01, X), TDLearner(DefaultDict{Matrix{Tile}, Float64, var"#44#46"}([X blank O; X blank O; blank blank blank] => 0.5, [X O blank; X blank blank; blank O blank] => 0.5, [O O blank; X blank blank; blank blank X] => 0.5, [X O O; O blank blank; X X blank] => 0.5, [X O blank; X X O; O blank blank] => 0.5, [O blank X; X blank blank; blank O blank] => 0.5, [X blank O; O blank blank; blank X blank] => 0.5, [X blank X; O blank blank; blank blank blank] => 0.

In [39]:
learner_o.value

DefaultDict{Matrix{Tile}, Float64, var"#44#46"} with 280 entries:
  [X blank O; X blank O; blank blank blank]             => 0.5
  [X O blank; X blank blank; blank O blank]             => 0.5
  [O O blank; X blank blank; blank blank X]             => 0.5
  [X O O; O blank blank; X X blank]                     => 0.5
  [X O blank; X X O; O blank blank]                     => 0.5
  [O blank X; X blank blank; blank O blank]             => 0.5
  [X blank O; O blank blank; blank X blank]             => 0.5
  [X blank X; O blank blank; blank blank blank]         => 0.5
  [O O blank; X blank X; blank blank blank]             => 0.5
  [blank blank blank; blank O blank; blank blank X]     => 0.5
  [O X blank; X O blank; O X X]                         => 0.5
  [X X blank; O blank blank; blank blank O]             => 0.5
  [O X X; X O blank; O X O]                             => 0.0
  [X blank blank; blank blank blank; blank blank blank] => 0.5
  [O X blank; X O blank; O blank X]                 

_Exercise 1.2: Symmetries_

> Many tic-tac-toe positions appear different but are really the same because of symmetries. How might we amend the larning process described above to take advantage of this? In what ways would this change improve the learning process? Now think again. Suppose the opponent did not take advantage of symmetries. In that case, should we? Is it true, then, that symmetrically equivalent positions should necessarily have the same value?

Rather than having a map of all states to values, the agent could have a map from canonical states to values, where there is one canonical state for each set of states that are symmetric with respect to rotation and flipping.

This may improve the learning algorithm, because then we will visit each canonicalized state more frequently during training, and therefore gain a better estimate of it's value.

However, if the opponent is not symmetric, then it would make sense for us to continue to disregard symmetries in our value table. This is because if the opponent assigns different values to states that are equivalent, then we could benefit by learning and understanding that behavior.

As an example, consider an opponent that always maps the following state to moving to the center:

In [134]:
[ X blank blank ; blank blank blank ; blank blank blank]

3×3 Matrix{Tile}:
 X::Tile = 0      blank::Tile = 2  blank::Tile = 2
 blank::Tile = 2  blank::Tile = 2  blank::Tile = 2
 blank::Tile = 2  blank::Tile = 2  blank::Tile = 2

while mapping the following state to always moving to the "opposite" corner

In [135]:
[ blank blank X ; blank blank blank ; blank blank blank ]

3×3 Matrix{Tile}:
 blank::Tile = 2  blank::Tile = 2  X::Tile = 0
 blank::Tile = 2  blank::Tile = 2  blank::Tile = 2
 blank::Tile = 2  blank::Tile = 2  blank::Tile = 2

Despite the two states being symmetric, the asymmetric agent treats them differently. Therefore, the policy's value should be different between these two positions.

_Exercise 1.3: Greedy Play_

> Suppose the reinforcement learning player was _greedy_, that is, it always played the move that brought it to the position that it rated the best. Might it learn to play better, or worse, than a nongreedy player? What problems might occur?

It would likely learn to play worse than a non-greedy player, because it may obtain a high estimate of a particular state, and then never attempt to explore other states, leading it to forgo learning more optimal states.

_Exercise 1.4: Learning From Exploration_

> Suppose learning updates occurred after _all_ moves, including exploratory moves. If the step-size parameter is appropriately reduced over time (but not the tendency to explore), then the state values would converge to a different set of probabilities. What (conceptually) are the two sets of probabilities computed when we do, and when we do not, learn from exploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins?

The two sets of probabilities are as follows:

1. When we _do not_ update after exploratory moves, we are learning the probability of winning the game in the current state _if we only take greedy moves from now on_.
2. When we _do_ update after exploratory moves, we are learning the probability of winning the game in the current state _if we continue to make greedy moves with probability $\epsilon$_.

_Exercise 1.5: Other Improvements_

> Can you think of other ways to improve the reinforcement learning player? Can you think of any better way to solve the tic-tac-toe problem posed?

1. You could attempt to have an agent that planned further ahead, but considered _all_ possible opponent moves. This way, the agent could potentially detect if another move leads, regardless of the actions of the other player, to victory.