# Play tic-tac-toe against various agents

In [None]:
from tic_tac_toe.agent import RandomAgent, SarsaAgent, QLearningAgent
from tic_tac_toe.main import run_agent_against_human

## Random agent

The random agent chooses uniformly among allowable next moves.

In [None]:
zufall = RandomAgent()
run_agent_against_human(zufall)

## Sarsa agent

Sarsa learning is an on-policy temporal difference control algorithm with update formula

$$Q(s,a) = Q(s,a) + \alpha * [\mathrm{reward}(s') + \gamma * Q(s',a') - Q(s,a)]$$

where $s'$ is the next state after taking action $a$ in state $s$ and $a'$ is the action to take by agent's policy (e.g. epsilon greedy)

For more information, see
* the [Wikipedia SARSA entry](https://en.wikipedia.org/wiki/State%E2%80%93action%E2%80%93reward%E2%80%93state%E2%80%93action)
* Section 6.4 of [Reinforcement Learning, by Sutton and Barto](https://mitpress.mit.edu/books/reinforcement-learning-second-edition)

In [None]:
sarsa = SarsaAgent()
run_agend_against_human(sarsa)

## Q-learning agent

Q-learning is an off-policy temporal difference control algorithm with update formula

$$Q(s,a) = Q(s,a) + \alpha * [\mathrm{reward}(s') + \gamma * \mathrm{max}(Q(s',a_)) - Q(s,a)]$$

where $s'$ is the next state after taking action $a$ in state $s$ and $\mathrm{max}(Q(s',a_))$ is the maximum of all the $Q$-values having state $s'$.

For more information, see
* the [Wikipedia Q-learning entry](https://en.wikipedia.org/wiki/Q-learning)
* Section 6.7 of [Reinforcement Learning, by Sutton and Barto](https://mitpress.mit.edu/books/reinforcement-learning-second-edition)

In [None]:
qlearn = QLearningAgent()
run_agent_against_human(qlearn)