# Reinforcement Learning

# 4. Online control

This notebook presents the **online control** of an agent by SARSA and Q-learning.

In [14]:
import numpy as np

In [15]:
from model import TicTacToe, Nim, ConnectFour
from agent import Agent, OnlineControl
from dynamic import ValueIteration

## To do

* Complete the class ``SARSA`` and test it on Tic-Tac-Toe.
* Complete the class ``QLearning`` and test it on Tic-Tac-Toe.
* Compare these algorithms on Tic-Tac-Toe (play first) and Nim (play second), using a random adversary, then a perfect adversary. Comment your results.
* Test these algorithms on Connect 4 against a random adversary. Comment your results.

### SARSA

In [16]:
class SARSA(OnlineControl):
    """Online control by SARSA."""
        
    def update_values(self, state=None, horizon=100, epsilon=0.5):
        """Learn the action-value function online."""
        self.model.reset(state)
        state = self.model.state
        if not self.model.is_terminal(state):
            action = self.randomize_best_action(state, epsilon=epsilon)
            for t in range(horizon):
                self.add_state(state)
                code = self.model.encode(state)
                self.action_count[code][action] += 1
                reward, stop = self.model.step(action)
                # to be modified (get sample gain)
                # begin
                gain = reward
                new_state = self.model.state
                if not stop:
                    new_action = self.randomize_best_action(new_state, epsilon=epsilon)
                    gain = self.gamma * self.action_value[self.model.encode(new_state)][new_action]
                # end
                diff = gain - self.action_value[code][action]
                count = self.action_count[code][action]
                self.action_value[code][action] += diff / count
                if stop:
                    break
                # to be modified (update state and action)
                # begin
                state = new_state
                action = new_action
                # end

### Q-learning

In [17]:
class QLearning(OnlineControl):
    """Online control by Q-learning."""
        
    def update_values(self, state=None, horizon=100, epsilon=0.5):
        """Learn the action-value function online."""
        self.model.reset(state)
        state = self.model.state
        # to be completed
        if not self.model.is_terminal(state):
            action = self.randomize_best_action(state, epsilon=epsilon)
            for t in range(horizon):
                self.add_state(state)
                code = self.model.encode(state)
                self.action_count[code][action] += 1
                reward, stop = self.model.step(action)
                # to be modified (get sample gain)
                # begin
                gain = reward
                new_state = self.model.state
                if not stop:
                    new_actions = self.get_best_actions(new_state)
                    i = np.random.choice(len(new_actions))
                    gain = self.gamma * self.action_value[self.model.encode(new_state)][new_actions[i]]
                # end
                diff = gain - self.action_value[code][action]
                count = self.action_count[code][action]
                self.action_value[code][action] += diff / count
                if stop:
                    break
                # to be modified (update state and action)
                # begin
                state = new_state
                action = new_actions[i]
                # end
        

### TicTacToe against random player

In [18]:
# With random policy

Game = TicTacToe
game = Game()

agent = Agent(game)

np.unique(agent.get_gains(), return_counts=True)

(array([-1,  0,  1]), array([24, 16, 60], dtype=int64))

In [19]:
np.mean(agent.get_gains())

0.33

In [20]:
# With SARSA

Control = SARSA
algo = Control(game)

n_games = 1000
for i in range(n_games):
    algo.update_values(epsilon=0.05)
    
policy = algo.get_policy()
agent = Agent(game, policy)

np.unique(agent.get_gains(), return_counts=True)

(array([-1,  1]), array([ 7, 93], dtype=int64))

In [21]:
np.mean(agent.get_gains())

0.75

In [22]:
# With QLearning

Control = QLearning
algo = Control(game)

n_games = 1000
for i in range(n_games):
    algo.update_values(epsilon=0.05)
    
policy = algo.get_policy()
agent = Agent(game, policy)

np.unique(agent.get_gains(), return_counts=True)

(array([-1,  0,  1]), array([ 6,  4, 90], dtype=int64))

In [23]:
np.mean(agent.get_gains())

0.76

The results for TicTacToe against a random player are the same as in SARSA and QLearning

### Nim against random player (play second)

In [24]:
# With random policy

Game = Nim
game = Game(play_first=False)

agent = Agent(game)

np.unique(agent.get_gains(), return_counts=True)

(array([-1,  1]), array([59, 41], dtype=int64))

In [25]:
np.mean(agent.get_gains())

0.02

In [26]:
# With SARSA

Control = SARSA
algo = Control(game)

n_games = 1000
for i in range(n_games):
    algo.update_values(epsilon=0.05)
    
policy = algo.get_policy()
agent = Agent(game, policy)

np.unique(agent.get_gains(), return_counts=True)

(array([-1,  1]), array([12, 88], dtype=int64))

In [27]:
np.mean(agent.get_gains())

0.56

In [28]:
# With QLearning

Control = QLearning
algo = Control(game)

n_games = 1000
for i in range(n_games):
    algo.update_values(epsilon=0.05)
    
policy = algo.get_policy()
agent = Agent(game, policy)

np.unique(agent.get_gains(), return_counts=True)

(array([-1,  1]), array([25, 75], dtype=int64))

In [29]:
np.mean(agent.get_gains())

0.44

Results as a Nim second player against a random adversary are the same between SARSA and QLearning

### TicTacToe agains a perfect player

In [30]:
algo = ValueIteration(TicTacToe(), gamma=0.9)
policy, adversary_policy = algo.get_perfect_players()

Game = TicTacToe
game = Game(adversary_policy=adversary_policy)

In [31]:
# With SARSA

Control = SARSA
algo = Control(game)

n_games = 1000
for i in range(n_games):
    algo.update_values(epsilon=0.05)
    
policy = algo.get_policy()
agent = Agent(game, policy)

np.unique(agent.get_gains(), return_counts=True)

(array([-1,  0]), array([50, 50], dtype=int64))

In [32]:
np.mean(agent.get_gains())

-0.49

In [33]:
# With QLearning

Control = QLearning
algo = Control(game)

n_games = 1000
for i in range(n_games):
    algo.update_values(epsilon=0.05)
    
policy = algo.get_policy()
agent = Agent(game, policy)

np.unique(agent.get_gains(), return_counts=True)

(array([-1,  0]), array([39, 61], dtype=int64))

In [34]:
np.mean(agent.get_gains())

-0.42

SARSA and QLearning policies can't win over perfect players.

### Nim against perfect player (play second)

In [47]:
algo = ValueIteration(Nim(play_first=False), gamma=0.9)
policy, adversary_policy = algo.get_perfect_players()

Game = Nim
game = Game(adversary_policy=adversary_policy, play_first=False)

In [48]:
# With SARSA

Control = SARSA
algo = Control(game)

n_games = 1000
for i in range(n_games):
    algo.update_values(epsilon=0.05)
    
policy = algo.get_policy()
agent = Agent(game, policy)

np.unique(agent.get_gains(), return_counts=True)

(array([-1,  1]), array([59, 41], dtype=int64))

In [49]:
np.mean(agent.get_gains())

0.08

In [50]:
# With QLearning

Control = QLearning
algo = Control(game)

n_games = 1000
for i in range(n_games):
    algo.update_values(epsilon=0.05)
    
policy = algo.get_policy()
agent = Agent(game, policy)

np.unique(agent.get_gains(), return_counts=True)

(array([-1,  1]), array([ 4, 96], dtype=int64))

In [51]:
np.mean(agent.get_gains())

0.98

Against a perfect player in a Nim game, if playing second, SARSA policy can make you win.

However, QLearning policy always win against a perfect player, because by playing first, the perfect player knows it can't win, and that QLearning learns moves powerful enough to make us win all the time, Nim being a game with practically no chance involved.

### ConnectFour

In [52]:
# With random policy

Game = ConnectFour
game = Game()

agent = Agent(game)

np.unique(agent.get_gains(), return_counts=True)

(array([-1,  1]), array([46, 54], dtype=int64))

In [53]:
np.mean(agent.get_gains())

0.14

In [54]:
# With SARSA

Control = SARSA
algo = Control(game)

n_games = 1000
for i in range(n_games):
    algo.update_values(epsilon=0.05)
    
policy = algo.get_policy()
agent = Agent(game, policy)

np.unique(agent.get_gains(), return_counts=True)

(array([-1,  1]), array([41, 59], dtype=int64))

In [55]:
np.mean(agent.get_gains())

0.06

In [56]:
# With QLearning

Control = QLearning
algo = Control(game)

n_games = 1000
for i in range(n_games):
    algo.update_values(epsilon=0.05)
    
policy = algo.get_policy()
agent = Agent(game, policy)

np.unique(agent.get_gains(), return_counts=True)

(array([-1,  1]), array([37, 63], dtype=int64))

In [57]:
np.mean(agent.get_gains())

0.02

SARSA and QLearning based policies give better results than random policy against a random player.

These algorithms allow us to play ConnectFour, as they don't need to memorize all states, contrary to Dynamic Programming. Nevertheless the computation is very long.