# Reinforcement Learning

# Markov Decision Process

This notebook presents some examples of Markov Decison Processes, where an **agent** interacts with its **environment** and collects **rewards**. 

In all considered models, rewards are attached to **states**. In games for instance, the reward of a state is +1 if you win, -1 if you loose, 0 in all other cases.

In [1]:
import numpy as np

In [3]:
from model import Walk, Maze

ModuleNotFoundError: No module named 'model'

Each model is an object of the class ``Environment`` storing the current state. 

In [None]:
from model import Environment

In [None]:
methods_environment = [method for method in dir(Environment) if '__' not in method]

In [None]:
methods_environment

In [None]:
from agent import Agent

In [None]:
methods_agent = [method for method in dir(Agent) if '__' not in method]

In [None]:
methods_agent

## Walk

We start with a walk in a square. Some states (to be found) have positive rewards.

In [None]:
# environment
model = Walk()

In [None]:
model.display()

In [None]:
model.Size

In [None]:
# rewards (to be discovered)
model.Rewards

In [None]:
state = model.state

In [None]:
state

In [None]:
model.get_actions(state)

In [None]:
action = (0, 1)

In [None]:
model.step(action)

In [None]:
model.state

In [None]:
model.display()

In [None]:
# agent with random policy (default)
agent = Agent(model)

In [None]:
state = model.state
action = agent.get_action(state)

In [None]:
state

In [None]:
action

In [None]:
reward, stop = model.step(action)

In [None]:
reward

In [None]:
stop

In [None]:
# all possible actions
agent.get_actions(state)

In [None]:
# policy
probs, actions = agent.policy(state)

In [None]:
print(probs)

In [None]:
print(actions)

In [None]:
# an episode
stop, states, rewards = agent.get_episode(100)

In [None]:
# the episode includes the initial state
len(states)

In [None]:
# display
animation = model.display(states)

In [None]:
animation

In [None]:
# initial reward = 0 by convention
len(rewards)

In [None]:
np.sum(rewards)

In [None]:
# gains = cumulative rewards
gains = agent.get_gains(n_runs=10)

In [None]:
gains

## To do

* Test the weighted random policy where the probability of each move is proportional to its weight.
* Is this policy better than 

In [None]:
weights = {(0, 1): 2, (1, 0): 2, (0, -1): 1, (-1, 0): 1}

In [None]:
def weighted_random_policy(state, weights=weights):
    actions = Walk().get_actions(state)
    # to be modified
    probs = []
    return probs, actions

## Maze

Now let's try to escape a maze!

In [None]:
maze_map = np.load('maze.npy')

In [None]:
model = Maze()
init_state = (1, 0)
exit_state = (1, 20)
model.set_parameters(maze_map, init_state, [exit_state])

In [None]:
model.display()

In [None]:
model.state

In [None]:
state = model.state
reward = model.get_reward(state)

In [None]:
reward

In [None]:
model.get_actions(state)

In [None]:
action = (0, 1)

In [None]:
model.step(action)

In [None]:
model.display()

In [None]:
# agent with random policy
agent = Agent(model)

In [None]:
stop, states, rewards = agent.get_episode()

In [None]:
animation = model.display(states)

In [None]:
animation

In [None]:
# time in the Maze
-np.sum(rewards)

In [None]:
agent.get_gains(n_steps=500, n_runs=10)

## To do

In [None]:
weights = {(0, 1): 2, (1, 0): 2, (0, -1): 1, (-1, 0): 1}

In [None]:
def weighted_random_policy(state, weights=weights):
    actions = Maze().get_actions(state)
    probs = np.array([weights[action] for action in actions]).astype(float)
    probs /= np.sum(probs)
    return probs, actions

In [None]:
agent = Agent(model, policy=weighted_random_policy)

In [None]:
agent.get_gains(n_runs=10, n_steps=500)

## Games

Finally, let's play games!<br>
Note that in most games:
* you play against an adversary (which is part of the environment),
* you may play first or second,
* when your adversary plays, you have only one possible action (let your adversary play :-),
* you can also impose an action to your adversary (useful for training).

We here consider [Tic-Tac-Toe](https://en.wikipedia.org/wiki/Tic-tac-toe), [Nim](https://en.wikipedia.org/wiki/Nim), [Connect Four](https://en.wikipedia.org/wiki/Connect_Four) and [Five in a row](https://en.wikipedia.org/wiki/Gomoku).
Feel free to add more :-)

In [None]:
from model import TicTacToe, Nim, ConnectFour, FiveInRow

Each game is an object of the class ``Game``. 

In [None]:
from model import Game

Note that the method ``get_actions`` gives all possible moves (even if it is not your turn).

In [None]:
methods_game = [method for method in dir(Game) if '__' not in method]

In [None]:
methods_game

The method ``get_next_state`` allows you to get the next state for any (state, action) pair, without modifying the current state.

In [None]:
set(methods_game) - set(methods_environment)

## Tic-Tac-Toe

We start with Tic-Tac-Toe.

### Play first

In [None]:
# game against a random player (default)
game = TicTacToe()

In [None]:
game.display()

In [None]:
# next player, board
game.state

In [None]:
game.get_actions(game.state)

In [None]:
game.get_next_state(game.state, (1,1))

In [None]:
# the state is not modified
game.state

In [None]:
# you play at random (default)
agent = Agent(game)

In [None]:
# you play as player 1 (default)
agent.player

In [None]:
# your adversary plays as player -1
game.adversary.player

In [None]:
state = game.state
action = agent.get_action(state)

In [None]:
action

In [None]:
reward, stop = game.step(action)

In [None]:
# you're blue
game.display()

In [None]:
game.state

In [None]:
# all possible moves
game.get_actions(game.state)

In [None]:
# your moves (not your turn -> pass)
agent.get_actions(game.state)

In [None]:
action = agent.get_action(game.state)

In [None]:
print(action)

In [None]:
reward, stop = game.step(action)

In [None]:
game.display()

In [None]:
stop, states, rewards = agent.get_episode()

In [None]:
animation = game.display(states)

In [None]:
animation

In [None]:
rewards

In [None]:
gains = agent.get_gains()
np.unique(agent.get_gains(), return_counts=True)

### Play second

In [None]:
# your adversary starts
game = TicTacToe(play_first=False)

In [None]:
game.first_player

In [None]:
# you still play at random
agent = Agent(game)

In [None]:
stop, states, rewards = agent.get_episode()

In [None]:
animation = game.display(states)

In [None]:
# you're still blue, red starts
animation

In [None]:
rewards

In [None]:
gains = agent.get_gains()
np.unique(agent.get_gains(), return_counts=True)

### Control your adversary

You can force the actions of your adversary (useful for training).

In [None]:
game = TicTacToe()

In [None]:
actions = [(0, 0), (1, 1), (0, 2),  (2, 2), (0, 1)]

In [None]:
for action in actions:
    game.step(action)

In [None]:
game.display()

### One step ahead

In [None]:
# your adversary is random
game = TicTacToe()

In [None]:
# you play with one-step ahead policy
agent = Agent(game, 'one_step')

In [None]:
stop, states, rewards = agent.get_episode()

In [None]:
animation = game.display(states)

In [None]:
animation

In [None]:
gains = agent.get_gains()
np.unique(agent.get_gains(), return_counts=True)

In [None]:
# your adversary also looks one-step ahead
game = TicTacToe(adversary_policy='one_step')

In [None]:
agent = Agent(game, 'one_step')

In [None]:
stop, states, rewards = agent.get_episode()

In [None]:
animation = game.display(states)

In [None]:
animation

In [None]:
np.unique(agent.get_gains(), return_counts=True)

## Nim

### Random players

In [None]:
# game against a random player (default)
game = Nim()

In [None]:
game.state

In [None]:
game.display()

In [None]:
# player, board
game.state

In [None]:
# you play at random
agent = Agent(game)

In [None]:
state = game.state
action = agent.get_action(state)

In [None]:
action

In [None]:
reward, stop = game.step(action)

In [None]:
game.display()

In [None]:
stop, states, rewards = agent.get_episode()

In [None]:
animation = game.display(states)

In [None]:
animation

In [None]:
rewards

In [None]:
np.unique(agent.get_gains(), return_counts=True)

### One step ahead

In [None]:
game = Nim('one_step')

In [None]:
agent = Agent(game, 'one_step')

In [None]:
np.unique(agent.get_gains(), return_counts=True)

## Connect Four

### Random players

In [None]:
# game against a random player
game = ConnectFour()

In [None]:
game.display()

In [None]:
game.state

In [None]:
# you play at random
agent = Agent(game)

In [None]:
stop, states, rewards = agent.get_episode()

In [None]:
animation = game.display(states)

In [None]:
# you play yellow
animation

In [None]:
np.unique(agent.get_gains(n_runs=10), return_counts=True)

### One step ahead

In [None]:
game = ConnectFour('one_step')

In [None]:
agent = Agent(game, 'one_step')

In [None]:
stop, states, rewards = agent.get_episode()

In [None]:
animation = game.display(states)

In [None]:
animation

In [None]:
np.unique(agent.get_gains(n_runs=10), return_counts=True)

## Five-in-a-row

### Random players

In [None]:
game = FiveInRow()

In [None]:
game.display()

In [None]:
agent = Agent(game)

In [None]:
stop, states, rewards = agent.get_episode()

In [None]:
animation = game.display(states)

In [None]:
animation

In [None]:
np.unique(agent.get_gains(n_runs=10), return_counts=True)

### One step ahead

In [None]:
game = FiveInRow()

In [None]:
agent = Agent(game, 'one_step')

In [None]:
np.unique(agent.get_gains(n_runs=10), return_counts=True)

In [None]:
# a better adversary
game = FiveInRow('one_step')

In [None]:
agent = Agent(game, 'one_step')

In [None]:
np.unique(agent.get_gains(n_runs=10), return_counts=True)

## Value function

The value function of a policy can be computed from Bellman's equation, provided the state space is not too large.

In [None]:
from dp import PolicyEvaluation

You can check this condition by listing all states.

In [None]:
model = Walk()

In [None]:
len(model.get_states())

## To do

* Evaluate the random policy in the ``Walk`` model and display the value function.
* Are there weights for which the weighted random policy is better than the (pure) random policy?
* Do the same for the maze.

In [None]:
model = Walk()
agent = Agent(model)
random_policy = agent.policy

In [None]:
algo = PolicyEvaluation(model, random_policy)

In [None]:
values = algo.evaluate_policy()

In [None]:
values

In [None]:
model.display_values(values)

## To do

* Evaluate the random policy and the one-step policy in some games, when possible.
* Use this to predict (first) good moves.