##  3-dim grid example

In [1]:
#Load packages
from NashQLearn import Player, Grid, NashQLearning
import warnings
warnings.filterwarnings('ignore')

This notebook applies the Nash Q Learning algorithm to the following multiagent problem. Two robots placed on a grid need to reach the reward. Robots are allowed to move up, down, to the left, and to the right, or to stay at their current position. 
Robots are not allowed to be on the same tile unless it is the reward tile.



### Prepare the game environment

In [2]:
#Initialize the two players
player1 = Player([0,0])
player2 = Player([2,0])

In [3]:
#Initialize the grid
grid = Grid(length = 3,
            width = 3,
            players = [player1,player2],
           obstacle_coordinates = [[1,1]], #A single obstacle in the middle of the grid
           reward_coordinates = [1,2],
           reward_value = 20,
           collision_penalty = -1)

In [4]:
joint_states = grid.joint_states()
print('Available joint states : %s'%len(joint_states))
print(joint_states)

Available joint states : 57
[[[0, 0], [0, 1]], [[0, 0], [0, 2]], [[0, 0], [1, 0]], [[0, 0], [1, 2]], [[0, 0], [2, 0]], [[0, 0], [2, 1]], [[0, 0], [2, 2]], [[0, 1], [0, 0]], [[0, 1], [0, 2]], [[0, 1], [1, 0]], [[0, 1], [1, 2]], [[0, 1], [2, 0]], [[0, 1], [2, 1]], [[0, 1], [2, 2]], [[0, 2], [0, 0]], [[0, 2], [0, 1]], [[0, 2], [1, 0]], [[0, 2], [1, 2]], [[0, 2], [2, 0]], [[0, 2], [2, 1]], [[0, 2], [2, 2]], [[1, 0], [0, 0]], [[1, 0], [0, 1]], [[1, 0], [0, 2]], [[1, 0], [1, 2]], [[1, 0], [2, 0]], [[1, 0], [2, 1]], [[1, 0], [2, 2]], [[1, 2], [0, 0]], [[1, 2], [0, 1]], [[1, 2], [0, 2]], [[1, 2], [1, 0]], [[1, 2], [2, 0]], [[1, 2], [2, 1]], [[1, 2], [2, 2]], [[2, 0], [0, 0]], [[2, 0], [0, 1]], [[2, 0], [0, 2]], [[2, 0], [1, 0]], [[2, 0], [1, 2]], [[2, 0], [2, 1]], [[2, 0], [2, 2]], [[2, 1], [0, 0]], [[2, 1], [0, 1]], [[2, 1], [0, 2]], [[2, 1], [1, 0]], [[2, 1], [1, 2]], [[2, 1], [2, 0]], [[2, 1], [2, 2]], [[2, 2], [0, 0]], [[2, 2], [0, 1]], [[2, 2], [0, 2]], [[2, 2], [1, 0]], [[2, 2], [1, 2]],

In [5]:
walls = grid.identify_walls()
walls

[['left', [0, 0]],
 ['down', [0, 0]],
 ['left', [0, 1]],
 ['right', [0, 1]],
 ['left', [0, 2]],
 ['up', [0, 2]],
 ['up', [1, 0]],
 ['down', [1, 0]],
 ['up', [1, 2]],
 ['down', [1, 2]],
 ['right', [2, 0]],
 ['down', [2, 0]],
 ['left', [2, 1]],
 ['right', [2, 1]],
 ['right', [2, 2]],
 ['up', [2, 2]]]

### Run the Nash Q Learning algorithm

The efficiency of the algorithm depends on a set of parameters (the max number of iterations, the discount factor, the learning rate, ...) which may require some tuning. In general, the epsilon-greedy decision stragegy outperforms both the random and greedy strategies.

In [11]:
nashQ = NashQLearning(grid, 
                      max_iter = 2000,
                      discount_factor = 0.7,
                      learning_rate = 0.7,
                      epsilon = 0.5,
                     decision_strategy = 'epsilon-greedy')

In [12]:
#Retrieve the updated Q matrix after fitting the algorithm
Q0, Q1 = nashQ.fit(return_history = False)

100%|████████████████████████████████████████████████████████████████████████████████| 57/57 [00:00<00:00, 2716.77it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 2000/2000 [08:18<00:00,  4.01it/s]


In [13]:
#Best path followed by each player given the values in the q tables
p0, p1 = nashQ.get_best_policy(Q0,Q1)

[[0, 0], [2, 0]]
[[0, 1], [2, 1]]
[[0, 2], [2, 2]]
[[1, 2], [1, 2]]


In [14]:
print('Player 0 follows the  policy : %s of length %s' %('-'.join(p0),len(p0)))
print('Player 1 follows the  policy : %s of length %s'%('-'.join(p1),len(p1)))

Player 0 follows the  policy : up-up-right of length 3
Player 1 follows the  policy : up-up-left of length 3


In this experiment, the two players have successfully identified the optimal path to the reward.
