In [1]:
from Covid19Environment import GraphicCovid19Environment, Covid19Environment, States, Actions
import numpy as np
import random
import math

# Reinforcement Learning

To explain it I came up with an important survival skill we all have today. The ability to go to Kiwi and back without meeting another person. Thus we are creating a survivor that can stand on one tile and see only one tile in front. Survivor can make 3 actions, look left, look right and move forward.

We create a Q-table of possible states. It becomes a 5 x 5 x 3. First dimension for its standing state, 2nd for for s/he sees and 3 possible actions to take.

### Q Table
- Stand_states: null, someone, kiwi, house, border (5)
- See_states: null, someone, kiwi, house, border (5)
- Actions: left, right, forward (3)

### Rewards:
- Taking 1 action: -1 (faster is better)
- Standing on Kiwi: +50 if it is first time, then 0 (it is safe to look around whilst inside)
- Standing with person: -10 (you might get infected)
- Getting home: +30 or +200 if surviver has been to Kiwi, since that is the goal
- Stand on border: -1000 (do not want it to get out of the radius)

In [2]:
def generate_random_others(num):
    possibilities = []
    for i in range(world_size[0]):
        for j in range(world_size[1]):
            possibilities.append([i, j])
    possibilities.remove([0, 0])
    possibilities.remove([world_size[0]-1, world_size[1]-1])
    possibilities.remove([math.floor(world_size[0]/2), math.floor(world_size[1]/2)]) # remove start pos

    others = []
    for i in range(num):
        rand = random.choice(possibilities)
        others.append(rand)
        possibilities.remove(rand)
    return others

In [17]:
# Parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.01             # Exponential decay rate for exploration prob

learning_rate = 0.7
discount_rate = 0.618

total_episodes = 300
steps = 300

world_size = (5, 5)

# Init Q table
q_table = np.zeros((5, 5, 3))
# If you see border, then you should not go forward
for i in range(5):
    q_table[i][States.BORDER.value][Actions.FORWARD.value] = -math.inf
#print(q_table)

for episode in range(total_episodes):
    done = False
    total_reward = 0
    env = Covid19Environment(q_table, epsilon, generate_random_others(2), world_size)
    for s in range(steps):
        done, new_reward = env.step()
        total_reward += new_reward
        # If done : finish episode
        if done == True:
            break
    print(episode, " Finished: ", s, ", reward: ", total_reward, "Random/total actions: ", (env.random_actions*100)/env.total_actions)
        # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)

np.set_printoptions(precision=2, suppress=True)
print("\n\nQ Table:\n", q_table)

House reached
0  Finished:  63 , reward:  -51 Random/total actions:  100.0
House reached
1  Finished:  18 , reward:  12 Random/total actions:  100.0
Wooow! House reached after Kiwi!
2  Finished:  237 , reward:  -192 Random/total actions:  99.15966386554622
Wooow! House reached after Kiwi!
3  Finished:  89 , reward:  126 Random/total actions:  94.44444444444444
Wooow! House reached after Kiwi!
4  Finished:  222 , reward:  -52 Random/total actions:  98.20627802690584
House reached
5  Finished:  29 , reward:  1 Random/total actions:  96.66666666666667
House reached
6  Finished:  79 , reward:  -139 Random/total actions:  96.25
Wooow! House reached after Kiwi!
7  Finished:  233 , reward:  -91 Random/total actions:  91.88034188034187
House reached
8  Finished:  99 , reward:  -132 Random/total actions:  96.0
House reached
9  Finished:  51 , reward:  -120 Random/total actions:  92.3076923076923
Wooow! House reached after Kiwi!
10  Finished:  218 , reward:  -116 Random/total actions:  90.410958

Wooow! House reached after Kiwi!
136  Finished:  180 , reward:  91 Random/total actions:  25.414364640883978
Wooow! House reached after Kiwi!
137  Finished:  36 , reward:  217 Random/total actions:  18.91891891891892
Wooow! House reached after Kiwi!
138  Finished:  106 , reward:  158 Random/total actions:  18.69158878504673
House reached
139  Finished:  57 , reward:  -27 Random/total actions:  22.413793103448278
Wooow! House reached after Kiwi!
140  Finished:  139 , reward:  118 Random/total actions:  22.857142857142858
House reached
141  Finished:  5 , reward:  25 Random/total actions:  33.333333333333336
Wooow! House reached after Kiwi!
142  Finished:  217 , reward:  18 Random/total actions:  25.229357798165136
Wooow! House reached after Kiwi!
143  Finished:  55 , reward:  202 Random/total actions:  25.0
Wooow! House reached after Kiwi!
144  Finished:  62 , reward:  197 Random/total actions:  23.80952380952381
House reached
145  Finished:  5 , reward:  25 Random/total actions:  0.0
W

House reached
275  Finished:  17 , reward:  13 Random/total actions:  11.11111111111111
House reached
276  Finished:  11 , reward:  19 Random/total actions:  8.333333333333334
Wooow! House reached after Kiwi!
277  Finished:  87 , reward:  178 Random/total actions:  6.818181818181818
House reached
278  Finished:  9 , reward:  21 Random/total actions:  10.0
House reached
279  Finished:  91 , reward:  -61 Random/total actions:  6.521739130434782
Wooow! House reached after Kiwi!
280  Finished:  53 , reward:  237 Random/total actions:  3.7037037037037037
House reached
281  Finished:  65 , reward:  -35 Random/total actions:  6.0606060606060606
Wooow! House reached after Kiwi!
282  Finished:  111 , reward:  149 Random/total actions:  4.464285714285714
Wooow! House reached after Kiwi!
283  Finished:  21 , reward:  235 Random/total actions:  13.636363636363637
Wooow! House reached after Kiwi!
284  Finished:  186 , reward:  98 Random/total actions:  8.02139037433155
House reached
285  Finished: 

Here you can run a graphical representation of what is happening (**see pop-up window**). It uses learned Q table to make decisions. You can move yourself by pressing forward, left and right or **press down to make a decision based on the Q table**.

- <span style="color:green">Kiwi is a green rectangle</span>
- <span style="color:brown">House is a brown rectangle</span>
- <span style="color:orange">Others are orange rectangles</span>
- Body of surviver is white and what it sees is a different colour blue

<img src="covid_graphics1.png" alt="Graphical representation of the game" width="200"/>

<img src="covid_graphics2.png" alt="Graphical representation near the house" width="200"/>

In [None]:
# Graphical representation of the game
world_size = (10, 10)
env = GraphicCovid19Environment(generate_random_others(2), True, q_table)

PlottingGrid
FinishedPlotting
House added
Kiwi added
Survivor added
Others are added
[-2.09 -2.19  0.22]
States.NOTHING States.NOTHING
[-2.09 -2.19  0.22]
States.NOTHING States.NOTHING
[-2.09 -2.19  0.22]
States.NOTHING States.NOTHING
[-2.09 -2.19  0.22]
States.NOTHING States.BORDER
[-2.56 -0.76  -inf]
States.NOTHING States.NOTHING
[-2.09 -2.19  0.22]
States.NOTHING States.NOTHING
[-2.09 -2.19  0.22]
States.NOTHING States.NOTHING
[-2.09 -2.19  0.22]
States.NOTHING States.KIWI
[-0.91 -0.78 35.52]
States.KIWI States.BORDER
[  0.   0. -inf]
States.KIWI States.NOTHING
[ 0.    0.   -0.11]
States.KIWI States.NOTHING
[ 0.    0.   -0.11]
States.KIWI States.BORDER
[  0.   0. -inf]
States.KIWI States.BORDER
[  0.   0. -inf]
States.KIWI States.NOTHING
[ 0.    0.   -0.11]
States.KIWI States.NOTHING
[ 0.    0.   -0.11]
States.KIWI States.BORDER
[  0.   0. -inf]
States.KIWI States.BORDER
[  0.   0. -inf]
States.KIWI States.NOTHING
[ 0.    0.   -0.11]
States.KIWI States.NOTHING
[ 0.    0.   -0.11]
St

A surprising observation for me was that it is better to train the model on a 5x5 environment, since it is a bigger chance it had to learn to find Kiwi before coming home. And then applying that knowledge on a bigger world. It makes sense that the change of the size of the world does not matter so much since the agent only knows 2 things (its position and what is in front)

## Deep QL

Using Q-table shows its weaknesses in covid19 environment, because the survivor can easily end up going in circles when the board gets bigger. It does not really remember well where the survivor has been. Thus it would have been more benefitial to use Deep QL.

In this case one could have easily used the environment table as a representation of a camera seeing above the grid. One would need to update the survivor's position each step. To make it more interesting we could make "others" move around and thus sending 2 tables at each step into a neural network that outputs 3 values (probabilities of taking the 3 actions).

In [16]:
env.environment_table[3, 3] = 9 # place holder for survivor position
env.environment_table

array([[1, 0, 0, 0, 0],
       [0, 2, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 2, 0, 9, 0],
       [0, 0, 0, 0, 3]])

The difference in updating Q table and weights would be as follows:

 1. Choosing an action from the current state depending on the random epsilon
 2. Taking that action, finding reward and next state after that
 3. Finding the maximum possible reward from that new state

For Q-table:

4.Update the q-table with the new q value which is: 
 $$current_Q_value + learning_rate * (reward + discount_rate * highest_Q_value - current_Q_value)$$
For Deep QL:

4.Update the weights: 
 $$learning_rate * (reward + discount_rate * highest_Q_value) * ∇_weights_Q_value$$