In [1]:
from Covid19Environment import GraphicCovid19Environment, Covid19Environment, States, Actions
import numpy as np
import random
import math

# Reinforcement Learning

To explain it I came up with an important survival skill we all have today. The ability to go to Kiwi and back without meeting another person. Thus we are creating a survivor that can stand on one tile and see only one tile in front. Survivor can make 3 actions, look left, look right and move forward.

We create a Q-table of possible states. It becomes a 5 x 5 x 3. First dimension for its standing state, 2nd for for s/he sees and 3 possible actions to take.

### Q Table
- Stand_states: null, someone, kiwi, house, border (5)
- See_states: null, someone, kiwi, house, border (5)
- Actions: left, right, forward (3)

### Rewards:
- Taking 1 action: -1 (faster is better)
- Standing on Kiwi: +50 if it is first time, then 0 (it is safe to look around whilst inside)
- Standing with person: -10 (you might get infected)
- Getting home: +30 or +200 if surviver has been to Kiwi, since that is the goal
- Stand on border: -1000 (do not want it to get out of the radius)

In [4]:
# Parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.01             # Exponential decay rate for exploration prob

learning_rate = 0.7
discount_rate = 0.618

total_episodes = 300
steps = 300

world_size = (5, 5)

# Init Q table
q_table = np.zeros((5, 5, 3))
# If you see border, then you should not go forward
for i in range(5):
    q_table[i][States.BORDER.value][Actions.FORWARD.value] = -math.inf
#print(q_table)

for episode in range(total_episodes):
    done = False
    total_reward = 0
    env = Covid19Environment(q_table, epsilon, world_size)
    for s in range(steps):
        done, new_reward = env.step()
        total_reward += new_reward
        # If done : finish episode
        if done == True:
            break
    print(episode, " Finished: ", s, ", reward: ", total_reward, "Random/total actions: ", (env.random_actions*100)/env.total_actions)
        # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)

np.set_printoptions(precision=2, suppress=True)
print("\n\nQ Table:\n", q_table)

House reached
0  Finished:  58 , reward:  -28 Random/total actions:  100.0
1  Finished:  299 , reward:  -248 Random/total actions:  100.0
Wooow! House reached after Kiwi!
2  Finished:  208 , reward:  52 Random/total actions:  99.52153110047847
House reached
3  Finished:  208 , reward:  -178 Random/total actions:  99.04306220095694
House reached
4  Finished:  6 , reward:  24 Random/total actions:  85.71428571428571
House reached
5  Finished:  15 , reward:  15 Random/total actions:  93.75
House reached
6  Finished:  132 , reward:  -102 Random/total actions:  95.48872180451127
House reached
7  Finished:  109 , reward:  -79 Random/total actions:  94.54545454545455
Wooow! House reached after Kiwi!
8  Finished:  194 , reward:  58 Random/total actions:  94.87179487179488
Wooow! House reached after Kiwi!
9  Finished:  110 , reward:  143 Random/total actions:  93.69369369369369
10  Finished:  299 , reward:  -244 Random/total actions:  92.33333333333333
Wooow! House reached after Kiwi!
11  Finis

House reached
131  Finished:  61 , reward:  -31 Random/total actions:  30.64516129032258
Wooow! House reached after Kiwi!
132  Finished:  85 , reward:  172 Random/total actions:  26.74418604651163
Wooow! House reached after Kiwi!
133  Finished:  145 , reward:  122 Random/total actions:  23.972602739726028
Wooow! House reached after Kiwi!
134  Finished:  101 , reward:  154 Random/total actions:  34.31372549019608
Wooow! House reached after Kiwi!
135  Finished:  193 , reward:  73 Random/total actions:  19.587628865979383
Wooow! House reached after Kiwi!
136  Finished:  18 , reward:  238 Random/total actions:  15.789473684210526
House reached
137  Finished:  12 , reward:  18 Random/total actions:  30.76923076923077
Wooow! House reached after Kiwi!
138  Finished:  22 , reward:  230 Random/total actions:  13.043478260869565
Wooow! House reached after Kiwi!
139  Finished:  44 , reward:  208 Random/total actions:  20.0
Wooow! House reached after Kiwi!
140  Finished:  14 , reward:  238 Random/

Here you can run a graphical representation of what is happening (**see pop-up window**). It uses learned Q table to make decisions. You can move yourself by pressing forward, left and right or **press down to make a decision based on the Q table**.

In [5]:
# Graphical representation of the game
env = GraphicCovid19Environment(True, q_table)

PlottingGrid
FinishedPlotting
add_house
add_house
add_kiwi
add_kiwi
add_survivor
add_survivor
[-2.57 -2.57 12.4 ]
States.NOTHING States.NOTHING
[-2.57 -2.57 12.4 ]
States.NOTHING States.NOTHING
[-2.57 -2.57 12.4 ]
States.NOTHING States.NOTHING
[-2.57 -2.57 12.4 ]
States.NOTHING States.BORDER
[-0.79 -2.61  -inf]
States.NOTHING States.NOTHING
[-2.57 -2.57 12.4 ]
States.NOTHING States.NOTHING
[-2.57 -2.57 12.4 ]
States.NOTHING States.NOTHING
[-2.57 -2.57 12.4 ]
States.NOTHING States.NOTHING
[-2.57 -2.57 12.4 ]
States.NOTHING States.NOTHING
[-2.57 -2.57 12.4 ]
States.NOTHING States.BORDER
[-0.79 -2.61  -inf]
States.NOTHING States.NOTHING
[-2.57 -2.57 12.4 ]
States.NOTHING States.NOTHING
[-2.57 -2.57 12.4 ]
States.NOTHING States.NOTHING
[-2.57 -2.57 12.4 ]
States.NOTHING States.NOTHING
[-2.57 -2.57 12.4 ]
States.NOTHING States.NOTHING
[-2.57 -2.57 12.4 ]
States.NOTHING States.NOTHING
[-2.57 -2.57 12.4 ]
States.NOTHING States.NOTHING
[-2.57 -2.57 12.4 ]
States.NOTHING States.NOTHING
[-2.57 -

A surprising observation for me was that it is better to train the model on a 5x5 environment, since it is a bigger chance it had to learn to find Kiwi before coming home. And then applying that knowledge on a bigger world. It makes sense that the change of the size of the world does not matter so much since the agent only knows 2 things (its position and what is in front)