In [5]:
from Covid19Environment import GraphicCovid19Environment, Covid19Environment, States, Actions
import numpy as np
import random
import math

# 9. Reinforcement Learning (Kandidatnr: 10114)

To explain it I came up with an important survival skill we all have today. The ability to go to Kiwi and back without meeting another person. Thus we are creating a survivor that can stand on one tile and see only one tile in front. Survivor can make 3 actions, look left, look right and move forward.

We create a Q-table of possible states. It becomes a 5 x 5 x 3. First dimension for its standing state, 2nd for for s/he sees and 3 possible actions to take.

### Q Table
- Stand_states: null, someone, kiwi, house, border (5)
- See_states: null, someone, kiwi, house, border (5)
- Actions: left, right, forward (3)

### Rewards:
- Taking 1 action: -1 (faster is better)
- Standing on Kiwi: +50 if it is first time, then 0 (it is safe to look around whilst inside)
- Standing with person: -10 (you might get infected)
- Getting home: +30 or +200 if surviver has been to Kiwi, since that is the goal
- Stand on border: -1000 (do not want it to get out of the radius)

In [6]:
def generate_random_others(num):
    possibilities = []
    for i in range(world_size[0]):
        for j in range(world_size[1]):
            possibilities.append([i, j])
    possibilities.remove([0, 0])
    possibilities.remove([world_size[0]-1, world_size[1]-1])
    possibilities.remove([math.floor(world_size[0]/2), math.floor(world_size[1]/2)]) # remove start pos

    others = []
    for i in range(num):
        rand = random.choice(possibilities)
        others.append(rand)
        possibilities.remove(rand)
    return others

In [7]:
# Parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.01             # Exponential decay rate for exploration prob

learning_rate = 0.7
discount_rate = 0.618

total_episodes = 300
steps = 300

world_size = (5, 5)

# Init Q table
q_table = np.zeros((5, 5, 3))
# If you see border, then you should not go forward
for i in range(5):
    q_table[i][States.BORDER.value][Actions.FORWARD.value] = -math.inf
#print(q_table)

for episode in range(total_episodes):
    done = False
    total_reward = 0
    env = Covid19Environment(q_table, epsilon, generate_random_others(2), world_size)
    for s in range(steps):
        done, new_reward = env.step()
        total_reward += new_reward
        # If done : finish episode
        if done == True:
            break
    #print(episode, " Finished: ", s, ", reward: ", total_reward, "Random/total actions: ", (env.random_actions*100)/env.total_actions)
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)

np.set_printoptions(precision=2, suppress=True)
print("\n\nQ Table:\n", q_table)



Q Table:
 [[[ -2.52  -2.51  32.85]
  [  4.02   0.76 162.92]
  [ -2.18   4.76 -14.19]
  [ -2.1   -0.89  48.66]
  [ -1.97   3.41   -inf]]

 [[  0.     0.     0.  ]
  [  0.     0.     0.  ]
  [  0.     0.     0.  ]
  [  0.     0.     0.  ]
  [  0.     0.     -inf]]

 [[-14.36 -13.82   1.74]
  [ -6.33   0.35  44.92]
  [-13.88  -8.74  -7.72]
  [ -7.66 -10.41   1.37]
  [-10.52 -13.84   -inf]]

 [[  0.     0.    11.29]
  [  0.     0.     0.  ]
  [  0.     0.01  -7.02]
  [  0.     0.     0.  ]
  [  0.     2.43   -inf]]

 [[  0.     0.     0.  ]
  [  0.     0.     0.  ]
  [  0.     0.     0.  ]
  [  0.     0.     0.  ]
  [  0.     0.     -inf]]]


### Play it
Here you can run a graphical representation of what is happening (**see pop-up window**). It uses learned Q table to make decisions. You can move yourself by pressing forward, left and right or **press down to make a decision based on the Q table**.

- <span style="color:green">Kiwi is a green rectangle</span>
- <span style="color:brown">House is a brown rectangle</span>
- <span style="color:orange">Others are orange rectangles</span>
- Body of surviver is white and what it sees is a different colour blue

<img src="covid_graphics1.png" alt="Graphical representation of the game" width="200"/>

<img src="covid_graphics2.png" alt="Graphical representation near the house" width="200"/>

In [8]:
# Graphical representation of the game
world_size = (10, 10)
env = GraphicCovid19Environment(generate_random_others(2), True, q_table)

PlottingGrid
FinishedPlotting
House added
Kiwi added
Survivor added
Others are added
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.BORDER
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.KIWI
States.KIWI States.BORDER
States.KIWI States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.BORDER
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.NOTHING
States.NOTHING States.HOUSE
States.HOUSE States.BORDER
Wooow! House reached after Kiwi! 224
Total reward:  224


A surprising observation for me was that it is better to train the model on a 5x5 environment, since it is a bigger chance it had to learn to find Kiwi before coming home. And then applying that knowledge on a bigger world. It makes sense that the change of the size of the world does not matter so much since the agent only knows 2 things (its position and what is in front)

## Deep QL

Using Q-table shows its weaknesses in covid19 environment, because the survivor can easily end up going in circles when the board gets bigger. It does not really remember well where the survivor has been. Thus it would have been more benefitial to use Deep QL.

In this case one could have easily used the environment table as a representation of a camera seeing above the grid. One would need to update the survivor's position each step. To make it more interesting we could make "others" move around and thus sending 2 tables at each step into a neural network that outputs 3 values (probabilities of taking the 3 actions).

In [16]:
env.environment_table[3, 3] = 9 # place holder for survivor position
env.environment_table

array([[1, 0, 0, 0, 0],
       [0, 2, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 2, 0, 9, 0],
       [0, 0, 0, 0, 3]])

The difference in updating Q table and weights would be as follows:

 1. Choosing an action from the current state depending on the random epsilon
 2. Taking that action, finding reward and next state after that
 3. Finding the maximum possible reward from that new state

For Q-table:

4.Update the q-table with the new q value which is: 
 $$current_Q_value + learning_rate * (reward + discount_rate * highest_Q_value - current_Q_value)$$
For Deep QL:

4.Update the weights: 
 $$learning_rate * (reward + discount_rate * highest_Q_value) * ∇_weights_Q_value$$

Sources:
- [Free code camp](https://www.freecodecamp.org/news/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe/)
- [Taxi](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/Taxi-v2/Q%20Learning%20with%20OpenAI%20Taxi-v2%20video%20version.ipynb)
- [Q Tables](https://itnext.io/reinforcement-learning-with-q-tables-5f11168862c8)