| <p style="text-align: left;">Name</p>               | Matr.Nr. | <p style="text-align: right;">Date</p> |
| --------------------------------------------------- | -------- | ------------------------------------- |
| <p style="text-align: left">Lion DUNGL</p> | 01553060 | 18.06.2020                            |

<h1 style="color:rgb(0,120,170)">Hands-on AI II</h1>
<h2 style="color:rgb(0,120,170)">Unit 9 (Assignment) -- Introduction to Reinforcement Learning -- Part II </h2>

<b>Authors</b>: Brandstetter, Schäfl <br>
<b>Date</b>: 08-06-2020

This file is part of the "Hands-on AI II" lecture material. The following copyright statement applies 
to all code within this file.

<b>Copyright statement</b>: <br>
This  material,  no  matter  whether  in  printed  or  electronic  form,  may  be  used  for personal  and non-commercial educational use only.  Any reproduction of this manuscript, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.

<h2>Exercise 0</h2>

- Import the same modules as discussed in the lecture notebook.
- Check if your model versions are correct.

In [1]:
import u9_utils as u9
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import gym

from typing import Any, Dict, Tuple
from gym.envs.toy_text import FrozenLakeEnv

# Set Seaborn plotting style.
sns.set()

In [2]:
u9.check_module_versions()

Installed Python version: 3.7 (✓)
Installed matplotlib version: 3.1.3 (✓)
Installed Pandas version: 1.0.3 (✓)
Installed Seaborn version: 0.10.1 (✓)
Installed OpenAI Gym version: 0.17.2 (✓)


All exercises in this assignment are referring to the <i>FrozenLake-v0</i> environment of <a href="https://gym.openai.com"><i>OpenAI Gym</i></a>. This environment is descibed according to its official <a href="https://gym.openai.com/envs/FrozenLake-v0/">OpenAI Gym website</a> as follows:<br>
<cite>Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.</cite>


There are <i>four</i> types of surfaces described in this environment:
<ul>
    <li><code>S</code> $\rightarrow$ starting point (<span style="color:rgb(0,255,0)"><i>safe</i></span>)</li>
    <li><code>F</code> $\rightarrow$ frozen surface (<span style="color:rgb(0,255,0)"><i>safe</i></span>)</li>
    <li><code>H</code> $\rightarrow$ hole (<span style="color:rgb(255,0,0)"><i>fall to your doom</i></span>)</li>
    <li><code>G</code> $\rightarrow$ goal (<span style="color:rgb(255,0,255)"><i>frisbee location</i></span>)</li>
</ul>


If not already done, more information on how to <i>install</i> and <i>import</i> the <code>gym</code> module is available in the lecture's notebook.

<h3 style="color:rgb(0,120,170)">States and actions</h3>
Experiment with the <i>FrozenLake-v0</i> environment as discussed during the lecture and explained in the accompanying notebook.

In [3]:
lake_environment = FrozenLakeEnv()
u9.set_seed(environment=lake_environment, seed=42)

In [4]:
lake_environment.render(mode=r'human')
current_state_id = lake_environment.s
print(f'\nCurrent state ID: {current_state_id}')


[41mS[0mFFF
FHFH
FFFH
HFFG

Current state ID: 0


The current position of the <i>disc retrieving</i> entity is displayed as a filled <span style="color:rgb(255,0,0)"><i>red</i></span> rectangle.

As we want to tackle this problem using our renowned <i>random search</i> approach, we have to analyse its applicability beforehand. Hence, the number of possible <i>actions</i> and <i>states</i> is of utter importance, as we don't want to get lost in the depth of combinatorial explosion.
<ul>
    <li>Query the amount of <i>actions</i> using the appropriate peoperty of the lake environment.</li>
    <li>Query the amount of <i>states</i> using the appropriate property of the lake environment.</li>
</ul>

In [5]:
num_actions = lake_environment.action_space.n
num_states = lake_environment.observation_space.n
print(f'The FrozenLake-v0 environment comprises <{num_actions}> actions and <{num_states}> states.')

The FrozenLake-v0 environment comprises <4> actions and <16> states.


<h2>Exercise 1</h2>

- Create a q_table for the frozen lake environment.
- Apply $Q$-learning as it was done in the lecture to solve the environment.
- Test the learned policy and animate one (or more) exemplary episode.
- What do you observe? Does the agent learn anything useful? Discuss if something strange happens. Hint: print the q_table during training to better understand what is going on during learning.

In [6]:
q_table = np.zeros([lake_environment.observation_space.n, lake_environment.action_space.n])
q_table

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [7]:
print(f'Shape of Q-Table: {q_table.shape}')

Shape of Q-Table: (16, 4)


In [8]:
def apply_q_learning(environment: lake_environment, alpha: float = 0.1):
    """
    Solve lake_environment by applying Q learning.
    """
    for i in range(1, 10001):
        state = environment.reset()
        done = False
        
        while not done:
            action = np.argmax(q_table[state])
            next_state, reward, done, info = environment.step(action)
            old_value = q_table[state, action]
            next_max = np.max(q_table[next_state])
            q_table[state, action] = (1 - alpha) * old_value + alpha * (reward + next_max)
            
            state = next_state
            
            if i % 100 == 0:
                clear_output(wait=True)
                print(f"Episode: {i}")
                print(q_table)

    print("Training finished.\n")

In [9]:
%%time
from IPython.display import clear_output
apply_q_learning(lake_environment, 0.1)

Episode: 10000
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
Training finished.

CPU times: user 5.84 s, sys: 1.3 s, total: 7.14 s
Wall time: 5.84 s


In [10]:
total_epochs, total_dives = 0, 0
episodes = 100

captured_frames = [[] for _ in range(episodes)]

for episode in range(episodes):
    
    # reset variables
    epochs, dives = 0, 0
    state = lake_environment.reset()
    done = False
    
    while not done:
        epochs += 1
        
        # take best step with regard to the Q-table
        action = np.argmax(q_table[state])
        state, reward, done, info = lake_environment.step(action)
        
        # reset done if done = True because dive was taken
        if done and lake_environment.s != 15:
            dives += 1
            lake_environment.reset()
            done = False
            
        captured_frames[episode].append({
            r'frame': lake_environment.render(mode=r'ansi'),
            r'state': state,
            r'action': action,
            r'reward': reward
        })
        
        # safety switch; abort if number of epochs exceeds the number of steps it takes the random search method to reach the goal
        if epochs == 260:
            break
        
    total_epochs += epochs
    total_dives += dives
    

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average dives per episode: {total_dives / episodes}")

Results after 100 episodes:
Average timesteps per episode: 260.0
Average dives per episode: 14.6


In [11]:
u9.animate_environment_search(frames=captured_frames[4], verbose=True, delay=0.1)

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG

Step No.: 260
State ID: 0
Action ID: 0
Reward: 0.0


In [12]:
u9.animate_environment_search(frames=captured_frames[40], verbose=True, delay=0.1)

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG

Step No.: 260
State ID: 0
Action ID: 0
Reward: 0.0


In [13]:
u9.animate_environment_search(frames=captured_frames[99], verbose=True, delay=0.1)

  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG

Step No.: 260
State ID: 8
Action ID: 0
Reward: 0.0


# Discussion

<b>The agent always takes the same action: action (0).</b> This can result in a step to the left, up or down (see Assignment 8). The agent only moves if a step down is made.

The reason for this is that the Q-table never gets updated during the learning process (see above). Let's look at a few training steps:
1. The initial state is state (0), the action taken is the one with the highest, estimated reward in that state. We've initialzed the Q-table only with zeros. Therefore, action (0) will be performed. (If all values in an array are the same, then np.argmax() will return the index 0)
2. Action (0) in state (0) leads to no reward (the agent can only get a reward at the end).
3. The Q-table entry at [0, 0] will be updated to (1 - alpha) * q_table[0, 0] + alpha * (reward + max_reward_at_next_state) := (1 - alpha) * <b>0</b> + alpha * (<b>0</b> + <b>0</b>) = <b>0</b>. Therefore, it stays the same. 
4. Now the agent can either stay at state (0) or go to state (4) (one step down). In either way, the process practically starts again at 1.

### So the Q-table never gets updated and remains in its inital state. This can be seen during the training above.

<h2>Exercise 2</h2>
Very likely your training in Exercise 1 was not successful. Try to add exploration to your algorithm (you might have to write a new function):
<li><code>I</code> $\rightarrow$ Throw a random uniform number between 0 and 1. 
<li><code>II</code> $\rightarrow$ If the number is smaller than 0.1, sample a random action.
<li><code>III</code> $\rightarrow$ Choose your action as usual.   
    
- Apply the modified $Q$-learning again to solve the environment.
- Test the learned policy and animate one (or more) exemplary episode.
- What do you observe? Does the agent learn now?.

In [14]:
q_table = np.zeros([lake_environment.observation_space.n, lake_environment.action_space.n])
q_table

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [15]:
def apply_q_learning_exploration(environment: lake_environment, alpha: float = 0.1):
    """
    Solve lake_environment by applying Q learning and exploration.
    """
    for i in range(1, 10001):
        state = environment.reset()
        done = False
        
        while not done:
            
            # <I>: Throw a random uniform number between 0 and 1.
            random_number = np.random.uniform()
            
            # <II>: If the number is smaller than 0.1, sample a random action.
            if random_number < 0.1:
                action = environment.action_space.sample()
                
            # <III>: Else: Choose your action as usual.
            else:
                action = np.argmax(q_table[state])
                
            next_state, reward, done, info = environment.step(action)
            old_value = q_table[state, action]
            next_max = np.max(q_table[next_state])
            q_table[state, action] = (1 - alpha) * old_value + alpha * (reward + next_max)
            
            state = next_state
            
            if i % 100 == 0:
                clear_output(wait=True)
                print(f"Episode: {i}")
                print(q_table)

    print("Training finished.\n")

In [16]:
%%time
from IPython.display import clear_output
apply_q_learning_exploration(lake_environment, 0.1)

Episode: 10000
[[0.78764564 0.73993716 0.74464231 0.71572907]
 [0.46824384 0.4432702  0.39433624 0.65194753]
 [0.5316228  0.44112318 0.43590215 0.47005476]
 [0.22563011 0.08538446 0.09458473 0.13342553]
 [0.78835366 0.383867   0.3947618  0.52974961]
 [0.         0.         0.         0.        ]
 [0.42253322 0.19569666 0.2605268  0.19158155]
 [0.         0.         0.         0.        ]
 [0.40168451 0.49888638 0.53086618 0.78894771]
 [0.52914774 0.7768856  0.48894954 0.52660474]
 [0.73318008 0.55043344 0.50000362 0.31364192]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.58640306 0.46298261 0.87415881 0.57793201]
 [0.84360624 0.94469809 0.86407091 0.88331754]
 [0.         0.         0.         0.        ]]
Training finished.

CPU times: user 9.08 s, sys: 1.4 s, total: 10.5 s
Wall time: 8.76 s


In [17]:
total_epochs, total_dives = 0, 0
episodes = 100

captured_frames_exploration = [[] for _ in range(episodes)]

for episode in range(episodes):
    # reset variables
    epochs, dives = 0, 0
    state = lake_environment.reset()
    done = False
    
    while not done:
        epochs += 1
        
        # take best step with regard to the Q-table
        action = np.argmax(q_table[state])
        state, reward, done, info = lake_environment.step(action)
        
        # reset done if done = True because dive was taken
        if done and lake_environment.s != 15:
            dives += 1
            lake_environment.reset()
            done = False
            
        captured_frames_exploration[episode].append({
            r'frame': lake_environment.render(mode=r'ansi'),
            r'state': state,
            r'action': action,
            r'reward': reward
        })
        
        # safety switch; abort if number of epochs exceeds the number of steps it takes the random search method to reach the goal
        if epochs == 260:
            break
        
    total_epochs += epochs
    total_dives += dives

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average dives per episode: {total_dives / episodes}")

Results after 100 episodes:
Average timesteps per episode: 57.08
Average dives per episode: 0.31


In [18]:
u9.animate_environment_search(frames=captured_frames_exploration[4], verbose=True, delay=0.1)

  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m

Step No.: 31
State ID: 15
Action ID: 1
Reward: 1.0


In [19]:
u9.animate_environment_search(frames=captured_frames_exploration[40], verbose=True, delay=0.1)

  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m

Step No.: 62
State ID: 15
Action ID: 1
Reward: 1.0


In [20]:
u9.animate_environment_search(frames=captured_frames_exploration[99], verbose=True, delay=0.1)

  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m

Step No.: 36
State ID: 15
Action ID: 1
Reward: 1.0


# Observation

Now, the Q-table gets updated in the learning process and the agent actually learns. Therefore, he nearly never takes unvoluntarily a dive. Moreover, the average steps per episode are far less than when using the random search algorithm from U8.

Because of the added exploration, the agent doesn't chose the same action (0) every time. Therefore, eventually at some point the agent reaches the goal and gets an reward = 1. This results in the fact that the Q-table can be updated and isn't 0 anymore at some position. Now a chain reaction starts: Because not every Q-table entry is 0 anymore, in upcoming episodes another Q-table entry can be updated, because next_max is 1 at some point, etc. The agent is learning. 