# The Agent-Environment Interaction

In this exercise, you will implement the interaction of a reinforecment learning agent with its environment. We will use the gridworld environment from the second lecture. You will find a description of the environment below, along with two pieces of relevant material from the lectures: the agent-environment interface and the Q-learning algorithm.

1. Create an agent that chooses actions randomly with this environment. 

2. Create an agent that uses Q-learning. You can use initial Q values of 0, a stochasticity parameter for the $\epsilon$-greedy policy function $\epsilon=0.05$, and a learning rate $\alpha = 0.1$. But feel free to experiment with other settings of these three parameters.

3. Plot the mean total reward (i.e. the undiscounted return) obtained by the two agents for each episode. This kind of graph is called a **learning curve**, and it gives us an idea of how our agent's performance changes during training.


## The agent-environment interface

<img src="img/agent-environment.png" style="width: 500px;" align="left"/> 

<br><br><br>

The interaction of the agent with its environments starts at decision stage $t=0$ with the observation of the current state $s_0$. (Notice that there is no reward at this initial stage.) The agent then chooses an action to execute at decision stage $t=1$. The environment responds by changing its state to $s_1$ and returning the numerical reward signal $r_1$. 


## The environment: Navigation in a gridworld

<img src="img/gold.png" style="width: 250px;" align="left"/>

The agent has four possible actions in each state (grid square): west, north, south, and east. The actions are unreliable. They move the agent in the intended direction with probability 0.8, and with probability 0.2, they move the agent in a random other direction. If the direction of movement is blocked, the agent remains in the same grid square. The initial state of the agent is one of the five grid squares at the bottom, selected randomly. The grid squares with the gold and the bomb are **terminal states**. If the agent finds itself in one of these squares, the episode ends. Then a new episode begins with the agent at a randomly selected initial state.

You will use a reinforcement learning algorithm to compute the best policy for finding the gold with as few steps as possible while avoiding the bomb. For this, we will use the following reward function: $-1$ for each navigation action, an additional $+10$ for finding the gold, and an additional $-10$ for hitting the bomb. For example, the immediate reward for transitioning into the square with the gold is $-1 + 10 = +9$. Do not use discounting (that is, set $\gamma=1$).

## Q-Learning
For your reference, the pseudocode for the Q-Learning algorithm is reproduced below (Reinforcement Learning, Sutton & Barto, 2018, Section 6.5 p.131).
<img src="img/q.png" style="width: 720px;"/>


## Example of a learning curve

<img src="img/lc_example.png" style="width: 550px;" align="left"/>

<br><br><br><br>

This is a sample learning curve and shows the reward obtained by a Q-learning agent across 500 episodes. Do not try to replicate this exact curve! It was computed using a different environment than the one described here.

In [18]:
# YOUR CODE HERE!

import random
import numpy as np

In [19]:
learning_rate = 0.1
epsilon = 0.05
q = 0

bomb_position = 18
gold_position = 23

actions = ["UP", "DOWN", "LEFT", "RIGHT"]

probability_intended = 0.8

In [20]:
num_rows = 5
num_cols = 5
num_cells = num_rows*num_cols
random_move_probability = 0.2

agent_position = np.random.randint(0, 5)
bomb_positions = np.array([18])
gold_positions = np.array([23])

terminal_states = np.array([bomb_positions, gold_positions])

rewards = np.zeros(num_cells)
rewards[bomb_positions] = -10
rewards[gold_positions] = 10

actions = ["UP", "RIGHT", "DOWN", "LEFT"]
num_actions = len(actions)

In [21]:
print(bomb_positions)
print(gold_positions)
print(terminal_states)
print(rewards)

[18]
[23]
[[18]
 [23]]
[  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
   0.   0.   0.   0. -10.   0.   0.   0.   0.  10.   0.]


In [23]:
agent_position = 1

In [13]:
old_position = agent_position
new_position = agent_position

In [14]:
action = "UP"

In [15]:
if action == "UP":
    candidate_position = old_position + num_cols
    if candidate_position < num_cells:
        new_position = candidate_position
elif action == "RIGHT":
    candidate_position = old_position + 1
    if candidate_position % num_cols > 0:
        new_position = candidate_position
elif action == "DOWN":
    candidate_position = old_position - num_cols
    if candidate_position >= 0:
        new_position = candidate_position
elif action == "LEFT":
    candidate_position = old_position - 1
    if candidate_position % num_cols < num_cols - 1:
        new_position = candidate_position
    
        
agent_position = new_position  

In [16]:
new_position

8

In [17]:
old_position

3

In [None]:
q_1_n = 0


In [24]:

q_1_n = (1-learning_rate)*q_1_n + learning_rate*(reward + max())

In [25]:
max(-1,2,4,5)

5