# RL Framework

- Environment
- Agent

<div>
<img src="img/rl_base.png" width="500"/>
</div>

## Environment
### GridWorld
GridWorld is a 2D rectangular grid of size NxM. It has an **agent** starting in one of the grid squares and possible **rewards** in other grid squares.

In our initial setup, the GridWorld is a 3x4 grid with the agent starting in the bottom left corner. The world contains a blocking state, a positive and a negative reward.

The agent's **goal** is to receive a positive reward by moving up, down, left or right. The game ends when a reward is received.

<div>
<img src="img/grid_example.png" width="250"/>
</div>


In [1]:
# dimensions of the GridWorld
world_shape = (3, 4)

# initial position of the agent
agent_init_pos = (2, 0)

# list of blocking state positions
blocking_states = [(1, 1)]

# dictionary of rewards with key: position and value: reward
reward_states = {
    (0,3): 1,
    (1,3): -1
}

### Visualization
For now, only a human agent will interact with our environment.

We need some visualizations.

The environment is represented by a 2D array. We will label
- empty states with 0,
- the agent with 4,
- blocking states 8,
- rewards with their respective values.

In [5]:
import numpy as np

legend = {
    'empty': 0,
    'agent': 4,
    'blocking': 8
}

def render_environment(world_shape, agent_pos, blocking_states, reward_states):
    # initialize empty states
    states = np.ones(world_shape) * legend['empty']
    
    # add agent
    states[tuple(agent_pos)] = legend['agent']
    
    # add blocking states
    for blocking_state in blocking_states:
        states[blocking_state] = legend['blocking']
    
    # add rewards
    for state, reward in reward_states.items():
        states[state] = reward
    return states
    
render = render_environment(world_shape, agent_init_pos, blocking_states, reward_states)
print(render)


[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 4.  0.  0.  0.]]


### Actions
Possible actions in the GridWorld:
- up
- down
- right
- left

the agent is blocked by the bounds of the GridWorld and blocking states.

In [6]:
possible_actions = {
    'up': np.array([-1, 0]),
    'down': np.array([1, 0]),
    'right': np.array([0, 1]),
    'left': np.array([0, -1])
}

def move_agent(agent_pos, action, world_shape, blocking_states):
    # move agent
    new_agent_pos = np.array(agent_pos) + possible_actions[action]
    
    # check if new position is blocked
    if tuple(new_agent_pos) in blocking_states:
        return agent_pos
    
    # check if new position is out of bounds
    if (new_agent_pos < 0).any() or (new_agent_pos >= world_shape).any():
        return agent_pos
        
    return tuple(new_agent_pos)

# Test some actions
actions = ['down', 'up', 'right', 'left', 'up', 'up']
new_agent_pos = agent_init_pos
render = render_environment(world_shape, new_agent_pos, blocking_states, reward_states)
print(render)
for action in actions:
    print(f'going {action}')
    new_agent_pos = move_agent(new_agent_pos, action, world_shape, blocking_states)
    render = render_environment(world_shape, new_agent_pos, blocking_states, reward_states)
    print(render)


[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 4.  0.  0.  0.]]
going down
[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 4.  0.  0.  0.]]
going up
[[ 0.  0.  0.  1.]
 [ 4.  8.  0. -1.]
 [ 0.  0.  0.  0.]]
going right
[[ 0.  0.  0.  1.]
 [ 4.  8.  0. -1.]
 [ 0.  0.  0.  0.]]
going left
[[ 0.  0.  0.  1.]
 [ 4.  8.  0. -1.]
 [ 0.  0.  0.  0.]]
going up
[[ 4.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  0.  0.  0.]]
going up
[[ 4.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  0.  0.  0.]]


### Putting it together
We have everything for the environment.

Wi implement it in a class, that holds all required information (dimensions, agent_position ...) and can 
- <code>reset</code> the GridWorld to its initial state
- <code>step</code> the environment by executing actions, returning new observations, calculating rewards and deciding whether the game has ended

In [7]:
class GridWorld:
    def __init__(self, world_shape, agent_init_pos, blocking_states, reward_states):
        self.world_shape = world_shape
        self.agent_init_pos = agent_init_pos
        self.blocking_states = blocking_states
        self.reward_states = reward_states
        
        self.agent_current_pos = self.agent_init_pos
        
    def reset(self):
        # reset agent position
        self.agent_current_pos = self.agent_init_pos
            
        # render initial observation
        observation = render_environment(self.world_shape, 
                                         self.agent_current_pos, 
                                         self.blocking_states, 
                                         self.reward_states)
        return observation
        
    def step(self, action):
        # execute action
        self.agent_current_pos = move_agent(self.agent_current_pos, 
                                            action, 
                                            self.world_shape, 
                                            self.blocking_states)
        
        # check if there is any reward and whether the game ended
        if self.agent_current_pos in self.reward_states.keys():
            done = True
            reward = self.reward_states[self.agent_current_pos]
        else:
            done = False
            reward = 0
            
        # render observation
        observation = render_environment(self.world_shape, 
                                         self.agent_current_pos, 
                                         self.blocking_states, 
                                         self.reward_states)
        return observation, reward, done

## Agent
We use python's <code>input()</code> method to receive actions from the human agent.

In [8]:
def receive_action_input():
    # read input and translate to action
    action = input('move with w, a, s, d; exit with q - ')
    if action == 'w':
        return 'up'
    elif action == 'a':
        return 'left'
    elif action == 's':
        return 'down'
    elif action == 'd':
        return 'right'
    # additional action to exit from GridWorld
    elif action == 'q':
        return 'exit'
    else:
        return receive_action_input()
    
receive_action_input()

move with w, a, s, d; exit with q -  ljkj
move with w, a, s, d; exit with q -  w


'up'

To act, the observation is presented and the agent is prompted to input an action.

In [10]:
def act(observation):
    # present observation to human agent
    print(observation)
    
    # obtain action from human agent
    action = receive_action_input()
    return action

## Agent Environment Interaction

In [11]:
# initialize environment
env = GridWorld(world_shape, agent_init_pos, blocking_states, reward_states)

# reset environment and receive initial observaion
obs = env.reset()
while True:
    # get action from agent
    action = act(obs)
    
    # exit loop if exit action
    if action == 'exit':
        break
    
    # execute action in environment
    obs, reward, done = env.step(action)
    print(f'went {action}, received reward {reward}')
    
    # reset environment if game ended
    if done:
        print('============= game over =============')
        obs = env.reset()

[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 4.  0.  0.  0.]]


move with w, a, s, d; exit with q -  w


went up, received reward 0
[[ 0.  0.  0.  1.]
 [ 4.  8.  0. -1.]
 [ 0.  0.  0.  0.]]


move with w, a, s, d; exit with q -  s


went down, received reward 0
[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 4.  0.  0.  0.]]


move with w, a, s, d; exit with q -  d


went right, received reward 0
[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  4.  0.  0.]]


move with w, a, s, d; exit with q -  d


went right, received reward 0
[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  0.  4.  0.]]


move with w, a, s, d; exit with q -  d


went right, received reward 0
[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  0.  0.  4.]]


move with w, a, s, d; exit with q -  w


went up, received reward -1
[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 4.  0.  0.  0.]]


move with w, a, s, d; exit with q -  w


went up, received reward 0
[[ 0.  0.  0.  1.]
 [ 4.  8.  0. -1.]
 [ 0.  0.  0.  0.]]


move with w, a, s, d; exit with q -  w


went up, received reward 0
[[ 4.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  0.  0.  0.]]


move with w, a, s, d; exit with q -  d


went right, received reward 0
[[ 0.  4.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  0.  0.  0.]]


move with w, a, s, d; exit with q -  d


went right, received reward 0
[[ 0.  0.  4.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  0.  0.  0.]]


move with w, a, s, d; exit with q -  d


went right, received reward 1
[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 4.  0.  0.  0.]]


move with w, a, s, d; exit with q -  q


## Value
The value is the cumulative future reward.

In [19]:
def calculate_values(rewards):
    # create a copy of rewards, to not change the original input
    reversed_rewards = list(rewards)
    
    # reverse order, since we are interested in future rewards
    reversed_rewards.reverse()
    
    # shift rewards to start cumulation from the next timestep
    reversed_rewards = [0] + reversed_rewards[:-1]
    
    # calculate cumulative sum and reverse again, to obtain the original order
    values = np.cumsum(reversed_rewards)
    values = values[::-1]
    return values

rewards = [0, 1, 2, 3, 4]
calculate_values(rewards)

[4, 3, 2, 1, 0]
[0, 4, 3, 2, 1]
[ 0  4  7  9 10]


array([10,  9,  7,  4,  0])

### Calculate values from a game 

In [31]:
env = GridWorld(world_shape, agent_init_pos, blocking_states, reward_states)
obs = env.reset()
done = False

# collect rewards from game
rewards = []
# also record agent position for later
agent_positions = []
while not done:
    action = act(obs)
    if action == 'exit':
        break
    agent_positions.append(env.agent_current_pos)
    obs, reward, done = env.step(action)
    rewards.append(reward)
    print(f'went {action}, received reward {reward}')
    
print(rewards)
values = calculate_values(rewards)
print(values)

[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 4.  0.  0.  0.]]


move with w, a, s, d; exit with q -  d


went right, received reward 0
[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  4.  0.  0.]]


move with w, a, s, d; exit with q -  d


went right, received reward 0
[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  0.  4.  0.]]


move with w, a, s, d; exit with q -  d


went right, received reward 0
[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  0.  0.  4.]]


move with w, a, s, d; exit with q -  w


went up, received reward -1
[0, 0, 0, -1]
[-1 -1 -1  0]


### Visualize obtained values

In [32]:
def visualize_values(values, positions, world_shape):
    value_vis = np.zeros(world_shape)
    for p, v in zip(positions, values):
        value_vis[tuple(p)] = v
    return value_vis

visualize_values(values, agent_positions, env.world_shape)
    

array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [-1., -1., -1.,  0.]])