# Agent-Environment Loop
In RL we have an agent interacting with an environment. At every timestep $t$ the agent observes the state $s_t$ of the environment and choses an action $a_t$ to perform. Given the agent's action, the environment gives the agent a reward $r_t$ and transitions to a new state $s_{t+1}$. This interaction is known as the agent-environment interface and is given by the following diagram.

![The Agent-Environment Interface](../images/agent-environment.png)

We can implement this interface in code by building an Agent class and an Environment class. The Agent class should have a method `choose_action(state)` which, given a `state`, returns a valid `action`. The Environment should have a method `step(action)` which, given an `action`, should perform the given `action` in the environment and return the `next_state`.

One way to represent the `state` of an environment is to use a feature vector. Each element in the vector represents a feature of the environment. As an example lets implement a simple grid-world environment where an agent needs to navigate from the bottom left of the grid-world to the top right. We can represent the `state` of the environment by a vector of the agent's x- and y-coordinates. In Python we can implement vectors using the `numpy` library.

In [1]:
# import the numpy library
import numpy as np

# We can convert lists in Python to numpy arrays.
bottom_left = [0,0] # [x, y]
print("List:", bottom_left)
bottom_left = np.array(bottom_left) # numpy array
print("Numpy array:", bottom_left)

# Numpy arrays are powerful because they behave like vectors.
# We can do vector addition and scalar multiplication using numpy arrays.
move_right = np.array([1,0])
print("Move right twice", bottom_left + 2 * move_right) # move right twice

List: [0, 0]
Numpy array: [0 0]
Move right twice [2 0]


Using numpy vectors we can now build our simple grid-world environment as a Python class with methods and attributes. We will need a couple methods which will be left for you to implement. There should be a `reset()` method which puts the agent back on the bottom left tile and sets the environment `done` flag to `False`. Next we need a `step(action)` method which given an action moves the agent to a new state in the environment. 

In [2]:
class GridWorld():

    # The initialiser function.
    def __init__(self):
        # Current internal state of the environment.
        self.state = np.array([1,1])

        # We can use a dictionary to store all the actions.
        # So, actions[0] returns the numpy array [0 1].
        # Similarly actions[3] returns [-1 0]
        self.actions = {
            0: np.array([0,1]), # move up
            1: np.array([0,-1]), # move down
            2: np.array([1,0]),  # move right
            3: np.array([-1,0]) # move left
        }

        # Flag indicating if the environment is done. 
        self.done = False

    # A private function to check if the given state
    # is a terminal (death) state.
    def _is_terminal_state(self, state):
        if (state[0] <= 0 or state[0] >= 4 or
            state[1] <= 0 or state[1] >= 4):
            return True
        else:
            return False

    # A function to check if a state is the goal state.
    def _is_goal_state(self, state):
        if state[0] == 3 and state[1] == 3:
            return True
        else:
            return False

    # A function to step the environment.
    def step(self, action_id):
        action = self.actions[action_id]

        # Transition to next state.
        next_state = self.state + action

        # Check if action resulted in death.
        if self._is_terminal_state(next_state):
            self.done = True

        # Check if next state is goal state.
        if self._is_goal_state(next_state):
            self.done = True
            reward = 1
        else:
            reward = 0

        # Set environment internal state to next_state.
        self.state = next_state

        # Return tuple of (next_state, reward, done) for the agent.
        return (self.state, reward, self.done)

    def reset(self):
        # Reset done flag
        self.done = False

        # Reset agent position.
        self.state = np.array([1,1])

        return self.state


Next we can create an agent that randomly chooses actions in the environment and hopes for the best. Our agent needs a function `choose_action(state)` that returns the id of the agents chosen action, in this case a random integer from the set {0,1,2,3}.

In [3]:
class RandomAgent():

    def __init__(self):
        pass

    def choose_action(self, state):
        # Use numpy to chose a random action.
        action_id = np.random.randint(4)

        return action_id

We can then create the agent-environment loop as follows:

In [4]:
def agent_environment_loop(agent, env):
    #### String for prints ####
    action_strings = {
        0: "up.",
        1: "down.", 
        2: "right.",
        3: "left."
    }
    ###########################
    
    # Reset the environment.
    state = env.reset()
    print("Agent's starting position:", state)

    while True:
        # Agent chooses an action.
        action = agent.choose_action(state)

        # Print the agents action.
        print("Agent moved",action_strings[action])

        # Step the environment.
        next_state, reward, done = env.step(action)

        # Very important!!!
        # Set state to next_state
        state = next_state

        if done:
            if reward > 0:
                print("Agent reached the goal!")
            else:
                print("Agent died.")
                
            # Exit the agent-environment loop.
            break

    print("Agent's final position:", state)

In [5]:
# Initalize agent and environment.
agent = RandomAgent()
env = GridWorld()

agent_environment_loop(agent, env)

Agent's starting position: [1 1]
Agent moved down.
Agent died.
Agent's final position: [1 0]


Lets make an agent that follows the optimal policy in this simple grid world.

In [6]:
class OptimalAgent():

    def __init__(self):
        pass

    def choose_action(self, state):
        if state[0] == 3:
            action = 0 # move up
        else:
            action = 2 # move right

        return action

In [7]:
# Initalize agent and environment.
agent = OptimalAgent()
env = GridWorld()

agent_environment_loop(agent, env)

Agent's starting position: [1 1]
Agent moved right.
Agent moved right.
Agent moved up.
Agent moved up.
Agent reached the goal!
Agent's final position: [3 3]
