Skip to content
Igor Karpov edited this page Apr 26, 2015 · 2 revisions

Extending Q-learning with Function Approximation

In this exercise, you will implement a function approximator for the Python Q-learning agent. It will abstract a more fine-grained version of the problem space, making the problem tractable for the Q-Learner.

Downloading Exercise Files

Start with downloading the skeleton files for the Q-learning exercise. Copy the folder from the archive to your OpenNERO installation folder. At this point, a new mod should show up in OpenNERO whenever you run it.

The file agent.py contains 3 classes:

  1. MyTabularRLAgent: This class has the same implementation as QLearningAgent from Maze/agent.py. The mod lists this agent twice: once with the coarse-grained maze environment and once with the fine-grained maze environment.
  2. MyTilingRLAgent: This class has an empty implementation which just subclasses MyTabularRLAgent, making it behave like the tabular Q-Learning agent. You will need to complete the implementation of this class so that it implements a tiling approximator.
  3. MyNearestNeighborsRLAgent: This class has an empty implementation which just subclasses MyTabularRLAgent, making it behave like the tabular Q-Learning agent. You will need to complete the implementation of this class so that it implements a nearest neighbors approximator.

Learning parameters

The following parameters are available to you to tweak during training:

Epsilon parameter for the epsilon-greedy policy (varies between 0 and 1): The value of this parameter is controlled by the "Explore-Exploit" slider in the user interface. You can modify the value of epsilon during the episode and it will take effect immediately for every subsequent action. A high epsilon means more exploration. Keep this high at start (0.9) and slowly reduce this as agent learns more and more. Once the agent is full-trained, you can make this parameter 0 so that it always follows the best known action.

Starting Offset: The value of this parameter is controlled using another slider in the user interface. This parameter controls the spawn location of the agent at the beginning of the next episode: The agent will start from a random tile whose Manhattan distance from the source is equal to the Starting Offset. For example, if you set this value to 0, the agent will always start from the source. If you set it to 1, it will start from one of the two tiles that are adjacent to the source. A new value set for this parameter takes effect when the current episode ends and a new episode starts. Problem statement

Tabular Q-Learning

You don't need to change the implementation of MyTabularRLAgent for this assignment. Instead, you will use this agent to familiarize yourself with the behavior of tabular Q-Learning as the baseline for implementing the function approximators in the next sections.

Try training the tabular Q-Learning agent in the coarse-grained maze environment. Start with very high values for Starting Offset so the agent starts very close to the goal. Also set the value of epsilon close to 1.0 so that the agent can explore different actions.

The visited tiles are marked by yellow or green markers, and the Q-values by blue squares in the four directions around the central markers. The distance of each blue square from the yellow/green tile indicates how large the Q-value is. Also, the size of the central yellow/green marker in each maze tile corresponds to the highest Q-value in that tile. The values are initially small, and you can see how they gradually change as a result of learning. The central marker turns green when the value of the tile is above 0, which means the agent has discovered some action in that tile which can eventually take it to the goal. The agent receives a reward of 100 on the final move when it reaches the goal and a reward of -1 when it moves from a cell to another cell that is not the goal. This reward structure encourages reaching the goal with the minimum number of moves (each subtracting -1).

Learning progresses particularly when the agent is exploring a yellow tile and suddenly discovers an action that takes it to an adjacent green tile. At that point in time, the yellow marker becomes green, indicating that Steve can reach the goal from that tile by following the highest Q values. In other words, to increase the learning speed you should make Steve explore the boundary between the the two green and yellow regions so that more yellow cells can become green. When enough tiles turn green in the vicinity of the spawn location, gradually lower the value of Starting Offset so that the agent starts farther from the goal and explores more of the maze. Also, gradually lower the value of epsilon so that the agent tries to reach the goal from those farther locations. In the end, the agent should be able to reach the goal starting from the source (Starting Offset = 0.)

Now try running the tabular Q-Learning agent on the fine-grained maze environment. This environment divides the world into much smaller steps (64x64), resulting in a substantially larger state space. You'll likely notice that your agent is not performing as well with the larger state. In such cases, it's helpful to implement a function approximator that abstracts the large state space into something more manageable. You will implement two such approximators.

Tiling Approximator

Begin by implementing a simple tiling approximator. This function should simply map the current state back to the standard 8x8 space.

For the simple tiling approximator, think of it this way - let's say you have got just 8x8 memory i.e you can just store 64 value function entries. Now due to the limited memory, you divide the 64x64 world grid into 64 tiles of size 8x8 (just like in the figure below where each one of the four quadrants encapsulates 8x8 states). Having such a coarse tiling means that each 8x8 states falling into a single tile will have the same value function. Of-course this does not help you much in terms of learning (and your agent might wander around aimlessly), because nearby states lying in a single tile are not distinguishable. But such coarse tiling does saves you space. However, do not spend too much time in training your agent. For further details, refer to FAQ page. As a next step, you will implement nearest-neighbor approximator in order to distinguish the states falling in a single tile. Nearest-Neighbors Approximator

Nearest-Neighbors Approximator

Once this is completed and verified, implement a nearest-neighbors approximator. This function should interpolate between the values of the three nearest 8x8 locations. Note that each location should only be interpolated if it is reachable– don't interpolate across walls! Below is an example of how the nearest-neighbors approximator should work.

Fig1. Nearest Neighbor Approximator. Value of a given state/location is computed by taking weighted sum of 3 nearest neighbors. Equations for tiling approximator can be derived by considering only one nearest neighbor. Note, since OpenNERO environment is deterministic, Value(Y, Left) = Value(X).

In the figure, your agent is currently at state Y and is considering taking the action to move left to X. The function approximator would calculate the Q-value at X by interpolating the values of A, B, and C based on their Euclidean distance to X. For instance, X is 3.5 rows up and 1.5 rows to the left of A, so its distance is sqrt(3.52 + 1.52). Note that since the tiling of the approximator does not line up perfectly with the fine-grained space, every node in the fine-grained (64x64) space will be at least 0.5 rows and columns away from the nearest approximator (8x8) tile.

If X has a larger value than the tiles below or to the right of Y, (Y, Left) will be chosen as the epsilon-greedy (state, action) pair, and that is the value we must use to update the approximator tiles (A, B, and C). To do this, we simply take the reward received at X and perform the standard Q-learning update to all the approximator tiles.

In order to detect walls, your approximator can call get_environment().maze.walls to access the set of tuples containing the walls. The format for the walls set is ((r1,c1), (r2,c2)). For instance, to check if there is a wall between (3,4) and (3,5), your approximator could ask:

if ((3,4), (3,5)) in get_environment().maze.walls:
   print "There is a wall between (3,4) and (3,5)!"

The walls set uses the coarse-grained (8x8) mapping for rows and columns, which should work out well for your approximator.

Frequently Asked Questions

You may check out the FAQ accumulated from previous classes.

Debugging

If you run into any bugs in your program, you can find the error log file for OpenNERO at one of the following locations:

  • Linux or Mac: ~/.opennero/nero_log.txt
  • Windows: "AppData\Local\OpenNERO\nero_log.txt" or "Local Settings\Application Data\OpenNERO\nero_log.txt" depending on the version of Windows you have.
Clone this wiki locally