Skip to content

google/skyline_rl_lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

skyline_rl_lab

We are going to implement and do experiments on RL algorithms in this repo to facilitate our research and tutoring purposes. Below we are going to explain how use this repo with a simple example.

Environment

For RL (Reinforcement learning) to work, we need an environment to interact with. From Skyline lab, We can list supported environment as below:

>>> from skyline import lab
>>> lab.list_env()
===== GridWorld =====
This is a environment to show case of Skyline lab. The environment is a grid world where you can move up, down, right and leftif you don't encounter obstacle. When you obtain the reward (-1, 1, 2), the game is over. You can use env.info() to learn more.

Then We use function make to create the desired environment. e.g.

>>> grid_env = lab.make(lab.Env.GridWorld)
>>> grid_env.info()
- environment is a grid world
- x means you can't go there
- s means start position
- number means reward at that state
===========
.  .  .  1
.  x  . -1
.  .  .  x
s  x  .  2
===========

Avaiable actions are indicated as follows:

>>> grid_env.available_actions()
['U', 'D', 'L', 'R']

To get the current state of an environment:

>>> grid_env.current_state
GridState(i=3, j=0)

The starting position (s) is at the axes (3, 1) in this case.

Let's take an action and check how the state changes in the environment:

>>> grid_env.step('U')  # Take action 'Up'
ActionResult(action='U', state=GridState(i=2, j=0), reward=0, is_done=False, is_truncated=False, info=None)

>>> grid_env.current_state  # Get current state
GridState(i=2, j=0)

After taking action U, we expect the i-axis to move up from 3->2 and we can confirm this from the return action result. Let's reset the environment by calling the reset method which will bring the state of environment back to its initial state GridState(i=3, j=0):

>>> grid_env.reset()
>>> grid_env.current_state
GridState(i=3, j=0)

Experiments of RL algorithms

Here we are going to test some well-known RL algorithms and demonstrate the usage of this lab. All RL methods we are going to implement must implement proto RLAlgorithmProto in rl_protos.py. We will take a look at the implementation of some RL methods to see how they are used.

Monte Carlo Method

In this method, we simply simulate many trajectories (decision processes), and calculate the average returns. (wiki page)

We implement this algorithm in monte_carlo.py. The code snippet below will initialize this RL method:

>>> from skyline.lab.alg import monte_carlo
>>> mc_alg = monte_carlo.MonteCarlo()

Each RL method object will support method fit to learn from the given environment object. For example:

>>> mc_alg.fit(grid_env)

Then we can leverage utility gridworld_utils.py to print out the learned RL knowledge. Below is the learned value function from the Monte Carlo method:

>>> gridworld_utils.print_values(mc_alg._state_2_value, grid_env)
---------------------------
 1.18| 1.30| 1.46| 1.00|
---------------------------
 1.31| 0.00| 1.62|-1.00|
---------------------------
 1.46| 1.62| 1.80| 0.00|
---------------------------
 1.31| 0.00| 2.00| 2.00|

Then let's check the learned policy:

>>> gridworld_utils.print_policy(mc_alg._policy, grid_env)
---------------------------
  D  |  R  |  D  |  ?  |
---------------------------
  D  |  x  |  D  |  ?  |
---------------------------
  R  |  R  |  D  |  x  |
---------------------------
  U  |  x  |  R  |  ?  |

Finally, we can use trained Monte Carlo method object to interact with the environment. Below is the sample code for reference:

# Play game util done
grid_env.reset()

print(f'Begin state={grid_env.current_state}')
step_count = 0
while not grid_env.is_done:
    result = mc_alg.play(grid_env)
    step_count += 1
    print(result)

print(f'Final reward={result.reward} with {step_count} step(s)')

The execution would look like:

Begin state=GridState(i=3, j=0)
ActionResult(action='U', state=GridState(i=2, j=0), reward=0, is_done=False, is_truncated=False, info=None)
ActionResult(action='R', state=GridState(i=2, j=1), reward=0, is_done=False, is_truncated=False, info=None)
ActionResult(action='R', state=GridState(i=2, j=2), reward=0, is_done=False, is_truncated=False, info=None)
ActionResult(action='D', state=GridState(i=3, j=2), reward=0, is_done=False, is_truncated=False, info=None)
ActionResult(action='R', state=GridState(i=3, j=3), reward=2, is_done=True, is_truncated=False, info=None)
Final reward=2 with 5 step(s)

Random Method

This method takes random action(s) in the given environment. It is often used as a baseline to evaluate other RL methods. The code below will instantiate a Random RL method:

from skyline.lab.alg import random_rl

random_alg = random_rl.RandomRL()

Random RL method won't require training at all. So if you call method fit of random_alg, it will return immediately:

# Training
random_alg.fit(grid_env)

Since this is a random process, each time you play the game will have different result very likely:

# Play game util done
grid_env.reset()

print(f'Begin state={grid_env.current_state}')
step_count = 0
while not grid_env.is_done:
    result = random_alg.play(grid_env)
    step_count += 1
    print(result)
print(f'Final reward={result.reward} with {step_count} step(s)')

Below is one execution example:

Begin state=GridState(i=3, j=0)
ActionResult(action='U', state=GridState(i=2, j=0), reward=0, is_done=False, is_truncated=False, info=None)
...
ActionResult(action='R', state=GridState(i=0, j=3), reward=1, is_done=True, is_truncated=False, info=None)
Final reward=1 with 16 step(s)

From the result above, the random RL method took more steps and not guarantee to obtain the best reward, Therefore, it is obvious that the Monte Carlo method performs much better than the Random RL method!

How to rank RL methods

Before we start introducing how score board work, we need to understand RLExaminer first. Basically, scoreboard is a design to help you rank the different RL methods.

RLExaminer

Every environment can have more than one examiner to calculate the score of RL method. Each examiner may have its own aspect to evaluate the RL method (time, reward etc.). Let's check one used to calculate the average reward of grid environment:

# This examiner considers both reward and number of steps.
examiner = gridworld_env.GridWorldExaminer()

Then, what's score of Monte Carlo Method:

# Monte Carlo will get reward 2 by taking 5 steps.
# So the score will be reward / steps: 2 / 5 = 0.4
examiner.score(mc_alg, grid_env)

Monte Carlo method got score 0.4. Let's check another RL method Random Method:

# The number of steps required by random RL method is unknown.
# Also the best reward is not guaranteed. So the score here will be random.
examiner.score(random_alg, grid_env)

Random RL method often got scores to be less than Monte Carlo method.

Scoreboard

Scoreboard literally calculate the scores of given RL methods according to the specific examiner and the rank those RL methods accordingly:

score_board = lab.Scoreboard()
sorted_scores  = score_board.rank(
    examiner=examiner, env=grid_env, rl_methods=[random_alg, mc_alg])

Below output will be produced:

+-------+------------+---------------------+
| Rank. |  RL Name   |        Score        |
+-------+------------+---------------------+
|   1   | MonteCarlo |         0.4         |
|   2   |  RandomRL  | 0.13333333333333333 |
+-------+------------+---------------------+

Resources

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published