## <font color='darkblue'>Prefce</font>
Here we are going to use a toy testing environment `GridWorld` to demonstrate the usage of this lab.

### <font color='darkgreen'>Importing Packages</font>
Firstly, let's import all the necessary packages:

In [1]:
from skyline import lab
from skyline.lab import gridworld_env
from skyline.lab import gridworld_utils

### <font color='darkgreen'>Make Lab Environment</font>
We can list supported environment as below:

In [2]:
lab.list_env()

===== GridWorld =====
This is a environment to show case of Skyline lab. The environment is a grid world where you can move up, down, right and leftif you don't encounter obstacle. When you obtain the reward (-1, 1, 2), the game is over. You can use env.info() to learn more.




Then We use function <font color='blue'>make</font> to obtain the desired environment. e.g.:

In [3]:
grid_env = lab.make(lab.Env.GridWorld)

In [4]:
# Check what our environment looks like:
grid_env.info()

- environment is a grid world
- x means you can't go there
- s means start position
- number means reward at that state
.  .  .  1
.  x  . -1
.  .  .  x
s  x  .  2



In [5]:
# Show available actions
grid_env.available_actions()

['U', 'D', 'L', 'R']

In [6]:
# Get current state
grid_env.current_state

GridState(i=3, j=0)

Let's take a action and check the state change:

In [7]:
# Take action 'Up'
grid_env.step('U')

# Check current state
grid_env.current_state

GridState(i=2, j=0)

After taking action `U`, we expect the axis-i to move up from 2->1 and we can confirm it from the output state. Let's reset the environment by calling method <font color='blue'>reset</font> which will bring the state of environment back to intial state `GridState(i=2, j=0)`:

In [8]:
# Reset environment
grid_env.reset()

# Check current state
grid_env.current_state

GridState(i=3, j=0)

## <font color='darkblue'>Experiments of RL algorithms</font>
Here we are going to test some well-known RL algorithms and demonstrate the usage of this lab:

<a id='monte_carlo_method'></a>
### <font color='darkgreen'>Monte Carlo Method</font>
<b><font size='3ptx'>In this method, we simply simulate many trajectories (<font color='darkbrown'>decision processes</font>), and calculate the average returns.</font></b> ([wiki page](https://en.wikiversity.org/wiki/Reinforcement_Learning#Monte_Carlo_policy_evaluation))

We implement this algorithm in `monte_carlo.py`. The code below will demonstrate the usage of it:

In [9]:
from skyline.lab.alg import monte_carlo

In [10]:
mc_alg = monte_carlo.MonteCarlo()

In [11]:
grid_env.info()

- environment is a grid world
- x means you can't go there
- s means start position
- number means reward at that state
.  .  .  1
.  x  . -1
.  .  .  x
s  x  .  2



In [12]:
grid_env.random_action(gridworld_env.GridState(1, 0))

'U'

#### Training

In [13]:
%%time
# Training
mc_alg.fit(grid_env)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:07<00:00, 1295.75it/s]

CPU times: user 7.71 s, sys: 385 ms, total: 8.09 s
Wall time: 7.75 s





Let's check what value function we get:

In [14]:
gridworld_utils.print_values(mc_alg._state_2_value, grid_env)

---------------------------
 1.18| 1.31| 1.46| 1.00|
---------------------------
 1.31| 0.00| 1.62|-1.00|
---------------------------
 1.46| 1.62| 1.80| 0.00|
---------------------------
 1.31| 0.00| 2.00| 2.00|


Then let's print the learned policy:

In [15]:
gridworld_utils.print_policy(mc_alg._policy, grid_env)

---------------------------
  R  |  R  |  D  |  ?  |
---------------------------
  D  |  x  |  D  |  ?  |
---------------------------
  R  |  R  |  D  |  x  |
---------------------------
  U  |  x  |  R  |  ?  |


#### Prior run

Finally, let's reset the environment and play the game:

In [16]:
# Play game util done
grid_env.reset()

print(f'Begin state={grid_env.current_state}')
step_count = 0
while not grid_env.is_done:
    result = mc_alg.play(grid_env)
    step_count += 1
    print(result)
print(f'Final reward={result.reward} with {step_count} step(s)')

Begin state=GridState(i=3, j=0)
ActionResult(action='U', state=GridState(i=2, j=0), reward=0, is_done=False, is_truncated=False, info=None)
ActionResult(action='R', state=GridState(i=2, j=1), reward=0, is_done=False, is_truncated=False, info=None)
ActionResult(action='R', state=GridState(i=2, j=2), reward=0, is_done=False, is_truncated=False, info=None)
ActionResult(action='D', state=GridState(i=3, j=2), reward=0, is_done=False, is_truncated=False, info=None)
ActionResult(action='R', state=GridState(i=3, j=3), reward=2, is_done=True, is_truncated=False, info=None)
Final reward=2 with 5 step(s)


In [17]:
# Show learned value function
# mc_alg._state_2_value

In [18]:
# Show learned Q table
# mc_alg._q

<a id='random_method'></a>
### <font color='darkgreen'>Random Method</font>
This method takes random action in the given environment. It is often used as a based line to evaluate other RL methods.

In [19]:
from skyline.lab.alg import random_rl

In [20]:
random_alg = random_rl.RandomRL()

#### Train
Random won't require any training and therefore below call should end in no time.

In [21]:
%%time
# Training
random_alg.fit(grid_env)

CPU times: user 12 µs, sys: 4 µs, total: 16 µs
Wall time: 30.3 µs


#### Prior run
Since this is a random process, each time you play the game will have difference result:

In [22]:
# Play game util done
grid_env.reset()

print(f'Begin state={grid_env.current_state}')
step_count = 0
while not grid_env.is_done:
    result = random_alg.play(grid_env)
    step_count += 1
    print(result)
print(f'Final reward={result.reward} with {step_count} step(s)')

Begin state=GridState(i=3, j=0)
ActionResult(action='U', state=GridState(i=2, j=0), reward=0, is_done=False, is_truncated=False, info=None)
ActionResult(action='D', state=GridState(i=3, j=0), reward=0, is_done=False, is_truncated=False, info=None)
ActionResult(action='U', state=GridState(i=2, j=0), reward=0, is_done=False, is_truncated=False, info=None)
ActionResult(action='D', state=GridState(i=3, j=0), reward=0, is_done=False, is_truncated=False, info=None)
ActionResult(action='U', state=GridState(i=2, j=0), reward=0, is_done=False, is_truncated=False, info=None)
ActionResult(action='U', state=GridState(i=1, j=0), reward=0, is_done=False, is_truncated=False, info=None)
ActionResult(action='U', state=GridState(i=0, j=0), reward=0, is_done=False, is_truncated=False, info=None)
ActionResult(action='R', state=GridState(i=0, j=1), reward=0, is_done=False, is_truncated=False, info=None)
ActionResult(action='R', state=GridState(i=0, j=2), reward=0, is_done=False, is_truncated=False, info=No

From the result above, it is obviously that <a href='#monte_carlo_method'><b>Monte Carlo Method</b></a> can perform much better than <a href='#random_method'><b>Random Method</b></a>!

## <font color='darkblue'>Scoreboard</font>
Before we know how score board work, we need to understand <b><font color='blue'>RLExaminer</font></b> first.

### <font color='darkgreen'>RLExaminer</font>
Every environment can have more than one examiner to calculate the score of RL method. Each examiner may have its own aspect to evaluate the RL method (time, reward etc.). Let's check one used to calculate the average reward of grid environment:

In [23]:
examiner = gridworld_env.GridWorldExaminer()

Then, what's score of `Monte Carlo Method`:

In [24]:
examiner.score(mc_alg, grid_env)

0.4

`Monte Carlo Method` got score 0.4. Let's check another RL method `Random Method`:

In [25]:
examiner.score(random_alg, grid_env)

0.05

`Random Method` got score 0.5 which is less than `Monte Carlo Method`.

### <font color='darkgreen'>Score Board</font>
<b><font color='blue'>Scoreboard</font></b> literally calculate the scores of given RL methods according to the specific examiner and the rank those RL methods accordingly:

In [26]:
score_board = lab.Scoreboard()

In [27]:
sorted_scores  = score_board.rank(
    examiner=examiner, env=grid_env, rl_methods=[random_alg, mc_alg])

+-------+------------+---------------------+
| Rank. |  RL Name   |        Score        |
+-------+------------+---------------------+
|   1   | MonteCarlo |         0.4         |
|   2   |  RandomRL  | 0.13333333333333333 |
+-------+------------+---------------------+


In [28]:
sorted_scores

[('MonteCarlo', 0.4), ('RandomRL', 0.13333333333333333)]

## <font color='darkblue'>A Real World RL problem (BCST test case selection)</font>
Here we are going to use a real-world example to explain how this lab works:

### <font color='darkgreen'>Explore the environment</font>

In [29]:
from skyline.lab import bcst_tc_env

In [30]:
bcst_env = bcst_tc_env.BCSTEnvironment()
bcst_examiner = bcst_tc_env.BCSTRewardCountExaminer()

In [31]:
bcst_env.available_actions()

['rl_test_case3',
 'rl_test_case5',
 'rl_test_case2',
 'rl_test_case4',
 'rl_test_case1']

In [32]:
bcst_env.available_states()[:10]

[(),
 ('rl_test_case3',),
 ('rl_test_case5',),
 ('rl_test_case2',),
 ('rl_test_case4',),
 ('rl_test_case1',),
 ('rl_test_case3', 'rl_test_case3'),
 ('rl_test_case3', 'rl_test_case5'),
 ('rl_test_case3', 'rl_test_case2'),
 ('rl_test_case3', 'rl_test_case4')]

In [33]:
bcst_env.step('rl_test_case1')
bcst_env.step('rl_test_case2')
bcst_env.current_state

('rl_test_case1', 'rl_test_case2')

In [34]:
bcst_env.step('rl_test_case3')
bcst_env.step('rl_test_case4')
bcst_env.current_state

('rl_test_case2', 'rl_test_case3', 'rl_test_case4')

### <font color='darkgreen'>Random RL</font>
Let's check our baseline:

#### Train

In [35]:
%%time
# Training
random_alg.fit(bcst_env)

CPU times: user 8 µs, sys: 3 µs, total: 11 µs
Wall time: 16.7 µs


#### Prior run

In [36]:
bcst_env.reset()
random_alg.play(bcst_env)

ActionResult(action='rl_test_case2', state=('rl_test_case2',), reward=0, is_done=False, is_truncated=False, info=None)

#### Score

In [37]:
score = bcst_examiner.score(random_alg, bcst_env, play_round=10)
print(f'Score={score:.02f}')

Score=62.00


### <font color='darkgreen'>Monte Carlo Method</font>
Next let's check `Monte Carlo Method` with two cases. One for default option `round_num=10000`, another one with parameter `round_num=15000`

In [44]:
mc_alg_r1000 = monte_carlo.MonteCarlo(name='r1000', round_num=1000)
mc_alg_r5000 = monte_carlo.MonteCarlo(name='r5000', round_num=5000)

#### Train

In [45]:
%%time
# Training of default setting
mc_alg_r1000.fit(bcst_env)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:05<00:00, 189.74it/s]

CPU times: user 5.24 s, sys: 357 ms, total: 5.6 s
Wall time: 5.28 s





In [47]:
%%time
# Training of Monte Carlo Method for larger `round_num`
mc_alg_r5000.fit(bcst_env)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:35<00:00, 140.75it/s]

CPU times: user 35.1 s, sys: 2.51 s, total: 37.6 s
Wall time: 35.5 s





#### Score

In [48]:
score = bcst_examiner.score(mc_alg_r1000, bcst_env, play_round=10)
print(f'Monte Carlo Method ({mc_alg_r1000.name}) with Score={score:.02f}')

Monte Carlo Method (r1000) with Score=59.40


In [49]:
score = bcst_examiner.score(mc_alg_r5000, bcst_env, play_round=10)
print(f'Monte Carlo Method ({mc_alg_r5000.name}) with Score={score:.02f}')

Monte Carlo Method (r5000) with Score=139.50


### <font color='darkgreen'>Score Board</font>
Finally, let's check the ranking among supported RL methods:

In [50]:
sorted_scores  = score_board.rank(
    examiner=bcst_examiner, env=bcst_env,
    rl_methods=[random_alg, mc_alg_r1000, mc_alg_r5000])

+-------+----------+-------+
| Rank. | RL Name  | Score |
+-------+----------+-------+
|   1   |  r5000   | 135.0 |
|   2   |  r1000   |  58.0 |
|   3   | RandomRL |  52.0 |
+-------+----------+-------+


## <font color='darkblue'>Supplement</font>
* [Udemy - Artificial Intelligence: Reinforcement Learning in Python](https://www.udemy.com/course/artificial-intelligence-reinforcement-learning-in-python/)