# Reinforcement Learning Tutorial Using OpenAI Gym toolkit

## Part I - Gym toolkit

### Frozen Lake Problem

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.

In [24]:
import numpy as np
import gym
import universe

In [25]:
lake = gym.make('FrozenLake8x8-v0') #loading the environment

[2019-02-18 16:47:06,890] Making new env: FrozenLake8x8-v0


In [26]:
lake.reset() #reset the environment to the initial state
lake.render() #render one frame of the environmnet for visualization


[41mS[0mFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG


In [27]:
print('action sapece:',lake.action_space) #LEFT = 0, DOWN = 1, RIGHT = 2, UP = 3
print('state space:',lake.observation_space)

action sapece: Discrete(4)
state space: Discrete(64)


We can manually assign a state with `env.env.s` and render one frame to see the result, current state.

In [28]:
lake.env.s = 8
lake.render()


SFFFFFFF
[41mF[0mFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG


The parameters P, nA, nS and gamma defined as:
* $nS$ `int`: number of states in the environment 
* $nA$ `int`: number of actions in the environmnet
* $\gamma$ `float`: discount factor in range $[0,1)$
* $P$: nested dictionary from `gym.core.Environment` where for each pair of states in $[1,nS]$ and actions in $[1, nA]$, $P[state][action]$ is a tuple of the form *(Probability, nextstate, reward, terminal)* where:
    * *probability* `float` is the probably of transitioning from *state* to *nextstate*
    * *nextstate* `int` denotes the state we transition to
    * *reward* `int` is either $0$ or $1$ the reward for transitioning from *state* to *nextstate* with *action*
    * *terminal* `bool` is $True$ when *nextstate* is a terminal state *(H,G)* and $False$ otherwise
    
``` P = {s : {a : [] for a in range(nA)} for s in range(nS)} ```


In [40]:
lake.env.P[8] #shows P[8][.]

{0: [(0.3333333333333333, 0, 0.0, False),
  (0.3333333333333333, 8, 0.0, False),
  (0.3333333333333333, 16, 0.0, False)],
 1: [(0.3333333333333333, 8, 0.0, False),
  (0.3333333333333333, 16, 0.0, False),
  (0.3333333333333333, 9, 0.0, False)],
 2: [(0.3333333333333333, 16, 0.0, False),
  (0.3333333333333333, 9, 0.0, False),
  (0.3333333333333333, 0, 0.0, False)],
 3: [(0.3333333333333333, 9, 0.0, False),
  (0.3333333333333333, 0, 0.0, False),
  (0.3333333333333333, 8, 0.0, False)]}

In [30]:
'''
Step the environment by one timestep. 
Returns observation: Observations, reward, done, info
'''
lake.step(1) #take action "down"

(16, 0.0, False, {'prob': 0.3333333333333333})

In [31]:
lake.step(lake.action_space.sample()) #take one random action of all possible actions

(24, 0.0, False, {'prob': 0.3333333333333333})

In [32]:
lake.reset()
epochs = 0
done = False
frames = []

while not done:
    action = lake.action_space.sample()
    state, reward, done, info = lake.step(action)
    frames.append({
        'frame': lake.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        })
    epochs += 1
    
print("Timesteps taken:", epochs)
print('Total reward:', reward)

Timesteps taken: 22
Total reward: 0.0


In [33]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'].getvalue())
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
        
print_frames(frames)

  (Up)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFF[41mH[0mFFFF
FHHFFFHF
FHFFHFHF
FFFHFFFG

Timestep: 22
State: 35
Action: 3
Reward: 0.0


Here is a demonstration of reaching the goal following the default stochastic policy:

In [36]:
iteration = 0
goal = 0
while goal==0:
    lake.reset()
    epochs = 0
    done = False
    frames = []
    while not done:
        action = lake.action_space.sample()
        state, reward, done, info = lake.step(action)
        goal =+ reward
        frames.append({
            'frame': lake.render(mode='ansi'),
            'state': state,
            'action': action,
            'reward': reward
            })
        epochs += 1
    iteration += 1

print('Total number of episodes:', iteration)
print("Timesteps taken:", epochs)
print('Total reward:', goal)

Total number of episodes: 620
Timesteps taken: 41
Total reward: 1.0


In [37]:
print_frames(frames)

  (Down)
SFFFFFFF
FFFFFFFF
FFFHFFFF
FFFFFHFF
FFFHFFFF
FHHFFFHF
FHFFHFHF
FFFHFFF[41mG[0m

Timestep: 41
State: 63
Action: 1
Reward: 1.0


As demonstrated above, agent takes thousands of time steps to reach the goal once. Because the policy is not optimal (take random action at each time step) and agent is not learning how to improve its policy. 