FrozenLake-v0
====

![frozenlake.jpg](frozenlake.jpg)

https://github.com/yandexdataschool/Practical_RL/blob/master/week0/frozenlake.ipynb

https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0

https://gym.openai.com/evaluations/eval_BuxTzFMwTfKQr2mCwos1uA


这些代码用的都是基于表格的Q函数。使用了类似于epsilon greedy的方法来产生随机动作。得分大概在30-80之间，很难再高。

    SFFF
    FHFH
    FFFH
    HFFG

这个布局是固定的。S在左上角，G在右下角。H表示hole，F表示frozen。目标是要从S到G，只能经过F，而不能掉到H里。能采取的动作就是上下左右，但这个动作得到的结果有一定的随机性。

In [1]:
import gym
import numpy as np
from tqdm import tqdm

env = gym.make('FrozenLake-v0')
env.render()
print()

# 关于这个env的一些信息
n_states = env.observation_space.n
n_actions = env.action_space.n
max_steps = env.spec.tags.get('wrapper_config.TimeLimit.max_episode_steps')
print('n_states:', n_states, 'n_actions:', n_actions, 'max_steps:', max_steps)

[2017-07-21 13:44:23,562] Making new env: FrozenLake-v0



[41mS[0mFFF
FHFH
FFFH
HFFG

n_states: 16 n_actions: 4 max_steps: 100


# evaluate


Q也叫action-value function。输入状态和动作，得到价值的期望。Q(state, action) -> v

先写个函数用来评价不同Q的好坏。运行100次episode，计算reward的平均值，越高表示这个Q越好。

上面的tutorial里有评价policy的代码。稍微调整一下就行。policy(state) -> action

In [2]:
def sample_reward(env, policy, t_max=100):
    """
    Interact with an environment, return sum of all rewards.
    If game doesn't end on t_max (e.g. agent walks into a wall),
    force end the game and return whatever reward you got so far.
    Tip: see signature of env.step(...) method above.
    """
    s = env.reset()
    total_reward = 0

    for _ in range(t_max):
        s, r, is_done, _ = env.step(policy[s])
        total_reward += r
        if is_done:
            break

    return total_reward


def evaluate(env, policy, n_times=100):
    """Run several evaluations and average the score the policy gets."""
    rewards = [sample_reward(env, policy) for _ in range(n_times)]
    return float(np.mean(rewards))


def q_to_policy(env, q):
    p = {}
    for s in range(env.observation_space.n):
        p[s] = np.argmax(q[s])
    return p


def evaluate_q(env, q, n_times=100):
    return evaluate(env, q_to_policy(env, q), n_times)

In [3]:
# evaluate a random q function

def random_q():
    return np.random.random((n_states, n_actions))

evaluate_q(env, random_q())

0.0

In [7]:
def train(env, n=500, lr=.81, gamma=.96, steps=100):
    Q = np.zeros((env.observation_space.n, env.action_space.n))

    for i in tqdm(range(n)):
        s = env.reset()
        for _ in range(steps):
            # 在value上加了一个随机量，这个随机量越来越小 (noise greedy)
            a = np.argmax(Q[s,:] + np.random.randn(env.action_space.n) * (1. / (i+1)))
            s1, r, d, _ = env.step(a)
            qsa = Q[s, a]
            Q[s, a] += lr * (r + gamma * np.max(Q[s1,:]) - qsa)

            if d:
                break
            s = s1

    return Q

evaluate_q(env, train(env))

100%|██████████| 500/500 [00:00<00:00, 709.52it/s] 


0.7

In [8]:
# 使用这种形式的 epsilon greedy 得到的结果较差的情况比较多

def train(env, n=500, lr=.81, gamma=.96, steps=100, epsilon=1., epsilon_decay=.98):
    Q = np.zeros((env.observation_space.n, env.action_space.n))

    for i in tqdm(range(n)):
        s = env.reset()
        for _ in range(steps):
            if np.random.random() < epsilon:
                a = env.action_space.sample()
            else:
                a = np.argmax(Q[s, :])
            s1, r, d, _ = env.step(a)
            qsa = Q[s, a]
            Q[s, a] += lr * (r + gamma * np.max(Q[s1,:]) - qsa)

            if d:
                break
            s = s1
        epsilon *= epsilon_decay

    return Q

evaluate_q(env, train(env))

100%|██████████| 500/500 [00:00<00:00, 1298.34it/s]


0.39