https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/8_Actor_Critic_Advantage/AC_CartPole.py

结合了 Policy Gradient (Actor) 和 Function Approximation (Critic) 的方法. `Actor` 基于概率选行为, `Critic` 基于 `Actor` 的行为评判行为的得分, `Actor` 根据 `Critic` 的评分修改选行为的概率.

**Actor Critic 方法的优势**: 可以进行单步更新, 比传统的 Policy Gradient 要快.

**Actor Critic 方法的劣势**: 取决于 Critic 的价值判断, 但是 Critic 难收敛, 再加上 Actor 的更新, 就更难收敛. 为了解决收敛问题, Google Deepmind 提出了 `Actor Critic` 升级版 `Deep Deterministic Policy Gradient`. 后者融合了 DQN 的优势, 解决了收敛难的问题. 我们之后也会要讲到 Deep Deterministic Policy Gradient. 不过那个是要以 `Actor Critic` 为基础, 懂了 `Actor Critic`, 后面那个就好懂了.

In [5]:
import numpy as np
import tensorflow as tf
import gym

In [6]:
print(gym.__version__)
print(tf.__version__)

0.17.2
1.15.0


In [7]:
np.random.seed(2)
tf.set_random_seed(2)  # reproducible

# Superparameters
OUTPUT_GRAPH = False
MAX_EPISODE = 1000
DISPLAY_REWARD_THRESHOLD = 200  # renders environment if total episode reward is greater then this threshold
MAX_EP_STEPS = 1000   # maximum time step in one episode
RENDER = False  # rendering wastes time
GAMMA = 0.9     # reward discount in TD error
LR_A = 0.001    # learning rate for actor
LR_C = 0.01     # learning rate for critic

env = gym.make('CartPole-v0')
env.seed(1)  # reproducible
env = env.unwrapped

N_F = env.observation_space.shape[0]
N_A = env.action_space.n

In [8]:
print(N_F)
print(N_A)


4
2


`Actor` 想要最大化期望的 `reward`, 在 `Actor Critic` 算法中, 我们用 “比平时好多少” (`TD error`) 来当做 `reward`.

`Actor` 在运用 Policy Gradient 的方法进行 Gradient ascent 的时候, 由 `Critic` 来告诉他, 这次的 Gradient ascent 是不是一次正确的 ascent, 如果这次的得分不好, 那么就不要 ascent 那么多.

In [4]:
# 基于概率选行为
class Actor(object):
    def __init__(self, sess, n_features, n_actions, lr=0.001):
        '''
        用tf建立actor网络，搭建训练的graph
        '''
        self.sess = sess

        self.s = tf.placeholder(tf.float32, [1, n_features], "state")
        self.a = tf.placeholder(tf.int32, None, "act")
        self.td_error = tf.placeholder(tf.float32, None, "td_error")  # TD_error

        with tf.variable_scope('Actor'):
            l1 = tf.layers.dense(
                inputs=self.s,
                units=20,    # number of hidden units
                activation=tf.nn.relu,
                kernel_initializer=tf.random_normal_initializer(0., .1),    # weights
                bias_initializer=tf.constant_initializer(0.1),  # biases
                name='l1'
            )

            self.acts_prob = tf.layers.dense(
                inputs=l1,
                units=n_actions,    # output units
                activation=tf.nn.softmax,   # get action probabilities
                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
                bias_initializer=tf.constant_initializer(0.1),  # biases
                name='acts_prob'
            )

        with tf.variable_scope('exp_v'):
            log_prob = tf.log(self.acts_prob[0, self.a])
            self.exp_v = tf.reduce_mean(log_prob * self.td_error)  # advantage (TD_error) guided loss

        with tf.variable_scope('train'):
            self.train_op = tf.train.AdamOptimizer(lr).minimize(-self.exp_v)  # minimize(-exp_v) = maximize(exp_v)

    def learn(self, s, a, td):
        '''
        s,a产生gradient ascent的方向，td来自critic，告诉actor这个方向对不对
        '''
        s = s[np.newaxis, :]
        feed_dict = {self.s: s, self.a: a, self.td_error: td}
        _, exp_v = self.sess.run([self.train_op, self.exp_v], feed_dict)
        return exp_v

    def choose_action(self, s):
        '''
        根据s选择a
        '''
        s = s[np.newaxis, :]
        probs = self.sess.run(self.acts_prob, {self.s: s})   # get probabilities for all actions
        return np.random.choice(np.arange(probs.shape[1]), p=probs.ravel())   # return a int

`Critic` 的更新很简单, 就是像 Q learning 那样更新现实和估计的误差 (TD error) 就好了.

In [5]:
class Critic(object):
    def __init__(self, sess, n_features, lr=0.01):
        self.sess = sess

        self.s = tf.placeholder(tf.float32, [1, n_features], "state")
        self.v_ = tf.placeholder(tf.float32, [1, 1], "v_next")
        self.r = tf.placeholder(tf.float32, None, 'r')

        with tf.variable_scope('Critic'):
            l1 = tf.layers.dense(
                inputs=self.s,
                units=20,  # number of hidden units
                activation=tf.nn.relu,  # None
                # have to be linear to make sure the convergence of actor.
                # But linear approximator seems hardly learns the correct Q.
                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
                bias_initializer=tf.constant_initializer(0.1),  # biases
                name='l1'
            )

            self.v = tf.layers.dense(
                inputs=l1,
                units=1,  # output units
                activation=None,
                kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
                bias_initializer=tf.constant_initializer(0.1),  # biases
                name='V'
            )

        with tf.variable_scope('squared_TD_error'):
            # 立即奖励 + 衰减的动作后状态的潜在奖励衰减 - 动作前状态的奖励。
            self.td_error = self.r + GAMMA * self.v_ - self.v
            self.loss = tf.square(self.td_error)    # TD_error = (r+gamma*V_next) - V_eval
        with tf.variable_scope('train'):
            self.train_op = tf.train.AdamOptimizer(lr).minimize(self.loss)

    def learn(self, s, r, s_):
        s, s_ = s[np.newaxis, :], s_[np.newaxis, :]

        v_ = self.sess.run(self.v, {self.s: s_})
        td_error, _ = self.sess.run([self.td_error, self.train_op],
                                          {self.s: s, self.v_: v_, self.r: r})
        return td_error

在 `Policy Gradients` 的影片中提到过, 现实中的奖惩会左右 `Actor` 的更新情况. `Policy Gradients` 也是靠着这个来获取适宜的更新. 那么何时会有奖惩这种信息能不能被学习呢? 这看起来不就是 以值为基础的强化学习方法做过的事吗. 那我们就拿一个 `Critic` 去学习这些奖惩机制, 学习完了以后由 `Actor` 来指手画脚,由 `Critic` 来告诉 `Actor` 你的那些指手画脚哪些指得好, 哪些指得差.

`Critic` 通过学习环境和奖励之间的关系, 能看到现在所处状态`s_`的潜在奖励`v_`, 所以用它来指点 `Actor` 便能使 `Actor` 每一步都在更新, 如果使用单纯的 `Policy Gradients`, `Actor` 只能等到回合结束才能开始更新.

https://www.jianshu.com/p/25c09ae3d206

`Actor`输入是一个状态，一个动作，一个奖惩；vt换成td_error；选择动作和Policy Gradient一样，根据计算出的softmax值来选择动作。

`Critic`要反馈给`Actor`一个时间差分值`td_error`，来决定`Actor`选择动作的好坏，如果时间差分值大的话，说明当前`Actor`选择的这个动作的惊喜度较高，需要更多的出现来使得时间差分值减小。
考虑时间差分的计算：
TD = r + gamma * f(s') - f(s),这里f(s)即`self.v`代表将`s`状态输入到`Critic`神经网络中得到的`Q`值。
所以`Critic`的输入也分三个，首先是当前状态`s`，当前的奖励`self.v`，以及下一个时刻的奖励折现值`self.v_`。为什么没有动作`a`呢？动作`a`是确定的呀，是`Actor`选的呀，对不对！还有为什么不是下一时刻的`Q`值而不是下一个时刻的状态，因为我们已经在计算`TD`时已经把状态带入到神经网络中得到`Q`值了。


In [6]:
sess = tf.Session()

In [7]:
actor = Actor(sess, n_features=N_F, n_actions=N_A, lr=LR_A)
critic = Critic(sess, n_features=N_F, lr=LR_C)     # we need a good teacher, so the teacher should learn faster than the actor

sess.run(tf.global_variables_initializer())

Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.


In [8]:

if OUTPUT_GRAPH:
    tf.summary.FileWriter("logs/", sess.graph)

for i_episode in range(MAX_EPISODE):
    s = env.reset()
    t = 0
    track_r = []
    while True:
        if RENDER: env.render()

        a = actor.choose_action(s)

        s_, r, done, info = env.step(a)  # s_下一步状态,r立即奖励，done指示位，表示这一训练周期是否结束，True or False

        if done: r = -20

        track_r.append(r)
        
        # ac算法中把pg中的vt的手动计算，用critic网络代替
        td_error = critic.learn(s, r, s_)  # gradient = grad[r + gamma * V(s_) - V(s)]
        actor.learn(s, a, td_error)     # true_gradient = grad[logPi(s,a) * td_error]

        s = s_
        t += 1

        if done or t >= MAX_EP_STEPS:
            ep_rs_sum = sum(track_r)

            if 'running_reward' not in globals():
                running_reward = ep_rs_sum
            else:
                running_reward = running_reward * 0.95 + ep_rs_sum * 0.05
            if running_reward > DISPLAY_REWARD_THRESHOLD: RENDER = True  # rendering
            print("episode:", i_episode, "  reward:", int(running_reward))
            break

episode: 0   reward: -7
episode: 1   reward: -6
episode: 2   reward: -5
episode: 3   reward: -5
episode: 4   reward: -5
episode: 5   reward: -5
episode: 6   reward: -6
episode: 7   reward: -6
episode: 8   reward: -6
episode: 9   reward: -6
episode: 10   reward: -6
episode: 11   reward: -6
episode: 12   reward: -5
episode: 13   reward: -5
episode: 14   reward: -5
episode: 15   reward: -5
episode: 16   reward: -5
episode: 17   reward: -5
episode: 18   reward: -5
episode: 19   reward: -4
episode: 20   reward: -4
episode: 21   reward: -3
episode: 22   reward: -3
episode: 23   reward: -3
episode: 24   reward: -4
episode: 25   reward: -3
episode: 26   reward: -3
episode: 27   reward: -4
episode: 28   reward: -3
episode: 29   reward: -3
episode: 30   reward: -2
episode: 31   reward: -2
episode: 32   reward: -2
episode: 33   reward: -1
episode: 34   reward: -1
episode: 35   reward: -2
episode: 36   reward: -1
episode: 37   reward: -2
episode: 38   reward: 0
episode: 39   reward: 0
episode: 40 

episode: 318   reward: 161
episode: 319   reward: 158
episode: 320   reward: 154
episode: 321   reward: 151
episode: 322   reward: 144
episode: 323   reward: 141
episode: 324   reward: 138
episode: 325   reward: 133
episode: 326   reward: 128
episode: 327   reward: 126
episode: 328   reward: 123
episode: 329   reward: 119
episode: 330   reward: 115
episode: 331   reward: 115
episode: 332   reward: 115
episode: 333   reward: 115
episode: 334   reward: 119
episode: 335   reward: 120
episode: 336   reward: 121
episode: 337   reward: 136
episode: 338   reward: 137
episode: 339   reward: 144
episode: 340   reward: 174
episode: 341   reward: 215
episode: 342   reward: 255
episode: 343   reward: 292
episode: 344   reward: 327
episode: 345   reward: 331
episode: 346   reward: 364
episode: 347   reward: 396
episode: 348   reward: 426
episode: 349   reward: 424
episode: 350   reward: 453
episode: 351   reward: 480
episode: 352   reward: 481
episode: 353   reward: 464
episode: 354   reward: 444
e

episode: 632   reward: 116
episode: 633   reward: 117
episode: 634   reward: 118
episode: 635   reward: 118
episode: 636   reward: 117
episode: 637   reward: 117
episode: 638   reward: 116
episode: 639   reward: 117
episode: 640   reward: 119
episode: 641   reward: 119
episode: 642   reward: 119
episode: 643   reward: 119
episode: 644   reward: 118
episode: 645   reward: 117
episode: 646   reward: 118
episode: 647   reward: 118
episode: 648   reward: 116
episode: 649   reward: 115
episode: 650   reward: 111
episode: 651   reward: 109
episode: 652   reward: 108
episode: 653   reward: 114
episode: 654   reward: 114
episode: 655   reward: 111
episode: 656   reward: 110
episode: 657   reward: 112
episode: 658   reward: 122
episode: 659   reward: 122
episode: 660   reward: 138
episode: 661   reward: 153
episode: 662   reward: 166
episode: 663   reward: 161
episode: 664   reward: 156
episode: 665   reward: 154
episode: 666   reward: 153
episode: 667   reward: 153
episode: 668   reward: 150
e

episode: 936   reward: 117
episode: 937   reward: 115
episode: 938   reward: 115
episode: 939   reward: 115
episode: 940   reward: 114
episode: 941   reward: 113
episode: 942   reward: 112
episode: 943   reward: 112
episode: 944   reward: 112
episode: 945   reward: 113
episode: 946   reward: 113
episode: 947   reward: 112
episode: 948   reward: 113
episode: 949   reward: 113
episode: 950   reward: 114
episode: 951   reward: 118
episode: 952   reward: 126
episode: 953   reward: 129
episode: 954   reward: 130
episode: 955   reward: 130
episode: 956   reward: 131
episode: 957   reward: 131
episode: 958   reward: 132
episode: 959   reward: 133
episode: 960   reward: 136
episode: 961   reward: 137
episode: 962   reward: 136
episode: 963   reward: 136
episode: 964   reward: 135
episode: 965   reward: 133
episode: 966   reward: 131
episode: 967   reward: 130
episode: 968   reward: 126
episode: 969   reward: 123
episode: 970   reward: 122
episode: 971   reward: 122
episode: 972   reward: 121
e

https://www.jianshu.com/p/25c09ae3d206

# D4PG

https://www.linkresearcher.com/theses/c9789137-8f12-4ff8-9ff4-48d21145b09c

https://github.com/deepmind/acme

https://github.com/deepmind/acme/blob/master/docs/index.md

https://github.com/deepmind/acme/blob/master/examples/quickstart.ipynb

https://zhkmxx9302013.github.io/post/dad17569.html