# 强化学习-Policy Gradient
## 强化学习
- 强化学习（Reinforcement learning，简称RL）是机器学习中的一个领域，区别与监督学习和无监督学习，强调如何基于环境而行动，以取得最大化的预期利益。
- 基本操作步骤：智能体`agent`在环境`environment`中学习，根据环境的状态`state`（或观测到的`observation`），执行动作`action`，并根据环境的反馈`reward`（奖励）来指导更好的动作。

比如本项目的Cart pole小游戏中，`agent`就是动图中的杆子，杆子有向左向右两种`action`。

![](https://ai-studio-static-online.cdn.bcebos.com/17c106ff3b724082a9405eca76adaa712a40d46936bf4f59887f3c41bbdf976f)

## Policy Gradient简介
* 在强化学习中，有两大类方法，一种基于值（`Value-based`），一种基于策略（`Policy-based`）
    * `Value-based`的算法的典型代表为`Q-learning`和`SARSA`，将`Q`函数优化到最优，再根据`Q`函数取最优策略。
    * `Policy-based`的算法的典型代表为`Policy Gradient`，直接优化策略函数。
* 采用神经网络拟合策略函数，需计算策略梯度用于优化策略网络。
    * 优化的目标是在策略`π(s,a)`的期望回报：所有的轨迹获得的回报`R`与对应的轨迹发生概率`p`的加权和，当N足够大时，可通过采样N个Episode求平均的方式近似表达。
    
    ![](https://ai-studio-static-online.cdn.bcebos.com/eb184ddf8dcc4dc3b528a105f8d8e3ea6487d4905bc04cdebd7725f2d6a2752f)
    
    * 优化目标对参数`θ`求导后得到策略梯度：
    
    ![](https://ai-studio-static-online.cdn.bcebos.com/326d8abe040347cea25e4c0be3e09015e85cb818a02c445483381540ab1d238c)
    


## 安装依赖


In [1]:
!pip install pygame
!pip install gym
!pip install atari_py

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting pygame
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/32/ce/c40213f819148f7afaa39102fb68be5037fe0fc88e45d8fc5aa8bac64492/pygame-2.1.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.8/21.8 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: pygame
Successfully installed pygame-2.1.2
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting atari_py
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/7a/ad/bf0b26d4aa571e393619bd4d77e6ccb45f39a23d87f9a67080e02fa7b831/atari_py-0.2.9-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing co

## 导入相关库

本部分代码使用PaddlePaddle2.0版本

In [2]:
import gym
import os
import random
import collections

import paddle
import paddle.nn as nn
import numpy as np
import paddle.nn.functional as F

## 模型Model

这里的模型可以根据自己的需求选择不同的神经网络组建。

`PolicyGradient`用来定义前向(`Forward`)网络，可以自由的定制自己的网络结构。

In [3]:
class PolicyGradient(nn.Layer):
    def __init__(self, act_dim):
        super(PolicyGradient, self).__init__()
        act_dim = act_dim
        hid1_size = act_dim * 10

        self.linear1 = nn.Linear(in_features=4, out_features=hid1_size)
        self.linear2 = nn.Linear(in_features=hid1_size, out_features=act_dim)

    def forward(self, obs):
        out = self.linear1(obs)
        out = paddle.tanh(out)
        out = self.linear2(out)
        out = F.softmax(out)
        return out

## 智能体Agent的学习函数
这里包括模型探索与模型训练两个部分

`Agent`负责算法与环境的交互，在交互过程中把生成的数据提供给`Algorithm`来更新模型(`Model`)，数据的预处理流程也一般定义在这里。

In [4]:
def sample(obs, MODEL):
    global ACTION_DIM
    obs = np.expand_dims(obs, axis=0)
    obs = paddle.to_tensor(obs, dtype='float32')
    act = MODEL(obs)
    act_prob = np.squeeze(act, axis=0)
    act = np.random.choice(range(ACTION_DIM), p=act_prob.numpy())
    return act


def learn(obs, action, reward, MODEL):
    obs = np.array(obs).astype('float32')
    obs = paddle.to_tensor(obs)
    act_prob = MODEL(obs)
    action = paddle.to_tensor(action.astype('int32'))
    log_prob = paddle.sum(-1.0 * paddle.log(act_prob) * F.one_hot(action, act_prob.shape[1]), axis=1)
    reward = paddle.to_tensor(reward.astype('float32'))
    cost = log_prob * reward
    cost = paddle.sum(cost)

    opt = paddle.optimizer.Adam(learning_rate=LEARNING_RATE,
                                parameters=MODEL.parameters())  # 优化器(动态图)
    cost.backward()
    opt.step()
    opt.clear_grad()
    return cost.numpy()

## 模型梯度更新算法


In [5]:
def run_train(env, MODEL):
    MODEL.train()
    obs_list, action_list, total_reward = [], [], []
    obs = env.reset()

    while True:
        # 获取随机动作和执行游戏
        obs_list.append(obs)
        action = sample(obs, MODEL) # 采样动作
        action_list.append(action)
        
        obs, reward, isOver, info = env.step(action)
        total_reward.append(reward)
        
        # 结束游戏
        if isOver:
            break
    return obs_list, action_list, total_reward


def evaluate(model, env, render=False):
    model.eval()
    eval_reward = []
    for i in range(5):
        obs = env.reset()
        episode_reward = 0
        while True:
            obs = np.expand_dims(obs, axis=0)
            obs = paddle.to_tensor(obs, dtype='float32')
            action = model(obs)
            action = np.argmax(action.numpy())
            obs, reward, done, _ = env.step(action)
            episode_reward += reward
            if render:
                env.render()
            if done:
                break
        eval_reward.append(episode_reward)
    return np.mean(eval_reward)

## 训练函数与验证函数

设置超参数

In [6]:
LEARNING_RATE = 0.001  # 学习率大小

OBS_DIM = None
ACTION_DIM = None

In [7]:
# 根据一个episode的每个step的reward列表，计算每一个Step的Gt
def calc_reward_to_go(reward_list, gamma=1.0):
    for i in range(len(reward_list) - 2, -1, -1):
        # G_t = r_t + γ·r_t+1 + ... = r_t + γ·G_t+1
        reward_list[i] += gamma * reward_list[i + 1]  # Gt
    return np.array(reward_list)


In [None]:
def main():
    global OBS_DIM
    global ACTION_DIM

    train_step_list = []
    train_reward_list = []
    evaluate_step_list = []
    evaluate_reward_list = []

    # 初始化游戏
    env = gym.make('CartPole-v0')
    # 图像输入形状和动作维度
    action_dim = env.action_space.n
    obs_dim = env.observation_space.shape[0]
    OBS_DIM = obs_dim
    ACTION_DIM = action_dim
    max_score = -int(1e4)

    # 创建存储执行游戏的内存
    MODEL = PolicyGradient(ACTION_DIM)
    TARGET_MODEL = PolicyGradient(ACTION_DIM)

    # 开始训练
    print("start training...")
    # 训练max_episode个回合，test部分不计算入episode数量
    for i in range(1000):
        obs_list, action_list, reward_list = run_train(env, MODEL)
        if i % 10 == 0:
            print("Episode {}, Reward Sum {}.".format(i, sum(reward_list)))

        batch_obs = np.array(obs_list)
        batch_action = np.array(action_list)
        batch_reward = calc_reward_to_go(reward_list)
        cost = learn(batch_obs, batch_action, batch_reward, MODEL)

        if (i + 1) % 100 == 0:
            total_reward = evaluate(MODEL, env, render=False) # render=True 查看渲染效果，需要在本地运行，AIStudio无法显示
            print("Test reward: {}".format(total_reward))


if __name__ == '__main__':
    main()

start training...
Episode 0, Reward Sum 14.0.
Episode 10, Reward Sum 25.0.
Episode 20, Reward Sum 12.0.
Episode 30, Reward Sum 10.0.
Episode 40, Reward Sum 22.0.
Episode 50, Reward Sum 15.0.
Episode 60, Reward Sum 15.0.
Episode 70, Reward Sum 16.0.
Episode 80, Reward Sum 60.0.
Episode 90, Reward Sum 15.0.
Test reward: 33.6
Episode 100, Reward Sum 24.0.
Episode 110, Reward Sum 18.0.
Episode 120, Reward Sum 18.0.
Episode 130, Reward Sum 30.0.
Episode 140, Reward Sum 42.0.
Episode 150, Reward Sum 35.0.
Episode 160, Reward Sum 33.0.
Episode 170, Reward Sum 13.0.
Episode 180, Reward Sum 14.0.
Episode 190, Reward Sum 37.0.
Test reward: 50.6
Episode 200, Reward Sum 31.0.
Episode 210, Reward Sum 24.0.
Episode 220, Reward Sum 18.0.
Episode 230, Reward Sum 20.0.
Episode 240, Reward Sum 48.0.
Episode 250, Reward Sum 46.0.
Episode 260, Reward Sum 58.0.
Episode 270, Reward Sum 57.0.
Episode 280, Reward Sum 17.0.
Episode 290, Reward Sum 35.0.
Test reward: 55.0
Episode 300, Reward Sum 21.0.
Episode 3