# Reinforce-Prompt

## BaseLine

作为第一个示例，令Q1和Q2都等于某个小的正数，而Q3等于一个
大的负数。因此，第一步和第二步的动作得到了一些小的奖励，但是
第三步并不是很成功。由这三个步骤所产生的综合梯度将试图使策略
远离第三步的动作，而稍微朝第一步和第二步采取的动作靠拢，这是
完全合理的。


现在让我们想象一下，假设奖励永远是正的，只有价值不同。这
对应于为每个奖励（Q1、Q2和Q3）加上一些常数。在这种情况下，Q1
和Q2将变为较大的正数，而Q3为较小的正值。但是，策略更新将有所
不同！接下来，我们将努力将策略推向第一步和第二步的动作，并略
微将其推向第三步的动作。因此，严格来说，尽管相对奖励是相同
的，但我们不再试图避免选择第三步所执行的动作。

策略更新依赖于奖励中所加的常数，这可能会大大减慢训练速
度，因为我们可能需要更多样本来平均掉这种策略梯度偏移的影响。
甚至更糟的是，由于折扣总奖励随时间变化，随着智能体学着如何表
现得越来越好，策略梯度的方差也可能发生变化。


## Monte Carlo policy gradient (REINFORCE)
- 1、用随机权重初始化策略网络
- 2、运行N个完整的片段，保存其(s,a,r,s')状态转移
- 3、对于每个片段k的每一步t，计算后续步的带折扣的总奖励$Q_{k,t}=\sum_{i\in T}\gamma_ir_i - \frac{1}{n}\sum_{i\in T}\gamma_ir_i$
- 4、计算所有状态转移的损失函数 $L=-\sum_{k,t}Q_{k,t}ln\pi(a_{k,t}|s_{k,t})$
- 5、执行SGD更新权重，以最小化损失
- 6、从步骤2开始重复，直到收敛

In [1]:
import collections
import copy
import math
import random
import time
from collections import defaultdict

import gym
import gym.spaces
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from gym.envs.toy_text import frozen_lake
from torch.utils.tensorboard import SummaryWriter

In [2]:
# 1、用随机权重初始化策略网络
class PolicyNet(nn.Module):
    def __init__(self, obs_n, hidden_num, act_n):
        super().__init__()
        # 动作优势A(s, a)
        self.net = nn.Sequential(
            nn.Linear(obs_n, hidden_num),
            nn.ReLU(),
            nn.Linear(hidden_num, act_n),
            nn.Softmax(dim=1),
        )

    def forward(self, state):
        if len(torch.Tensor(state).size()) == 1:
            state = state.reshape(1, -1)
        return self.net(state)

In [3]:
def discount_reward(R, gamma):
    # r 为历史得分
    n = len(R)
    dr = 0
    for i in range(n):
        dr += gamma**i * R[i]
    return dr

In [4]:
# - 2、运行N个完整的片段，保存其(s,a,r,s')状态转移
def generate_episode(env, n_steps, net, predict=False):
    episode_history = dict()
    r_list = []

    for _ in range(n_steps):
        episode = []
        predict_reward = []
        state, info = env.reset()
        while True:
            p = net(torch.Tensor(state)).detach().numpy().reshape(-1)
            action = np.random.choice(list(range(env.action_space.n)), p=p)
            next_state, reward, terminated, truncted, info = env.step(action)
            episode.append([state, action, next_state, reward, terminated])
            predict_reward.append(reward)
            state = next_state
            if terminated or truncted:
                episode_history[_] = episode
                r_list.append(len(episode))
                episode = []
                predict_reward = []
                break
    if predict:
        return np.mean(r_list)
    return episode_history

In [5]:
# 对于每个片段k的每一步t，计算后续步的带折扣的总奖励
def calculate_t_discount_reward(reward_list, gamma, baseline=False):
    discount_reward = []
    total_reward = 0
    for i in reward_list[::-1]:
        total_reward = total_reward * gamma + i
        if baseline:
            discount_reward.append(total_reward - np.mean(reward_list))
        else:
            discount_reward.append(total_reward)
    return discount_reward[::-1]

- 4、计算所有状态转移的损失函数 $L=-\sum_{k,t}Q_{k,t}ln\pi(a_{k,t}|s_{k,t})$

In [6]:
def loss(batch, gamma):
    l = 0
    for episode in batch.values():
        reward_list = [
            reward for state, action, next_state, reward, terminated in episode
        ]
        state = [state for state, action, next_state, reward, terminated in episode]
        action = [action for state, action, next_state, reward, terminated in episode]
        qt = calculate_t_discount_reward(reward_list, gamma, True)
        pi = net(torch.Tensor(state))
        pi = pi.gather(dim=1, index=torch.LongTensor(action).reshape(-1, 1))
        l -= torch.Tensor(qt) @ torch.log(pi)
    return l / len(batch.values())

## 训练

In [7]:
## 初始化环境
env = gym.make("CartPole-v1", max_episode_steps=200)
# env = gym.make("CartPole-v1", render_mode = "human")

state, info = env.reset()

obs_n = env.observation_space.shape[0]
hidden_num = 64
act_n = env.action_space.n
net = PolicyNet(obs_n, hidden_num, act_n)

# 定义优化器
opt = optim.Adam(net.parameters(), lr=0.01)

# 记录
writer = SummaryWriter(
    log_dir="logs/PolicyGradient/reinforce-baseline", comment="test1"
)

In [8]:
epochs = 200
batch_size = 20
gamma = 0.9

for epoch in range(epochs):
    batch = generate_episode(env, batch_size, net)
    l = loss(batch, gamma)

    # 反向传播
    opt.zero_grad()
    l.backward()
    opt.step()

    writer.add_scalars(
        "Loss",
        {"loss": l.item(), "max_steps": generate_episode(env, 10, net, predict=True)},
        epoch,
    )

    print(
        "epoch:{},  Loss: {}, max_steps: {}".format(
            epoch, l.detach(), generate_episode(env, 10, net, predict=True)
        )
    )

  pi = net(torch.Tensor(state))


epoch:0,  Loss: tensor([87.4521]), max_steps: 26.1
epoch:1,  Loss: tensor([123.9768]), max_steps: 32.9
epoch:2,  Loss: tensor([195.3206]), max_steps: 35.2
epoch:3,  Loss: tensor([134.8634]), max_steps: 34.5
epoch:4,  Loss: tensor([163.0333]), max_steps: 45.9
epoch:5,  Loss: tensor([203.7186]), max_steps: 49.0
epoch:6,  Loss: tensor([240.0397]), max_steps: 47.4
epoch:7,  Loss: tensor([265.6391]), max_steps: 83.6
epoch:8,  Loss: tensor([257.5919]), max_steps: 63.5
epoch:9,  Loss: tensor([287.2999]), max_steps: 71.1
epoch:10,  Loss: tensor([359.1578]), max_steps: 83.0
epoch:11,  Loss: tensor([401.5188]), max_steps: 88.4
epoch:12,  Loss: tensor([477.2301]), max_steps: 94.7
epoch:13,  Loss: tensor([393.0710]), max_steps: 115.1
epoch:14,  Loss: tensor([691.4105]), max_steps: 133.6
epoch:15,  Loss: tensor([592.6606]), max_steps: 108.6
epoch:16,  Loss: tensor([711.7249]), max_steps: 151.7
epoch:17,  Loss: tensor([663.7057]), max_steps: 132.0
epoch:18,  Loss: tensor([686.0579]), max_steps: 136.

## entropy bonus

即使将策略表示为概率分布，智能体也很有可能会收敛到某些局
部最优策略并停止探索环境。在DQN中，我们使用ε-greedy动作选择
方式解决了这一问题：有epsilon的概率，智能体执行随机动作，而不
是当前策略决定的动作。当然，我们可以使用相同的方法，但是策略
梯度方法使我们可以采取更好的方法，即熵奖励（entropy bonus）。

在信息论中，熵是某些系统中不确定性的度量。将熵应用到智能
体的策略中，它可以显示智能体对执行何种动作的不确定程度。策略
的熵可以用数学符号定义为：H(π) = –∑π(a|s)logπ(a|s)。熵的
值始终大于零，并且在策略符合平均分布（换句话说，所有动作具有
相同的概率）时具有一个最大值。当策略决定某个动作的概率为1而所
有其他动作的概率为0时，熵就变得最小，这意味着该智能体完全确定
要做什么。为了防止智能体陷入局部最小值，在损失函数中减去熵，
以惩罚智能体过于确定要采取的动作

In [9]:
def loss(batch, gamma, entropy_beta):
    l = 0
    for episode in batch.values():
        reward_list = [
            reward for state, action, next_state, reward, terminated in episode
        ]
        state = [state for state, action, next_state, reward, terminated in episode]
        action = [action for state, action, next_state, reward, terminated in episode]
        qt = calculate_t_discount_reward(reward_list, gamma)
        pi = net(torch.Tensor(state))
        entropy_loss = -torch.sum((pi* torch.log(pi)),axis=1).mean() * entropy_beta
        pi = pi.gather(dim=1, index=torch.LongTensor(action).reshape(-1, 1))
        l_policy = -torch.Tensor(qt) @ torch.log(pi)
        l += l_policy - entropy_loss
    return l / len(batch.values())

## entropy bonus训练

In [10]:
## 初始化环境
env = gym.make("CartPole-v1", max_episode_steps=200)
# env = gym.make("CartPole-v1", render_mode = "human")

state, info = env.reset()

obs_n = env.observation_space.shape[0]
hidden_num = 64
act_n = env.action_space.n
net = PolicyNet(obs_n, hidden_num, act_n)

# 定义优化器
opt = optim.Adam(net.parameters(), lr=0.01)

# 记录
writer = SummaryWriter(
    log_dir="logs/PolicyGradient/reinforce-entropy-bonus", comment="test1"
)

In [11]:
epochs = 200
batch_size = 20
gamma = 0.9
entropy_beta= 0.01

for epoch in range(epochs):
    batch = generate_episode(env, batch_size, net)
    l = loss(batch, gamma, entropy_beta)

    # 反向传播
    opt.zero_grad()
    l.backward()
    opt.step()

    writer.add_scalars(
        "Loss",
        {"loss": l.item(), "max_steps": generate_episode(env, 10, net, predict=True)},
        epoch,
    )

    print(
        "epoch:{},  Loss: {}, max_steps: {}".format(
            epoch, l.detach(), generate_episode(env, 10, net, predict=True)
        )
    )

epoch:0,  Loss: tensor([121.6083]), max_steps: 27.6
epoch:1,  Loss: tensor([186.1725]), max_steps: 38.7
epoch:2,  Loss: tensor([193.4650]), max_steps: 41.4
epoch:3,  Loss: tensor([156.9653]), max_steps: 26.2
epoch:4,  Loss: tensor([275.4433]), max_steps: 47.6
epoch:5,  Loss: tensor([195.5026]), max_steps: 40.6
epoch:6,  Loss: tensor([246.6728]), max_steps: 36.0
epoch:7,  Loss: tensor([183.4123]), max_steps: 40.8
epoch:8,  Loss: tensor([247.6472]), max_steps: 51.5
epoch:9,  Loss: tensor([212.2267]), max_steps: 67.8
epoch:10,  Loss: tensor([222.9993]), max_steps: 58.3
epoch:11,  Loss: tensor([367.2082]), max_steps: 63.4
epoch:12,  Loss: tensor([265.2997]), max_steps: 74.0
epoch:13,  Loss: tensor([294.9002]), max_steps: 72.0
epoch:14,  Loss: tensor([305.4070]), max_steps: 53.2
epoch:15,  Loss: tensor([348.8931]), max_steps: 47.7
epoch:16,  Loss: tensor([252.6658]), max_steps: 55.5
epoch:17,  Loss: tensor([299.4608]), max_steps: 46.8
epoch:18,  Loss: tensor([304.1024]), max_steps: 101.2
ep

## entropy_beta&baseline

In [12]:
def loss(batch, gamma, entropy_beta=False, baseline=False):
    l = 0
    for episode in batch.values():
        reward_list = [
            reward for state, action, next_state, reward, terminated in episode
        ]
        state = [state for state, action, next_state, reward, terminated in episode]
        action = [action for state, action, next_state, reward, terminated in episode]
        qt = calculate_t_discount_reward(reward_list, gamma, baseline)
        pi = net(torch.Tensor(state))
        entropy_loss = -torch.sum((pi * torch.log(pi)), axis=1).mean() * entropy_beta
        pi = pi.gather(dim=1, index=torch.LongTensor(action).reshape(-1, 1))
        l_policy = -torch.Tensor(qt) @ torch.log(pi)
        if entropy_beta:
            l += l_policy - entropy_loss
        else:
            l += l_policy
    return l / len(batch.values())

In [13]:
## 初始化环境
env = gym.make("CartPole-v1", max_episode_steps=200)
# env = gym.make("CartPole-v1", render_mode = "human")

state, info = env.reset()

obs_n = env.observation_space.shape[0]
hidden_num = 64
act_n = env.action_space.n
net = PolicyNet(obs_n, hidden_num, act_n)

# 定义优化器
opt = optim.Adam(net.parameters(), lr=0.01)

# 记录
writer = SummaryWriter(
    log_dir="logs/PolicyGradient/reinforce-entropy-bonus&baseline", comment="test1"
)

In [14]:
epochs = 200
batch_size = 20
gamma = 0.9
entropy_beta= 0.01
baseline=True

for epoch in range(epochs):
    batch = generate_episode(env, batch_size, net)
    l = loss(batch, gamma, entropy_beta, baseline)

    # 反向传播
    opt.zero_grad()
    l.backward()
    opt.step()

    writer.add_scalars(
        "Loss",
        {"loss": l.item(), "max_steps": generate_episode(env, 10, net, predict=True)},
        epoch,
    )

    print(
        "epoch:{},  Loss: {}, max_steps: {}".format(
            epoch, l.detach(), generate_episode(env, 10, net, predict=True)
        )
    )

epoch:0,  Loss: tensor([82.1014]), max_steps: 29.7
epoch:1,  Loss: tensor([98.1543]), max_steps: 26.5
epoch:2,  Loss: tensor([96.1294]), max_steps: 29.7
epoch:3,  Loss: tensor([114.7237]), max_steps: 26.6
epoch:4,  Loss: tensor([145.7323]), max_steps: 36.7
epoch:5,  Loss: tensor([159.2324]), max_steps: 51.4
epoch:6,  Loss: tensor([170.9406]), max_steps: 39.4
epoch:7,  Loss: tensor([216.6581]), max_steps: 50.6
epoch:8,  Loss: tensor([211.0787]), max_steps: 52.1
epoch:9,  Loss: tensor([326.0906]), max_steps: 43.6
epoch:10,  Loss: tensor([311.7147]), max_steps: 71.7
epoch:11,  Loss: tensor([324.1201]), max_steps: 47.7
epoch:12,  Loss: tensor([263.7476]), max_steps: 69.8
epoch:13,  Loss: tensor([298.4491]), max_steps: 73.9
epoch:14,  Loss: tensor([362.7023]), max_steps: 78.4
epoch:15,  Loss: tensor([352.8200]), max_steps: 87.1
epoch:16,  Loss: tensor([413.0023]), max_steps: 100.4
epoch:17,  Loss: tensor([463.8265]), max_steps: 104.3
epoch:18,  Loss: tensor([552.8887]), max_steps: 113.7
epo

# 预测

In [15]:
# env = gym.make("CartPole-v1", render_mode="human")
# env = gym.wrappers.RecordVideo(env, video_folder="video")

# state, info = env.reset()
# total_rewards = 0

# while True:
#     p = net(torch.Tensor(state)).detach().numpy().reshape(-1)
#     action = np.random.choice(list(range(env.action_space.n)), p=p)
#     state, reward, terminated, truncted, info = env.step(action)
#     if terminated:
#         break