# 03 Algorithms

## 3.10 Dyna Q-Planning

### Dyna Q-Planning
Dyna Q-Planning是一个经典的基于模型的强化学习算法，其中使用了一种叫做Q-Planning的方法来基于模型生成一些模拟数据，然后将这些模拟数据与实际经验结合起来进行策略更新。Q-Planning每次选取一个曾经访问过的状态$s$，采取一个曾经在改状态下执行过的动作$a$，通过模型得到转移后的状态$s'$和奖励$r$，然后使用这个混合数据$(s,a,r,s')$依照Q-learning的更新规则来更新动作价值。

<center>
<img src='../Images/Dyna Q-Planning.png'>
</center>


### Dyna Q-Planning Algorithms
- 初始化Q值表$q_0(s, a)$和模型$M_0(s, a)$、初始化$\epsilon$-greedy策略$\pi_0(a|s)$、学习率$\alpha$、折扣因子$\gamma$、探索概率$\epsilon$；
- for episode in episodes：
  - 得到初始状态$s_0$
  - for t=0 to T-1 do:
    - 根据$\epsilon$-greedy策略选择当前状态下的动作作$a_t$，并执行该动作得到奖励$r_{t+1}$和下一个状态$s_{t+1}$；
      - 更新Q值：$q_{t+1}(s_t, a_t) \leftarrow q_t(s_t, a_t) - \alpha [q_t(s_t, a_t) - (r_{t+1} + \gamma \max_a q_t(s_{t+1}, a))]$；
      - 更新模型：$M_{t+1}(s_t, a_t) \leftarrow r_{t+1}, s_{t+1}$；
      - for k=0 to K-1 do:  (Q-Planning部分)
        - 随机选择一个曾经访问过的状态$s$和执行过的动作$a
        - 从模型中采样一个转移$(r, s')$
        - 更新Q值：$q(s, a) \leftarrow q(s, a) - \alpha [q(s, a) - (r + \gamma \max_a q(s', a))]$；
      - end for
  - end for
- end for


### Example

In [1]:
import time
import random

import numpy as np
import gymnasium as gym
from tqdm import tqdm

In [2]:
class DynaQPlanning:
    """ Dyna Q-Planning Algorithms """

    def __init__(self, env, alpha=0.1, gamma=0.95, epsilon=0.1, epsilon_decay=0.99, n_planning=5):

        self.env = env
        self.alpha = alpha  # learning rate
        self.gamma = gamma  # discount factor
        self.epsilon = epsilon  # exploration rate
        self.epsilon_decay = epsilon_decay  # decay rate for exploration
        self.n_planning = n_planning  # number of planning steps

        self.returns = []
        self.q_tables = np.zeros((env.observation_space.n, env.action_space.n))
        self.policy = np.ones((env.observation_space.n, env.action_space.n)) / env.action_space.n  # uniform policy
        self.model = {}  # model of the environment (s, a) -> s'

    @staticmethod
    def custom_reward(done, reward):
        if done and reward == 1:
            return 10
        elif done and reward == 0:
            return -5
        else:
            return -0.1

    def take_action(self, state):
        """ Take action according to the current policy. """

        return np.random.choice(range(self.env.action_space.n), p=self.policy[state])

    def best_action(self, state):
        """ Return the best action based on the Q-table """

        return np.argmax(self.q_tables[state])

    def generate_episode(self, state):
        episode = []
        done = False
        while not done:
            action = self.take_action(state)
            next_state, reward, terminated, truncated, info = self.env.step(action)

            done = terminated or truncated
            reward = self.custom_reward(done, reward)
            episode.append((state, action, reward, next_state))
            state = next_state
        return episode

    def q_learning(self, state, action, reward, next_state):
        """ Update the Q-table using the generated episode. """

        td_target = reward + self.gamma * np.max(self.q_tables[next_state])
        td_error = self.q_tables[state][action] - td_target
        self.q_tables[state][action] -= self.alpha * td_error

    def update_policy(self, state):
        """ Update the policy based on the Q-table. """

        policy = np.zeros(self.env.action_space.n)
        policy[self.best_action(state)] = 1.
        self.policy[state] = policy

    def update_policy_and_values(self, episode):
        """ Update the policy and values using the generated episode. """

        gamma_power = 1
        episode_return = 0
        for state, action, reward, next_state in reversed(episode):
            self.q_learning(state, action, reward, next_state)
            self.update_policy(state)

            self.model[(state, action)] = (reward, next_state)  # Update model with transition

            episode_return += reward * gamma_power
            gamma_power *= self.gamma

        return episode_return

    def q_planning(self):
        """ Perform Q-Planning for a number of episodes. """

        for _ in range(self.n_planning):
            (state, action), (reward, next_state) = random.choice(list(self.model.items()))
            self.q_learning(state, action, reward, next_state)
            self.update_policy(state)

    def train(self, episodes=500):
        """ Train the agent for a number of episodes. """

        for i in range(10):
            with tqdm(total=episodes // 10, desc=f'Episode {i + 1}') as pbar:
                for idx in range(episodes // 10):
                    state, info = self.env.reset()

                    # Generate an episode using the current policy
                    episode = self.generate_episode(state)
                    episode_return = self.update_policy_and_values(episode)
                    self.returns.append(episode_return)

                    # Perform Q-Planning for a number of episodes
                    self.q_planning()

                    # Update progress bar
                    if (idx + 1) % 10 == 0:
                        pbar.set_postfix(
                            {
                                'epoch': episodes / 10 * i + idx + 1,
                                'return': np.mean(self.returns),
                                'epsilon': self.epsilon
                            }
                        )
                    pbar.update(1)

                    self.epsilon *= self.epsilon_decay
                    self.epsilon = max(self.epsilon, 0.01)

    def visualize_policy(self, delay=0.5):
        """ Visualize the policy learned by the agent """
        state, info = self.env.reset()
        done = False

        while not done:
            self.env.render()
            action = np.argmax(self.policy[state])
            state, reward, terminated, truncated, info = self.env.step(action)
            done = terminated or truncated
            time.sleep(delay)

        self.env.render()
        self.env.close()

In [3]:
environment = gym.make('FrozenLake-v1', desc=None, map_name='4x4', is_slippery=True, render_mode='rgb_array')
environment.reset()

(0, {'prob': 1})

In [4]:
agent = DynaQPlanning(environment, gamma=0.9, epsilon=0.99, alpha=0.1, epsilon_decay=0.99, n_planning=5)


In [5]:
agent.train(100)
print(f"Optimal policy: {agent.policy}")
print(f"Optimal Q-tables: {agent.q_tables}")

Episode 1: 100%|██████████| 10/10 [00:00<00:00, 1111.02it/s, epoch=10, return=-2.38, epsilon=0.904]
Episode 2: 100%|██████████| 10/10 [00:00<00:00, 769.51it/s, epoch=20, return=-4, epsilon=0.818]
Episode 3: 100%|██████████| 10/10 [00:00<00:00, 322.53it/s, epoch=30, return=-4.56, epsilon=0.74]
Episode 4: 100%|██████████| 10/10 [00:00<00:00, 500.10it/s, epoch=40, return=-4.83, epsilon=0.669]
Episode 5: 100%|██████████| 10/10 [00:00<00:00, 357.13it/s, epoch=50, return=-3.81, epsilon=0.605]
Episode 6: 100%|██████████| 10/10 [00:00<00:00, 322.61it/s, epoch=60, return=-3.65, epsilon=0.547]
Episode 7: 100%|██████████| 10/10 [00:00<00:00, 416.65it/s, epoch=70, return=-3.32, epsilon=0.495]
Episode 8: 100%|██████████| 10/10 [00:00<00:00, 296.95it/s, epoch=80, return=-2.5, epsilon=0.448]
Episode 9: 100%|██████████| 10/10 [00:00<00:00, 416.52it/s, epoch=90, return=-2.37, epsilon=0.405]
Episode 10: 100%|██████████| 10/10 [00:00<00:00, 333.35it/s, epoch=100, return=-2.27, epsilon=0.366]

Optimal policy: [[0.   0.   0.   1.  ]
 [0.   0.   0.   1.  ]
 [1.   0.   0.   0.  ]
 [0.   0.   0.   1.  ]
 [1.   0.   0.   0.  ]
 [0.25 0.25 0.25 0.25]
 [0.   0.   1.   0.  ]
 [0.25 0.25 0.25 0.25]
 [0.   0.   0.   1.  ]
 [0.   1.   0.   0.  ]
 [1.   0.   0.   0.  ]
 [0.25 0.25 0.25 0.25]
 [0.25 0.25 0.25 0.25]
 [0.   0.   1.   0.  ]
 [0.   0.   0.   1.  ]
 [0.25 0.25 0.25 0.25]]
Optimal Q-tables: [[-0.69977049 -0.75717105 -0.70720194 -0.69079698]
 [-3.85616038 -3.56704926 -3.2566078  -0.94977619]
 [-0.96776166 -2.16247353 -1.71105487 -0.97011691]
 [-1.06173258 -3.85616038 -1.02604526 -0.93041163]
 [-0.6506329  -3.01717736 -4.24952682 -3.85844806]
 [ 0.          0.          0.          0.        ]
 [-3.68261113 -4.07133943 -3.32616564 -3.81561641]
 [ 0.          0.          0.          0.        ]
 [-3.44607882 -3.86002434 -2.84766395 -0.52481377]
 [-0.45296685 -0.19197845 -2.84766395 -3.97445627]
 [-1.22863274 -2.2595655  -4.39211673 -2.71878562]
 [ 0.          0.          0.       




In [6]:
environment = gym.make('FrozenLake-v1', desc=None, map_name='4x4', is_slippery=True, render_mode='human')
environment.reset()

agent.env = environment
agent.visualize_policy(delay=0.005)