## Chapter 8: Integrating Learning and Planning

### Model-based Reinforcement Learning

In this chapter, we focus on RL algorithms that focus on learning **model** from experiences, and use it to construct value function or policy. 

RL algorithms that mentioned in previous chapters are **model-free RL**: it directly learns value function and(or) policy. 

And there exists **model-based RL**: it learns model from experience, and plan value function and(or) policy from model.

### What is a Model?

A *Model* $\mathcal{M}$ is a representation of an MDP <$\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}$> parametrized by $\eta$. Assuming we know state space $\mathcal{S}$ and action space $\mathcal{A}$, a model $<\mathcal{P}, \mathcal{R}>$ can represent transitions
</br>
</br>
<font size="3">
$$\begin{align}
S_{t+1} \sim \mathcal{P}_\eta(S_{t+1}|S_t, A_t) \\
R_{t+1} \sim \mathcal{R}_\eta(R_{t+1}|S_t, A_t)
\end{align}$$
</font>

model $<\mathcal{P}, \mathcal{R}>$ is learned from experience ${S_1, A_1, R_2, ... , S_T}$.

Model can be parametrized in various ways: from lookup table models to deep neural networks.

For example, lookup table models  can be constructed from experiences as the following:
</br>
</br>
<font size="3">
$$\begin{align}
\hat{\mathcal{P}}_{s, s'}^a = \dfrac{1}{N(s, a)} \sum_{t=1}^T \mathbb{1}(S_t, A_t, S_{t+1} = s, a, s') \\
\hat{\mathcal{R}}_{s}^a = \dfrac{1}{N(s, a)} \sum_{t=1}^T \mathbb{1}(S_t, A_t = s, a)R_t 
\end{align}$$
</font>

where $N(s, a)$ is a visit count to each state-action pair $(s,a)$.

If the model is accurate, then we can learn value function or policy from experiences sampled from the model. Yet if not, learning from simulated experience can lead to sub-optimal policy.

### Integrating Planning, Acting and Learning

Dyna-Q algorithm is a reinforcement learning algorithm that learns value function and(or) policy from both simulated experiences from model and real-world experiences.

![dyna.png](attachment:dyna.png) 

Learning from both direct experiences and simulated experiences can help improving sample efficiency.

In [1]:
class Dyna_Q:
    def __init__(self, env):
        self.state_dim = env.observation_space.n
        self.action_dim = env.action_space.n

        self.alpha = 0.1
        self.gamma = 0.8
        self.eps = 0.1
        self.model_training = 10

        self.q = np.zeros([self.state_dim, self.action_dim])
        self.model_r = np.zeros([self.state_dim, self.action_dim])
        self.model_ns = np.zeros([self.state_dim, self.action_dim])
        
    def action(self, s):
        if np.random.random() < self.eps:
            action = np.random.randint(low=0, high=self.action_dim - 1)
        else:
            action = np.argmax(self.q[s,:])

        return action

    def run(self):
        states = []
        actions = []
        success = 0

        for episode in range(10000):
            observation = env.reset()
            done = False
            episode_reward = 0
            local_step = 0
            
            while not done:
                action = self.action(observation)
                next_observation, reward, done, _ = env.step(action)
                if reward == 0:
                    reward = -0.001
                if done and next_observation != 15:
                    reward = -1
                episode_reward += reward
                local_step += 1
                self.q[observation, action] = self.q[observation, action] + self.alpha*(reward + self.gamma*np.max(self.q[next_observation, :]) - self.q[observation, action])
                #self.q[observation, action] = self.q[observation, action] + self.alpha*(reward + self.gamma*np.max(self.q[next_observation,:]) - self.q[observation, action])

                self.model_r[observation, action] = reward
                self.model_ns[observation, action] = next_observation

                states.append(observation)
                actions.append(action)

                observation = next_observation


            if episode >= 100:
                for _ in range(self.model_training):
                    sample = np.random.randint(low=0, high=len(states) - 1)
                    s = states[sample]
                    a = actions[sample]

                    r= (self.model_r[s, a])
                    ns = int(self.model_ns[s, a])
                    self.q[s, a] = self.q[s,a] + self.alpha*(r + self.gamma*np.max(self.q[ns,:]) - self.q[s,a])
                    
            print("Episode: {}, Step: {}, Episode_reward: {}".format(episode, local_step, episode_reward))
            if episode_reward >=0:
                success += 1
            print("Success: ", success)
            print(self.q)
            print(self.model_r)
            print(self.model_ns)
            
    def eval(self):
        for i in range(1):
            observation = env.reset()
            done = False
            local_step = 0

            while not done:
                local_step += 1

                action = np.argmax(self.q[observation,:])
                next_observation, reward, done, _ = env.step(action)

                env.render()
                observation = next_observation

IndentationError: expected an indented block after function definition on line 15 (1574257826.py, line 16)

In [None]:
LEFT = 0
DOWN = 1
RIGHT = 2
UP = 3

env = gym.make("FrozenLake-v1", is_slippery=False)
#env = gym.make("FrozenLake8x8-v0", is_slippery=False)
'''
q_learning = Q_learing(env)
q_learning.run()
q_learning.eval()
'''
dyna_q = Dyna_Q(env)
dyna_q.run()
dyna_q.eval()

#model = Model(env)
#model.run()
#model.eval()
# sarsa = SARSA(env)
# sarsa.run()
# sarsa.eval()