# Planning and Learning with Tabular methods

## Models

A __model__ can be anything that an agent can use to predict how the environment will respond to its actions. It can be used to simulate experiences.

While Dynamic Programming assumes full knowledge of the environment dynamics, TD and MC methods do not need a model of the environment, and learn from sampled experience. They are called *model-based* and *model-free* reinforcement learning methods respectively.

__Model-based__ reinforcement learning methods make use of a model, and rely on *planning* as their primary component as opposed to *learning* that is the primary components in *model-free* reinforcement learning methods.

* __Sample models__ return samples from a probability distribution.
* __Distribution models__ describe the probability distribution of all possibilities.

### State-space Planning

*State-space planning* methods compute value functions as a key intermediate step in policy improvement.

Both planning and learning methods estimate value functions with update operations (also called *back-up operations*).
Whereas in learning we use real experiences, in planning we use simulated experiences.
Learning methods that learn from real experiences can in many cases be applied to planning with simulated experiences.

## Dyna Architecture

Within an agent, there can be two roles for real experiences:
* Improve the model (model learning)
* Improve value function estimates (direct RL)

The interactions between policy, value functions, model and experiences are shown in the diagram.

![Value?policy - Model - Experience interaction](http://incompleteideas.net/book/first/ebook/figtmp63.png)

Indirect methods make a fuller use of a limited amount of experience 
Achieve better policies with fewer interactions with the environment

Direct methods are unaffected by bias in the model 

![Dyna Architecture diagram](http://incompleteideas.net/book/first/ebook/figtmp64.png)

Diagram shows experience being used for both direct RL and model learning.
search control - process by which the starting state action pairs are chosen
planning is achieved by applying learning methods on simulated experiences as if they were real.
Learning and Planning both share the same RL methods and differ only in the source of their experience

## Changing Environments

When the environment is stochastic, and only few experiences have been recorded, the model might be incomplete. If the environment is dynamic, and state transition probabilities vary, and the model becomes inaccurate. When this happens, the planning process will likely compute a suboptimal policy.

### Dyna-Q+ agent

The Dyna-Q+ agent solves the problem of inaccurate and incomplete models by exploration. 

Since exploration after computing an optimal policy results is lower return, we must balance exploration.

Dyna-Q+ agent achieves this by keeping track of the time that has elapsed since a state-action pair was visited, ans adds the following term to the reward for visiting a state-action pair.

$$ \kappa \sqrt{\tau} $$

where 
\
$\tau$ is timesteps since last visit to state-action pair (s,a)
\
$\kappa$ is a small number that controls exploration.

### Prioritized Sweeping

Sampling experiences uniformly at random during planning is not optimal, since some states have more information or are more relevant to the problem than others. 

Since, the states that preceed a state whose value has changed recently are the states whose values will change during an update, it is favourable to sample those states. This idea is called *backward focusing* of planning computations.

In *prioritized sweeping*, we use backward focusing of planning computation and prioritize states by the change in their value. 
