# Reinforcement Learning for Newbies

## Train a Video Game Playing Agent

In this tutorial, we will build a learning agent to play video games. To follow this tutorial you will need to install the `gym` package provided by OpenAI. Please follow the installation instruction [here](https://gym.openai.com). Note we will only need the `atari` environment.

In this project, I provide some templates to simplifying and unifying wrappers for gym-atari environment in `atari_a3c/atari.py`. Basicly, one 
- creates a game-environment using `env = create_env(GAME-NAME)`
- then can ask `env` for the current observation of screen pixels as standard numpy array
- and can given `env` an integer representing a button being pressed on a game console

The task is to design an observation to action mapping, called a _policy_, so to win a game.

### Problem Def
We build a program that can play video games:
```
Input: Game-Environment (env)
Output: Game-Policy (policy)
```

__env__:
```
Input: action (0~k, say, 2)
Output: screen-image, reward, game-is-over
```

__policy__:
```
Input: screen-image
Output: action
```

### Start playing

Let us see a concrete example of a policy. __REMINDER__: states are stacked-4 step observed and processed frames -- 42x42 pixels per frame, 1 or 3 channels (Mono/RGB colours) per pixel -- totally a state is a $\Big[[12(=3\times4) | 4] \times 42 \times 42\Big]$ array. In decision-making scenarios, the output of a policy should be the likelihood of taking each possible actions given the observation.

This is a good case to apply neural network models to process the states. 

* let's try to play with our environment.

In [None]:
# First check the state, i.e. the input to our model
import torch
from atari_a3c import atari
env = atari.create_atari_env('PongDeterministic-v4')
s_ = env.reset()
print(s_.shape) # This is a 4-frame mono-colour observation. 

# the state s_ is now a numpy array, let's convert it to a tensor to process
state_tensor = torch.from_numpy(s_).unsqueeze(0)
# unsqueeze(0) add a "batch" dimension, i.e. in this batch, we only have 1 sample
print(type(state_tensor))
print(state_tensor.shape)

* Lets process the observation using convolutional neural networks
-- We will worry about the decision making in a later stage

In [None]:
import torch
import torch.nn as nn

num_inputs = 4 # Corresponding to s_.shape[0]
conv1 = nn.Conv2d(num_inputs, 32, 3, stride=2, padding=1)
conv2 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
conv3 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
conv4 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
conv_layers = nn.Sequential(
    conv1, nn.ELU(), 
    conv2, nn.ELU(), 
    conv3, nn.ELU(), 
    conv4, nn.ELU()) # means when processing an x
# the computation is 
#  - elu(conv1(x)) -> x'
#  - elu(conv2(x')) -> x''
# ...

Now let's try to process our `state-tensor` using the layers.

In [None]:
y = conv_layers(state_tensor)
print(y.shape)

Good. The image is now processed, and we are about to make decisions out of these _features_ of the observation. The simplest way to make use of those features is to build linear models from them for each possible action's probability. To be specific, see how many features we have:

In [None]:
y.view(1, -1).shape # 1: we have 1 sample; 
# -1 means unravelling all features per sample into one row

So we will make a fully connected linear layer of $288 \times 3$, where 3 outputs correspond to UP/DOWN/STAY actions. (Let's consider a simple Pingpong game for now. To play more sophisticated games, more actions are needed, of course.)

In [None]:
actor_linear = nn.Linear(288, 3)

In [None]:
action_likelihood = actor_linear(y.view(1, -1))

In [None]:
print(action_likelihood)

It looks OK. The next steps are to
- convert the likelihood to proper probability density -- so all actions have a probability between [0, 1] to be chosen, and all the probabilities adding to 1.0. We do this by using [softmax], which is often used in describing the probability of classes in multi-class classification problems.
- draw a random sample from the above probability distribution
- commit the action, and so we move to the next time step 

[softmax]: https://www.youtube.com/watch?v=LLux1SW--oM

In [None]:
action_prob = nn.functional.softmax(action_likelihood, dim=1) 
print(action_prob)

`dim=1` means we perform softmax among all numbers in each ROW. This is not relevant in our case here -- we have only one ROW and it is meaningless to take softmax along a COLUMN. Generally speaking, the `dim` should refer to the dimension of an array, along which we want to get probability among the elements. E.g. here we want the probability among the ACTIONS, and all the actions of an observation (in this case we have only one) are collected in the corresponding ROW.

In [None]:
action = action_prob.multinomial(1) # take one sample according to the distribution
print(action)

# Try to run this cell MULTIPLE TIMES and see what you got.

Now lets commit the selected action to the environment. Say
- Action-0: UP
- Action-1: STAY
- Action-2: DOWN

__NOTE__: the environment generally have DIFFERENT idea about what number represents what action, or some env may not accept numbers, but strings to represent actions! But this has no effect in our learning problem. We just make a trivial mapping to tell the game-environment to perform "UP" when our `action==0`.

You can use the following cell to try out the code of each action used by our environment.

In [None]:
import time
tmp_env = atari.create_atari_env('PongDeterministic-v4')
tmp_env.reset()
tmp_done_ = False
tmp_steps_ = 0
TRY_ACT = 3
while not tmp_done_ and tmp_steps_ < 300:
    tmp_env.render()
    _, _, tmp_done_, _ = tmp_env.step(TRY_ACT) 
    # step is the way to commit an action
    time.sleep(0.01)
    tmp_steps_ += 1
    
tmp_env.close()
del tmp_env
    

# NOTE: 0-STAY, 2-UP, 3-DOWN

Putting what we had tried so far, so we are ready to make our AI game-player.

In [None]:
# Neural network AI playing game.
ACTION_CODE = [0, 2, 3]
state = env.reset()
done = False
steps = 0
TRY_ACT = 3
while not done and steps < 500:
    env.render()
    state_tensor = torch.from_numpy(state).unsqueeze(0)
    features = conv_layers(state_tensor)
    features = features.view(1, -1)
    action_likelihood = actor_linear(features)
    action_prob = nn.functional.softmax(action_likelihood, dim=1) 
    action = action_prob.multinomial(1)
    
    new_state, reward, done, _ = env.step(ACTION_CODE[action]) 
    state = new_state
    # step is the way to commit an action
    time.sleep(0.05)
    steps += 1
    
env.close()

Note although our agent looks random and stupid now, there a key difference from a really random player. 
```
new_state, reward, done, _ = env.step(ACTION_CODE[action])
```
It collects observations and act ACCORDING to what has been observed in the next step. It is also aware how well/poor it has performed by collecting the `reward`. In other words, it has a brain, which opens the door of learning. 

So we are ready to setup our learning problem:
* create a Python file `models.py` in our `atari_a3c` module folder.

Put the definitions of convolutional representation and the decision making components (called `Actor`) and the Agent  as follows:

```python
import torch
import torch.nn as nn
import torch.nn.functional as F
from .utils import weights_init


class StateRepresentor(torch.nn.Module):
    def __init__(self, in_channels):
        super(StateRepresentor, self).__init__()
        conv1 = nn.Conv2d(in_channels, 32, 3, stride=2, padding=1)
        conv2 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
        conv3 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
        conv4 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
        self.conv_layers = nn.Sequential(
            conv1, nn.ELU(),
            conv2, nn.ELU(),
            conv3, nn.ELU(),
            conv4, nn.ELU())
        self.apply(weights_init)

    def forward(self, state_tenor):
        num_samples = state_tenor.shape[0]
        feats = self.conv_layers(state_tenor)
        feats = feats.view(num_samples, -1)
        return feats


class Actor(torch.nn.Module):
    def __init__(self, feat_num, action_num):
        super(Actor, self).__init__()
        self.h_linear = nn.Linear(feat_num, 256)
        self.a_linear = nn.Linear(256, action_num)

    def forward(self, feats):
        h = F.elu(self.h_linear(feats))
        lp = self.a_linear(h)
        return lp


class Agent(torch.nn.Module):
    def __init__(self, input_channels, feat_num, actions):
        super(Agent, self).__init__()
        self.repnet = StateRepresentor(input_channels)
        self.actor = Actor(feat_num, actions)

    def forward(self, state_tensor):
        return self.actor(self.repnet(state_tensor))

```

Also put the playing routine into a separate file `learn.py`.

```python
import torch
import torch.nn as nn
import time


def play(env, agent, first_state,
         max_steps=20, render=False, action_code=[0, 2, 3]):
    done = False
    steps = 0
    state = first_state
    while not done and steps < max_steps:
        if render:
            env.render()

        state_tensor = torch.from_numpy(state).unsqueeze(0)
        action_likelihood = agent(state_tensor)
        action_prob = nn.functional.softmax(action_likelihood, dim=1)
        action = action_prob.multinomial(1)

        new_state, reward, done, _ = env.step(action_code[action])
        state = new_state
        if render:
            time.sleep(0.05)
        steps += 1
```
Note I added some initialisation steps in defining the net, which are optional. Such helper functions are in the `utils.py`.

Also, let's put useful functions/classes in the `atari_a3c/__init__.py` to clean up our import.

Start over and check the following conceptually clean program:

In [None]:
from atari_a3c import play, Agent, create_atari_env

env = create_atari_env('PongDeterministic-v4')
agent = Agent(input_channels=4, feat_num=288, actions=3)
state = env.reset()
play(env, agent, state, max_steps=50, render=True)
env.close()

## Learning the policy

Next we perform learning, by "learning", we mean

- play game using current agent
- collect relevant information from the environment along the way, which refers to the returned information by the `env.step()` operation.
- adjust agent parameters, so the behaviour of the agent gets improved a bit in each step
- by _improve_, we mean to collect more `reward` in each _episode_
- by _episode_, we mean the period from `env.reset()` to the step where the returned `done` flag is True.

Immediately, we find it is necessary to enhance our `play` function a bit to collect the information for learning:

We add a record book called `trajectory`, and add an entry to each list in the trajectory in each playing step
```python
trajectory = {'states': [],
              'rewards': [],
              'actions_logprob': [],
              'actions': []}
              
     ...
     
trajectory['states'].append(state)
trajectory['rewards'].append(reward)
trajectory['actions'].append(action)  # extract single-num
trajectory['actions_logprob'].append(action_logprob[0, action])
```

The information `state`, `reward` and `action` is easy to understand. You might be confused by the log-probability of an action. This is a key to our learning. We will revisit it below, for now, the caluculation is: 

> action_logprob = nn.functional.log_softmax(action_likelihood, dim=1)

which consists of $3$ numbers corresponding to the 3 probability weights of the "UP", "STAY" and "DOWN" actions computed by our agent. Note
1. For our learning, we need only one of the 3 numbers, which correspond to the action we acutally took. So we `append(action_logprob[0, action])`, the 0 refers to the first (and only) sample -- the shape of actions_logprob is $1 \times 3$ in each step. For example, if the action randomly drawn is 1, then `action_logprob[0, 1]` is recorded.
2. Unlike other values, such as `action`, which are simple numbers, we need to keep the `tenor` data structure. I.e. in our record, we do not only keep the result of this log-probability as a value, but also the entire computational history -- how it is computed involving states and particularly __model parameters__ is also saved. Therefore, this is the key to figure our how to adjust our model parameters.




In [1]:
from atari_a3c import play, Agent, create_atari_env

In [2]:
env = create_atari_env('PongDeterministic-v4')
agent = Agent(input_channels=4, feat_num=288, actions=3)
state = env.reset()
trj, state, done = play(env, agent, state, max_steps=20)

Please check `trj`. For now, it is strongly recommended to attempt to invent some own scheme to adjust the agent's parameters. Remind the typical learning procedure of a torch model is 

1. to allocate an `optimiser` to hold and handle all the parameters
2. to design some __object__ to __minimise__
3. to __back propagate__ desired change of the object
4. to let the optimiser to update model parameters

E.g. 

In [None]:
# 1. optimiser
from torch.optim import Adam
optim = Adam(agent.parameters)

# 2. objective
accumu_reward = 0
for r_ in trj['rewards']:
    accumu_reward += r_
L = -accumu_reward  # we want to maximise reward, 
# but optimisers only minimising stuff, so we negate the object

# 3. back-prop
L.backward()

# 4. update parameters
optim.step()

Does the above scheme (and/or your trail) work? If not, what is in the way?

Below is a more formal discussion on the difficulty to find directions to adjust the parameters, you can skip it in the first reading.

__THEORY STARTS__

Formally, now we are at the core of the problem: we have collected a "trajectory" of 

> $[s_0, a_0, r_0]$, $[s_1, a_1, r_1]$, ...

using our current policy $\pi_{\theta}$. The key question is how we should adjust $\theta$ to improve our policy? I.e. we need a direction $\nabla_\theta$ in the space of $\Theta$, so if we move $\theta$ along $\nabla_\theta$, hopefully our policy $\pi_{\theta_{New}}$ would be better at collecting rewards.

In standard supervised learning, we compare the model output $\hat{y}(\theta)$ with some ground-truth $y$, and get a loss $L(\theta)$. Then the back-propagation algorithm applies chain-rule to compute a direction along which to change $\theta$ minimise $L(\theta)$. Note we intentionally denote the dependency on parameters $\theta$ in the former statement. 

However, the the above learning scheme, the objective is to maximise the reward, which is __returned by the environment__, with the model of how the world is working unknown, we cannot directly compute the influence of $\theta$ on the reward. Moreover, although the action probability is directly influenced by $\theta$, the action itself is the result of a stochastic process, the chain of derivation computation doesn't pass through the stochastic operation of "drawing samples from a multinomial distribution" to the parameters of the distribution.

In a seminar [paper][1] Sutton et al. established that the influence on the __expected accumulated return__ by the parameters of the agent model has the form

$$
\frac{\partial J(\theta)}{\partial \theta_k} = 
\mathbf{E}_{s\sim d^\pi, a\sim \pi(\cdot|s)}\big[ 
  \frac{\partial \pi(a|s)}{\partial \theta_k} A^{\pi}(s, a)
\big]
$$

[1]: https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf

* $J$ represents the target, expected total reward.

* $d^\pi$ is the __stationary probability distribution__ of the states following policy $\pi$, which is a fancy name for the concept: 
> If one follows policy $\pi$ (our agent), and play with the game environment infinitely many steps. She collects all states encountered, stripping the time/order of the states arriving in her collection, what the distribution of different states should be.

* $a\sim \pi(\cdot|s)$ is simply following $\pi$ at each state as we had done in the steps drawing multinomial random number.

* $A^\pi(s, a)$ is a real number, representing an assessment about how good/poor if following $\pi$ from the state-action pair $(s, a)$.


__THEORY ENDS__

The idea is intuitive: instead of computing the influence of the model parameters on the total reward (because we cannot), we compute the influence of the model parameters on the action taken, and weigh the influence by a factor relating to the outcome of this agent's interaction with the environment. So
- if an action leads to good great reward, we adjust our agent so in future, if we are at similar states, we increase the oppertunity to take the same action
- on contrary, if an action leads to penalty, we try to avoid it in future encountering of the similar states.

Note, this is the reason behind our practice of taking only the log_probability of the action that has been actualised:
> `trajectory['actions_logprob'].append(action_logprob[0, action])`

rather than

> `trajectory['actions_logprob'].append(action_logprob)`

because we don't know what to do with our calculation about hypothetical decisions. We can only access decisions we had committed.

All analysis are theoretical so far, if we want to realise the scheme, here is a list of immediate issues and quick-dirty solutions:
- we don't know how to compute the expectation over an unknown distribution (i.e. all possible outcomes by following our agent $\pi$
    - use just one trajectory to "estimate" (acutally, to represent) the expectation
- we don't know the assessment of (state, action) pair if letting our agent start from there
    - summing up received rewards until reach `done==True`, which is called an _episode_
- (following above), what if an episode never ends, or runs so long that is effectively endless for my computer?
    - just take some maximum steps, and cut the computation there (so you see the reason why we have the `max_step` argument in `play()`)
    
Let's try the simple idea:
```python
def estimate_policy_gradient_v0(trajectory, optim):
    total_reward = 0
    total_objective = 0.0
    for r, logp in zip(reversed(trajectory['rewards']),
                       reversed(trajectory['actions_logprob'])):  # traverse the experience
                       # backwards, it is more convenient to get the total reward at each 
                       # time step this way
        total_reward += r  # the accumulated reward starting from this time-step
        # until the end of the episode
        total_objective += total_reward * logp

    total_objective = - total_objective
    optim.zero_grad()
    total_objective.backward()
    optim.step()
    

def policy_gradient(env, agent):
    optim = Adam(agent.parameters(), lr=5e-5)
    state = env.reset()

    while True:  # never stops learning!
        trj, state, done = play(env, agent, state, max_steps=20)
        if done:
            state = env.reset()  # start over when game-over
            # ... do evaluation to check progress ...

        estimate_policy_gradient(trj, optim)
```

There are some useful things to do during training, such as evaluate the policy periodically to see if everything works, or save the latest model parameters from time to time:
```python
            if report_train_steps > report_every_n_steps:
                # let's check the agent's performance
                trj, _, _ = play(env, agent, state, max_steps=1000000)
                state = env.reset() # this game has been consumed by testing, so 
                # we need to start-over again.
                print("Train {}: Test score {}".format(
                    train_steps, sum(trj['rewards'])))
                report_train_steps = 0
                
                # optional: save intermediate models
                state_dict = agent.state_dict()
                torch.save(state_dict,
                           'atari_a3c/checkpoints/'
                           'vanilla_{}.pth'.format(train_steps))
```

And add a testing script:
```python
from atari_a3c import Agent, create_atari_env
from atari_a3c.learn import policy_gradient

if __name__ == '__main__':
    g_env = create_atari_env('PongDeterministic-v4')
    g_agent = Agent(input_channels=4, feat_num=288, actions=3)
    policy_gradient(g_env, g_agent)
```

If you follow this scheme, don't expect much. I save this version of implementation as `learn_v1.py` and `test_v1.py` respectively. You can try yourself, but on my MacBook Pro, the progress is so slow that I didn't find much happening at all.

The main difficulty lies in the fact that we are doing stochastic estimation in a very large space using very few samples:
- actually, the expectation over $s\sim d^\pi$ is performed using a single sample! (after each update, the policy $\pi$ changes, and how often a  state is visited under $\pi$ changes along with $\pi$, so for each particular $\pi$, the estimation is done using a single sample)
- the space is large -- every possible combination of individual image pixels (4-stacked) is a member of the state space
- the estimation of the expected future return is too crude -- we simple look forward 20 steps and see how well the policy performed during the short period

So one must expect very high variance in the estimation, so much so that failure is almost certain -- And it is those challenges that make Artificial Intelligence fancinating -- how our brain manages to learn any thing despite the difficulties?

We will adopt several techniques to improve our learning agent. First and foremost,

### Variance reduction by improving advantage estimation

Generally, it is relatively easy to calculate how to change the parameters to increase or decrease the chance of taking a particular action in future. However, the difficult part is to assess how good/poor the state a particular decision has led us to. And since we are consider the probability of choosing the actions, it is the relative advantage of one action over another that matters, not the absolute value. So it is proposed to introduce a critic model, which estimate the overall value of a state (in the same paper we mentioned above). 

By substract the average expected future return from the evaluation of each (state, action) pair, we are looking at their relative fitness to the task. This does not change the expectation, but can significantly reduce the variane of the estimation. Given the fact that we have only one sample, the method is known to speed up training.

Specifically, let us allocate a critic net, which has very similar structure as the agent itself, but generates one real-number as the estimation of the long-term return starting from some state $s$ following a policy $\pi$.
```python
class Critic(torch.nn.Module):
    def __init__(self, feat_num):
        super(Critic, self).__init__()
        self.h_linear = nn.Linear(feat_num, 256)
        self.q_linear = nn.Linear(256, 1)

    def forward(self, feats):
        h = F.elu(self.h_linear(feats))
        v = self.q_linear(h)
        return v
```

When playing and collecting information, the critic's assessment on each state is saved as well
```
def play(...):
    ...
    trajectory = {...
                  'critic_values': []}
    while not done and steps < max_steps:
        ...        
        action_likelihood, critic_val = agent(state_tensor)
        ...
        trajectory['critic_values'].append(critic_val)
```

When training the model, we made several updates in the gradient estimation:
- __[A]__ compute the advantage by substracting the predicted value from the accumulated return

- __[B]__ including training for the critic, i.e. by requiring the estimation value of a state to be close to the actual outcome

- __[C]__ including a discount factor $\gamma$, so when accumulating the future reward, we are looking at a finite time horizon (when a reward is too remote in time, you may not want it to affect the current decision -- such influence, if arises from data, is more likely by coincidence than reflecting any patterns

```python
def estimate_policy_gradient(trajectory, optim, gamma=0.99):
    total_reward = 0
    total_objective = 0.0
    actor_objective = 0.0
    critic_objective = 0.0
    for r, logp, cv in zip(reversed(trajectory['rewards']),
                       reversed(trajectory['actions_logprob']),
                       reversed(trajectory['critic_values'])):
        # the accumulated reward starting from this time-step
        # until the end of the episode
        total_reward = total_reward*gamma + r  # [C]
        advantage = total_reward - cv  # [A]
        actor_objective += advantage.detach() * logp  # for actor,
        # we treat advantage as a coefficient, not some value to adjust

        critic_objective += advantage ** 2  # [B]

    total_objective = critic_objective - actor_objective
    optim.zero_grad()
    total_objective.backward()
    optim.step()
```

This algorithm worked, but it is very slow. You can see it made progress (such as stay in the game for longer), but it took very long to learn any winning policy.

### Variance reduction by parallel learning

As the current Monte Carlo estimation scheme employs only one sample to represent the expectation, an obvious improvement is to collect more samples. However in the reinforcement learning, data are generated within the training process. So the process is inherently serial. To have more samples, a natural way is to run multiple game-sessions simultaneously, and make a gradient estimation in each of the sessions. Then we use the average of the estimated gradients to update our model. 

There are mainly two ways to implement the scheme:
1. Server-Client Mode
    - Build a "master-model", which is like a centralised server. 
    - Each learning worker uses the up-to-date model "pulled" from the server, generate trajectory data.
    - Each worker applies the gradient it computed to the server model. 

2. Shared Model Mode
    - Create a shared model
    - Each learning worker use this shared model 
    - The individually computed gradients are applied to the shared model
    
Traditionally, method-1 is easier to implement, but fortunately, `pytorch` provides very nice parallelisation method to automatically handle the issues such one worker is using the shared model to compute the action probability and the other is trying to update the parameters of the nets. So we choose the second one. 

Then change of code is minimal. 

```python
import torch.multiprocessing as mp

def start_learning(num_workders, env_name):
    shared_agent = Agent(input_channels=4, feat_num=288, actions=3)
    shared_agent.share_memory()  # so the model's parameters is managed by
    # pytorch to avoid access conflicts.
    processes = []
    for rank in range(num_workders):
        p = mp.Process(target=policy_gradient,
                       args=(rank, env_name, shared_agent))
        p.start()
        processes.append(p)

    # Separate evaluation process
    p = mp.Process(target=evaluate_policy, args=(env_name, shared_agent))
    p.start()
    processes.append(p)

    for p in processes:
        p.join()
```

Note the multiprocess learning also allows me to move evaluation (along with all the scheduling of evaluation, saving training progress, etc. etc.) to a separate process. So we can afford much nicer, more informative evaluation along the training without introducing clutters and bugs in the learning code. The new learner is `learn_mp.py`, and a simple test is via `test_02.py`

The efficiency is better, but this program still very slow to converge and with strong instability.

If you are interested, please consider your own method to improve (of course, based on existing techniques, this is a viable path for Assignment 2 and 3).

There are a lot going on in this challenging and exciting area. We will introduce more works in the class following the stuvac (and put a reference list here after).

To further help your study, below is a glossary that might be useful when studying the literature. Also, a faster working implementation of the A3C algorithm is  provided [here](https://github.com/junjy007/aifoundation/tree/master/utils/a3c).

## Math Glossary
Some math representation of useful concepts are:

- $s_t$: the state observed at time $t$. This is usually but NOT always the stuff returned to the agent at the time taking actions. The most prominent exception is the screen-image-based states in our video-game playing examples. One applies some simple preprocessing to the states.

- $a_t$: the action taken at time $t$. Generally, it is an integer $\{0, 1, ..., K-1\}$ if there are $K$ different actions. Keep in mind that the actual action could be represented differently, such as "press A-button". For a decision making agent, given all possible action choices, choosing actions is eqivalent to choosing the indexes.

- $r_t$: the immediate reward received at time $t$. Note some authors used to let $r_t$ refer to the reward received __after__ taking action $a_t$ in state $s_t$, while others take $r_t$ as the reward received __at the beginning__ at time $t$, after taking action $a_{t-1}$ in state $s_{t-1}$. In whatever way, the procedure: in $s_t$ taking action $a_t$ according to some policy $\pi$ arriving the next state $s_{t+1}$ and receiving a reward $r_{t}$ (or $r_{t+1}$ subject to your choice of denotation) is called a __transition step__.

- $\pi(\cdot|s)$, given the state $s$, a policy $\pi(a|s)$ (not to be confused with the $\pi\approx3.1416$) assigns a non-negative real number to each action -- it dictates the possibility of choosing the action given $s$. If a policy is deterministic, rather than stochastic, it can attribute all probabilities to one particular action, so that the corresponding $\pi(a|s)=1$ and $\pi(a'|s)=0$ for all other actions $a'$.

- $Q^\pi(s, a)$, evaluation of __long term__ return for taking action $a$ in state $s$. Since it considers future effects, it relies on the on-going policy, $\pi$. Note taking $a$ at the current state $s$, the very first step of this evaluation is not necessarily with respect to $\pi$. Consider this $Q$-evaluation as answering a hypothetical question: what the long term reward would have been if she took action $a$ at $s$ and followed $\pi$ henceafter. Of course if this evaluation is known, it is wise to take the action maximising this $Q$ at each $s$.

- Note in neural network implementation, $Q$- and $\pi$-nets share the same structure: map states to $K$ numbers, where $K$ is the number of actions.