# Introduction

## Reinforcement Learning
![rl loop](refs/rl_loop.png)
> An RL algorithm seeks to maximize the agent’s total reward, given a previously unknown environment, through a trial-and-error learning process.

## Deep Reinforcement Learning
Solve RL problems through deep learning approach.

## Methods
![methods](refs/methods_overview.png)

# Environment (OpenAI Gym)

In [None]:
import gym

env = gym.make('Pong-v0').unwrapped
observation = env.reset()

while True:
    env.render()
    observation, reward, done, _ = env.step(action)

# Input

![before](refs/before_img.png)

![after](refs/after_img.png)

In [None]:
def preprocess(img):
    """ Preprocess 210x160x3 uint8 frame into 6400 (80x80) 1D float vector """
    I = img[35:195]     # crop
    I = I[::2, ::2, 0]  # downsample by factor of 2
    I[I == 144] = 0     # erase background (background type 1)
    I[I == 109] = 0     # erase background (background type 2)
    I[I != 0] = 1       # everything else (paddles, ball) just set to 1

    return I.astype(np.float).ravel()

# Model

In [None]:
class PolicyGradient(nn.Module):
    """
    It's out model class.
    """

    def __init__(self, in_dim):
        super(PolicyGradient, self).__init__()
        self.hidden = nn.Linear(in_dim, 200)
        self.out = nn.Linear(200, 3)

        self.rewards = []
        self.actions = []

        # Weights initialization
        for m in self.modules():
            if isinstance(m, nn.Linear):
                # 'n' is number of inputs to each neuron
                n = len(m.weight.data[1])
                # "Xavier" initialization
                m.weight.data.normal_(0, np.sqrt(2. / n))
                m.bias.data.zero_()

    def forward(self, x):
        h = F.relu(self.hidden(x))
        logits = self.out(h)
        return F.softmax(logits)

    def reset(self):
        del self.rewards[:]
        del self.actions[:]

# Stochastic policy

We use stochastic policy which means our model produces probability distribution over all actions, _π(a | s) = probability of action given state_. Then we sample from this distribution in order to get action.

Why stochastic policy?:

* We can use the score function gradient estimator, which tries to make good actions more probable.
* Stochastic environments.
* Partially observable states.
* The randomness inherent in the policy leads to exploration, which is crucial for most learning problems.

In [None]:
def get_action(policy, observation):
    # Get current state, which is difference between current and previous state
    cur_state = preprocess(observation)
    state = cur_state - get_action.prev_state \
        if get_action.prev_state is not None else np.zeros(len(cur_state))
    get_action.prev_state = cur_state

    var_state = Variable(
        # Make torch FloatTensor from numpy array and add batch dimension
        torch.from_numpy(state).type(FloatTensor).unsqueeze(0)
    )
    probabilities = policy(var_state)
    # Stochastic policy: roll a biased dice to get an action
    action = probabilities.multinomial()
    # Record action for future training
    policy.actions.append(action)
    # '+ 1' converts action to valid Pong env action
    return action.data[0, 0] + 1

# Learning

## Supervised Learning

![sl](refs/sl_figure.png)

## Reinforcement Learning

![sl](refs/rl_figure.png)

# Discounted reward

> In a more general RL setting we would receive some reward \\(r_t\\) at every time step. One common choice is to use a discounted reward, so the “eventual reward” in the diagram above would become:

\begin{align}
R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}
\end{align}

But why discounted?

## MEU principle says...

A rational agent should chose the action that maximizes its expected utility given its knowlage.

### Expected utility in state s with respect to policy:

$$
U^{\pi}(s) = E[\sum_{t = 0}^{\infty}\gamma^tR(S_t)] \\
S_t - \text{random variable representing state reached at time } t \text{ following policy }  \pi
$$

...where the expectation is with respect to the probability distribution over state sequences determined by s and π.

### Optimal policy:

$$
\pi^*(s) = \mathop{argmax}_{a \in A(s)} \sum_{s^\prime}P(s^\prime \vert s,a)U^{\pi^*}(s^\prime) \\
P(s^\prime \vert s,a) - \text{transition model}
$$

# Deriving Policy Gradients

## Score function gradient estimator

\begin{align}
\nabla_{\theta} E_x[f(x)] &= \nabla_{\theta} \sum_x p(x) f(x) & \text{definition of expectation} \\
& = \sum_x \nabla_{\theta} p(x) f(x) & \text{swap sum and gradient} \\
& = \sum_x p(x) \frac{\nabla_{\theta} p(x)}{p(x)} f(x) & \text{both multiply and divide by } p(x) \\
& = \sum_x p(x) \nabla_{\theta} \log p(x) f(x) & \text{use the fact that } \nabla_{\theta} \log(z) = \frac{1}{z} \nabla_{\theta} z \\
& = E_x[f(x) \nabla_{\theta} \log p(x) ] & \text{definition of expectation}
\end{align}

## Interpretation

![policy distribution](refs/pg_distribution_plot.png)

# Weights visualization

![weights](refs/weights.png)

# Improvements
* Hiperparameters tuning
* ConvNets
* Move penalty

_References:_
* Andrej Karpathy's blog post, [Deep Reinforcement Learning: Pong from Pixels](http://karpathy.github.io/2016/05/31/rl/)
* John Schulmann's thesis, [Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs](http://joschu.net/docs/thesis.pdf)
* Stuart Russell and Peter Norvig, _Artificial Intelligence: A Modern Approach, 3rd Edition_
* Richard S. Sutton and Andrew G. Barto, _Reinforcement Learning: An Introduction Second edition, in progress \*\*\*\*Draft\*\*\*\*_

_Also:_
* https://www.quora.com/How-do-we-benefit-from-stochastic-policies-in-reinforcement-learning