-----
## PacMan Gym
### Evgeny Egorov, Ruslan Kostoev, Vladislav Ishimtsev


#### SkolTech, Spring 2017
-----

$${\huge Data}$$
![pacman](./pacman.PNG)
**Input:** $(210 \times 160 \times 3)$ RGB image

**Output:** action $(\leftarrow, \uparrow, \downarrow, \rightarrow, etc...)$

## The agent–environment interaction in reinforcement learning (AgentNet)

![agent](./agent.PNG)

* An agent implementation may contain these parts:
 * Observation(s)
 * Memory layer(s)
 * Policy layer
 * Resolver 

$${\huge Resolvers}$$

## Greedy Play 
It always played the move that brought it to the position that
it rated the best.

## NonGreedy
You are exploring, because this enables
you to improve your estimate of the nongreedy action’s value.

## $\varepsilon$-greedy
Behave greedily most of the time, but every
once in a while, say with small probability ε, instead to select randomly from
amongst all the actions with equal probability independently of the action value estimates.

## Probabilistic
Samples actions with probabilities given by input layer.

$${\huge RL~techniques}$$

# Q-learning
One-step Q-learning, is defined by
$$
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[R_{t+1}+\gamma \max\limits_aQ(S_{t+1},a) - Q(S_t, A_t)]
$$


# SARSA
$$
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t, A_t)]
$$

If $S_{t+1}$ is terminal then $Q(S_{t+1}, A_{t+1})$ is defined by zero


# Actor Critic
$$
\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t),\\
\pi_t(a|s) = \dfrac{e^{h(a|s)}}{\sum_be^{h(b,s)}} - \text{Gibbs softmax method},\\
h(A_t|S_t) \leftarrow h(A_t | S_t) + \beta\delta_t[1 - \pi_t(S_t, A_t)]
$$


$${\huge Models}$$

## Baseline (CNN)
* Observations: Grayscaled input image $(84, 110)$
* Memory layer: max value of last 5 images (avoid flickering)
* Policy layer: 
    * Convolution with 16 $8\times8$ filters with stride 4
    * Convolution with 32 $4\times4$ filters with stride 2
    * Dense Layer with 256 units
    * Dense Layer with 10 units (number of actions)
* Resolver:
    * $\varepsilon$-greedy resolver
* Learning target:
    * Q-value

## Fat CNN
* Observations: Grayscaled input image $(84, 110)$
* Memory layer: max value of last 5 images (avoid flickering)
* Policy layer: 
    * Convolution with 16 $8\times8$ filters with stride 2
    * Convolution with 32 $4\times4$ filters with stride 2
    * Convolution with 64 $4\times4$ filters with stride 2
    * Convolution with 64 $4\times4$ filters with stride 2
    * Dense Layer with 256 units
    * Dense Layer with 10 units (number of actions)
* Resolver:
    * $\varepsilon$-greedy resolver
* Learning target:
    * Q-value

## CNN+LSTM
* Observations: Grayscaled input image $(84, 84)$
* Memory layer: max value of last 4 images (avoid flickering)
* Policy layer: 
    * Convolution with 16 $8\times8$ filters with stride 4
    * Convolution with 32 $4\times4$ filters with stride 2
    * Dense Layer with 256 units
    * LSTM Layer with 256 units
    * Dense Layer with 10 units (number of actions)
* Resolver:
    * $\varepsilon$-greedy resolver
* Learning target:
    * Q-value

## Actor Critic
* Observations: Grayscaled input image $(84, 84)$
* Memory layer: max value of last 4 images (avoid flickering)
* Policy layer: 
    * Convolution with 16 $8\times8$ filters with stride 4
    * Convolution with 32 $4\times4$ filters with stride 2
    * Dense Layer with 256 units
    * Dense Layer with 10 units (number of actions)
* Resolver:
    * $\varepsilon$-greedy resolver
* Learning target:
    * Actor Critic

$${\huge Comparizon~of~models}$$

In [4]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def plot_npz(file_name, label, is_pkl=False):
    if is_pkl:
        a = np.load(open(file_name, 'r'))[()]
    else:
        a = np.load(open(file_name, 'r')).items()[0][1][()]
    x = sorted(a)
    y = [a[k] for k in x]
    plt.plot(x, y, label=label)
    plt.legend(loc='best')

plt.figure(figsize=(12,8))
plt.subplot(2,1,1)
plot_npz('stuff/greedy.npz', 'Baseline')
plot_npz('stuff/greedy_fat_log.pcl', 'Fat CNN', True)
plot_npz('stuff/greedy_lstm_log.pcl', 'CNN+LSTM', True)
plt.xlabel('Epoch')
plt.ylabel('MA reward')
plt.title('Greedy', fontsize=20);
plt.subplot(2,1,2)
plot_npz('stuff/egreedy.npz', 'Baseline')
plot_npz('stuff/egreedy_fat_log.pcl', 'Fat CNN', True)
plot_npz('stuff/egreedy_lstm_log.pcl', 'CNN+LSTM', True)
plt.xlabel('Epoch')
plt.ylabel('MA reward')
plt.title('$\\varepsilon$-Greedy', fontsize=20);
plt.tight_layout()
plt.savefig('stuff/models', dpi=300);

## Baseline vs Fat CNN vs CNN+LSTM

![models](./stuff/models.png)

In [9]:
plt.figure(figsize=(12,5))
plot_npz('stuff/greedy_fat_log.pcl', 'Fat CNN', True)
plot_npz('stuff/greedy_sarsa_log.pcl', 'Fat CNN (SARSA)', True)
plt.xlabel('Epoch')
plt.ylabel('MA reward')
plt.title('Fat CNN', fontsize=20);
plt.savefig('stuff/sarsa', dpi=300);

## Q-Learning vs SARSA
![sarsa](./stuff/sarsa.png)

In [3]:
plt.figure(figsize=(12,5))
plot_npz('stuff/greedy_qlstm_log.pcl', 'LSTM', True)
plot_npz('stuff/greedy_lstm_log.pcl', 'LSTM (SARSA)', True)
plt.xlabel('Epoch')
plt.ylabel('MA reward')
plt.title('LSTM', fontsize=20);
plt.savefig('stuff/lstm', dpi=300);

![lstm](./stuff/lstm.png)

| Reward                   | Min |  Mean | Median |  Max |  Std  |
|--------------------------|:---:|:-----:|:------:|:----:|:-----:|
| Fat CNN (Proba resolver) |  80 | 201.0 |   200  |  310 |  **56.9** |
| Actor Critic             |  60 | 207.8 |   110  |  830 | 196.1 |
| Baseline (CNN)           |  70 | 333.5 |   245  |  900 | 208.0 |
| Fat CNN                  |  70 | 391.2 |   440  | 1730 | 228.8 |
| Fat CNN (SARSA)          |  70 | 419.5 |   440  | 2010 | 240.1 |
| CNN+LSTM                 |  140   |  447.6     |  510      |  1330    |  311.1     |
| CNN+LSTM (SARSA)         | **140** | **480.8** |   **510**  | **1860** | 320.3 |

$${\huge Demo}$$

$$\large \text{Greedy/}\varepsilon\text{-Greedy}$$

In [1]:
import io
import base64
from IPython.display import HTML

video = io.open('stuff/lstm.mp4', 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls>
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii')))

$$\large \text{Probabilistic}$$

In [2]:
video = io.open('stuff/proba.mp4', 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls>
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii')))

Comparizon with others in [GYM](https://gym.openai.com/envs/MsPacman-v0)