# Deep Q-Networks and Experience Replay

Gradient methods are intuitively appealing: you just move a little bit in the downhill direction. They have a serious drawback, however: they are not sample-efficient. To see this, observe that we update our function approximation in the direction of the experience, and then throw the experience away.	It would make more sense to find the best fitting value function given the agent's experience. We would be better off if we process batches of this experience, instead of throwing it all away.

An **experience** or **replay memory** $\mathcal D$ is a collection of tuples $(s,a,r)$. We will train the agent chooses randomly a minibatch from $\mathcal D$, to replay his experience and update $\theta$. Why choosing randomly? This helps break correlations from the data. For example, if you spent the second half of your time in an episode doing something completely useless from the reward point of view, you would not learn much if you take, say, the last 10 moves in your batch. This idea was introduced in a [*Nature* paper in 2015](https://www.nature.com/nature/journal/v518/n7540/full/nature14236.html), and has really impacted much of the research in the last two years, at least in what concerns video game playing. The DQN algorithm works as follows:

* Take action $a$ according to $\epsilon$-greedy policy. 
* Store transition $s,a,r,s'$ in replay memory $\mathcal D$
* Choose a random sample from $\mathcal D$ (minibatch).
* Compute $Q$-learning target with old, fixed parameters $w^-$.
* Choose the new parameter $w$ that minimizes the error
$$ \sum_{s,a,r,s'}\left ( r+ \gamma \max_{a'} Q(s',a',w^-)-Q(s,a,w)\right )^2$$


The Atari corpus consist of a number of Atari games. For each of them the environment works as follows:
- Input: stack of raw pixels from last 4 frames.
- Reward is change of score for the step.
- Output: 18 joystick/button positions.

The training time reported with the previous architecture is 2 weeks on GPU to reach human-level performance. A remarkable fact is that the same architecture was used for all the games.


Two tricks that make DQN work (recall the non-convergence discussion from last lecture):
- Experience Replay: because of the correlation-breaking feature discussed above.
- Fixed Q-target: For a while, we are improving on the direction of the frozen parameter.


## Improvements since the original DQN



### Double DQN
- **Issue**: Overestimation of the actions
- It remains an open problem whether overestimation of the actions is an issue. 
	- What can go wrong? Be too optimistic about bad actions.
- Not a "deep learning" issue, the same happens in tabular methods (NIPS 2010).
- Current network is used to select actions. 
- Older network is used to evaluate actions.
- Error to minimize is:

$$ \sum_{s,a,r,s'}\left ( r+ \gamma Q(s', \mathrm{argmax}_{a'}Q(s',a',w),w^-)-Q(s,a,w)\right )^2$$

 
### Prioritised replay
- State transitions can be more or less surprising, irrelevant or even not relevant for the current agent level.
- Replay transitions with high expected learning progress
- Store the experience in a priority queue, depending to the DQN error
	$$|r+\gamma \max_{a'}(s',a',w^-)-Q(s,a,w)|$$ 
- Some noise in the selection needed to reduce bias and loss of diversity.
- Similar results as in the DQN paper, but faster.



### Duelling network

- Split the $Q$-network into two channels
	$$Q(s,a) = V(s,u)+ A(s,a,w).$$
- More efficient learning, because the updates of the value function $V$ do not depend on the action.



## Q-Learning with experience replay (Linear Approximator)

In the following code sample, we use the experience replay idea, although with a twist for pedagogical purposes: instead of using neural networks, we use a simple linear function approximator.


In [2]:
import random
import numpy as np
import gym
from scipy.optimize import minimize


class LinearEstimator:
    def __init__(self):
        np.random.seed(1)
        self.DIM = 5
        self.w = 2*np.random.random(self.DIM)-1
        self.alpha = 0.01
        self.D = []
        self.batch_size = 32
        
        
    def featurize(self,s,a):
        x = np.zeros(self.DIM)
        x[0] = s[0]
        x[1] = s[1]
        x[2] = s[2]
        x[3] = s[3]
        x[4] = -1 if a==0 else 1
        return x
        
    def predict(self,s,a):
        return np.matmul(self.featurize(s,a),self.w)
    
    def remember(self,s,a,r,s1):
        self.D.append((s,a,r,s1))
        
    def train_model(self):
        if len(self.D) < self.batch_size:
            return
        batch = random.sample(self.D,self.batch_size)
        
        def Q(s,a,w):
            return np.matmul(self.featurize(s,a),w)
        
        def target(w):
            w_ = self.w
            tot_error = 0
            for b in batch:
                s,a,r,s_ = b
                q_max = max(Q(s,0,w_), Q(s,1,w_))
                tot_error += (r+ q_max-Q(s_,a,w))**2
            return tot_error
        
        w0 = np.random.rand(len(self.w))
        res = minimize(target, w0)
        return res.x
        
        
        
def epsilon_greedy_policy(estimator, epsilon, actions):
    """ Q is a numpy array, epsilon between 0,1 
    and a list of actions"""
    
    def policy_fn(state):
        if np.random.rand()>epsilon:
            action = np.argmax([estimator.predict(state,a) \
                                for a in actions])
        else:
            action = np.random.choice(actions)
        return action
    return policy_fn


estimator = LinearEstimator()

env = gym.make("CartPole-v0")

gamma = 0.9

n_episodes = 50000
update_freq = 100
initial_train = 1000
actions = range(env.action_space.n)
# TO DO:
score = []    
for e in range(n_episodes):
    done = False
    state = env.reset()
    policy = epsilon_greedy_policy(estimator,epsilon=0.1, 
                                   actions = actions )
    
    step_count = 0
    ep_reward = 0
    
    
    ### Generate sample episode
    while not done:
        step_count += 1
        action = policy(state)
        new_state, reward, done, _ =  env.step(action)
        new_action = policy(new_state)
        ep_reward += reward
        
        #Calculate the td_target
        if done:
            td_target = reward
        else:
            new_q_val = estimator.predict(new_state,new_action)
            td_target = reward + gamma * new_q_val
        
        estimator.remember(state,action,reward, new_state)    
        
        if step_count % update_freq == 0 and e>initial_train:
            estimator.train_model()
        
        state = new_state
            
        if done:
            if len(score) < 100:
                score.append(ep_reward)
            else:
                score[e % 100] = ep_reward
                #print("\rEpisode {} / {}. 100 ep score:\
                # {}".format(e+1, n_episodes,np.mean(score)), end="")
            # Stop training when reaching milestone
            if np.mean(score)>195:
                print("SOLVED")

            break
    
env.close()


## Exercises

- Beat the benchmark! But this time with a different function approximator. Linear functions won't cut it in this case.