**<font size="6"><center>Deep Q-Learning</center></font>**

Répéter :

1) Rassembler et stocker des échantillons en mémoire avec la politique actuelle

2) Echantillonner aléatoirement des batch d'expériences de la mémoire (connus sous le nom de **Experience Replay**)

3) Utiliser les expériences échantillonnées pour mettre à jour le Q-network

**Expérience Replay**

*Pourquoi échantillonner des expériences au hasard, au lieu d'utiliser simplement des expériences séquentielles passées ?*

Les expériences séquentielles sont fortement corrélées (temporellement) entre elles. Dans les tâches d'apprentissage et d'optimisation statistique, on veut que les données soient distribuées indépendamment. Autrement dit, on ne veut pas qu'elles soient corrélées entre elles. L'échantillonnage aléatoire des expériences permet d'éviter cette corrélation temporelle du comportement et la répartit sur un bon nombre de ses états précédents. En faisant cela, on évite des oscillations ou des divergences importantes dans notre modèle - problèmes qui peuvent survenir à partir de données corrélées.

**Mise à jour du Q-network**

Pour mettre à jour le réseau Q, on veut minimiser l'erreur quadratique moyenne entre la Q-valeur cible (selon l'équation de Bellman) et la sortie Q-valeur actuelle:

$$ L(\theta_i) = \underset{s,a \sim p(.)}{\mathbb{E}} [y_i - Q_{\theta_i} (s,a)]^2 $$

où 
- $ Q_{\theta_i} (s,a)$ : l'actuelle Q-valeur calculée par le réseau de neurone    
- $ y_i = R_{t+1}+\gamma \underset{a'}{\max} Q(s_{t+1},a') $: la Q-valeur cible donné par l'équation de Bellman 

# Implementation

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.autograd as autograd
import numpy as np
import math
import gym
from collections import deque
import random

le **réseau de neurone** qui approxime la Q-valeur

In [2]:
class DQN(nn.Module):
    
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.input_dim  = input_dim
        self.output_dim = output_dim
        
        self.fc = nn.Sequential(
            nn.Linear(self.input_dim[0], 128),
            nn.ReLU(),
            nn.Linear(128, 256),
            nn.ReLU(),
            nn.Linear(256, self.output_dim) )

    def forward(self, state):
        qvals = self.fc(state)
        return qvals

le **replay buffer** qui sauvegarde les expériences et échantillonne des expériences aléatoirement pour l'apprentissage

In [3]:
class BasicBuffer:

    def __init__(self, max_size):
        self.max_size = max_size
        self.buffer = deque(maxlen=max_size)

    def push(self, state, action, reward, next_state, done):
        experience = (state, action, np.array([reward]), next_state, done)
        self.buffer.append(experience)

    def sample(self, batch_size):
        state_batch = []
        action_batch = []
        reward_batch = []
        next_state_batch = []
        done_batch = []

        batch = random.sample(self.buffer, batch_size)

        for experience in batch:
            state, action, reward, next_state, done = experience
            state_batch.append(state)
            action_batch.append(action)
            reward_batch.append(reward)
            next_state_batch.append(next_state)
            done_batch.append(done)

        return (state_batch, action_batch, reward_batch, next_state_batch, done_batch)

    def __len__(self):
        return len(self.buffer)

l'**agent** DQN

In [4]:
class DQNAgent:

    def __init__(self, env, learning_rate=3e-4, gamma=0.99, buffer_size=10000):
        self.env           = env
        self.learning_rate = learning_rate
        self.gamma         = gamma
        self.replay_buffer = BasicBuffer(max_size=buffer_size)
        self.model         = DQN(env.observation_space.shape, env.action_space.n)
        self.optimizer     = torch.optim.Adam(self.model.parameters())
        self.MSE_loss      = nn.MSELoss()

    def get_action(self, state, eps=0.20):
        state  = autograd.Variable(torch.from_numpy(state).float().unsqueeze(0))
        qvals  = self.model.forward(state)
        action = np.argmax(qvals.detach().numpy())
        if(np.random.randn() < eps):
            action = self.env.action_space.sample()
        
        return action

    def compute_loss(self, batch):
        states, actions, rewards, next_states, dones = batch
        states      = torch.FloatTensor(states)
        actions     = torch.LongTensor(actions)
        rewards     = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(next_states)
        dones       = torch.FloatTensor(dones)

        curr_Q     = self.model.forward(states).gather(1, actions.unsqueeze(1))
        curr_Q     = curr_Q.squeeze(1)
        next_Q     = self.model.forward(next_states)
        max_next_Q = torch.max(next_Q, 1)[0]
        expected_Q = rewards.squeeze(1) + (1 - dones) * self.gamma * max_next_Q

        loss = self.MSE_loss(curr_Q, expected_Q.detach())
        return loss

    def update(self, batch_size):
        batch = self.replay_buffer.sample(batch_size)
        loss  = self.compute_loss(batch)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

Test on environment

In [5]:
def mini_batch_train(env, agent, max_episodes, max_steps, batch_size):
    episode_rewards = []

    for episode in range(max_episodes):
        state = env.reset()
        episode_reward = 0

        for step in range(max_steps):
            action = agent.get_action(state)
            next_state, reward, done, _ = env.step(action)
            agent.replay_buffer.push(state, action, reward, next_state, done)
            episode_reward += reward

            if len(agent.replay_buffer) > batch_size:
                agent.update(batch_size)   

            if done or step == max_steps-1:
                episode_rewards.append(episode_reward)
                print("Episode " + str(episode) + ": " + str(episode_reward))
                break

            state = next_state

    return episode_rewards

In [6]:
env_id = "CartPole-v0"
MAX_EPISODES = 1000
MAX_STEPS    = 500
BATCH_SIZE   = 32

env   = gym.make(env_id)
agent = DQNAgent(env)
episode_rewards = mini_batch_train(env, agent, MAX_EPISODES, MAX_STEPS, BATCH_SIZE)

Episode 0: 18.0
Episode 1: 21.0
Episode 2: 11.0
Episode 3: 11.0
Episode 4: 27.0
Episode 5: 13.0
Episode 6: 51.0
Episode 7: 39.0
Episode 8: 21.0
Episode 9: 35.0
Episode 10: 88.0
Episode 11: 23.0
Episode 12: 21.0
Episode 13: 20.0
Episode 14: 20.0
Episode 15: 15.0
Episode 16: 51.0
Episode 17: 20.0
Episode 18: 32.0
Episode 19: 20.0
Episode 20: 28.0
Episode 21: 27.0
Episode 22: 40.0
Episode 23: 55.0
Episode 24: 25.0
Episode 25: 18.0
Episode 26: 85.0
Episode 27: 87.0
Episode 28: 40.0
Episode 29: 132.0
Episode 30: 104.0
Episode 31: 19.0
Episode 32: 66.0
Episode 33: 93.0
Episode 34: 127.0
Episode 35: 81.0
Episode 36: 169.0
Episode 37: 23.0
Episode 38: 126.0
Episode 39: 43.0
Episode 40: 13.0
Episode 41: 108.0
Episode 42: 171.0
Episode 43: 66.0
Episode 44: 130.0
Episode 45: 42.0
Episode 46: 52.0
Episode 47: 51.0
Episode 48: 97.0
Episode 49: 170.0
Episode 50: 95.0
Episode 51: 93.0
Episode 52: 126.0
Episode 53: 112.0
Episode 54: 21.0
Episode 55: 20.0
Episode 56: 30.0
Episode 57: 80.0
Episode 58: 1

Episode 457: 67.0
Episode 458: 10.0
Episode 459: 115.0
Episode 460: 12.0
Episode 461: 16.0
Episode 462: 36.0
Episode 463: 45.0
Episode 464: 76.0
Episode 465: 11.0
Episode 466: 17.0
Episode 467: 18.0
Episode 468: 15.0
Episode 469: 18.0
Episode 470: 74.0
Episode 471: 11.0
Episode 472: 16.0
Episode 473: 50.0
Episode 474: 51.0
Episode 475: 28.0
Episode 476: 14.0
Episode 477: 27.0
Episode 478: 20.0
Episode 479: 24.0
Episode 480: 21.0
Episode 481: 17.0
Episode 482: 30.0
Episode 483: 39.0
Episode 484: 127.0
Episode 485: 14.0
Episode 486: 25.0
Episode 487: 51.0
Episode 488: 141.0
Episode 489: 60.0
Episode 490: 147.0
Episode 491: 200.0
Episode 492: 68.0
Episode 493: 47.0
Episode 494: 30.0
Episode 495: 16.0
Episode 496: 15.0
Episode 497: 62.0
Episode 498: 16.0
Episode 499: 52.0
Episode 500: 19.0
Episode 501: 18.0
Episode 502: 37.0
Episode 503: 136.0
Episode 504: 72.0
Episode 505: 16.0
Episode 506: 70.0
Episode 507: 52.0
Episode 508: 125.0
Episode 509: 15.0
Episode 510: 20.0
Episode 511: 20.0
Epi

Episode 910: 200.0
Episode 911: 79.0
Episode 912: 84.0
Episode 913: 124.0
Episode 914: 14.0
Episode 915: 72.0
Episode 916: 155.0
Episode 917: 71.0
Episode 918: 83.0
Episode 919: 17.0
Episode 920: 70.0
Episode 921: 90.0
Episode 922: 41.0
Episode 923: 47.0
Episode 924: 104.0
Episode 925: 15.0
Episode 926: 28.0
Episode 927: 19.0
Episode 928: 19.0
Episode 929: 29.0
Episode 930: 163.0
Episode 931: 21.0
Episode 932: 161.0
Episode 933: 109.0
Episode 934: 18.0
Episode 935: 17.0
Episode 936: 16.0
Episode 937: 87.0
Episode 938: 37.0
Episode 939: 150.0
Episode 940: 15.0
Episode 941: 12.0
Episode 942: 42.0
Episode 943: 103.0
Episode 944: 33.0
Episode 945: 15.0
Episode 946: 41.0
Episode 947: 21.0
Episode 948: 17.0
Episode 949: 20.0
Episode 950: 143.0
Episode 951: 46.0
Episode 952: 78.0
Episode 953: 26.0
Episode 954: 9.0
Episode 955: 12.0
Episode 956: 114.0
Episode 957: 15.0
Episode 958: 72.0
Episode 959: 112.0
Episode 960: 12.0
Episode 961: 29.0
Episode 962: 51.0
Episode 963: 164.0
Episode 964: 33.