# Deep Q Learning

## Loss
$$\bigtriangledown_{\theta}\text{J}(\theta) = \frac{1}{N}\sum_{i=1}^N{ [ \sum_{t=0}^T{[ \text{smooth_l1_loss}(q(s_{i, t}), r(s_{i, t}, a_{i, t}) + \gamma q_{target}(s_{i, t+1}) ) ]}] }$$ 

Note that $q(s_{i, t})$ is produced by Q Network, while $q_{target}(i, t+1)$ is given by Q_target Network

## Techs

+ Replay Buffer: store samples(this code) or transitions(see [here](https://github.com/seungeunrho/minimalRL/blob/master/dqn.py))

+ Update Q Network once each epoch, while update Q_target Network by passing the Q Network parameters every few epochs

+ After the Replay Buffer collects a sufficient number of samples, start training the Q Network

+ Q Learning select the actions with highest reward instead of sampling from the softmax results, because the values outputed by Q Network represents the expected future reward of the action rather than the probability of taking the action

+ $\epsilon$-greedy: $\epsilon$ is reduced as the training progresses based on the formula: $$max\{0.01,  0.08 - 0.01* \frac{epoch}{200}\}$$

## 1. Import packages

In [0]:
import gym
import random
import collections
import torch
import torch.nn.functional as F
import torch.nn as nn

## 2. Define constants

In [0]:
gamma = 0.98
num_epochs = 3000
reward_div = 100
max_buffer = 500

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## 3. Prepare data

In [0]:
env = gym.make("CartPole-v0")

def get_sample(env, policy, max_iter=600, epsilon=0.01):
    done = False
    s = env.reset() # (state_size, )

    ss, aa, rr, s_primes, done_masks = list(), list(), list(), list(), list()
    for t in range(max_iter):
        a = policy.sample_action(s, epsilon=epsilon)
        s_prime, r, done, _ = env.step(a) # a is 0 or 1
        ss.append(s)
        aa.append(a)
        rr.append(r)
        s_primes.append(s_prime)
        done_mask = 0.0 if done else 1.0
        done_masks.append(done_mask)
        s = s_prime
        if done:
            break

    sample = (torch.Tensor(ss).to(device), torch.LongTensor(aa).to(device), torch.Tensor(rr).to(device), torch.Tensor(s_primes).to(device), torch.Tensor(done_masks).to(device))
    return sample

## 4. Build model

In [0]:
class ReplayBuffer(object):
    def __init__(self):
        self.buffer = collections.deque(maxlen=max_buffer)
    
    def append(self, sample): # I store one sample in the ReplayBuffer, while the minimalRL stores 1 transition
        self.buffer.append(sample)
    
    def sample(self):
        return random.choice(self.buffer)
    
    def __len__(self):
        return len(self.buffer)

      
class DQN(nn.Module):
    def __init__(self):
        super(DQN, self).__init__()
        self.value = nn.Sequential(
            nn.Linear(4, 256), # states: (B, state_size=4)
            nn.ReLU(),
            nn.Linear(256, 2) # values: (B, action_size=2)
        )
        
        self.optimizer = torch.optim.Adam(self.parameters(), lr=0.0002, betas=(0.9, 0.99))

      
    def sample_action(self, state, epsilon=0.01):
        state = torch.Tensor(state).to(device)
        values = self.value(state) # (B=1, action_size)
        if random.random() < epsilon:
            return random.choice([0, 1])
        else:
            return values.topk(1)[1].item()

    def fit(self, sample, q_target):
        (s, a, r, s_prime, done_mask) = sample
        
        r /= reward_div
        
        q_prime = q_target.value(s_prime).max(dim=1)[0]
        
        td_target = r + gamma * q_prime * done_mask
        q_pred = self.value(s) # (B, action_size=2)
        q_pred = q_pred.gather(1, a.unsqueeze(1))
        
        loss = F.smooth_l1_loss(q_pred, td_target.unsqueeze(1))
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

## 5. Train

In [5]:
q = DQN().to(device)

q_target = DQN().to(device)
q_target.load_state_dict(q.state_dict())

buffer = ReplayBuffer()

score = 0.0

for epoch in range(num_epochs):
    # ------------------------- Get sample ---------------------------------------------
    epsilon = max(0.01, 0.08 - 0.01* (epoch/200)) #Linear annealing from 8% to 1%
    sample = get_sample(env, q, epsilon=epsilon)
    rewards = sample[2]
    score += sum(rewards)
    
    # ------------------------- Append sample to Replay Buffer ---------------------------------------------
    buffer.append(sample)
    
    # ------------------------- Train Q Network using sample randomly chosen from Replay Buffer ---------------------------------------------
    if len(buffer) > 40:
        for i in range(10):
            sample = buffer.sample()
            q.fit(sample, q_target)
            
    # ------------------------- Update Q_target Network ---------------------------------------------
    if epoch != 0 and epoch % 20 == 0:
        # pass the parameters from q to q_target
        q_target.load_state_dict(q.state_dict())

    if epoch % 100 == 0:
        print('Epoch %d || Average Score: %.6f'%(epoch, score / (epoch + 1)))

Epoch 0 || Average Score: 10.000000
Epoch 100 || Average Score: 15.980198
Epoch 200 || Average Score: 26.079601
Epoch 300 || Average Score: 62.142857
Epoch 400 || Average Score: 82.022446
Epoch 500 || Average Score: 94.522957
Epoch 600 || Average Score: 104.623955
Epoch 700 || Average Score: 112.870186
Epoch 800 || Average Score: 114.408234
Epoch 900 || Average Score: 111.486130
Epoch 1000 || Average Score: 116.703293
Epoch 1100 || Average Score: 119.029976
Epoch 1200 || Average Score: 120.970856
Epoch 1300 || Average Score: 121.441200
Epoch 1400 || Average Score: 119.381157
Epoch 1500 || Average Score: 117.480347
Epoch 1600 || Average Score: 117.738297
Epoch 1700 || Average Score: 118.125214
Epoch 1800 || Average Score: 118.739029
Epoch 1900 || Average Score: 119.277222
Epoch 2000 || Average Score: 120.924538
Epoch 2100 || Average Score: 121.871971
Epoch 2200 || Average Score: 122.861427
Epoch 2300 || Average Score: 123.503693
Epoch 2400 || Average Score: 124.530197
Epoch 2500 || Aver