# Deep Q-Networks exploration 
q networks and q learning interchangeable

q learning only works for discrete action spaces, continuous action spaces require diff algorithms
theory:
https://www.youtube.com/watch?v=0bt0SjbS3xc&list=PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv&index=13&ab_channel=deeplizard

code:
https://www.youtube.com/watch?v=NP8pXZdU-5U&ab_channel=brthor (simpler)
https://www.youtube.com/watch?v=wc-FxNENg9U&ab_channel=MachineLearningwithPhil (more advanced)

instead of using value iteration, use deep neural network
the network approximates the q function using the bellman eqn

n_outputs of DQN = n_actions
each node represents the q value for a particular action given the state (which is the input to the DQN) 
output layer is without activation function so we can see the raw q values

experience replay 
With deep Q-networks, we often utilize this technique called experience replay during training. With experience replay, we store the agent's experiences at each time step in a data set called the replay memory. We represent the agent's experience at time as.At time, the agent's experience, is defined as this tuple:
et=(st,at,rt+1,st+1)


A key reason for using replay memory is to break the correlation between consecutive samples.

DQN Pseudocode without Target Network
1. Initialize replay memory capacity.
2. Initialize the network with random weights.
3. For each episode:
- Initialize the starting state.
- For each time step:
    -Select an action.
        -Via exploration or exploitation
    -Execute selected action in an emulator.
    -Observe reward and next state.
    -Store experience in replay memory.
    -Sample random batch from replay memory.
    -Preprocess states from batch.
    -Pass batch of preprocessed states to policy network.
    -Calculate loss between output Q-values and target Q-values.
        -Requires a second pass to the network for the next state to get the max q value for the next state to get target Q value [inefficient so we introduce a Target Network - freezes the policy networks weights and update every few steps only ]
    -Gradient descent updates weights in the policy network to minimize loss.
    
    
DQN Pseudocode without Target Network
1. Initialize replay memory capacity.
2. Initialize the network with random weights.
3. Clone the policy network, and call it the target network.
4. For each episode:
- Initialize the starting state.
- For each time step:
    -Select an action.
        -Via exploration or exploitation
    -Execute selected action in an emulator.
    -Observe reward and next state.
    -Store experience in replay memory.
    -Sample random batch from replay memory.
    -Preprocess states from batch.
    -Pass batch of preprocessed states to policy network.
    -Calculate loss between output Q-values and target Q-values.
        -Requires a pass to the target network for the next state
    -Gradient descent updates weights in the policy network to minimize loss.
        -After time steps, weights in the target network are updated to the weights in the policy network.    
    

# algo 1

implementation of the DeepMind paper
https://training.incf.org/sites/default/files/2023-05/Human-level%20control%20through%20deep%20reinforcement%20learning.pdf
    
code from
https://www.youtube.com/watch?v=NP8pXZdU-5U&ab_channel=brthor (simpler)


In [1]:
import gym
import torch
import torch.nn as nn
from collections import deque
import itertools
import numpy as np
import random
import time

from collections import deque
A deque is a data structure that allows insertion and removal of elements from both ends. This is different from a queue, which only allows insertion at one end and removal from the other end, following a first-in, first-out (FIFO) order.
This is a linked list

In [2]:
GAMMA=0.99
BATCH_SIZE=32 #num transitions to sample from replay buffer
BUFFER_SIZE=50000 #max num of transitions to store before overwriting all transitions
MIN_REPLAY_SIZE=1000 #min num of transitions to store before computing gradients
EPSILON_START=1.0
EPSLION_END=0.02
EPSILON_DECAY=10000 #num of steps to decay epsilon from start to end, this is NOT the decay value itself but num of steps
TARGET_UPDATE_FREQ=1000 #num steps to set target params (target network) to online params (main network)
LR=5e-4
MAX_STEPS=100000

In [3]:
class Network(nn.Module):
    def __init__(self,env):
        super().__init__()
        in_features=int(np.prod(env.observation_space.shape))
        self.net=nn.Sequential(
            nn.Linear(in_features,64),
            nn.Tanh(),
            nn.Linear(64,env.action_space.n)
        )
        
    def forward(self,x):
        return self.net(x)
    
    def act(self,obs):
        state_t=torch.as_tensor(state,dtype=torch.float32) #_t indicates that its tensor, smart easy trick for debugging
        q_values=self(state_t.unsqueeze(0)) #add dim to beginning of shape, every operation in pytorch requires batch dim, so unsqueeze(0) adds dim 1 since this state_t only has 1 dim
        
        max_q_index=torch.argmax(q_values,dim=1)[0]
        action=max_q_index.detach().item()
        return action

In [8]:
env=gym.make('CartPole-v0')

replay_buffer=deque(maxlen=BUFFER_SIZE)
rew_buffer=deque([0.0],maxlen=100)

episode_reward=0.0

online_network=Network(env)
target_network=Network(env)
#need to set the target network params to online network params because they have been defined differently
target_network.load_state_dict(online_network.state_dict()) 

optimizer=torch.optim.Adam(online_network.parameters(),lr=LR)

#Initialise replay buffer - only run once at the start of algo for initialisation purpose
state=env.reset()
start_time=time.time()
for _ in range(MIN_REPLAY_SIZE):
    action=env.action_space.sample()
    new_state,reward,done,_=env.step(action)
    transition=(state,action,reward,done,new_state)
    replay_buffer.append(transition)
    state=new_state
    
    if done:
        state=env.reset()
        
#main training loop
state=env.reset()
print('################# START TRAINING #################')
for step in range(MAX_STEPS): #epsilon greedy method - need to calc epsilon value
    epsilon=np.interp(step,[0,EPSILON_DECAY],[EPSILON_START,EPSLION_END])
    
    random_sample=random.random()
    
    if random_sample<=epsilon:
        action=env.action_space.sample()
    else:
        action=online_network.act(state)
        
    new_state,reward,done,_=env.step(action)
    transition=(state,action,reward,done,new_state)
    replay_buffer.append(transition)
    state=new_state
    episode_reward+=reward
    
    if done:
        state=env.reset()
        rew_buffer.append(episode_reward)
        episode_reward=0.0
        
#     #After task is solved, test algo on env
#     if len(rew_buffer)>=100:
#         if np.mean(rew_buffer)>=195:
#             while True: #infinite loop
#                 action=online_network.act(state)
#                 state,_,done,_=env.step(action)
#                 env.render()
#                 if done:
#                     env.reset()
        
    #start gradient step 
    transitions=random.sample(replay_buffer,BATCH_SIZE)
    
    states=np.asarray([t[0] for t in transitions])
    actions=np.asarray([t[1] for t in transitions])
    rewards=np.asarray([t[2] for t in transitions])    
    dones=np.asarray([t[3] for t in transitions])    
    new_states=np.asarray([t[4] for t in transitions])    
    
    states_t=torch.tensor(states,dtype=torch.float32)
    actions_t=torch.tensor(actions,dtype=torch.int64).unsqueeze(-1) #unsqueeze(-1) since the var here alr in batches so -1 addes dim to end rather than beginning
    rewards_t=torch.tensor(rewards,dtype=torch.float32).unsqueeze(-1)
    dones_t=torch.tensor(dones,dtype=torch.float32).unsqueeze(-1)
    new_states_t=torch.tensor(new_states,dtype=torch.float32)   
    
    #compute targets
    target_q_values=target_network(new_states_t) #get the q values for each state in new_states in the target network
    max_target_q_values=target_q_values.max(dim=1,keepdim=True)[0] #get only the highest q values for each state in new state
    targets=rewards_t+GAMMA*(1-dones_t)*max_target_q_values #formula from paper
    
    #compute loss
    q_values=online_network(states_t) #get the q values for each state in states in online network
    action_q_values=torch.gather(input=q_values,dim=1,index=actions_t) #get q values for specific actions
    loss=nn.functional.smooth_l1_loss(action_q_values,targets) #Huber loss
    
    #gradient descent
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    #update target network
    if step%TARGET_UPDATE_FREQ==0:
        target_network.load_state_dict(online_network.state_dict())
        
    #logging
    if step%5000==0:
        print()
        print(f'Step {step}, Avg Rew : {np.mean(rew_buffer)}, Time elapsed: {(time.time()-start_time):.2f}s')
        
print('################# END TRAINING #################')
sss

################# START TRAINING #################

Step 0, Avg Rew : 0.0, Time elapsed: 0.02s

Step 5000, Avg Rew : 32.47, Time elapsed: 9.02s

Step 10000, Avg Rew : 68.38, Time elapsed: 19.27s

Step 15000, Avg Rew : 109.28, Time elapsed: 28.91s

Step 20000, Avg Rew : 150.3, Time elapsed: 38.81s

Step 25000, Avg Rew : 184.7, Time elapsed: 48.47s

Step 30000, Avg Rew : 198.21, Time elapsed: 58.12s

Step 35000, Avg Rew : 198.65, Time elapsed: 68.43s

Step 40000, Avg Rew : 194.37, Time elapsed: 79.20s

Step 45000, Avg Rew : 181.65, Time elapsed: 90.12s

Step 50000, Avg Rew : 170.05, Time elapsed: 100.35s

Step 55000, Avg Rew : 162.28, Time elapsed: 111.53s

Step 60000, Avg Rew : 166.59, Time elapsed: 122.31s

Step 65000, Avg Rew : 174.38, Time elapsed: 133.05s

Step 70000, Avg Rew : 182.3, Time elapsed: 143.36s

Step 75000, Avg Rew : 188.02, Time elapsed: 153.18s

Step 80000, Avg Rew : 190.01, Time elapsed: 162.95s

Step 85000, Avg Rew : 192.81, Time elapsed: 173.07s

Step 90000, Avg Rew

In [17]:
max_steps=10090
n_eps=5
for _ in range(n_eps):
    state=env.reset()
    for _ in range(max_steps): #test 5 episodes
        action=online_network.act(state)
        state,_,done,_=env.step(action)
        if done:
            env.render()
        
env.close()

KeyboardInterrupt: 

In [18]:
env.close()

# algo 2

code from
https://www.youtube.com/watch?v=wc-FxNENg9U&ab_channel=MachineLearningwithPhil 

only Online Network (DQN), doesnt implement Target Network but the model can still work

In [9]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

In [11]:
#put this in a python file .py

class DeepQNetwork(nn.Module):
    def __init__(self,lr,input_dims,fc1_dims,fc2_dims,n_actions):
        super(DeepQNetwork,self).__init__()
        self.input_dims=input_dims
        self.fc1_dims=fc1_dims
        self.fc2_dims=fc2_dims
        self.n_actions=n_actions
        self.fc1=nn.Linear(*self.input_dims,self.fc1_dims)
        self.fc2=nn.Linear(self.fc1_dims,self.fc2_dims)
        self.fc3=nn.Linear(self.fc2_dims,self.n_actions)
        self.optimizer=optim.Adam(self.parameters(),lr=lr)
        self.loss=nn.MSELoss()
        self.device=device('cuda:0' if torch.cuda.is_available() else "cpu")
        
    def forward(self,state):
        x=F.relu(self.fc1(state))
        x=F.relu(self.fc2(x))
        actions=self.fc3(x)
        return actions
    
class Agent():
    def __init__(self,gamma,epsilon,lr,input_dims,batch_size,n_actions,max_mem_size=100000,eps_min=0.01,eps_decay=5e-4):
        self.gamma=gamma
        self.epsilon=epsilon
        self.lr=lr
        self.input_dims=input_dims
        self.batch_size=batch_size
        self.n_actions=n_actions
        self.mem_size=max_mem_size
        self.eps_min=eps_min
        self.eps_decay=eps_decay
        self.action_space=[i for i in range(n_actions)]
        self.mem_counter=0 #keep track of first avail memory for Agent
        
        self.Q_eval=DeepQNetwork(lr=self.lr,n_actions=self.n_actions,input_dims=self.input_dims,fc1_dims=256,fc2_dims=256)
        self.state_memory=np.zeros((self.mem_size,*input_dims),dtype=np.float32)
        self.new_state_memory=np.zeros((self.mem_size,*input_dims),dtype=np.float32)
        self.action_memory=np.zeros(self.mem_size,dtype=np.int32)
        self.reward_memory=np.zeros(self.mem_size,dtype=np.float32)        
        self.terminal_memory=np.zeros(self.mem_size,dtype=np.bool) 
        
    def store_transition(self,state,action,reward,new_state,done):
        index=self.mem_counter%self.mem_size
        self.state_memory[index]=state
        self.new_state_memory[index]=new_state        
        self.action_memory[index]=action 
        self.reward_memory[index]=reward       
        self.terminal_memory[index]=done
        
        self.mem_counter+=1
        
    def choose_action(self,observation):
        if np.random.random()>self.epsilon:
            state=torch.tensor([observation]).to(self.Q_eval.device)
            actions=self.Q_eval.forward(state)
            action=T.argmax(actions).item()
        else:
            action=np.random.choice(self.action_space)
        return action
    
    def learn(self):
        if self.mem_counter<self.batch_size:
            return 
        self.Q_eval.optimizer.zero_grad()
        max_mem=min(self.mem_counter,self.mem_size)
        batch=np.random.choice(max_mem,self.batch_size,replace=False)
        batch_index=np.arange(self.batch_size,dtype=np.int32)
        
        state_batch=torch.tensor(self.state_memory[batch]).to(self.Q_eval.device)
        new_state_batch=torch.tensor(self.new_state_memory[batch]).to(self.Q_eval.device)        
        reward_batch=torch.tensor(self.reward_memory[batch]).to(self.Q_eval.device)        
        terminal_batch=torch.tensor(self.terminal_memory[batch]).to(self.Q_eval.device)       
        action_batch=self.action_memory[batch]
        
        q_eval=self.Q_eval.forward(state_batch)[batch_index,action_batch]
        q_next=self.Q_eval.forward(new_state_batch)
        q_next[terminal_batch]=0.0
        q_target=reward_batch+self.gamma+torch.max(q_next,dims=1)[0]
        
        loss=self.Q_eval.loss(q_target,q_eval).to(self.Q_eval.device)
        loss.backward()
        self.Q_eval.optimizer.step()
        
        self.epsilon=self.epsilon-self.eps_decay if self.epsilon>self.eps_min else self.eps_min

In [None]:
# then run this code

import gym
from simple_dqn_torch_2020 import Agent
from utils import plotLearning
import numpy as np

if __name__ == '__main__':
    env = gym.make('LunarLander-v2')
    agent = Agent(gamma=0.99, epsilon=1.0, batch_size=64, n_actions=4, eps_end=0.01,
                  input_dims=[8], lr=0.001)
    scores, eps_history = [], []
    n_games = 500
    
    for i in range(n_games):
        score = 0
        done = False
        observation = env.reset()
        while not done:
            action = agent.choose_action(observation)
            observation_, reward, done, info = env.step(action)
            score += reward
            agent.store_transition(observation, action, reward, 
                                    observation_, done)
            agent.learn()
            observation = observation_
        scores.append(score)
        eps_history.append(agent.epsilon)

        avg_score = np.mean(scores[-100:])

        print('episode ', i, 'score %.2f' % score,
                'average score %.2f' % avg_score,
                'epsilon %.2f' % agent.epsilon)
    x = [i+1 for i in range(n_games)]
    filename = 'lunar_lander.png'
    plotLearning(x, scores, eps_history, filename)

In [None]:
#utils.py

import matplotlib.pyplot as plt
import numpy as np
import gym

def plotLearning(x, scores, epsilons, filename, lines=None):
    fig=plt.figure()
    ax=fig.add_subplot(111, label="1")
    ax2=fig.add_subplot(111, label="2", frame_on=False)

    ax.plot(x, epsilons, color="C0")
    ax.set_xlabel("Game", color="C0")
    ax.set_ylabel("Epsilon", color="C0")
    ax.tick_params(axis='x', colors="C0")
    ax.tick_params(axis='y', colors="C0")

    N = len(scores)
    running_avg = np.empty(N)
    for t in range(N):
	    running_avg[t] = np.mean(scores[max(0, t-20):(t+1)])

    ax2.scatter(x, running_avg, color="C1")
    #ax2.xaxis.tick_top()
    ax2.axes.get_xaxis().set_visible(False)
    ax2.yaxis.tick_right()
    #ax2.set_xlabel('x label 2', color="C1")
    ax2.set_ylabel('Score', color="C1")
    #ax2.xaxis.set_label_position('top')
    ax2.yaxis.set_label_position('right')
    #ax2.tick_params(axis='x', colors="C1")
    ax2.tick_params(axis='y', colors="C1")

    if lines is not None:
        for line in lines:
            plt.axvline(x=line)

    plt.savefig(filename)

class SkipEnv(gym.Wrapper):
    def __init__(self, env=None, skip=4):
        super(SkipEnv, self).__init__(env)
        self._skip = skip

    def step(self, action):
        t_reward = 0.0
        done = False
        for _ in range(self._skip):
            obs, reward, done, info = self.env.step(action)
            t_reward += reward
            if done:
                break
        return obs, t_reward, done, info

    def reset(self):
        self._obs_buffer = []
        obs = self.env.reset()
        self._obs_buffer.append(obs)
        return obs

class PreProcessFrame(gym.ObservationWrapper):
    def __init__(self, env=None):
        super(PreProcessFrame, self).__init__(env)
        self.observation_space = gym.spaces.Box(low=0, high=255,
                                                shape=(80,80,1), dtype=np.uint8)
    def observation(self, obs):
        return PreProcessFrame.process(obs)

    @staticmethod
    def process(frame):

        new_frame = np.reshape(frame, frame.shape).astype(np.float32)

        new_frame = 0.299*new_frame[:,:,0] + 0.587*new_frame[:,:,1] + \
                    0.114*new_frame[:,:,2]

        new_frame = new_frame[35:195:2, ::2].reshape(80,80,1)

        return new_frame.astype(np.uint8)

class MoveImgChannel(gym.ObservationWrapper):
    def __init__(self, env):
        super(MoveImgChannel, self).__init__(env)
        self.observation_space = gym.spaces.Box(low=0.0, high=1.0,
                            shape=(self.observation_space.shape[-1],
                                   self.observation_space.shape[0],
                                   self.observation_space.shape[1]),
                            dtype=np.float32)

    def observation(self, observation):
        return np.moveaxis(observation, 2, 0)

class ScaleFrame(gym.ObservationWrapper):
    def observation(self, obs):
        return np.array(obs).astype(np.float32) / 255.0

class BufferWrapper(gym.ObservationWrapper):
    def __init__(self, env, n_steps):
        super(BufferWrapper, self).__init__(env)
        self.observation_space = gym.spaces.Box(
                             env.observation_space.low.repeat(n_steps, axis=0),
                             env.observation_space.high.repeat(n_steps, axis=0),
                             dtype=np.float32)

    def reset(self):
        self.buffer = np.zeros_like(self.observation_space.low, dtype=np.float32)
        return self.observation(self.env.reset())

    def observation(self, observation):
        self.buffer[:-1] = self.buffer[1:]
        self.buffer[-1] = observation
        return self.buffer

def make_env(env_name):
    env = gym.make(env_name)
    env = SkipEnv(env)
    env = PreProcessFrame(env)
    env = MoveImgChannel(env)
    env = BufferWrapper(env, 4)
    return ScaleFrame(env)