# Deep Q-learning 
### This notebook contains the code for creating a DQN agent for solving the environment Pong

#### Install modules[IMPORTANT]
As of the time of this writing, latest versions of gym & ale-py do not work \
Please install these versions to ensure pong v5 can be run successfully 

In [None]:
# pip install gym=0.24.1
# pip install ale-py==0.7.4

#### Logic

Mostly following this paper[https://arxiv.org/abs/1312.5602] & code here[https://towardsdatascience.com/deep-q-network-dqn-ii-b6bf911b6b2c] \
Idea is as follows:
1) We pass 4 frames[including frame for current state] to predict state's action values \
2) For every episode, start by creating 3 dummy frames[consisting of state as current frame, 0s for reward & action, terminate as False] \
3) Using the 3 dummy frames as starting point, we pass last action[0] to obtain next frame, reward & terminate \
4) Continuously add them to agent's memory until a certain length is achieved \
5) After which, we also concurrently do weights updates by randomly selecting batch_size of frames from agent's memory for model training

After looping through all episodes, save model weights

#### Import modules

In [None]:
import torch
import copy
import gym
import cv2
import ale_py
import collections
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms
import numpy as np
from PIL import Image
from tqdm import tqdm
from torch import optim

#### Define functions & classes 

We create 4 wrapper classes:

1)  ProcessFrame84() \
&nbsp; This redefines the observation returned from each state as a grayscaled image of shape 84x84x1 

2) ImageToPyTorch() \
&nbsp; This changes the dimension of the array(from W,H,C to C,W,H to facilitate pytorch training) \

3) BufferWrapper() \
&nbsp; This redefines observation returned as a stack of 4 images \
&nbsp; The observation method ensures return of latest 4 arrays

4) ScaledFloatFrame() \
&nbsp; Rescales pixel values to range 0-1

Finally, we add everything under function make_env() \
This ensures that a stack of latest 4 grayscaled images will be returned at every state

We create a class for storing agent experience:
ExperienceReplay() creates a deque of maxlen capacity. The sample method allows random selection of experience via randomly-picking batch_size number of indices before zipping them into their respective arrays[state, action, reward, done, new_state]

Also create a namedtuple called Experience to store experience for every state before appending to the deque created above

In [None]:
class ProcessFrame84(gym.ObservationWrapper):
    def __init__(self, env=None):
        super(ProcessFrame84, self).__init__(env)
        self.observation_space = gym.spaces.Box(low=0, high=255, shape=(84, 84, 1), dtype=np.uint8)
            
    def observation(self, obs):
         return self.process(obs)

    def process(self,frame):
        img = frame[:, :, 0] * 0.299 + frame[:, :, 1] * 0.587 + frame[:, :, 2] * 0.114
        resized_screen = cv2.resize(img, (84, 110))
        x_t = resized_screen[18:102, :]
        x_t = np.reshape(x_t, [84, 84, 1])
        return x_t.astype(np.uint8)
    
class ImageToPyTorch(gym.ObservationWrapper):
    def __init__(self, env):
        super(ImageToPyTorch, self).__init__(env)
        old_shape = self.observation_space.shape
        self.observation_space = gym.spaces.Box(low=0.0, high=1.0, shape=(old_shape[-1], old_shape[0], old_shape[1]),dtype=np.float32)
    
    def observation(self, observation):
        return np.moveaxis(observation, 2, 0)    
    
class BufferWrapper(gym.ObservationWrapper):
    def __init__(self, env, n_steps, dtype=np.float32):
        super(BufferWrapper, self).__init__(env)
        self.dtype = dtype
        old_space = env.observation_space
        self.observation_space = gym.spaces.Box(old_space.low.repeat(n_steps, axis=0),old_space.high.repeat(n_steps, axis=0),dtype=dtype)
    
    def reset(self):
        self.buffer = np.zeros_like(self.observation_space.low, dtype=self.dtype)
        return self.observation(self.env.reset())
    
    def observation(self, observation):
        self.buffer[:-1] = self.buffer[1:]
        self.buffer[-1] = observation
        return self.buffer   
    
class ScaledFloatFrame(gym.ObservationWrapper):
    def observation(self, obs):
        return np.array(obs).astype(np.float32) / 255.0    
    
def make_env(env_name):
    env = gym.make(env_name,render_mode='human',full_action_space=False)
    env = ProcessFrame84(env)
    env = ImageToPyTorch(env) 
    env = BufferWrapper(env, 4)
    return ScaledFloatFrame(env) 

Experience = collections.namedtuple('Experience',field_names=['state', 'action', 'reward', 'done', 'new_state'])

class ExperienceReplay:
    def __init__(self, capacity):
        self.buffer = collections.deque(maxlen=capacity)
        
    def __len__(self):
        return len(self.buffer)
    
    def append(self, experience):
        self.buffer.append(experience)
  
    def sample(self, batch_size):
        indices = np.random.choice(len(self.buffer),batch_size,replace=False)
        states, actions, rewards, dones, next_states = zip(*[self.buffer[idx] for idx in indices])
        return np.array(states), np.array(actions,dtype=np.int64), np.array(rewards,dtype=np.float32), np.array(dones, dtype=np.uint8), np.array(next_states)

We define our hyperparameters here:
1) episodes refer to number of training episodes. Each episode terminates when either player reaches 21 points \
2) gamma is the decay multiplied to action-value of the next state \
3) min_memory_len is the minimum length of experience agent needs to attain before model training begins \
4) learning_rate is the learning rate used for model training \
5) epsilon_start is the starting value for epsilon. Recall that agent selects a random action whenever action probability falls below epsilon \
6) epsilon_decay is the value multiplied to epsilon. This ensures continuous decay of epsilon, resulting in agent acting more greedily as it gets better \
7) epsilon_min is the minimum epsilon value allowed. This ensures that agent will always carry out exploration with a very-small probability \
8) device stores our training data & models. Here, we are using gpu to accelerate training

In [None]:
episodes = 800
gamma = 0.99    
batch_size = 32
min_memory_len = 10000
learning_rate = 0.0001
epsilon_start = 1.0
epsilon_decay = 0.99999
epsilon_min = 0.03
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

We also create 2 classes for function approximation & agent-training:
1) DQN() \
&nbsp; This defines the neural network for estimating action values \
&nbsp; Largely-based on the architecture described in this paper[https://arxiv.org/abs/1312.5602]

2) Agent() \
This defines the agent \
&nbsp; a) __init__() initialises parameters used for agent-training. It starts by assigning input env to env variable, creates an experience buffer called exp_buffer, resets &emsp; variable learns to 0 & calls _reset() method \
&nbsp; b) _reset() resets env, timestep & total_reward \
&nbsp; c) select_action() chooses action based on action probability. If it is < epsilon, we take a random action. Else, act greedy \
&nbsp; d) select_optimal_action() is used to observe agent's behavior after training. This will enable agent to take the action that maximises it's action-value at every &emsp; state \
&nbsp; e) get_experience() adds replay experiences to agent memory. Logic is as follows: \
&emsp; 1) Create episode_reward variable & set as None \
&emsp; 2) Latest state is passed to env.step(), along with model, epsilon & device to produce next_state, reward, terminate & info \
&emsp; 3) Create Experience namedtuple using self.state,action,reward,terminate,next_state \
&emsp; 4) Append Experience to experience buffer \
&emsp; 5) Set state as next_state, increment timestep & total_reward \
&emsp; 6) IF next state is terminal, set episode_reward as total_reward. Print termination timestep & episode_reward. Call _reset() method, return True with \
&emsp; episode_reward \
&emsp; 7) ELSE, return False & episode_reward[with None value] \
&emsp; 8) IF we have collected enough memory, update_weights() \
&nbsp; f) update_weights() is used to update model weights. Logic is as follows: \
&emsp; 1) Call sample method of experience buffer. Assign result as batch variable \
&emsp; 2) Assign resulting arrays to states, actions, rewards, dones, next_states\
&emsp; 3) Create tensors states_t,next_states_t, actions_t, rewards_t & done_mask \
&emsp; 4) Pass states_t to model to obtain action value predictions \
&emsp; 5) Use gather method to obtain action values for every state based on their respective actions. Call this action_values \
&emsp; 6) Pass next_states_t to target_model to obtain next state action values \
&emsp; 7) Find the max action value of each next state. Call this next_action_values \
&emsp; 8) Use done_mask to set terminal next state action values to 0 \
&emsp; 9) Calculate expected_action_values by multiplying next_action_values to gamma before adding rewards_t \
&emsp; 10) Calculate loss using nn.MSELoss()(action_values, expected_action_values) & do backward propagation \
&emsp; 11) Update optimizer & increment learns \
&emsp; 12) for every 1000th learns, update weights of target_model

In [None]:
class DQN(nn.Module):
    def __init__(self):
        super().__init__()
        self.Conv1 = nn.Conv2d(4,32,8,stride=4)
        self.Conv2 = nn.Conv2d(32,64,4,stride=2)
        self.Conv3 = nn.Conv2d(64,64,3,stride=1)
        self.Linear1 = nn.Linear(3136,512)
        self.Linear2 = nn.Linear(512, 6)
        
        
    def forward(self,x):
        x = F.relu(self.Conv1(x))
        x = F.relu(self.Conv2(x))
        x = F.relu(self.Conv3(x))
        x = torch.flatten(x,1,3)
        x = F.relu(self.Linear1(x))
        x = self.Linear2(x)
        return x
    
class Agent():
    def __init__(self,env,exp_buffer):
        self.env = env
        self.exp_buffer = exp_buffer
        self.learns = 0
        self._reset()
        
    def _reset(self):
        self.state = env.reset()
        self.timestep = 0
        self.total_reward = 0    
            
    def select_action(self,model,epsilon,device=device):
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            state = np.array([self.state], copy=False)
            state = torch.tensor(state)
            state = state.to(device)
            action = np.argmax(model(state).cpu().detach().numpy())
        return action    
    
    def select_optimal_action(self,model,device=device):
        state = np.array([self.state], copy=False)
        state = torch.tensor(state)
        state = state.to(device)
        action = np.argmax(model(state).cpu().detach().numpy())
        return action 
            
    def get_experience(self,model,target_model,epsilon,device=device):
        episode_reward = None
        action = self.select_action(model,epsilon,device)
        next_state, reward, terminate, info = self.env.step(action)
        exp = Experience(self.state,action,reward,terminate,next_state)
        self.exp_buffer.append(exp)
        self.state = next_state
        self.timestep += 1
        self.total_reward += reward
        
        if terminate:
            episode_reward = self.total_reward
            print(f"Score {self.timestep} timestep: {episode_reward}")
            self._reset()
            return True, episode_reward
            
        if len(buffer) == min_memory_len:
            self.update_weights(model,target_model)    

        return False, episode_reward   
        
    def update_weights(self,model,target_model):
        batch = buffer.sample(batch_size)
        states, actions, rewards, dones, next_states = batch
                
        states_t = torch.tensor(states).to(device)
        next_states_t = torch.tensor(next_states).to(device)
        actions_t = torch.tensor(actions).to(device)
        rewards_t = torch.tensor(rewards).to(device)
        done_mask = torch.ByteTensor(dones).to(device)
        
        action_values = model(states_t).gather(1,actions_t.unsqueeze(-1)).squeeze(-1)
        next_action_values = target_model(next_states_t).max(1)[0]
        next_action_values[done_mask] = 0.0
        next_action_values = next_action_values.detach()
        
        expected_action_values = next_action_values*gamma + rewards_t
        loss_t = nn.MSELoss()(action_values, expected_action_values)
        
        optimizer.zero_grad()
        loss_t.backward()
        optimizer.step()
            
        self.learns += 1    
        if self.learns%1000 == 0:
            target_model.load_state_dict(model.state_dict())
            print(f"learns {self.learns}: target_model weights updated")
        

#### Train agent
We train for 800 episodes \
First, we initialise env, model & target_model \
Then, we create experience buffer using ExperienceReplay() \
Also, initialise agent, optimizer \
Finally, set epsilon as epislon_start & create empty list episode_rewards \

For every episode: \
&emsp; reset terminate as False \
&emsp; while not terminate, use updated epsilon value & add experience to agent's buffer \
&emsp; IF terminate, append reward to episode_rewards & calculate mean of latest 100 rewards[mean_reward]. Print episode & mean_reward. Print message \
&emsp; 'weights updated' if required memory length is achieved. 

In [None]:
env = make_env("ALE/Pong-v5")
net = DQN().to(device)
target_net = copy.deepcopy(net).to(device)

# net = DQN().to(device)
# net.load_state_dict(torch.load("pong_agent.pth"))
# target_net = copy.deepcopy(net).to(device)
buffer = ExperienceReplay(min_memory_len)
agent = Agent(env, buffer)
optimizer = optim.Adam(net.parameters(), lr=learning_rate)
epsilon = epsilon_start
episode_rewards = []

for episode in tqdm(range(episodes)):
    terminate = False
    while not terminate:
        epsilon = max(epsilon*epsilon_decay,epsilon_min)
#         epsilon = epsilon_min
        terminate, reward = agent.get_experience(net,target_net,epsilon,device=device)
        if terminate:
            episode_rewards.append(reward)
            mean_reward = round(np.mean(episode_rewards[-100:]),3)
            print(f"episode {episode}, mean reward: {mean_reward}")
            if len(buffer) == min_memory_len:
                print('weights updated')
env.reset()        
env.close() 

#### Save model weights

Save model weights to "pong_agent.pth" \
We can load trained weights for a new agent if we do not want to retrain

In [None]:
torch.save(net.state_dict(), "pong_agent.pth")

### Run this if you only want to use pre-trained weights & observe agent in action

#### Observe agent
Let's see how our agent performs for 10 episodes 

In [None]:
trained_net = DQN().to(device)
trained_net.load_state_dict(torch.load("pong_agent.pth"))

In [None]:
episodes = 10
env = make_env("ALE/Pong-v5")
buffer = ExperienceReplay(min_memory_len)
agent = Agent(env, buffer)
episode_rewards = []

for episode in tqdm(range(episodes)):
    terminate = False
    episode_reward = 0
    agent._reset()
    while not terminate:
        action = agent.select_optimal_action(trained_net,device=device)
        next_state, reward, terminate, info = agent.env.step(action)
        episode_reward += reward
        agent.state = next_state
        if terminate:
            episode_rewards.append(episode_reward)
mean_reward = sum(episode_rewards)/len(episode_rewards)            
print("mean reward: {%.3f}" % mean_reward)

env.reset()        
env.close() 

#### Conclusion

Our agent was able to win most 10 games after training for 800 episodes
One can observe certain interesting traits of the agent's behavior:
1) Most wins are scored by hitting the ball such that the return angle cannot be reached by the opponent's paddle \
2) If ball starts by going towards agent's paddle at the bottom, it wins mostly by doing a "header"[hitting with top part of paddle to return ball at high speed, towards upper end of opponent] \
3) Fast returns by opponent often results in a loss 

There are definitely ways to improve this agent so feel free to try out your own methods