# In the name of God
### HW6
### Deep Q-Learning



**Name:** ...

**Std. No.:** ...


### Deep Q-Learning (DQN)

Deep Q-Learning is a popular algorithm in reinforcement learning that combines the ideas of Q-learning, a traditional reinforcement learning method, with deep neural networks. The goal is to train an agent to make decisions by estimating the optimal action-value function Q, which represents the expected cumulative future rewards for taking a particular action in a given state.

Key components of DQN:

- **Experience Replay:** To break the temporal correlation in sequential data and improve sample efficiency, we use an experience replay buffer to store and sample past experiences.
- **Target Networks:** The use of two separate networks, the main network and a target network, helps stabilize training by decoupling the update targets from the online network's constantly changing values.

### The Lunar Lander Problem

The task is to control a lunar lander and guide it to land safely on the moon's surface. The agent needs to learn a policy that takes into account the lunar lander's state (position, velocity, angle, angular velocity, etc.) and chooses appropriate actions (thrust left, thrust right, thrust up, or do nothing) to achieve a safe landing.

### Overview

- **Environment:** LunarLander-v2 from OpenAI Gym.
- **Objective:** Train an agent to learn a policy for landing the lunar lander safely.
- **Techniques:** Deep Q-Learning, Experience Replay, Target Networks.

### Instructions

1. Follow the instructions and comments in the code cells to implement and understand each component.
2. Replace the `#####TO DO#####` placeholders with your code.
3. Experiment with hyperparameters and observe how they affect the training process.
4. Run the notebook to train the agent and play the game with the trained model.
5. Answer any provided questions or tasks to reinforce your understanding.

### Prerequisites

Make sure you have the following libraries installed:


In [None]:
!pip install --upgrade setuptools wheel

In [None]:
!pip install swig
!pip install gym[box2d]

# Imports

In [None]:
import numpy as np
import gym
import time
import torch
import torch.nn as nn
import torch.optim as optim
import os
import collections
import matplotlib.pyplot as plt
import collections


env = gym.make('LunarLander-v2')

In [None]:

class DQN(nn.Module):
    def __init__(self, in_features, n_actions):
        """
        Initialize the Deep Q-Network (DQN).

        Parameters:
        - in_features (int): Number of input features (dimension of the state).
        - n_actions (int): Number of possible actions in the environment.
        """
        super(DQN, self).__init__()

        # TODO: Implement the neural network architecture
        # Use Linear layers with ReLU
        # Number of hidden units in each layer:
        # - Layer 1: 256 units
        # - Layer 2: 128 units
        # - Layer 3: 64 units


    def forward(self, x):
        """
        Define the forward pass of the neural network.

        Parameters:
        - x (torch.Tensor): Input tensor representing the state.

        Returns:
        - torch.Tensor: Output tensor representing Q-values for each action.
        """
        # TODO: Implement the forward pass
        return


In [None]:


class ExperienceBuffer():
    def __init__(self, capacity):
        """
        Initialize the Experience Replay Buffer.

        Parameters:
        - capacity (int): Maximum capacity of the buffer.
        """
        self.exp_buffer = collections.deque(maxlen=capacity)

    def append(self, exp):
        """
        Append a new experience to the buffer.

        Parameters:
        - exp (tuple): Tuple representing a single experience (state, action, reward, done, next_state).
        """
        self.exp_buffer.append(exp)

    def __len__(self):
        """
        Get the current size of the buffer.

        Returns:
        - int: Number of experiences currently stored in the buffer.
        """
        return len(self.exp_buffer)

    def clear(self):
        """Clear all experiences from the buffer."""
        self.exp_buffer.clear()

    def sample(self, batch_size):
        """
        TODO: Sample a batch of experiences from the buffer.

        Parameters:
        - batch_size (int): Size of the batch to be sampled.

        Returns:
        - tuple: Batch of experiences (states, actions, rewards, dones, next_states).
        """
        # TODO: Implement the sampling logic


        # TODO: Convert to NumPy arrays with appropriate data types
        return


In [None]:
class Agent():
    def __init__(self, env, buffer):
        """
        Initialize the agent.

        Parameters:
        - env: The environment the agent interacts with.
        - buffer: Experience replay buffer to store agent experiences.
        """
        self.env = env
        self.buffer = buffer
        self._reset()

    def _reset(self):
        """
        Reset the agent's state and total rewards to the initial state.
        """
        self.state = env.reset()
        self.total_rewards = 0.0

    def step(self, net, eps, device="cpu"):
        """
        TODO: Implement the exploration-exploitation strategy (epsilon-greedy) here.

        Take a step in the environment using the provided neural network.

        Parameters:
        - net: The neural network representing the agent's policy.
        - eps (float): Epsilon value for epsilon-greedy exploration.
        - device (str): Device for neural network computations.

        Returns:
        - done_reward: Total rewards obtained in the episode if it is finished, otherwise None.
        """
        done_reward = None

        # TODO: Implement exploration-exploitation strategy here

        # TODO: Take the selected action for 4 time steps (adjustable)

        # TODO: Append the experience to the buffer

        return done_reward


In [None]:
# Hyperparameters
GAMMA = 0.99  # Discount factor for future rewards
EPSILON_START = 1.0  # Initial exploration probability (epsilon-greedy)
EPSILON_FINAL = 0.01  # Final exploration probability (epsilon-greedy)
EPSILON_DECAY_OBS = 10**5  # Number of observations for epsilon decay
BATCH_SIZE = 32  # Size of the experience replay batch
MEAN_GOAL_REWARD = 250  # Mean reward goal for solving the environment
REPLAY_BUFFER_SIZE = 10000  # Maximum capacity of the experience replay buffer
REPLAY_MIN_SIZE = 10000  # Minimum size of the experience replay buffer before training begins
LEARNING_RATE = 1e-4  # Learning rate for the neural network optimizer
SYNC_TARGET_OBS = 1000  # Number of observations before synchronizing target and online networks

In [None]:
import torch
import torch.nn as nn

def cal_loss(batch, net, tgt_net, device='cpu'):
    """
    TODO: Implement the loss calculation for Deep Q-Learning.

    Calculate the loss for Deep Q-Learning.

    Parameters:
    - batch (tuple): Batch of experiences (states, actions, rewards, dones, next_states).
    - net: The neural network representing the online Q-network.
    - tgt_net: The neural network representing the target Q-network.
    - device (str): Device for neural network computations (default is "cpu").

    Returns:
    - torch.Tensor: Loss value calculated using Mean Squared Error (MSE) loss.
    """

    states, actions, rewards, dones, next_states = batch
    states_v = torch.tensor(states).to(device)
    actions_v = torch.tensor(actions).to(device)
    rewards_v = torch.tensor(rewards).to(device)
    dones_v = torch.BoolTensor(dones).to(device)
    next_states_v = torch.tensor(next_states).to(device)

    # TODO: Calculate Q-values for the current states and selected actions

    # TODO: Calculate the maximum Q-value for the next states using the target network

    # TODO: Zero out Q-values for terminal states

    # TODO: Detach Q-values for the next states to avoid gradient flow

    # TODO: Calculate the expected return for the current states

    # TODO: Implement the Mean Squared Error (MSE) loss calculation

    return loss



# Learning Curves
 Plot learning curves showing key metrics (e.g., total rewards, loss) over the course of training. Analyze the trends and identify key points in the learning process.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')

net = DQN(env.observation_space.shape[0], env.action_space.n).to(device)
tgt_net = DQN(env.observation_space.shape[0], env.action_space.n).to(device)

buffer = ExperienceBuffer(REPLAY_BUFFER_SIZE)

agent = Agent(env, buffer)

epsilon = EPSILON_START

optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)

# Lists to track total rewards and losses over training
total_rewards = []
losses = []

# Initialize time variables for tracking training time
ts = time.time()
best_mean_reward = None
obs_id = 0

while True:
    obs_id += 1

    # Update exploration rate based on epsilon decay schedule
    epsilon = max(EPSILON_FINAL, EPSILON_START - obs_id/EPSILON_DECAY_OBS)

    # Agent takes a step in the environment, receives a reward
    reward = agent.step(net, epsilon, device=device)

    if reward is not None:
        # Store total rewards and update game time
        total_rewards.append(reward)
        game_time = time.time() - ts
        ts = time.time()
        mean_reward = np.mean(total_rewards[-100:])

        losses.append(loss_t.item())

        if best_mean_reward is None or best_mean_reward < mean_reward:
            torch.save(net.state_dict(), './lunar_lander-best.dat')

            if best_mean_reward is None:
                last = mean_reward
                best_mean_reward = mean_reward

            if best_mean_reward is not None and best_mean_reward - last > 10:
                last = best_mean_reward
                print("GAME : {}, TIME ECLAPSED : {}, EPSILON : {}, MEAN_REWARD : {}"
                      .format(obs_id, game_time, epsilon, mean_reward))
                print("Reward {} -> {} Model Saved".format(best_mean_reward, mean_reward))

            best_mean_reward = mean_reward

        if mean_reward > MEAN_GOAL_REWARD:
            print("SOLVED in {} obs".format(obs_id))
            break

    # Continue training if the replay buffer size is below the minimum required
    if len(buffer) < REPLAY_MIN_SIZE:
        continue

    # Synchronize target network with the Q-network at regular intervals
    if obs_id % SYNC_TARGET_OBS == 0:
        tgt_net.load_state_dict(net.state_dict())

    # TODO: Implement the training process (calculating loss, backpropagation, and optimizer step)


    # TODO: Plot learning curves every few episodes or steps



# Visual Comparison:

write a function to render and display the environment before and after training. What visual differences do you observe in the agent's behavior? Discuss it. Also, Upload the Videos with your notebook. You can use the following library for rendering and saving videos.

In [None]:
import imageio

# Helper function for rendering and saving a video
def render_and_save_video(env, net, episodes=10, save_path="./render_video.mp4", device="cpu"):
    # TODO: Render and display the environment


# Render and save a video before training
print("### BEFORE TRAINING ###")
render_and_save_video(env, net, device=device,save_path = './before.mp4')

# Render and save a video after training
print("### AFTER TRAINING ###")
render_and_save_video(env, net, device=device,save_path = './after.mp4')


# Question:

Exploration (Epsilon-Greedy):

Discuss the significance of the exploration strategy, specifically the Epsilon-Greedy approach, in balancing exploration and exploitation during training.