<a href="https://colab.research.google.com/github/mayank0290/Deep-Q-Learning-For-Lunar-Lander/blob/main/Deep_Q_Learning_for_Lunar_Lander.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Q-Learning for Lunar Lander

## Part 0 - Installing the required packages and importing the libraries

In [49]:
!pip install gymnasium
!pip install "gymnasium[atari, accept-rom-license]"
!apt-get install -y swig
!pip install gymnasium[box2d]

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
swig is already the newest version (4.0.2-1ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 29 not upgraded.


### Importing the libraries

In [50]:
import os
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.autograd as autograd
from torch.autograd import Variable
from collections import deque, namedtuple

## Part 1 - Building the AI

### Creating the architecture of the Neural Network

**Brain:** the core of the NN

`state_size:` piece of info(position, speed etc) the brain gets

`action_size:` diff actions the brain can choose from(fire left right etc)

`seed`:the brain behaves in a same way after refresh

`nn.Linear`: reating three connected layers in our brain,

1.  `First layer`: Takes in the game information and processes it into 64 numbers
2.`   Second layer`: Takes those 64 numbers and processes them into another 64 numbers
3. [Third layer](https://): Takes those 64 numbers and turns them into values for each possible action

**2nd Part** of the code shows how information flows through our brain:

The game information goes into the first layer
`F.relu(x)` is like an on/off switch - it turns negative numbers to zero and keeps positive numbers as they are.
The information passes through the second layer
Finally, the third layer gives us a score for each possible action



In [51]:
class Brain(nn.Module):
  def __init__(self, state_size, action_size, seed = 42) -> None:
      super(Brain, self).__init__()
      self.seed = torch.manual_seed(seed)
      self.fc1 = nn.Linear(state_size, 64)
      self.fc2 = nn.Linear(64, 64)
      self.fc3 = nn.Linear(64, action_size)

  def forward(self, state):
      x = self.fc1(state)
      x = F.relu(x)
      x = self.fc2(x)
      return self.fc3(x)


## Part 2 - Training the AI

### Setting up the environment

In [52]:
import gymnasium as gym
env = gym.make('LunarLander-v3')
state_shape = env.observation_space.shape
state_size = env.observation_space.shape[0]
number_actions = env.action_space.n
print('State shape: ', state_shape)
print('State size: ', state_size)
print('Number of actions: ', number_actions)

State shape:  (8,)
State size:  8
Number of actions:  4


### Initializing the hyperparameters

**Learning Parameters**

**Learning Rate (0.0005):** How big of steps the agent takes when learning, it's like adjusting the TV volume

**Minibatch Size (100):** Agent learns from 100 experiences at once

**Discount Factor (0.99):** Agent values future rewards almost as much as immediate ones

**Replay Buffer (100,000):** Agent remembers 100,000 past experiences to learn from

**Tau (0.001):** Controls how slowly the target network updates for stability

In [53]:
learning_rate = 5e-4
minibatch_size = 100
discount_factor = 0.99
replay_buffer_size = int(1e5)
tau = 1e-3

### Implementing Experience Replay



*  **Random Sampling:** We pick experiences randomly to break correlations and provide diverse learning examples.
* **Batch Processing:** Processing multiple experiences at once is much more efficient than one at a time.
*   **Data Organization:** We separate each component (states, actions, etc.) to make them easier to process in the learning algorithm.
*   **Tensor Conversion:** Converting to PyTorch tensors allows the neural network to process the data more efficiently.
*  **Filtering:**  The if e is not None checks ensure we only process valid experiences.


**Replay Memory: The Agent's Learning Journal**

This ReplayMemory class serves as the agent's memory system - like a journal of past experiences. It stores a limited number of game moments (state, action, reward, next state, done status) and manages them efficiently by removing the oldest entries when full. When it's time to learn, the agent randomly samples a batch of these experiences rather than learning sequentially. This random sampling is crucial as it breaks up patterns that might bias learning and provides diverse situations to learn from. The class also handles the technical work of converting these memories into the proper format (PyTorch tensors) and sending them to the best available hardware (GPU or CPU) for faster processing. This memory system is essential for deep reinforcement learning because it allows the agent to efficiently revisit and learn from important past experiences multiple times.





In [54]:
class ReplayMemory(object):
 def __init__(self, capacity):
    self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    self.capacity = capacity
    self.memory = []

 def push(self, event):
  self.memory.append(event)
  if len(self.memory) > self.capacity:
    del self.memory[0]

 def sample(self, batch_size):
  experiences = random.sample(self.memory, k = batch_size)
  states = torch.from_numpy(np.vstack([e[0] for e in experiences if e is not None])).float().to(self.device)
  actions = torch.from_numpy(np.vstack([e[1] for e in experiences if e is not None])).long().to(self.device)
  rewards = torch.from_numpy(np.vstack([e[2] for e in experiences if e is not None])).float().to(self.device)
  next_states = torch.from_numpy(np.vstack([e[3] for e in experiences if e is not None])).float().to(self.device)
  dones = torch.from_numpy(np.vstack([e[4] for e in experiences if e is not None]).astype(np.uint8)).float().to(self.device)
  return states, next_states, actions, rewards, dones

### Implementing the DQN class

This code implements a reinforcement learning agent using Deep Q-Network (DQN) architecture. Here's what it does:

**Core Components:**

Two neural networks: a primary "local" network for decision-making and a more stable "target" network
Experience replay memory to store and learn from past interactions
Adam optimizer for network training


**Key Methods:**

step(): Stores experiences and triggers learning every 4 steps
act(): Makes decisions using epsilon-greedy strategy (balancing exploration vs. exploitation)
learn(): Updates network weights using the Bellman equation
soft_update(): Gradually updates the target network for stability


**Reinforcement Learning Principles:**

Learns from experience through trial and error
Uses batch learning from randomly sampled past experiences
Optimizes for long-term rewards using a discount factor
Stabilizes learning through target networks and experience replay



This agent gradually improves its decision-making ability by learning which actions in which states lead to maximum long-term rewards.

In [55]:
class Agent():

  def __init__(self, state_size, action_size):
    self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    self.state_size = state_size
    self.action_size = action_size
    self.local_qnetwork = Brain(state_size, action_size).to(self.device)
    self.target_qnetwork = Brain(state_size, action_size).to(self.device)
    self.optimizer = optim.Adam(self.local_qnetwork.parameters(), lr = learning_rate)
    self.memory = ReplayMemory(replay_buffer_size)
    self.t_step = 0

  def step(self, state, action, reward, next_state, done):
    self.memory.push((state, action, reward, next_state, done))
    self.t_step = (self.t_step + 1) % 4
    if self.t_step == 0:
      if len(self.memory.memory) > minibatch_size:
        experiences = self.memory.sample(100)
        self.learn(experiences, discount_factor)

  def act(self, state, epsilon = 0.):
    state = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
    self.local_qnetwork.eval()
    with torch.no_grad():
      action_values = self.local_qnetwork(state)
    self.local_qnetwork.train()
    if random.random() > epsilon:
      return np.argmax(action_values.cpu().data.numpy())
    else:
      return random.choice(np.arange(self.action_size))

  def learn(self, experiences, discount_factor):
    states, next_states, actions, rewards, dones = experiences
    next_q_targets = self.target_qnetwork(next_states).detach().max(1)[0].unsqueeze(1)
    q_targets = rewards + discount_factor * next_q_targets * (1 - dones)
    q_expected = self.local_qnetwork(states).gather(1, actions)
    loss = F.mse_loss(q_expected, q_targets)
    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()
    self.soft_update(self.local_qnetwork, self.target_qnetwork, tau)

  def soft_update(self, local_model, target_model, interpolation_parameter):
    for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
      target_param.data.copy_(tau * local_param.data + (1.0 - tau) * target_param.data)

### Initializing the DQN agent

In [56]:
agent = Agent(state_size, number_actions)

### Training the DQN agent

**Reinforcement Learning Training Loop Summary**

This code implements the core training loop for reinforcement learning, where an agent learns through repeated interactions with an environment. Key components include:

**Training Parameters:**

2000 maximum episodes, each up to 1000 timesteps long
Epsilon value decreases from 1.0 to 0.01 at a rate of 0.995 per episode
Performance tracked via rolling average of last 100 episode scores


**Episode Structure:**

Environment resets at start of each episode
For each timestep: agent selects action, environment responds, agent learns
Episode ends when environment signals completion or maximum timesteps reached


**Training Progress:**

Epsilon gradually decreases, shifting from exploration to exploitation
Regular progress updates display average performance
Training completes early if average score reaches 200 points
Successful model weights saved to checkpoint file



This loop represents the iterative process where an agent improves its decision-making ability through repeated trial-and-error interactions with an environment.

In [57]:
number_episodes = 2000
maximum_number_timesteps_per_episode = 1000
epsilon_starting_value  = 1.0
epsilon_ending_value  = 0.01
epsilon_decay_value  = 0.995
epsilon = epsilon_starting_value
scores_on_100_episodes = deque(maxlen = 100)

for episode in range(1, number_episodes + 1):
  state, _ = env.reset()
  score = 0
  for t in range(maximum_number_timesteps_per_episode):
    action = agent.act(state, epsilon)
    next_state, reward, done, _, _ = env.step(action)
    agent.step(state, action, reward, next_state, done)
    state = next_state
    score += reward
    if done:
      break
  scores_on_100_episodes.append(score)
  epsilon = max(epsilon_ending_value, epsilon_decay_value * epsilon)
  print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_on_100_episodes)), end = "")
  if episode % 100 == 0:
    print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_on_100_episodes)))
  if np.mean(scores_on_100_episodes) >= 200.0:
    print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(episode - 100, np.mean(scores_on_100_episodes)))
    torch.save(agent.local_qnetwork.state_dict(), 'checkpoint.pth')
    break

Episode 100	Average Score: -169.05
Episode 200	Average Score: -124.37
Episode 300	Average Score: -68.18
Episode 400	Average Score: 10.27
Episode 500	Average Score: 4.17
Episode 600	Average Score: 85.98
Episode 689	Average Score: 200.04
Environment solved in 589 episodes!	Average Score: 200.04


## Part 3 - Visualizing the results

**Agent Visualization Code Summary**

This code creates and displays a video recording of a trained reinforcement learning agent in action. The workflow consists of two main functions:

**show_video_of_model():**

Creates a Gym environment with visualization capabilities
Records each frame while the agent completes one full episode
Saves the collected frames as an MP4 video file


**show_video():**

Locates the MP4 file in the current directory
Converts the video to base64 encoding
Displays it directly in a Jupyter notebook using HTML



Together, these functions provide a complete visualization pipeline that allows researchers to observe and share how their trained agent performs in the environment.

In [58]:
import glob
import io
import base64
import imageio
from IPython.display import HTML, display

def show_video_of_model(agent, env_name):
    env = gym.make(env_name, render_mode='rgb_array')
    state, _ = env.reset()
    done = False
    frames = []
    while not done:
        frame = env.render()
        frames.append(frame)
        action = agent.act(state)
        state, reward, done, _, _ = env.step(action.item())
    env.close()
    imageio.mimsave('video.mp4', frames, fps=30)

show_video_of_model(agent, 'LunarLander-v3')

def show_video():
    mp4list = glob.glob('*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
    else:
        print("Could not find video")

show_video()

