# In the name of God
## HW6
### Practical Section: TRPO Algorithm


**Name:** Javad Razi

**Std. No.:** 401204354

### PPO Algorithm

# Importing Required Libraries

First, we need to import the necessary libraries. We will be using OpenAI's `gym` for the Lunar Lander environment, `numpy` for numerical operations, and `torch` for implementing the neural network and optimization.


In [4]:
!pip install --upgrade setuptools wheel

Collecting setuptools
  Downloading setuptools-69.0.3-py3-none-any.whl (819 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m819.5/819.5 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: setuptools
  Attempting uninstall: setuptools
    Found existing installation: setuptools 67.7.2
    Uninstalling setuptools-67.7.2:
      Successfully uninstalled setuptools-67.7.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython 7.34.0 requires jedi>=0.16, which is not installed.
lida 0.0.10 requires fastapi, which is not installed.
lida 0.0.10 requires kaleido, which is not installed.
lida 0.0.10 requires python-multipart, which is not installed.
lida 0.0.10 requires uvicorn, which is not installed.[0m[31m
[0mSuccessfully installed setuptools-69.0.3


In [5]:
!pip install swig
!pip install gym[box2d]

Collecting swig
  Using cached swig-4.1.1.post1-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.8 MB)
Installing collected packages: swig
Successfully installed swig-4.1.1.post1
Collecting box2d-py==2.3.5 (from gym[box2d])
  Using cached box2d-py-2.3.5.tar.gz (374 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pygame==2.1.0 (from gym[box2d])
  Using cached pygame-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
Building wheels for collected packages: box2d-py
  Building wheel for box2d-py (setup.py) ... [?25l[?25hdone
  Created wheel for box2d-py: filename=box2d_py-2.3.5-cp310-cp310-linux_x86_64.whl size=2373128 sha256=3b4b9332556f839483b763ee10224573920c89defdd92488dcfed82b6664066c
  Stored in directory: /root/.cache/pip/wheels/db/8f/6a/eaaadf056fba10a98d986f6dce954e6201ba3126926fc5ad9e
Successfully built box2d-py
Installing collected packages: box2d-py, pygame
  Attempting uninstall: pygame
    Found existing installation: p

In [1]:
import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

# Creating the Environment

We will create the Lunar Lander environment using the `gym.make()` function. We will also set the `enable_wind` parameter to `True` as mentioned.


In [6]:
env = gym.make('LunarLander-v2')
env.enable_wind = True

  deprecation(
  deprecation(


# Defining the Policy Network

We will define a simple policy network using PyTorch. This network will take the state of the environment as input and output the action probabilities and state value.


In [7]:
class PolicyNetwork(nn.Module):
    def __init__(self, num_inputs, num_actions, hidden_size, learning_rate=3e-4):
        super(PolicyNetwork, self).__init__()

        self.num_actions = num_actions
        self.linear1 = nn.Linear(num_inputs, hidden_size)
        self.linear2 = nn.Linear(hidden_size, num_actions)
        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)

    def forward(self, state):
        x = torch.tanh(self.linear1(state))
        x = self.linear2(x)
        action_probs = torch.softmax(x, dim=1)
        return action_probs


# Defining the Value Network

Next, we define a value network that estimates the value of a state. This network is separate from the policy network and has its own parameters.


In [8]:
class ValueNetwork(nn.Module):
    def __init__(self, num_inputs, hidden_size, learning_rate=3e-4):
        super(ValueNetwork, self).__init__()

        self.linear1 = nn.Linear(num_inputs, hidden_size)
        self.linear2 = nn.Linear(hidden_size, 1)
        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)

    def forward(self, state):
        x = torch.tanh(self.linear1(state))
        x = self.linear2(x)
        return x


# Implementing the TRPO Algorithm

Now, we will implement the TRPO algorithm. We will use the PyTorch's automatic differentiation feature to compute the gradients. The objective function and the constraint are implemented as mentioned in the task description.


In [9]:
def trpo_step(policy_net, value_net, states, actions, rewards, masks, epsilon=0.2):
    # Compute the old action probabilities
    old_action_probs = policy_net(states).gather(1, actions)

    # Compute the value function
    values = value_net(states)

    # Compute the advantages
    advantages = rewards + masks * values - values.detach()

    # Compute the new action probabilities
    new_action_probs = policy_net(states).gather(1, actions)

    # Compute the surrogate function
    ratio = new_action_probs / old_action_probs
    surrogate = ratio * advantages

    # Compute the KL divergence
    kl_divergence = old_action_probs * torch.log(old_action_probs / new_action_probs)

    # Compute the loss
    loss = -surrogate + epsilon * kl_divergence

    # Update the policy network
    policy_net.optimizer.zero_grad()
    loss.backward()
    policy_net.optimizer.step()

    # Update the value network
    value_net.optimizer.zero_grad()
    values.backward()
    value_net.optimizer.step()

# Implementing the PPO Algorithm

Next, we will implement the PPO algorithm. The PPO algorithm is similar to the TRPO algorithm, but it uses a clipped surrogate objective instead of the original surrogate objective.


In [10]:
# Note: This is a simplified version of the PPO algorithm and may need adjustments based on the specific requirements of your task.

def ppo_step(policy_net, value_net, states, actions, rewards, masks, epsilon=0.2, beta=3.0):
    # Compute the old action probabilities
    old_action_probs = policy_net(states).gather(1, actions)

    # Compute the value function
    values = value_net(states)

    # Compute the advantages
    advantages = rewards + masks * values - values.detach()

    # Compute the new action probabilities
    new_action_probs = policy_net(states).gather(1, actions)

    # Compute the surrogate function
    ratio = new_action_probs / old_action_probs
    surrogate = ratio * advantages

    # Compute the KL divergence
    kl_divergence = old_action_probs * torch.log(old_action_probs / new_action_probs)

    # Compute the clipped surrogate function
    clipped_surrogate = torch.clamp(ratio, 1 - epsilon, 1 + epsilon) * advantages

    # Compute the loss
    loss = -torch.min(surrogate, clipped_surrogate) + beta * kl_divergence

    # Update the policy network
    policy_net.optimizer.zero_grad()
    loss.backward()
    policy_net.optimizer.step()

    # Update the value network
    value_net.optimizer.zero_grad()
    values.backward()
    value_net.optimizer.step()