# CA 4, Interactive Learning, Fall 2024
- **Name**: Majid Faridfar
- **Student ID**: 810199569

## Part 1: Analytical Questions

### Problem 1

Deep RL algorithms are either based on value learning or direct policy learning. Compare these two categories. In general, why is there a need for direct policy learning as long as value can be learned? Is there a third category that uses both approaches? If so, explain.

### Problem 2

One algorithm that uses the DQN idea is Dueling DQN. Research this algorithm and explain its main differences from DQN. It is important to explain how the advantage function works.

### Problem 3

Examine the A2C algorithm. How is the idea of ​​advantage used in this algorithm? How is this method different from the Dueling DQN method?

## Part 2: Implementation Environment

### Problem 4

Explore the Cart Pole v1 environment in the Gym library. Provide complete information about the Action Space, Observation Space, and Rewards of this environment.

Action space:
The action can take values {0, 1} indicating the direction of the fixed force the cart is pushed with.

0: Push cart to the left

1: Push cart to the right

The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it.

Observation space:
The observation is the values corresponding to the following positions and velocities:

| Num | Observation | Min | Max |
| 0 | Cart Position | -4.8 | 4.8 |

1

Cart Velocity

-Inf

Inf

2

Pole Angle

~ -0.418 rad (-24°)

~ 0.418 rad (24°)

3

Pole Angular Velocity

-Inf

Inf

Note: While the ranges above denote the possible values for observation space of each element, it is not reflective of the allowed values of the state space in an unterminated episode. Particularly:

The cart x-position (index 0) can be take values between (-4.8, 4.8), but the episode terminates if the cart leaves the (-2.4, 2.4) range.

The pole angle can be observed between (-.418, .418) radians (or ±24°), but the episode terminates if the pole angle is not in the range (-.2095, .2095) (or ±12°)

Rewards:


### Problem 5

Why is it not good to use classical RL algorithms in this environment?

## Part 3: From Scratch

### Problem 6



In [None]:
import gym
import torch
import torch.nn as nn
import torch.optim as optim
import random
import numpy as np
from collections import deque

# Hyperparameters
GAMMA = 0.99
LEARNING_RATE = 1e-3
BATCH_SIZE = 64
TARGET_UPDATE_FREQ = 1000
MEMORY_SIZE = 10000
EPSILON_START = 1.0
EPSILON_END = 0.01
EPSILON_DECAY = 500

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_dim)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        samples = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*samples)
        return (np.array(states), np.array(actions), np.array(rewards),
                np.array(next_states), np.array(dones))
    
    def __len__(self):
        return len(self.buffer)

def epsilon_greedy_policy(state, epsilon, q_network, action_dim):
    if random.random() < epsilon:
        return random.randint(0, action_dim - 1)
    else:
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        with torch.no_grad():
            q_values = q_network(state_tensor)
        return q_values.argmax().item()

def train():
    env = gym.make('CartPole-v0')
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    q_network = DQN(state_dim, action_dim)
    target_network = DQN(state_dim, action_dim)
    target_network.load_state_dict(q_network.state_dict())
    target_network.eval()

    optimizer = optim.Adam(q_network.parameters(), lr=LEARNING_RATE)
    memory = ReplayBuffer(MEMORY_SIZE)
    
    epsilon = EPSILON_START
    step_count = 0
    
    for episode in range(300):
        state = env.reset()
        total_reward = 0

        for t in range(200):
            action = epsilon_greedy_policy(state, epsilon, q_network, action_dim)
            next_state, reward, done, _ = env.step(action)
            memory.push(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward

            if len(memory) >= BATCH_SIZE:
                states, actions, rewards, next_states, dones = memory.sample(BATCH_SIZE)
                states = torch.FloatTensor(states)
                actions = torch.LongTensor(actions).unsqueeze(1)
                rewards = torch.FloatTensor(rewards)
                next_states = torch.FloatTensor(next_states)
                dones = torch.FloatTensor(dones)

                q_values = q_network(states).gather(1, actions).squeeze()
                next_q_values = target_network(next_states).max(1)[0]
                target_q_values = rewards + GAMMA * next_q_values * (1 - dones)

                loss = nn.MSELoss()(q_values, target_q_values.detach())
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                
            if step_count % TARGET_UPDATE_FREQ == 0:
                target_network.load_state_dict(q_network.state_dict())
            
            step_count += 1
            if done:
                break
        
        epsilon = max(EPSILON_END, EPSILON_START - episode / EPSILON_DECAY)
        print(f"Episode {episode}, Total Reward: {total_reward}")
    
    env.close()

if __name__ == "__main__":
    train()


### Problem 7

### Problem 8

## Part 4: Using Library

### Problem 9

### Problem 10