<a href="https://colab.research.google.com/github/kimsooyoung/rl_oc_python/blob/main/oc_lec2_reinforce/REINFORCE_basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install Requirements

In [None]:
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!pip install swig
!pip install renderlab
!pip install gymnasium
!pip install gymnasium[box2d]

### Import the Necessary Packages

In [None]:
import gymnasium as gym
import collections
import random

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical

### `torch.distributions.categorical.Categorical` example

In [None]:
prob_list = torch.tensor([-2, -1, 0, 1, 2])
c_result = Categorical(logits=prob_list)
c_result.probs

tensor([0.0117, 0.0317, 0.0861, 0.2341, 0.6364])

This is exactly same with below

In [None]:
torch.exp(prob_list) / torch.sum(torch.exp(prob_list))

tensor([0.0117, 0.0317, 0.0861, 0.2341, 0.6364])

## Render Test

In [None]:
import renderlab as rl

env = gym.make("CartPole-v1", render_mode = "rgb_array")
env = rl.RenderFrame(env, "./output")

observation, info = env.reset()
score = 0

while True:
  action = env.action_space.sample()
  observation, reward, terminated, truncated, info = env.step(action)
  score += reward

  if terminated:
    print("Score : ", score)
    break

env.play()

Score :  22.0
Moviepy - Building video temp-{start}.mp4.
Moviepy - Writing video temp-{start}.mp4




                                                   

Moviepy - Done !
Moviepy - video ready temp-{start}.mp4




## Initialize Hyper Params

In [None]:
learning_rate = 0.0002
gamma         = 0.98

## Define Policy Class

- input: 4 length tensor
- layer structure: (4 * 128) = ReLU > (128 * 2) = softmax>
- **[Caution]** final layer must be softmax, because total sum of policy should be 1


- `put_data` method: append episodes into class variable. We'll only save **Rewards & NN outputs** as episodes.
- `train_net` method : optimize network with policy gradient loss

$\quad \quad \nabla_\theta J = G * \nabla log \pi(s, a) $

$\quad \quad therefore, J = G * log \pi(s, a) $


> However, We need Maximum Cost, not the minimum cost. Hence sign is suppose to be negative.

$\quad \quad J = - G * log \pi(s, a)$

In [None]:
class Policy(nn.Module):
  def __init__(self):
    super(Policy, self).__init__()
    self.data = []

    self.fc1 = nn.Linear(4, 128)
    self.fc2 = nn.Linear(128, 2)
    self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)

  def forward(self, obs):
    x1 = F.relu(self.fc1(obs))
    x2 = F.softmax(self.fc2(x1), dim=0)
    return x2

  def put_data(self, transition):
    self.data.append(transition)

  def train_net(self):

    R = 0
    self.optimizer.zero_grad()

    for r, prob in self.data[::-1]:
      R = R + gamma * r
      loss = - R * torch.log(prob)
      loss.backward()

    self.optimizer.step()
    self.data = []

## Main loop

- create environments and policy, and reset score value
- for each train epochs
  - reset environment
  - train policy after each episodes
  - for each episodes
    - obtain policy action probability and actual action value
    - step environment
    - put transistion into dataset
    - update state, update score

In [None]:
env = gym.make("CartPole-v1")
policy = Policy()

score = 0.0
print_interval = 200

for epi in range(2600):
  s, _ = env.reset()
  done = False

  while not done:
    prob = policy( torch.from_numpy(s).float() )
    action = Categorical(prob).sample()
    sp, r, done, truncated, info = env.step(action.item())
    policy.put_data( (r, prob[action]) )

    s = sp
    score += r

    if done:
      break

  policy.train_net()

  if (epi % print_interval == 0) and (epi != 0):
    print(f"epi: {epi} / avg_score: {score / print_interval}")
    score = 0.0

env.close()

epi: 200 / avg_score: 27.53
epi: 400 / avg_score: 35.47
epi: 600 / avg_score: 49.87
epi: 800 / avg_score: 63.59
epi: 1000 / avg_score: 77.36
epi: 1200 / avg_score: 146.89
epi: 1400 / avg_score: 254.225
epi: 1600 / avg_score: 362.245
epi: 1800 / avg_score: 530.89
epi: 2000 / avg_score: 882.45
epi: 2200 / avg_score: 1511.955
epi: 2400 / avg_score: 2171.51


##  Test result with Rendered Animation

[test video](https://github.com/kimsooyoung/rl_oc_python/assets/12381733/9cd16567-d910-4ee8-ad8f-f4f1f1fa0f94)

In [None]:
import renderlab as rl

env = gym.make("CartPole-v1", render_mode = "rgb_array")
env = rl.RenderFrame(env, "./output")
s, info = env.reset()

while True:
  prob = policy( torch.from_numpy(s).float() )
  action = Categorical(prob).sample()
  sp, r, done, truncated, info = env.step(action.item())
  policy.put_data( (r, prob[action]) )

  s = sp
  score += r

  if done or truncated:
    break

env.play()

Moviepy - Building video temp-{start}.mp4.
Moviepy - Writing video temp-{start}.mp4



                                                               

Moviepy - Done !
Moviepy - video ready temp-{start}.mp4


