# PFRL Quickstart Guide

This is a quickstart guide for users who just want to try PFRL for the first time.

If you have not yet installed PFRL, run the command below to install it:
```
pip install pfrl
```

If you have already installed PFRL, let's begin!

First, you need to import necessary modules. The module name of PFRL is `pfrl`. Let's import `torch`, `gymnasium`, and `numpy` as well since they are used later.

In [1]:
import pfrl
import torch
import torch.nn
import gymnasium
import numpy

PFRL can be used for any problems if they are modeled as "environments". [OpenAI gymnasium](https://github.com/openai/gymnasium) provides various kinds of benchmark environments and defines the common interface among them. PFRL uses a subset of the interface. Specifically, an environment must define its observation space and action space and have at least two methods: `reset` and `step`.

- `env.reset` will reset the environment to the initial state and return the initial observation.
- `env.step` will execute a given action, move to the next state and return five values:
  - a next observation
  - a scalar reward
  - a boolean value indicating whether the current state is terminal or not
  - a boolean value indicating whether the episode has been truncated or not
  - additional information
- `env.render` will render the current state. (optional)

Let's try `CartPole-v0`, which is a classic control problem. You can see below that its observation space consists of four real numbers while its action space consists of two discrete actions.

In [2]:
env = gymnasium.make('CartPole-v0')
print('observation space:', env.observation_space)
print('action space:', env.action_space)

obs = env.reset()
print('initial observation:', obs)

action = env.action_space.sample()
obs, r, terminated, truncated, info = env.step(action)
print('next observation:', obs)
print('reward:', r)
print('terminated:', terminated)
print('terminated:', truncated)
print('info:', info)

# Uncomment to open a GUI window rendering the current state of the environment
# env.render()

observation space: Box(4,)
action space: Discrete(2)
initial observation: [ 0.03923832  0.00510645 -0.03804804  0.00186333]
next observation: [ 0.03934045  0.20075283 -0.03801078 -0.30257726]
reward: 1.0
done: False
info: {}


Now you have defined your environment. Next, you need to define an agent, which will learn through interactions with the environment.

PFRL provides various agents, each of which implements a deep reinforcement learning algorithm.

Let's try using the DoubleDQN algorithm (https://arxiv.org/abs/1509.06461), which is implemented by `pfrl.agents.DoubleDQN`. This algorithm trains a Q-function that receives an observation and returns an expected future return for each action the agent can take. In PFRL, you can define your Q-function as `torch.nn.Module` as below. Note that the outputs are wrapped by `pfrl.action_value.DiscreteActionValue`. By wrapping the outputs of Q-functions, PFRL can support not only discrete-action Q-functions like this but also continuous-action Q-functions (via [Normalized Advantage Functions](https://arxiv.org/abs/1603.00748)) in the same way.

In [3]:
class QFunction(torch.nn.Module):

    def __init__(self, obs_size, n_actions):
        super().__init__()
        self.l1 = torch.nn.Linear(obs_size, 50)
        self.l2 = torch.nn.Linear(50, 50)
        self.l3 = torch.nn.Linear(50, n_actions)

    def forward(self, x):
        h = x
        h = torch.nn.functional.relu(self.l1(h))
        h = torch.nn.functional.relu(self.l2(h))
        h = self.l3(h)
        return pfrl.action_value.DiscreteActionValue(h)

obs_size = env.observation_space.low.size
n_actions = env.action_space.n
q_func = QFunction(obs_size, n_actions)

It is also possible to define the same model using `torch.nn.Sequential`. `pfrl.q_functions.DiscreteActionValueHead` is just a `torch.nn.Module` that packs its input to `pfrl.action_value.DiscreteActionValue`.

In [4]:
q_func2 = torch.nn.Sequential(
    torch.nn.Linear(obs_size, 50),
    torch.nn.ReLU(),
    torch.nn.Linear(50, 50),
    torch.nn.ReLU(),
    torch.nn.Linear(50, n_actions),
    pfrl.q_functions.DiscreteActionValueHead(),
)

As usual in PyTorch, `torch.optim.Optimizer` is used to optimize a model.

In [5]:
# Use Adam to optimize q_func. eps=1e-2 is for stability.
optimizer = torch.optim.Adam(q_func.parameters(), eps=1e-2)

To create a DoubleDQN agent with these Q-function and optimizer, you need to specify a bit more parameters and configurations.

In [6]:
# Set the discount factor that discounts future rewards.
gamma = 0.9

# Use epsilon-greedy for exploration
explorer = pfrl.explorers.ConstantEpsilonGreedy(
    epsilon=0.3, random_action_func=env.action_space.sample)

# DQN uses Experience Replay.
# Specify a replay buffer and its capacity.
replay_buffer = pfrl.replay_buffers.ReplayBuffer(capacity=10 ** 6)

# Since observations from CartPole-v0 is numpy.float64 while
# As PyTorch only accepts numpy.float32 by default, specify
# a converter as a feature extractor function phi.
phi = lambda x: x.astype(numpy.float32, copy=False)

# Set the device id to use GPU. To use CPU only, set it to -1.
gpu = -1

# Now create an agent that will interact with the environment.
agent = pfrl.agents.DoubleDQN(
    q_func,
    optimizer,
    replay_buffer,
    gamma,
    explorer,
    replay_start_size=500,
    update_interval=1,
    target_update_interval=100,
    phi=phi,
    gpu=gpu,
)

Now you have an agent and an environment. It's time to start reinforcement learning!

During training, two methods of `agent` must be called: `agent.act` and `agent.observe`. `agent.act(obs)` takes the current observation as input and returns an exploratory action. Once the returned action is processed in the env, `agent.observe(obs, reward, done, reset)` then observes the consequences:
- `obs`: next observation.
- `reward`: an immediate reward.
- `done`: a boolean value set to True if it reached a terminal state.
- `reset`: a boolean value set to True if an episode is interrupted at a non-terminal state, typically by a time limit.

Optionally, you can get training statistics of the agent via `agent.get_statistics`.

In [7]:
n_episodes = 300
max_episode_len = 200
for i in range(1, n_episodes + 1):
    obs = env.reset()
    R = 0  # return (sum of rewards)
    t = 0  # time step
    while True:
        # Uncomment to watch the behavior in a GUI window
        # env.render()
        action = agent.act(obs)
        obs, reward, terminated, _, _ = env.step(action)
        R += reward
        t += 1
        reset = t == max_episode_len
        agent.observe(obs, reward, terminated, reset)
        if done or reset:
            break
    if i % 10 == 0:
        print('episode:', i, 'R:', R)
    if i % 50 == 0:
        print('statistics:', agent.get_statistics())
print('Finished.')

episode: 10 R: 12.0
episode: 20 R: 10.0
episode: 30 R: 9.0
episode: 40 R: 12.0
episode: 50 R: 10.0
statistics: [('average_q', 0.86276), ('average_loss', 0.18341776728630066), ('cumulative_steps', 565), ('n_updates', 66), ('rlen', 565)]
episode: 60 R: 10.0
episode: 70 R: 14.0
episode: 80 R: 16.0
episode: 90 R: 10.0
episode: 100 R: 10.0
statistics: [('average_q', 5.4785624), ('average_loss', 0.23100754196755588), ('cumulative_steps', 1300), ('n_updates', 801), ('rlen', 1300)]
episode: 110 R: 16.0
episode: 120 R: 34.0
episode: 130 R: 20.0
episode: 140 R: 20.0
episode: 150 R: 38.0
statistics: [('average_q', 8.537258), ('average_loss', 0.29845759100979197), ('cumulative_steps', 2633), ('n_updates', 2134), ('rlen', 2633)]
episode: 160 R: 65.0
episode: 170 R: 144.0
episode: 180 R: 200.0
episode: 190 R: 200.0
episode: 200 R: 200.0
statistics: [('average_q', 10.152343), ('average_loss', 0.10948933256324381), ('cumulative_steps', 9775), ('n_updates', 9276), ('rlen', 9775)]
episode: 210 R: 200.0


Now you finished training the DoubleDQN agent for 300 episodes. How good is the agent now? You can evaluate it by using `with agent.eval_mode()`. Exploration such as epsilon-greedy is not used anymore.

In [8]:
with agent.eval_mode():
    for i in range(10):
        obs = env.reset()
        R = 0
        t = 0
        while True:
            # Uncomment to watch the behavior in a GUI window
            # env.render()
            action = agent.act(obs)
            obs, r, terminated, _, _ = env.step(action)
            R += r
            t += 1
            reset = t == 200
            agent.observe(obs, r, terminated, reset)
            if done or reset:
                break
        print('evaluation episode:', i, 'R:', R)

evaluation episode: 0 R: 200.0
evaluation episode: 1 R: 200.0
evaluation episode: 2 R: 171.0
evaluation episode: 3 R: 173.0
evaluation episode: 4 R: 200.0
evaluation episode: 5 R: 200.0
evaluation episode: 6 R: 200.0
evaluation episode: 7 R: 198.0
evaluation episode: 8 R: 200.0
evaluation episode: 9 R: 200.0


For your information, `CartPole-v0`'s maximum achievable return is 200. If the agent could not achieve 200, it was unlucky! You can train the agent longer by running the training loop again.

If the results are good enough, the only remaining task is to save the agent so that you can reuse it. What you need to do is to simply call `agent.save` to save the agent, then `agent.load` to load the saved agent.

In [9]:
# Save an agent to the 'agent' directory
agent.save('agent')

# Uncomment to load an agent from the 'agent' directory
# agent.load('agent')

RL completed!

But writing code like this every time you use RL might be tedious. So, PFRL has utility functions that do these things.

In [10]:
# Set up the logger to print info messages for understandability.
import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

pfrl.experiments.train_agent_with_evaluation(
    agent,
    env,
    steps=2000,           # Train the agent for 2000 steps
    eval_n_steps=None,       # We evaluate for episodes, not time
    eval_n_episodes=10,       # 10 episodes are sampled for each evaluation
    train_max_episode_len=200,  # Maximum length of each episode
    eval_interval=1000,   # Evaluate the agent after every 1000 steps
    outdir='result',      # Save everything to 'result' directory
)

outdir:result step:180 episode:0 R:180.0
statistics:[('average_q', 9.803836), ('average_loss', 0.05234420951426728), ('cumulative_steps', 26570), ('n_updates', 26071), ('rlen', 26570)]
outdir:result step:344 episode:1 R:164.0
statistics:[('average_q', 9.851653), ('average_loss', 0.07100017969845794), ('cumulative_steps', 26734), ('n_updates', 26235), ('rlen', 26734)]
outdir:result step:474 episode:2 R:130.0
statistics:[('average_q', 9.863388), ('average_loss', 0.05790386739885434), ('cumulative_steps', 26864), ('n_updates', 26365), ('rlen', 26864)]
outdir:result step:584 episode:3 R:110.0
statistics:[('average_q', 9.890334), ('average_loss', 0.05864107925037388), ('cumulative_steps', 26974), ('n_updates', 26475), ('rlen', 26974)]
outdir:result step:712 episode:4 R:128.0
statistics:[('average_q', 9.924887), ('average_loss', 0.07561466434504836), ('cumulative_steps', 27102), ('n_updates', 26603), ('rlen', 27102)]
outdir:result step:878 episode:5 R:166.0
statistics:[('average_q', 9.868885

That's all of the PFRL quickstart guide. To know more about PFRL, please look into the `examples` directory and read and run the examples. Thank you!