# Reinforcement Learning - Policy Gradient

<img src="https://raw.githubusercontent.com/jeremiedecock/polytechnique-inf581-2023/master/logo.jpg" style="float: left; width: 15%" />

[INF581-2023](https://moodle.polytechnique.fr/course/view.php?id=14259) Lab session #6

2019-2023 Jérémie Decock, Mohamed Alami

[![Open in Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jeremiedecock/polytechnique-inf581-2023/blob/master/lab6_rl3_reinforce.ipynb)

[![My Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/jeremiedecock/polytechnique-inf581-2023/master?filepath=lab6_rl3_reinforce.ipynb)

[![NbViewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/jeremiedecock/polytechnique-inf581-2023/blob/master/lab6_rl3_reinforce.ipynb)

[![Local](https://img.shields.io/badge/Local-Save%20As...-blue)](https://github.com/jeremiedecock/polytechnique-inf581-2023/raw/master/lab6_rl3_reinforce.ipynb)

## Setup the Python environment

**Notice**: this notebook requires the following libraries: *PyTorch*, *Gymnasium*, NumPy, Pandas, Seaborn, imageio, pygame, ipywidgets, ipython and tqdm.

You can install them with the following command (the next cells do this for you if you use the Google Colab environment):

```
pip install gymnasium, imageio, ipython, ipywidgets, nnfigs, numpy, pandas, pygame, seaborn, torch, tqdm
```

In [None]:
%matplotlib inline

import subprocess

try:
    from inf581 import *
except ModuleNotFoundError:
    process = subprocess.Popen("pip install nnfigs inf581".split(), stdout=subprocess.PIPE)
    for line in process.stdout:
        print(line.decode().strip())
    from inf581 import *

from inf581.lab7 import *

import matplotlib.pyplot as plt
import gymnasium as gym
import numpy as np
import seaborn as sns

from IPython.display import Image   # To display GIF images in the notebook

$
\newcommand{\vs}[1]{\mathbf{#1}} % vector symbol (\boldsymbol, \textbf or \vec)
\newcommand{\ms}[1]{\mathbf{#1}} % matrix symbol (\boldsymbol, \textbf)
\def\U{V}
\def\action{\vs{a}}       % action
\def\A{\mathcal{A}}        % TODO
\def\actionset{\mathcal{A}} %%%
\def\discount{\gamma}  % discount factor
\def\state{\vs{s}}         % state
\def\S{\mathcal{S}}         % TODO
\def\stateset{\mathcal{S}}  %%%
%
\def\E{\mathbb{E}}
%\newcommand{transition}{T(s,a,s')}
%\newcommand{transitionfunc}{\mathcal{T}^a_{ss'}}
\newcommand{transitionfunc}{P}
\newcommand{transitionfuncinst}{P(\nextstate|\state,\action)}
\newcommand{transitionfuncpi}{\mathcal{T}^{\pi_i(s)}_{ss'}}
\newcommand{rewardfunc}{r}
\newcommand{rewardfuncinst}{r(\state,\action,\nextstate)}
\newcommand{rewardfuncpi}{r(s,\pi_i(s),s')}
\newcommand{statespace}{\mathcal{S}}
\newcommand{statespaceterm}{\mathcal{S}^F}
\newcommand{statespacefull}{\mathcal{S^+}}
\newcommand{actionspace}{\mathcal{A}}
\newcommand{reward}{R}
\newcommand{statet}{S}
\newcommand{actiont}{A}
\newcommand{newstatet}{S'}
\newcommand{nextstate}{\state'}
\newcommand{newactiont}{A'}
\newcommand{stepsize}{\alpha}
\newcommand{discount}{\gamma}
\newcommand{qtable}{Q}
\newcommand{finalstate}{\state_F}
%
\newcommand{\vs}[1]{\boldsymbol{#1}} % vector symbol (\boldsymbol, \textbf or \vec)
\newcommand{\ms}[1]{\boldsymbol{#1}} % matrix symbol (\boldsymbol, \textbf)
\def\vit{Value Iteration}
\def\pit{Policy Iteration}
\def\discount{\gamma}  % discount factor
\def\state{\vs{s}}         % state
\def\S{\mathcal{S}}         % TODO
\def\stateset{\mathcal{S}}  %%%
\def\cstateset{\mathcal{X}} %%%
\def\x{\vs{x}}                    % TODO cstate
\def\cstate{\vs{x}}               %%%
\def\policy{\pi}
\def\piparam{\vs{\theta}}         % TODO pparam
\def\action{\vs{a}}       % action
\def\A{\mathcal{A}}        % TODO
\def\actionset{\mathcal{A}} %%%
\def\caction{\vs{u}}       % action
\def\cactionset{\mathcal{U}} %%%
\def\decision{\vs{d}}       % decision
\def\randvar{\vs{\omega}}       %%%
\def\randset{\Omega}       %%%
\def\transition{T}       %%%
\def\immediatereward{r}    %%%
\def\strategichorizon{s}    %%% % TODO
\def\tacticalhorizon{k}    %%%  % TODO
\def\operationalhorizon{h}    %%%
\def\constalpha{a}    %%%
\def\U{V}              % utility function
\def\valuefunc{V}
\def\X{\mathcal{X}}
\def\meu{Maximum Expected Utility}
\def\finaltime{T}
\def\timeindex{t}
\def\iterationindex{i}
\def\decisionfunc{d}       % action
\def\mdp{\text{MDP}}
$

## Introduction

In a previous lab, we have dealt with reinforcement learning in discrete state and action spaces.
To do so, we used methods based on action-value function and, especially, $Q$-function estimation.
The $Q$-function was stored in a table and updated with on- or off- policy algorithms (namely SARSA and $Q$-Learning). 

Yet, these methods do not scale to large state spaces and especially not to the case of continuous state spaces.
To address these issues one can either extend value-based methods making use of value-function approximation or directly search in policy spaces.
In this lab, we will explore both solutions. 

The first part of this lab presents the problem to solve: the CartPole envronment. 

In the second part, we will search in a family of parameterized policies $\pi_\theta(s, a)$ using a policy gradient method.

In the third part of this lab, we will apply value-function approximation methods (namely DQN) to solve the CartPole problem.

# Part 1: Hands on Cart Pole environment

For the purpose of focusing on the algorithms, we will use standard environments provided by the Gymnasium suite.
Gymnasium provides controllable environments (https://gymnasium.farama.org/environments/classic_control/) for research in reinforcement learning.
Especially, we will try to solve the CartPole-v1 environment (c.f. https://gymnasium.farama.org/environments/classic_control/cart_pole/) which offers a continuous state space and discrete action space.
The Cart Pole task consists in maintaining a pole in a vertical position by moving a cart on which the pole is attached with a joint.
No friction is considered.
The task is supposed to be solved if the pole stays up-right (within 15 degrees) for 195 steps in average over 100 episodes while keeping the cart position within reasonable bounds.
The state is given by $\{x,\frac{\partial x}{\partial t},\omega,\frac{\partial \omega}{\partial t}\}$ where $x$ is the position of the cart and $\omega$ is the angle between the pole and vertical position.
There are only two possible actions: $a \in \{0, 1\}$ where $a = 0$ means "push the cart to the LEFT" and $a = 1$ means "push the cart to the RIGHT".

## Exercise 1: Hands on Cart Pole

**Task 1:** read https://gymnasium.farama.org/environments/classic_control/cart_pole/ to discover the CartPole environment.

**Notice:** A reminder of Gymnasium main concepts is available at https://gymnasium.farama.org/content/basic_usage/.

Print some information about the environment:

In [None]:
env = gym.make('CartPole-v1')
print("State space dimension is:", env.observation_space.shape[0])
print("State upper bounds:", env.observation_space.high)
print("State lower bounds:", env.observation_space.low)
print("Actions are: {" + ", ".join([str(a) for a in range(env.action_space.n)]) + "}")
env.close()

**Task 2:** Run the following cells and check different basic 

policies (for instance constant actions or randomly drawn actions) to discover the CartPole environment.
Although this environment has easy dynamics that can be computed analytically, we will solve this problem with Policy Gradient based Reinforcement learning.

### Test the CartPole environment with a constant policy

In [None]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

In [None]:
env = gym.make('CartPole-v1', render_mode='rgb_array')
RenderWrapper.register(env, force_gif=True)

observation, info = env.reset()
done = False

for t in range(50):
    env.render_wrapper.render()

    if not done:
        print(observation)
    else:
        print("x", end="")

    # TODO


print()
env.close()

env.render_wrapper.make_gif("rl3_ex1left")

In [None]:
env = gym.make('CartPole-v1', render_mode='rgb_array')
RenderWrapper.register(env, force_gif=True)

observation, info = env.reset()
done = False

for t in range(50):
    env.render_wrapper.render()

    if not done:
        print(observation)
    else:
        print("x", end="")

    # TODO

print()
env.close()

env.render_wrapper.make_gif("rl3_ex1right")

### Test the CartPole environment with a random policy

In [None]:
env = gym.make('CartPole-v1', render_mode='rgb_array')
RenderWrapper.register(env, force_gif=True)

for episode_index in range(5):
    observation, info = env.reset()
    done = False

    for t in range(70):
        env.render_wrapper.render()

        if not done:
            print(observation)
        else:
            print("x", end="")
        
        # TODO

    print()
    env.close()

env.render_wrapper.make_gif("rl3_ex1random")

# Part 2: Implement REINFORCE

We will solve the CartPole environment using a policy gradient method which directly searchs in a family of parameterized policies $\pi_\theta(s, a)$ for the optimal policy.
This method performs gradient ascent in the policy space so that the total return is maximized.
We will restrict our work to episodic tasks, *i.e.* tasks that have a starting states and last for a finite and fixed number of steps $H$, called horizon. 

More formally, we define an optimization criterion that we want to maximize:

$$J(\theta) = \E_{\pi_\theta}\left[\sum_{t=1}^H r(s_t,a_t)\right],$$

where $\E_{\pi_\theta}$ means $a \sim \pi_\theta(s,.)$ and $H$ is the horizon of the episode.
In other words, we want to maximize the value of the starting state: $V^{\pi_\theta}(s)$.
The policy gradient theorem tells us that:

$$
\nabla_\theta J(\theta) = \nabla_\theta V^{\pi_\theta}(s) = \E_{\pi_\theta} \left[\nabla_\theta \log \pi_\theta (s,a) ~ Q^{\pi_\theta}(s,a) \right],
$$
With 

$$Q^\pi(s,a) = \E^\pi \left[\sum_{t=1}^H r(s_t,a_t)|s=s_1, a=a_1\right].$$

Policy Gradient theorem is extremely powerful because it says one doesn't need to know the dynamics of the system to compute the gradient if one can compute the $Q$-function of the current policy.
By applying the policy and observing the one-step transitions is enough.
Using a stochastic gradient ascent and replacing $Q^{\pi_\theta}(s_t,a_t)$ by a Monte Carlo estimate $R_t = \sum_{t'=t}^H r(s_{t'},a_{t'})$ over one single trajectory, we end up with a special case of the REINFORCE algorithm (see Algorithm below).

---
REINFORCE with Policy Gradient theorem Algorithm
---

Initialize $\theta^0$ as random<br>
Initialize step-size $\alpha_0$<br>
$n \leftarrow 0$<br>
<b>WHILE</b> no convergence<br>
	$\quad$ Generate rollout $h_n \leftarrow \{s_1^n,a_1^n,r_1^n, \ldots, s_H^n, a_H^n, r_H^n\} \sim \pi_{\theta^n}$<br>
	$\quad$ $PG_\theta \leftarrow 0$<br>
	$\quad$ <b>FOR</b> $t=1$ to $H$<br>
		$\quad\quad$ $R_t \leftarrow \sum_{t'=t}^H r_{t'}^n$<br>
		$\quad\quad$ $PG_\theta \leftarrow PG_\theta + \nabla_\theta \log \pi_{\theta^{n}}(s_t,a_t) ~ R_t$<br>
	$\quad$ $n \leftarrow n + 1$ <br>
	$\quad$ $\theta^n \leftarrow \theta^{n-1} + \alpha_n PG_\theta$<br>
	$\quad$  update $\alpha_n$ (if step-size scheduling)<br>

<b>RETURN</b> $\theta^n$ 

**Notice**: by replacing the $Q$-function by a Monte-Carlo estimate, we get rid of the Markov assumption and this algorithm is expected to work even in non-Markovian systems. 

## Exercise 2: Implement a sigmoid policy

As the number of actions is $2$ (push the cart to the left or push it to the right), one can see the problem of controlling the Cart Pole as a binary classification problem.
Binary classification can be easily solved thanks to logistic regression which transforms the classification problem into a regression problem using the sigmoid function.

**Task 1**: Implement the `sigmoid` function defined by:

$$\sigma(x) = \frac{1}{1+e^{-x}}.$$

In [None]:
def sigmoid(x):
    return ... # TODO

**Task 2**: Complete the `logistic_regression` function that implements the logistic regression. This function returns the probability to draw action 1 ("push right") w.r.t the parameter vector $\theta$ and the input vector $s$ (the 4-dimension state vector).

In [None]:
plot_logistic_regression_fig()

In [None]:
def logistic_regression(s, theta):
    prob_push_right = ... # TODO

    return prob_push_right

**Task 3**: Complete the `draw_action` function that draw an action according to current policy i.e. that select the *RIGHT* action with probability $\sigma(\theta^\top s)$, where $\theta$ is the parameter vector and $s$ is the 4-dimension state vector.

In [None]:
def draw_action(s, theta):
    prob_push_right = logistic_regression(s, theta)

    return ... # TODO

## Exercise 3: Compute $\nabla_{\theta} \log \pi_\theta (s,a)$

Verify that, for a sigmoid policy:
- $\nabla_\theta \log \pi_\theta (s,\text{RIGHT}) = \pi_\theta (s, \text{LEFT}) \times s$
- $\nabla_\theta \log \pi_\theta (s,\text{LEFT}) = - \pi_\theta (s, \text{RIGHT}) \times s$ 

## Exercise 4: Implement REINFORCE

Fill the following cells to implement the REINFORCE algorithm (defined in the introduction of this notebook).

In [None]:
ENV_NAME = "CartPole-v1"

# Since the goal is to attain an average return of 195, horizon should be larger than 195 steps (say 300 for instance)
EPISODE_DURATION = 300

ALPHA_INIT = 0.1
SCORE = 195.0
NUM_EPISODES = 100
LEFT = 0
RIGHT = 1

VERBOSE = True

**Task 1**: Implement the `play_one_episode` function that plays an episode with the given policy $\pi_\theta$ (for fixed horizon $H$) and returns its rollouts.

In [None]:
# Generate an episode
def play_one_episode(env, theta, max_episode_length=EPISODE_DURATION, render=False):
    s_t, info = env.reset()

    episode_states = []
    episode_actions = []
    episode_rewards = []
    episode_states.append(s_t)

    for t in range(max_episode_length):

        if render:
            env.render_wrapper.render()

        a_t = ... # TODO
        s_t, r_t, done, info = ... # TODO

        episode_states.append(s_t)
        episode_actions.append(a_t)
        episode_rewards.append(r_t)

        if done:
            break

    return episode_states, episode_actions, episode_rewards

**Task 2**: Implement the `score_on_multiple_episodes` function that test the given policy $\pi_\theta$ on `num_episodes` episodes (for fixed horizon $H$) and returns:
- `success`: `True` if the agent got an average reward greater or equals to 195 over 100 consecutive trials, `False` otherwise
- `num_success`: the number of episodes where the agent got an average reward greater or equals to 195
- `average_return`: the average reward on the `num_episodes` episodes

In [None]:
def score_on_multiple_episodes(env, theta, score=SCORE, num_episodes=NUM_EPISODES, max_episode_length=EPISODE_DURATION, render=False):
    
    # TODO

    return success, num_success, average_return

**Task 3**: Implement the `compute_policy_gradient` function that returns Policy Gradient for a given episode: policy gradient = $\sum_{t=1}^H \nabla_\theta \log \pi_\theta(s_t,a,_t) R_t$.

In [None]:
# Returns Policy Gradient for a given episode
def compute_policy_gradient(episode_states, episode_actions, episode_rewards, theta):

    # TODO

    return PG

**Task 4**: Implement the `train` function that updates $\theta$ parameters with gradient ascent until the agent got an average reward greater or equals to 195 over 100 consecutive trials.

In [None]:
# Train the agent got an average reward greater or equals to 195 over 100 consecutive trials
def train(env, theta_init, max_episode_length = EPISODE_DURATION, alpha_init = ALPHA_INIT):

    theta = theta_init
    episode_index = 0
    average_returns = []

    success, _, R = score_on_multiple_episodes(env, theta)
    average_returns.append(R)

    # Train until success
    while (not success):

        # TODO

    return theta, episode_index, average_returns

### Train the agent

In [None]:
#np.random.seed(1234)
env = gym.make(ENV_NAME, render_mode='rgb_array')
RenderWrapper.register(env, force_gif=True)
#env.seed(1234)

In [None]:
dim = env.observation_space.shape[0]

# Init parameters to random
theta_init = np.random.randn(1, dim)

# Train the agent
theta, i, average_returns = train(env, theta_init)

print("Solved after {} iterations".format(i))

### Test final policy

In [None]:
score_on_multiple_episodes(env, theta, num_episodes=10, render=True)
env.render_wrapper.make_gif("rl3_ex4")

### Display the evolution of the average reward w.r.t. PG iterations

In [None]:
# Show training curve
plt.plot(range(len(average_returns)),average_returns)
plt.title("Average reward on 100 episodes")
plt.xlabel("Training Steps")
plt.ylabel("Reward")

plt.show()

env.close()

# Part 3: Implement DQN

In a previous lab, we have dealt with reinforcement learning in discrete state and action spaces. To do so, we used methods based on action-value function and, especially, Q-function estimation. The Q-function was stored in a table and updated with on- or off- policy algorithms (namely SARSA and Q-Learning).

Yet, these methods do not scale to large state spaces. The main idea behind Q-learning is that if we had a function
$Q^*: State \times Action \rightarrow \mathbb{R}$, that could tell us what our return would be, if we were to take an action in a given state, then we could easily construct a policy that maximizes our rewards:

\begin{align}\pi^*(s) = \arg\!\max_a \ Q^*(s, a)\end{align}

However, we don't know everything about the world, so we don't have access to $Q^*$. But, since neural networks are universal function approximators, we can simply create one and train it to resemble $Q^*$. To address this issue, an option is to make use of the power of deep learning to estimate the Q-value of a state-action pair even if it was never encountered before. A deep neural network is used to estimate the Q-value of each possible action in a given state. 

We recall the update rule for DQN:
$$Q(S_t,A_t)\gets Q(S_t,A_t)+\alpha\left[R_{t+1}+\gamma\max_aQ(S_{t+1},a)-Q(S_t,A_t)\right]$$


---
DQN Algorithm
---

Initialize $\mathcal{A}$ the agent neural network<br>
Copy $\mathcal{A}$ weights to initialize the target neural network $\mathcal{T}$<br>
Initialize batch_size B<br>
<b>FOR</b> i in nb_episodes:<br>
    $\quad$ <b>WHILE</b> not done:<br>
    $\quad$ $\quad$ Generate experience $exp\gets[\{s_1^1,a^1,r^1,s_2^1,done^1,\dots, s_1^n,a^n,r^n,s_2^n,done^n\}]$<br>
    $\quad$ $\quad$ <b>IF</b> len(exp)>B:<br>
        $\quad$ $\quad$ $\quad$ sample minibatch $\sim$ exp<br>
        $\quad$ $\quad$ $\quad$ $q_1 \gets \mathcal{A}(s_1)^B$<br>
        $\quad$ $\quad$ $\quad$ $q_2 \gets \mathcal{T}(s_2)^B$<br>
        $\quad$ $\quad$ $\quad$ q_target = reward_batch + $\gamma * q_2$ * (1-done_batch)<br>
        $\quad$ $\quad$ $\quad$ Update $\mathcal{A}$<br>
    $\quad$ $\quad$ Every j iterations, $\mathcal{A}$ = $\mathcal{T}$

## Exercise 5: Implement DQN

Fill the following cells to implement the DQN algorithm.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from collections import deque
import copy
from tqdm.notebook import tqdm
import random

In [None]:
# Here we define some hyperparameters and the agent's architecture

env = gym.make('CartPole-v1', render_mode='rgb_array')
observation_space = env.observation_space.shape[0]
action_space = env.action_space.n

EPISODES = 150
LR = 0.0001
MEM_SIZE = 10000
BATCH_SIZE = 64
GAMMA = 0.95
EXPLORATION_MAX = 1.0
EXPLORATION_DECAY = 0.999
EXPLORATION_MIN = 0.001
sync_freq = 10

class Network(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.input_shape = env.observation_space.shape
        self.action_space = action_space

        self.fc1 = nn.Linear(*self.input_shape, 1024)
        self.fc2 = nn.Linear(1024, 512)
        self.fc3 = nn.Linear(512, self.action_space)

        self.optimizer = optim.Adam(self.parameters(), lr=LR)
        self.loss = nn.MSELoss()
        #self.to(DEVICE)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        return x

In [None]:
class ReplayBuffer:
    def __init__(self):
        self.memory = deque(maxlen=MEM_SIZE)
    
    def add(self, experience):
        self.memory.append(experience)
    
    def sample(self):
        minibatch = random.sample(self.memory, BATCH_SIZE)

        state1_batch = torch.stack([s1 for (s1,a,r,s2,d) in minibatch])
        action_batch = torch.tensor([a for (s1,a,r,s2,d) in minibatch])
        reward_batch = torch.tensor([r for (s1,a,r,s2,d) in minibatch])
        state2_batch = torch.stack([s2 for (s1,a,r,s2,d) in minibatch])
        done_batch = torch.tensor([d for (s1,a,r,s2,d) in minibatch])

        return (state1_batch, action_batch, reward_batch, state2_batch, done_batch)

In [None]:
class DQN:
    def __init__(self):
        self.replay = ReplayBuffer()
        self.exploration_rate = EXPLORATION_MAX
        self.network = Network()
        self.network2 = copy.deepcopy(self.network) #A
        self.network2.load_state_dict(self.network.state_dict())


    def choose_action(self, observation):
        if random.random() < self.exploration_rate:
            return env.action_space.sample()

        # Convert observation to PyTorch Tensor
        state = torch.tensor(observation).float().detach()
        #state = state.to(DEVICE)
        state = state.unsqueeze(0)
            
        ### BEGIN TODO ###

        # Get Q(s,.)
        q_values = 

        # Choose the action to play
        action = 

        ### END TODO ###

        return action


    def learn(self):
        if len(self.replay.memory)< BATCH_SIZE:
            return

        ### BEGIN TODO ###

        # Sample minibatch s1, a1, r1, s1', done_1, ... , sn, an, rn, sn', done_n
        state1_batch, action_batch, reward_batch, state2_batch, done_batch = 

        # Compute Q values (call self.network and apply the squeeze method on the result)
        q_values = 

        with torch.no_grad():
            # Compute next Q values (call self.network and apply the squeeze method on the result)
            next_q_values = 

        batch_indices = np.arange(BATCH_SIZE, dtype=np.int64)

        predicted_value_of_now = 
        predicted_value_of_future = 

        # Compute the q_target
        q_target = 

        # Compute the loss (c.f. self.network.loss())
        loss = 

        ### END TODO ###

        # Complute 𝛁Q
        self.network.optimizer.zero_grad()
        loss.backward()
        self.network.optimizer.step()

        self.exploration_rate *= EXPLORATION_DECAY
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)


    def returning_epsilon(self):
        return self.exploration_rate

In [None]:
agent = DQN()

env = gym.make('CartPole-v1', render_mode='rgb_array')
best_reward = 0
average_reward = 0
episode_number = []
average_reward_number = []

j=0
for i in tqdm(range(1, EPISODES)):
    state, info = env.reset()
    state = np.reshape(state, [1, observation_space])
    score = 0

    while True:
        j+=1

        action = agent.choose_action(state)
        state_, reward, done, truncated, info = env.step(action)
        state_ = np.reshape(state_, [1, observation_space])
        state = torch.tensor(state).float()
        state_ = torch.tensor(state_).float()

        exp = (state, action, reward, state_, done)
        agent.replay.add(exp)
        agent.learn()

        state = state_
        score += reward

        if j % sync_freq == 0:
            agent.network2.load_state_dict(agent.network.state_dict())

        if done:
            if score > best_reward:
                best_reward = score
            average_reward += score 
            if i%10==0:
                print("Episode {} Average Reward {} Best Reward {} Last Reward {} Epsilon {}".format(i, average_reward/i, best_reward, score, agent.returning_epsilon()))
                #test_model(agent,10, observation_space)
            break
  
        episode_number.append(i)
        average_reward_number.append(average_reward/i)

plt.plot(episode_number, average_reward_number)
plt.show()

In [None]:
def test_model(model, num_episodes, observation_space):

    for i_episode in range(num_episodes):
        G=0
        observation, info = env.reset()
        done = False
        while not done:
            observation = torch.from_numpy(observation).float()
            observation = np.reshape(observation, [1, observation_space])
            action = model.choose_action(observation)
            observation, reward, done, truncated, info = env.step(action)
            G += reward
            if done:
                print(G)
                break

In [None]:
env = gym.make('CartPole-v1', render_mode='rgb_array')
RenderWrapper.register(env, force_gif=True)

observation, info = env.reset()
observation = np.reshape(observation, [1, observation_space])
done = False

for t in range(50):
    env.render_wrapper.render()

    if not done:
        print(observation)
    else:
        print("x", end="")

    action = agent.choose_action(observation)
    observation, reward, done, truncated, info = env.step(action)
    # TODO


print()
env.close()

env.render_wrapper.make_gif("rl3_ex1left")

In [None]:
torch.save(agent.network.state_dict(), "rl3_dqn_cartpole.zip")