# Deep Deterministic Policy Gradients (DDPG)
---
In this notebook, we train DDPG with OpenAI Gym's BipedalWalker-v2 environment.

### 1. Import the Necessary Packages

In [1]:
# render ai gym environment
# import gym
import gymnasium as gym  # new version of gym

import random
import torch
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

from ddpg_agent import Agent

### 2. Instantiate the Environment and Agent

In [2]:
# https://gymnasium.farama.org/environments/box2d/bipedal_walker/
# env = gym.make('BipedalWalker-v2')
env = gym.make('BipedalWalker-v3', render_mode="rgb_array")
# env.seed(10)
agent = Agent(state_size=env.observation_space.shape[0], action_size=env.action_space.shape[0], random_seed=10)

In [3]:
# load checkpoint if desired.
agent.actor_local.load_state_dict(torch.load('checkpoint_actor.pth'))
agent.critic_local.load_state_dict(torch.load('checkpoint_critic.pth'))


<All keys matched successfully>

### 3. Train the Agent with DDPG

Run the code cell below to train the agent from scratch.  Alternatively, you can skip to the next code cell to load the pre-trained weights from file.

In [None]:
def ddpg(n_episodes=2000, max_t=700, print_every=100):
    scores_deque = deque(maxlen=100)
    scores = []
    max_score = -np.Inf
    for i_episode in range(1, n_episodes+1):
        # state = env.reset()
        state, _ = env.reset()                                    # new gymnasium
        agent.reset()
        score = 0
        action = np.array([0., 0., 0., 0.])
        for t in range(max_t):
            prev_action = action
            action = agent.act(state)
            # next_state, reward, done, _ = env.step(action)
            next_state, reward, done, _, _ = env.step(action)     # new gymnasium
            agent.step(state, action, reward, next_state, done)
            state = next_state
            score += reward
            if done:
                break
            # the agent is stuck in a "not done" state, but its actions 
            # are not effecting any change on the environment state
            # probably shouldn't do an equality... probably should have some tolerance
            if np.max(np.abs(prev_action - action)) < 0.0001 and np.max(np.abs(state - next_state)) < 0.0001:
                # print("stuck")
                break
            
        scores_deque.append(score)
        scores.append(score)
        print('\rEpisode {}\tAverage Score: {:.2f}\tScore: {:.2f}\t\t Memory Size {}                  '.format(i_episode, np.mean(scores_deque), score, len(agent.memory.memory)), end="")
        if i_episode % print_every == 0:
            torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
            torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
            print('\rEpisode {}\tAverage Score: {:.2f}\t\t Memory Size {}                 '.format(i_episode, np.mean(scores_deque), len(agent.memory.memory)))   
    return scores

scores = ddpg()

fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(scores)+1), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

Episode 100	Average Score: -37.68	 Memory Size 66478ory Size 66478                  
Episode 200	Average Score: -56.31	 Memory Size 108468y Size 108468                    
Episode 300	Average Score: -57.00	 Memory Size 116864ory Size 116864                  
Episode 400	Average Score: -21.83	 Memory Size 134118ry Size 134118                   
Episode 500	Average Score: -11.53	 Memory Size 151002y Size 151002                   
Episode 600	Average Score: -34.70	 Memory Size 185723 Size 185723                     
Episode 700	Average Score: -17.17	 Memory Size 198774y Size 198774                    
Episode 800	Average Score: -26.63	 Memory Size 200000ry Size 200000                   
Episode 900	Average Score: -57.27	 Memory Size 200000ry Size 200000                   
Episode 1000	Average Score: -77.28	 Memory Size 200000ory Size 200000                  
Episode 1100	Average Score: -25.54	 Memory Size 200000y Size 200000                    
Episode 1200	Average Score: -52.80	 Memory S

In [None]:
Episode 100	Average Score: -100.43	Score: -99.22
Episode 200	Average Score: -100.50	Score: -97.223
Episode 300	Average Score: -90.31	Score: -97.313
Episode 400	Average Score: -94.41	Score: -97.18
Episode 500	Average Score: -93.65	Score: -96.80
Episode 600	Average Score: -94.38	Score: -96.97
Episode 700	Average Score: -97.19	Score: -97.86
Episode 800	Average Score: -93.05	Score: -97.32
Episode 900	Average Score: -98.05	Score: -101.53
Episode 1000	Average Score: -95.61	Score: -100.68
Episode 1100	Average Score: -76.18	Score: -97.060
Episode 1200	Average Score: -60.73	Score: -37.78
Episode 1300	Average Score: -39.58	Score: -41.27
Episode 1314	Average Score: -40.41	Score: -31.60

### 4. Watch a Smart Agent!

In the next code cell, you will load the trained weights from file to watch a smart agent!

In [None]:
# should already be installed in my docker file
# !python -m pip install pyvirtualdisplay
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

is_ipython = 'inline' in plt.get_backend()
if is_ipython:
    from IPython import display

plt.ion()

In [None]:
agent.actor_local.load_state_dict(torch.load('checkpoint_actor.pth'))
agent.critic_local.load_state_dict(torch.load('checkpoint_critic.pth'))

# state = env.reset()
state, _ = env.reset()

# img = plt.imshow(env.render(mode='rgb_array'))
img = plt.imshow(env.render())

agent.reset()   
while True:
    action = agent.act(state)
    #
    # save to output
    #
    # env.render()
    img.set_data(env.render()) 
    plt.axis('off')
    display.display(plt.gcf())
    display.clear_output(wait=True)

    next_state, reward, done, _ = env.step(action)
    state = next_state
    if done:
        break
        
env.close()

### 5. Explore

In this exercise, we have provided a sample DDPG agent and demonstrated how to use it to solve an OpenAI Gym environment.  To continue your learning, you are encouraged to complete any (or all!) of the following tasks:
- Amend the various hyperparameters and network architecture to see if you can get your agent to solve the environment faster than this benchmark implementation.  Once you build intuition for the hyperparameters that work well with this environment, try solving a different OpenAI Gym task!
- Write your own DDPG implementation.  Use this code as reference only when needed -- try as much as you can to write your own algorithm from scratch.
- You may also like to implement prioritized experience replay, to see if it speeds learning.  
- The current implementation adds Ornsetein-Uhlenbeck noise to the action space.  However, it has [been shown](https://blog.openai.com/better-exploration-with-parameter-noise/) that adding noise to the parameters of the neural network policy can improve performance.  Make this change to the code, to verify it for yourself!
- Write a blog post explaining the intuition behind the DDPG algorithm and demonstrating how to use it to solve an RL environment of your choosing.  