# Balance taask in dm_control


The goal of this notebook is to make a NN learn the dm_control environement. As the action input of the cartpole environement is continous, a simple DQNAgent will not work and i'll first use a DDPG model with an actor and a critic

## About DDPG

DDPG (Deep Deterministic Policy Gradient) is used when dealing with continuous action spaces, while DQN (Deep Q-Network) is used for discrete action spaces. In problems with continuous action spaces, the number of possible actions is not finite, and actions can take any value within a certain range. DQN, which is based on estimating the Q-values for each action, is not well-suited for continuous action spaces because it would require evaluating the Q-value for an infinite number of actions. DDPG, on the other hand, directly learns a policy that maps states to actions, making it more suitable for continuous action spaces.

DDPG is an off-policy, model-free algorithm that combines ideas from DQN and policy gradient methods. It uses two neural networks: an actor network and a critic network.

**Actor**: The actor network learns a deterministic policy that maps states to actions. Given a state, the actor network directly outputs the best action to take in that state, according to the current policy. The actor network's goal is to maximize the expected return (the sum of rewards) following the policy.

**Critic**: The critic network learns the Q-value function, which estimates the expected return for taking an action in a state following the current policy. The critic's goal is to learn an accurate estimate of the Q-values for state-action pairs. It is used to update the policy (actor network) by providing feedback on the quality of the chosen actions.
The DDPG algorithm works as follows:

1. Initialize the actor and critic networks, as well as their target networks (used to stabilize learning).
2. At each time step, the agent takes an action based on the current policy (output of the actor network) and explores the environment using noise added to the actions.
3. Store the observed transitions (state, action, reward, next_state, done) in a replay buffer.
4. Randomly sample a batch of transitions from the replay buffer.
5. Train the critic network using the sampled transitions and the target Q-values, which are calculated using the target networks.
6. Train the actor network using the sampled transitions and the gradients of the Q-values concerning the actions. Update the policy to maximize the Q-values.
7. Softly update the target networks (actor and critic) using a mix of the current and target network weights.
By learning a policy directly through the actor network and using the critic network to improve the policy, DDPG can handle continuous action spaces efficiently.

This text is copied from ChatGPT.

In [7]:
# imports
import numpy as np
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense
from dm_control import suite
from dm_control import viewer
from dm_env import StepType

import random
from collections import deque
from statistics import mean, stdev
import time

In [2]:
class Actor(Model):
    def __init__(self, state_size, action_size):
        super(Actor, self).__init__()
        self.fc1 = Dense(24, activation='relu')
        self.fc2 = Dense(48, activation='relu')
        self.fc3 = Dense(96, activation='relu')
        self.fc4 = Dense(48, activation='relu')
        self.fc5 = Dense(24, activation='relu')
        self.fc6 = Dense(action_size, activation='tanh')

    def call(self, state):
        x = self.fc1(state)
        x = self.fc2(x)
        x = self.fc3(x)
        x = self.fc4(x)
        x = self.fc5(x)
        return self.fc6(x)

class Critic(Model):
    def __init__(self, state_size, action_size):
        super(Critic, self).__init__()
        self.fc1 = Dense(24, activation='relu')
        self.fc2 = Dense(48, activation='relu')
        self.fc3 = Dense(96, activation='relu')
        self.fc4 = Dense(48, activation='relu')
        self.fc5 = Dense(24, activation='relu')
        self.fc6 = Dense(1)

    def call(self, state, action):
        x = tf.concat([state, action], axis=-1)
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc3(x)
        x = self.fc4(x)
        x = self.fc5(x)
        return self.fc6(x)

class DDPGAgent:
    def __init__(self, state_size, action_size, gamma=0.99, tau=0.005, lr_actor=1e-4, lr_critic=1e-3):
        self.state_size = state_size
        self.action_size = action_size
        self.gamma = gamma
        self.tau = tau

        self.actor = Actor(state_size, action_size)
        self.actor_target = Actor(state_size, action_size)
        self.actor_optimizer = tf.keras.optimizers.Adam(learning_rate=lr_actor)

        self.critic = Critic(state_size, action_size)
        self.critic_target = Critic(state_size, action_size)
        self.critic_optimizer = tf.keras.optimizers.Adam(learning_rate=lr_critic)

        self.noise_std_dev = 0.2
        self.noise_clip = 0.5

        self.update_target_networks(tau=1.0)  # Initialize the target networks

    def update_target_networks(self, tau=None):
        tau = self.tau if tau is None else tau

        weights_actor = self.actor.get_weights()
        weights_actor_target = self.actor_target.get_weights()
        for i in range(len(weights_actor)):
            weights_actor_target[i] = tau * weights_actor[i] + (1 - tau) * weights_actor_target[i]
        self.actor_target.set_weights(weights_actor_target)

        weights_critic = self.critic.get_weights()
        weights_critic_target = self.critic_target.get_weights()
        
        for i in range(len(weights_critic)):
            weights_critic_target[i] = tau * weights_critic[i] + (1 - tau) * weights_critic_target[i]
        self.critic_target.set_weights(weights_critic_target)

    def act(self, state, noise=True):
        state = np.expand_dims(state, axis=0).astype(np.float32)
        action = self.actor(state).numpy()[0]

        if noise:
            noise = np.random.normal(loc=0, scale=self.noise_std_dev, size=self.action_size).clip(-self.noise_clip, self.noise_clip)
            action = np.clip(action + noise, -1, 1)

        return action
    def train(self, replay_buffer, batch_size=64):
        state_batch, action_batch, reward_batch, next_state_batch, done_batch = replay_buffer.sample(batch_size)

        # Train Critic
        with tf.GradientTape() as tape:
            next_actions = self.actor_target(next_state_batch)
            target_q_values = self.critic_target(next_state_batch, next_actions)
            y = reward_batch + (1 - done_batch) * self.gamma * target_q_values
            q_values = self.critic(state_batch, action_batch)
            critic_loss = tf.reduce_mean(tf.square(y - q_values))

        critic_grad = tape.gradient(critic_loss, self.critic.trainable_variables)
        self.critic_optimizer.apply_gradients(zip(critic_grad, self.critic.trainable_variables))

        # Train Actor
        with tf.GradientTape() as tape:
            actions = self.actor(state_batch)
            actor_loss = -tf.reduce_mean(self.critic(state_batch, actions))

        actor_grad = tape.gradient(actor_loss, self.actor.trainable_variables)
        self.actor_optimizer.apply_gradients(zip(actor_grad, self.actor.trainable_variables))

        # Update target networks
        self.update_target_networks()

class ReplayBuffer:
    def __init__(self, buffer_size=100000):
        self.buffer = deque(maxlen=buffer_size)

    def add(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        state, action, reward, next_state, done = map(np.array, zip(*batch))
        return state, action, reward, next_state, done

    def __len__(self):
        return len(self.buffer)


In [3]:
env = suite.load(domain_name='cartpole', task_name='balance')
action_size = env.action_spec().shape[0]
action_low = env.action_spec().minimum[0]
action_high = env.action_spec().maximum[0]
states = env.observation_spec()
state_size = states['position'].shape[0] + states['velocity'].shape[0]

agent = DDPGAgent(state_size, action_size)
replay_buffer = ReplayBuffer()

num_episodes = 200
max_steps = 1000

In [4]:
# pretraining test

test_episodes = 10
test_average = 0
for e in range(test_episodes):
    timestep, reward, discount, observation = env.reset()
    state = np.concatenate((observation['position'],observation['velocity']))
    state = np.reshape(state, [1, state_size])
    t = 0

    for _ in range(max_steps):
        action = agent.act(state)
        timestep, reward, discount, observation = env.step(action)
        state = np.concatenate((observation['position'],observation['velocity']))
        state = np.reshape(state, [1, state_size])

        if timestep == StepType.LAST or state[0][1] < 0.9: # cos(24deg) = 0.9
            t = _
            break

    print("Test Episode: {}/{}, Score: {}".format(e + 1, test_episodes, t))
    test_average += t

test_average/=test_episodes
print()
print('pretraining: score average ', test_average)

Test Episode: 1/10, Score: 74
Test Episode: 2/10, Score: 75
Test Episode: 3/10, Score: 88
Test Episode: 4/10, Score: 62
Test Episode: 5/10, Score: 98
Test Episode: 6/10, Score: 99
Test Episode: 7/10, Score: 76
Test Episode: 8/10, Score: 97
Test Episode: 9/10, Score: 87
Test Episode: 10/10, Score: 95

pretraining: score average  85.1


In [5]:
# training
for episode in range(num_episodes):
    timestep, reward, discount, observation = env.reset()
    state = np.concatenate((observation['position'],observation['velocity']))
    state = np.reshape(state, [1, state_size])
    episode_reward = 0

    for step in range(max_steps):
        action = agent.act(state)
        timestep, reward, discount, observation = env.step(action)
        next_state = np.concatenate((observation['position'],observation['velocity']))
        next_state = np.reshape(state, [1, state_size])

        if timestep == StepType.MID:
            done = 0
        elif timestep == StepType.LAST:
            done = 1

        replay_buffer.add(state, action, reward, next_state, float(done))
        state = next_state
        episode_reward += reward

        if len(replay_buffer) >= 64:
            agent.train(replay_buffer)

        if done or state[0][1] < 0.9:
            break

    print(f"Episode {episode + 1}/{num_episodes}, Reward: {episode_reward}")


Episode 1/200, Reward: 236.3871735683071
Episode 2/200, Reward: 141.61476140906447
Episode 3/200, Reward: 141.23328609274046
Episode 4/200, Reward: 142.22228071592923
Episode 5/200, Reward: 140.2418062902733
Episode 6/200, Reward: 142.05827656501148
Episode 7/200, Reward: 142.20981142773326
Episode 8/200, Reward: 141.9345349872028
Episode 9/200, Reward: 142.41715954032716
Episode 10/200, Reward: 141.9734951733018
Episode 11/200, Reward: 142.23837333320353
Episode 12/200, Reward: 142.8993916525479
Episode 13/200, Reward: 140.9960288319469
Episode 14/200, Reward: 139.90230452343772
Episode 15/200, Reward: 141.4556160026598
Episode 16/200, Reward: 140.08364648995266
Episode 17/200, Reward: 140.47708357572822
Episode 18/200, Reward: 141.3656992479838
Episode 19/200, Reward: 141.9421089401938
Episode 20/200, Reward: 141.93559042106827
Episode 21/200, Reward: 142.78570918159485
Episode 22/200, Reward: 140.71775585973245
Episode 23/200, Reward: 142.55179212421538
Episode 24/200, Reward: 142.4

In [7]:
# Test
env = suite.load(domain_name='cartpole', task_name='balance')

# Test
test_episodes = 30
test_scores = []
test_rewards = []
start_time = time.time()


for e in range(test_episodes):
    time_step = env.reset()
    sum_rewards = 0

    for t in range(1000): # 1000 steps (half delta t of gym)
        action = agent.act(state)
        timestep, reward, discount, observation = env.step(action)
        state = np.concatenate((observation['position'],observation['velocity']))
        state = np.reshape(state, [1, state_size])
        
        sum_rewards += reward
    

        if observation['position'][1] < 0.978 or timestep == StepType.LAST or t == 999 : # cos of angle < cos of 12 degrees--> angle > 12deg or 1000 steps (half delta t of gym) 
            print("Test Episode: {}/{}, Score: {}".format(e + 1, test_episodes, t))
            test_scores.append(t)
            test_rewards.append(sum_rewards)
            break

test_average = mean(test_scores)
test_sigma = stdev(test_scores)
reward_average = mean(test_rewards)
reward_sigma = stdev(test_rewards)
end_time = time.time()
total_time = end_time - start_time
total_steps = sum(test_scores)
average_time_per_step = total_time / total_steps

print()
print('Score average: {:.2f}, Sigma: {:.2f}'.format(test_average, test_sigma))
print('Average time per step: {:.4f} seconds'.format(average_time_per_step))
print('Reward average: {:.2f}, Sigma: {:.2f}'.format(reward_average, reward_sigma))

env.close()

Test Episode: 1/30, Score: 16
Test Episode: 2/30, Score: 19
Test Episode: 3/30, Score: 16
Test Episode: 4/30, Score: 15
Test Episode: 5/30, Score: 17
Test Episode: 6/30, Score: 19
Test Episode: 7/30, Score: 16
Test Episode: 8/30, Score: 16
Test Episode: 9/30, Score: 18
Test Episode: 10/30, Score: 17
Test Episode: 11/30, Score: 16
Test Episode: 12/30, Score: 16
Test Episode: 13/30, Score: 16
Test Episode: 14/30, Score: 19
Test Episode: 15/30, Score: 15
Test Episode: 16/30, Score: 17
Test Episode: 17/30, Score: 16
Test Episode: 18/30, Score: 17
Test Episode: 19/30, Score: 18
Test Episode: 20/30, Score: 17
Test Episode: 21/30, Score: 17
Test Episode: 22/30, Score: 16
Test Episode: 23/30, Score: 17
Test Episode: 24/30, Score: 17
Test Episode: 25/30, Score: 16
Test Episode: 26/30, Score: 18
Test Episode: 27/30, Score: 19
Test Episode: 28/30, Score: 17
Test Episode: 29/30, Score: 17
Test Episode: 30/30, Score: 18

Score average: 16.93, Sigma: 1.14
Average time per step: 0.0019 seconds
Reward

In [8]:
#create video

def ddpg_policy(timestep):
    state = np.concatenate((timestep.observation['position'],timestep.observation['velocity']))
    state = np.reshape(state, [1, state_size])
    action = agent.act(state)
    return action
    
from moviepy.editor import ImageSequenceClip

# Load the cartpole environment
env = suite.load(domain_name='cartpole', task_name='balance')

# Visualization and video creation
def save_video(policy):
    frames = []

    def policy_with_frame_grab(time_step):
        pixels = env.physics.render(height=480, width=640, camera_id=0)
        frames.append(pixels)
        return policy(time_step)

    # Create the viewer application
    viewer.launch(env, policy=policy_with_frame_grab)

    # Save the frames as a video
    clip = ImageSequenceClip(frames, fps=100)
    clip.write_videofile("video/DDPG_dm_balance.mp4", codec="libx264")

# Call the save_video function with your policy function
save_video(ddpg_policy)

Moviepy - Building video video/DDPG_dm_balance.mp4.
Moviepy - Writing video video/DDPG_dm_balance.mp4



                                                                

Moviepy - Done !
Moviepy - video ready video/DDPG_dm_balance.mp4




In [9]:
env = suite.load(domain_name='cartpole', task_name='balance')
env.step(1)
obs_shape = sum([value.shape[0] for value in env.observation_spec().values()])
obs_shape

5

## DDPG with Stablebaselines

In [3]:
import numpy as np
from dm_control import suite
#from dm_control.rl.wrappers import pixels
import gym
from gym.spaces import Box
from stable_baselines3 import DDPG

class DMControlWrapper(gym.Env):
    def __init__(self, domain_name, task_name):
        self.env = suite.load(domain_name=domain_name, task_name=task_name)
        obs_shape = sum([value.shape[0] for value in self.env.observation_spec().values()])
        self.observation_space = Box(low=-np.inf, high=np.inf, shape=(obs_shape,), dtype=np.float32)
        self.action_space = Box(low=self.env.action_spec().minimum[0], high=self.env.action_spec().maximum[0], shape=self.env.action_spec().shape, dtype=np.float32)

    def reset(self):
        time_step = self.env.reset()
        return np.array(self.get_obs(time_step))

    def step(self, action):
        time_step = self.env.step(action)
        return np.array(self.get_obs(time_step)), time_step.reward, time_step.last(), {}

    def get_obs(self, time_step):
        return np.concatenate([value for value in time_step.observation.values()])
    
    def action_spec(self):
        return self.env.action_spec()



In [4]:
env = DMControlWrapper("cartpole", "balance")
env.reset()
env.step(1)

(array([-0.03781814,  0.99992663,  0.01211301,  0.10266851, -0.15240272]),
 0.7987870816984224,
 False,
 {})

In [6]:
env = DMControlWrapper("cartpole", "balance")
DDPG_model = DDPG("MlpPolicy", env, verbose=1)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [7]:
DDPG_model.learn(total_timesteps=100000)

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1e+03    |
|    ep_rew_mean     | 173      |
| time/              |          |
|    episodes        | 4        |
|    fps             | 103      |
|    time_elapsed    | 38       |
|    total timesteps | 4000     |
| train/             |          |
|    actor_loss      | -2.39    |
|    critic_loss     | 0.00291  |
|    learning_rate   | 0.001    |
|    n_updates       | 3000     |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1e+03    |
|    ep_rew_mean     | 174      |
| time/              |          |
|    episodes        | 8        |
|    fps             | 88       |
|    time_elapsed    | 90       |
|    total timesteps | 8000     |
| train/             |          |
|    actor_loss      | -11      |
|    critic_loss     | 0.158    |
|    learning_rate   | 0.001    |
|    n_updates       | 7000     |
--------------

<stable_baselines3.ddpg.ddpg.DDPG at 0x29cb5b1f0>

In [None]:
print(DDPG_model.predict(observation)[0][0])

0.34260428


In [8]:
# Test
env = DMControlWrapper("cartpole", "balance")

# Test
test_episodes = 30
test_scores = []
test_rewards = []
start_time = time.time()


for e in range(test_episodes):
    state = env.reset()
    sum_rewards = 0

    for t in range(1000): # 1000 steps (half delta t of gym)
        action = DDPG_model.predict(state)[0][0]
        state, reward, done, _ = env.step(action)
        
        
        sum_rewards += reward
    

        if state[1] < 0.978 or done or t == 999 : # cos of angle < cos of 12 degrees--> angle > 12deg or 1000 steps (half delta t of gym) 
            print("Test Episode: {}/{}, Score: {}".format(e + 1, test_episodes, t))
            test_scores.append(t)
            test_rewards.append(sum_rewards)
            break

test_average = mean(test_scores)
test_sigma = stdev(test_scores)
reward_average = mean(test_rewards)
reward_sigma = stdev(test_rewards)
end_time = time.time()
total_time = end_time - start_time
total_steps = sum(test_scores)
average_time_per_step = total_time / total_steps

print()
print('Score average: {:.2f}, Sigma: {:.2f}'.format(test_average, test_sigma))
print('Average time per step: {:.4f} seconds'.format(average_time_per_step))
print('Reward average: {:.2f}, Sigma: {:.2f}'.format(reward_average, reward_sigma))


Test Episode: 1/30, Score: 519
Test Episode: 2/30, Score: 524
Test Episode: 3/30, Score: 525
Test Episode: 4/30, Score: 573
Test Episode: 5/30, Score: 508
Test Episode: 6/30, Score: 609
Test Episode: 7/30, Score: 611
Test Episode: 8/30, Score: 497
Test Episode: 9/30, Score: 484
Test Episode: 10/30, Score: 452
Test Episode: 11/30, Score: 598
Test Episode: 12/30, Score: 582
Test Episode: 13/30, Score: 565
Test Episode: 14/30, Score: 556
Test Episode: 15/30, Score: 510
Test Episode: 16/30, Score: 592
Test Episode: 17/30, Score: 515
Test Episode: 18/30, Score: 565
Test Episode: 19/30, Score: 573
Test Episode: 20/30, Score: 542
Test Episode: 21/30, Score: 584
Test Episode: 22/30, Score: 536
Test Episode: 23/30, Score: 530
Test Episode: 24/30, Score: 604
Test Episode: 25/30, Score: 553
Test Episode: 26/30, Score: 465
Test Episode: 27/30, Score: 514
Test Episode: 28/30, Score: 511
Test Episode: 29/30, Score: 539
Test Episode: 30/30, Score: 492

Score average: 540.93, Sigma: 42.85
Average time

In [9]:
#create video
from moviepy.editor import ImageSequenceClip

def ddpg_policy(time_step):
    timestep, reward, discount, observation = time_step
    state = np.concatenate((observation['position'],observation['velocity']))
    action = DDPG_model.predict(state)[0][0]
    return action

# Load the cartpole environment
env = suite.load(domain_name='cartpole', task_name='balance')

# Visualization and video creation
def save_video(policy):
    frames = []

    def policy_with_frame_grab(time_step):
        pixels = env.physics.render(height=480, width=640, camera_id=0)
        frames.append(pixels)
        return policy(time_step)

    # Create the viewer application
    viewer.launch(env, policy=policy_with_frame_grab)

    # Save the frames as a video
    clip = ImageSequenceClip(frames, fps=100)
    clip.write_videofile("video/DDPG_sb3_dm_balance.mp4", codec="libx264")

# Call the save_video function with your policy function
save_video(ddpg_policy)

Moviepy - Building video video/DDPG_sb3_dm_balance.mp4.
Moviepy - Writing video video/DDPG_sb3_dm_balance.mp4



                                                                

Moviepy - Done !
Moviepy - video ready video/DDPG_sb3_dm_balance.mp4




## With PPO from Stablebaselines 3

In [10]:
from stable_baselines3 import PPO
env = DMControlWrapper("cartpole", "balance")
PPO_model = PPO("MlpPolicy", env, verbose=1)
PPO_model.learn(total_timesteps=100000)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1e+03    |
|    ep_rew_mean     | 343      |
| time/              |          |
|    fps             | 3655     |
|    iterations      | 1        |
|    time_elapsed    | 0        |
|    total_timesteps | 2048     |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 1e+03        |
|    ep_rew_mean          | 304          |
| time/                   |              |
|    fps                  | 2384         |
|    iterations           | 2            |
|    time_elapsed         | 1            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0024196494 |
|    clip_fraction        | 0.00376      |
|    clip_range           | 0.2          |
|    en

<stable_baselines3.ppo.ppo.PPO at 0x2b279b7c0>

In [11]:
# information about the model
print("Policy network architecture:", PPO_model.policy)


Policy network architecture: ActorCriticPolicy(
  (features_extractor): FlattenExtractor(
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )
  (mlp_extractor): MlpExtractor(
    (shared_net): Sequential()
    (policy_net): Sequential(
      (0): Linear(in_features=5, out_features=64, bias=True)
      (1): Tanh()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): Tanh()
    )
    (value_net): Sequential(
      (0): Linear(in_features=5, out_features=64, bias=True)
      (1): Tanh()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): Tanh()
    )
  )
  (action_net): Linear(in_features=64, out_features=1, bias=True)
  (value_net): Linear(in_features=64, out_features=1, bias=True)
)


In [12]:
# Test
env = DMControlWrapper("cartpole", "balance")

# Test
test_episodes = 30
test_scores = []
test_rewards = []
start_time = time.time()


for e in range(test_episodes):
    state = env.reset()
    sum_rewards = 0

    for t in range(1000): # 1000 steps (half delta t of gym)
        action = PPO_model.predict(state)[0][0]
        state, reward, done, _ = env.step(action)
        
        
        sum_rewards += reward
    

        if state[1] < 0.978 or done or t == 999 : # cos of angle < cos of 12 degrees--> angle > 12deg or 1000 steps (half delta t of gym) 
            print("Test Episode: {}/{}, Score: {}".format(e + 1, test_episodes, t))
            test_scores.append(t)
            test_rewards.append(sum_rewards)
            break

test_average = mean(test_scores)
test_sigma = stdev(test_scores)
reward_average = mean(test_rewards)
reward_sigma = stdev(test_rewards)
end_time = time.time()
total_time = end_time - start_time
total_steps = sum(test_scores)
average_time_per_step = total_time / total_steps

print()
print('Score average: {:.2f}, Sigma: {:.2f}'.format(test_average, test_sigma))
print('Average time per step: {:.4f} seconds'.format(average_time_per_step))
print('Reward average: {:.2f}, Sigma: {:.2f}'.format(reward_average, reward_sigma))


Test Episode: 1/30, Score: 999
Test Episode: 2/30, Score: 999
Test Episode: 3/30, Score: 999
Test Episode: 4/30, Score: 999
Test Episode: 5/30, Score: 999
Test Episode: 6/30, Score: 999
Test Episode: 7/30, Score: 999
Test Episode: 8/30, Score: 999
Test Episode: 9/30, Score: 999
Test Episode: 10/30, Score: 999
Test Episode: 11/30, Score: 999
Test Episode: 12/30, Score: 999
Test Episode: 13/30, Score: 999
Test Episode: 14/30, Score: 999
Test Episode: 15/30, Score: 999
Test Episode: 16/30, Score: 999
Test Episode: 17/30, Score: 999
Test Episode: 18/30, Score: 999
Test Episode: 19/30, Score: 999
Test Episode: 20/30, Score: 999
Test Episode: 21/30, Score: 999
Test Episode: 22/30, Score: 999
Test Episode: 23/30, Score: 999
Test Episode: 24/30, Score: 999
Test Episode: 25/30, Score: 999
Test Episode: 26/30, Score: 999
Test Episode: 27/30, Score: 999
Test Episode: 28/30, Score: 999
Test Episode: 29/30, Score: 445
Test Episode: 30/30, Score: 999

Score average: 980.53, Sigma: 101.15
Average tim

In [12]:
#create video
from moviepy.editor import ImageSequenceClip

def ppo_policy(time_step):
    timestep, reward, discount, observation = time_step
    state = np.concatenate((observation['position'],observation['velocity']))
    action = PPO_model.predict(state)[0][0]
    return action

# Load the cartpole environment
env = suite.load(domain_name='cartpole', task_name='balance')

# Visualization and video creation
def save_video(policy):
    frames = []

    def policy_with_frame_grab(time_step):
        pixels = env.physics.render(height=480, width=640, camera_id=0)
        frames.append(pixels)
        return policy(time_step)

    # Create the viewer application
    viewer.launch(env, policy=policy_with_frame_grab)

    # Save the frames as a video
    clip = ImageSequenceClip(frames, fps=100)
    clip.write_videofile("video/PPO_sb3_dm_balance.mp4", codec="libx264")

# Call the save_video function with your policy function
save_video(ppo_policy)

Moviepy - Building video video/PPO_sb3_dm_balance.mp4.
Moviepy - Writing video video/PPO_sb3_dm_balance.mp4



                                                                

Moviepy - Done !
Moviepy - video ready video/PPO_sb3_dm_balance.mp4




## Test of PPO model trained for balance how it works in swingup

In [16]:
# Test
env = DMControlWrapper("cartpole", "swingup")

# Test
test_episodes = 30
test_scores = []
test_rewards = []
start_time = time.time()


for e in range(test_episodes):
    state = env.reset()
    sum_rewards = 0

    for t in range(1000): # 1000 steps (half delta t of gym)
        action = PPO_model.predict(state)[0][0]
        state, reward, done, _ = env.step(action)
        
        
        sum_rewards += reward
    
    test_rewards.append(sum_rewards)


reward_average = mean(test_rewards)
reward_sigma = stdev(test_rewards)
end_time = time.time()
total_time = end_time - start_time
total_steps = sum(test_scores)
average_time_per_step = total_time / (test_episodes * 1000)


print()
print('Average time per step: {:.4f} seconds'.format(average_time_per_step))
print('Reward average: {:.2f}, Sigma: {:.2f}'.format(reward_average, reward_sigma))


Average time per step: 0.0002 seconds
Reward average: 323.45, Sigma: 96.18


In [13]:
# visualize
def PPO_policy(time_step):
    timestep, reward, discount, observation = time_step
    state = np.concatenate((observation['position'],observation['velocity']))
    action = PPO_model.predict(state)[0][0]
    return action

env = suite.load(domain_name='cartpole', task_name='swingup')

viewer.launch(env, policy=PPO_policy)