# Project: Furuta Pendulum

This notebook provides a starting point for the project, introducing the simulation environment. 

In [17]:
# Load the autoreload extension to automatically reload modules when they are modified
%load_ext autoreload

# Set autoreload to reload all modules before executing code
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Setting Up Dependencies

The simulation and reinforcement learning algorithms used in this project require a few libraries. To install all the necessary dependencies, follow these steps:

1. Open a terminal.
2. Navigate to this directory.
3. Run the following commands (assuming you have already set up the `conda` environment from the exercise sessions):

```
conda activate py13roboticscourse
python -m pip install -r requirements.txt
```

Installing pytorch for Windows may require additional steps. If you encounter any issues, please refer to the official pytorch installation guide. Alternatively, you can work in a windows subsystem for Linux (WSL) environment, which is recommended for this project.

### Load the pendulum configuration

In [23]:
import yaml 
import os

with open('pendulum_description/simulation_pendulum.yaml', 'r') as file:
        config = yaml.safe_load(file)

parameters_model = config["parameters_model"]
urdf_path = os.path.join("pendulum_description", config["urdf_filename"])
forward_dynamics_casadi_path = os.path.join("pendulum_description", config["forward_dynamics_casadi_filename"])

### Create the environment

In [24]:
import gymnasium as gym
from furuta_torque_env import FurutaPendulumTorqueEnv

gym.register(
    id="FurutaPendulumTorque-v0",
    entry_point=FurutaPendulumTorqueEnv,
)

# Create an instance of the FurutaPendulumTorqueEnv environment
env = gym.make("FurutaPendulumTorque-v0",
               urdf_model_path=urdf_path, forward_dynamics_casadi_path=forward_dynamics_casadi_path, parameters_model=parameters_model, render=True, swingup=True)

# Click on the url that appears in the output to visualize the environment in a browser

model name: furuta_pendulum
q: [0. 0.]
T_pin:   R =
           1            0            0
           0           -1  1.22465e-16
           0 -1.22465e-16           -1
  p =    0.093525 1.57979e-17       0.055

You can open the visualizer by visiting the following URL:
http://127.0.0.1:7009/static/


### Execute some control actions on the environment

Random actions are sampled and executed on the pendulum. In the browser visualization, you should see the pendulum moving, until the episode terminates.

In [25]:
# Reset the environment to its initial state
observation, info = env.reset()

terminated = False
truncated = False
print(f"Initial Observation: {observation}")
# Apply a sequence of actions
while not terminated and not truncated:
    action = env.action_space.sample()  # Sample a random action
    observation, reward, terminated, truncated, info = env.step(action)
    print(f"Observation: {observation}, Reward: {reward}, Terminated: {terminated}, Truncated: {truncated}")

Initial Observation: [0. 1. 0. 1. 0. 0. 0.]
Swingup reward function not implemented yet
Observation: [-3.62470280e-03  9.99993431e-01  3.01842274e-03  9.99995445e-01
 -3.29494450e-02  1.20640447e-02 -5.76890630e-04], Reward: 0.0, Terminated: False, Truncated: False
Swingup reward function not implemented yet
Observation: [-0.01563694  0.99987774  0.01299554  0.99991555 -0.07624959  0.02781452
 -0.0024888 ], Reward: 0.0, Terminated: False, Truncated: False
Swingup reward function not implemented yet
Observation: [-0.04194762  0.99911981  0.03477375  0.99939521 -0.16301705  0.05926215
 -0.00667813], Reward: 0.0, Terminated: False, Truncated: False
Swingup reward function not implemented yet
Observation: [-0.08994594  0.99594665  0.0743799   0.99722998 -0.27424041  0.09932241
 -0.01433471], Reward: 0.0, Terminated: False, Truncated: False
Swingup reward function not implemented yet
Observation: [-0.14621262  0.98925319  0.12053068  0.9927096  -0.24091728  0.08606527
 -0.02335418], Reward:

### Load the trained model and test it
We provide a model of an agent that was already been trained to perform upward stabilization of the pendulum. 


This model was trained for 1 million timesteps using a slightly modified version of the [CleanRL implementation of PPO for continuous actions](https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_continuous_actionpy). You can find the script used for training (`ppo_continuous_action.py`) in this same directory.
In the following cells, we demonstrate how to load the trained model and visualize its behavior in the environment.

In [21]:
import torch
from ppo_continuous_action import Agent, make_env

saved_model_path = 'furuta_pendulum_tensorboard/furutaTorque__ppo_continuous_action__42__1744890705/ppo_continuous_action.cleanrl_model' # no swingup

# Set to False if task includes swingup
swingup = False

envs = gym.vector.SyncVectorEnv(
        [make_env(urdf_path=urdf_path, 
                  parameters_model=parameters_model, 
                  forward_dynamics_casadi_path=forward_dynamics_casadi_path,
                  render=True, swingup=swingup) for _ in range(1)]
    )

model = Agent(envs=envs)

model.load_state_dict(torch.load(saved_model_path, map_location="cpu"))
model.eval()

model name: furuta_pendulum
q: [0. 0.]
T_pin:   R =
           1            0            0
           0           -1  1.22465e-16
           0 -1.22465e-16           -1
  p =    0.093525 1.57979e-17       0.055

You can open the visualizer by visiting the following URL:
http://127.0.0.1:7006/static/


Agent(
  (critic): Sequential(
    (0): Linear(in_features=7, out_features=64, bias=True)
    (1): Tanh()
    (2): Linear(in_features=64, out_features=64, bias=True)
    (3): Tanh()
    (4): Linear(in_features=64, out_features=1, bias=True)
  )
  (actor_mean): Sequential(
    (0): Linear(in_features=7, out_features=64, bias=True)
    (1): Tanh()
    (2): Linear(in_features=64, out_features=64, bias=True)
    (3): Tanh()
    (4): Linear(in_features=64, out_features=1, bias=True)
  )
)

In [22]:
# Test the trained model
observation, info = envs.reset()
terminated = False
truncated = False
print(f"Initial Observation: {observation}")
# Apply a sequence of actions
while not terminated and not truncated:
    action, _, _, _ = model.get_action_and_value(torch.Tensor(observation), deterministic=True)  # Use the trained model to predict the action
    action = action.cpu().detach().numpy()
    observation, reward, terminated, truncated, info = envs.step(action)
    print(f"Observation: {observation}, Action: {action}, Reward: {reward}, Terminated: {terminated}, Truncated: {truncated}")
# Close the environment
envs.close()


Initial Observation: [[ 0.0000000e+00  1.0000000e+00  1.2246468e-16 -1.0000000e+00
   0.0000000e+00  0.0000000e+00  0.0000000e+00]]
Observation: [[-3.04630994e-03  9.99995360e-01  2.54047117e-03 -9.99996773e-01
  -2.76957928e-02 -1.01700594e-02 -4.84836035e-04]], Action: [[-0.22170198]], Reward: [-0.00133922], Terminated: [False], Truncated: [False]
Observation: [[-5.95719134e-03  9.99982256e-01  4.99028733e-03 -9.99987548e-01
   1.23094172e-03  3.62842387e-04 -9.48122056e-04]], Action: [[0.23175299]], Reward: [-0.00190787], Terminated: [False], Truncated: [False]
Observation: [[-0.00667149  0.99997775  0.00565195 -0.99998403 -0.00772522 -0.00301165
  -0.00106181]], Action: [[-0.07150369]], Reward: [-0.00025007], Terminated: [False], Truncated: [False]
Observation: [[-7.33780979e-03  9.99973078e-01  6.32304792e-03 -9.99980009e-01
   1.66706801e-03  3.25046895e-04 -1.16785918e-03]], Action: [[0.0754331]], Reward: [-0.00030516], Terminated: [False], Truncated: [False]
Observation: [[-7.2

### Tensorboard

Below you can see the charts of the training evolution, visualized through Tensorboard. These charts are generated automatically when training a model using the `ppo_continuos_action.py` script.


In [9]:
%load_ext tensorboard

%tensorboard --logdir furuta_pendulum_tensorboard