# Example of applying `REVIVE` to gym-pendulum
***

## [Abstract](#abstract)

Gym-pendulum is one of the classic control problems in traditional reinforcement learning (RL) field. As shown in the following animation, one end of the inverted pendulum is attached to the fixed potint, and the other end is free to swing. The goal of this control problem is to apply torque on the free end to swing the pendulum into an upright position, in which it can stand on the fix point and be in balance. Specific illustration of this problem can also be found in the __[gym-pendulum](https://www.gymlibrary.dev/environments/classic_control/pendulum/)__. 

Here, we try to illustrate how to use `REVIVE` to simulate gym-pendulum environment and learn a policy basing on the simulation environment. For the most important purpose, we also compare policy performance of the new learned policy from REVIEVE with that from original old policy. Readers and users, who are interested in `REVIVE`, will intuitively feel the policy-promotion that has been achieved by using `REVIVE`.
<div>
<center>
<img src="url/pendulum.gif" width="400"/>
</center>
</div>

***

## [Methodology](#method)

0. Sampling

    In this step, we stitch many trajectories into one long track by using the old policy. In the each step of the following buffer, `[state, action, next_state]` are stored in order. Index (`index`) includes all end point of those trajectories on the long track.

    > <b>Tip:</b> 
    > Actually, this step is unnecessary for using `REVIVE`. The introduction here is to make it clear for the next step of constructing `REVIVE` data. 
    <!-- *** -->
    > <b>Example:</b> 
    > ```python
    > import warnings
    > warnings.filterwarnings('ignore')
    > 
    > """
    > import pickle
    > import Sampling as SAmpling
    > 
    > Old_policy = SAmpling.Expert_policy()
    > Old_policy.enport_net("url/Expert_pendulum_v1.pt")
    > with open('url/Old_policy.pkl', 'wb') as f:
    >     pickle.dump(Old_policy, f)
    > """
    > 
    > import pickle
    > Old_policy = pickle.load(open('url/Old_policy.pkl', 'rb'))
    > buffer, index = Old_policy.sampling()
    > 
    > # output:
    > # Model's state_dict:
    > # fc1.weight 	 torch.Size([64, 3])
    > # fc1.bias 	 torch.Size([64])
    > # fc2.weight 	 torch.Size([64, 64])
    > # fc2.bias 	 torch.Size([64])
    > # fc3.weight 	 torch.Size([1, 64])
    > # fc3.bias 	 torch.Size([1])
    > 
    > ```

In [35]:
import warnings
warnings.filterwarnings('ignore')

"""
import pickle
import Sampling as SAmpling

Old_policy = SAmpling.Expert_policy()
Old_policy.enport_net("url/Expert_pendulum_v1.pt")
with open('url/Old_policy.pkl', 'wb') as f:
    pickle.dump(Old_policy, f)
"""

import pickle
Old_policy = pickle.load(open('url/Old_policy.pkl', 'rb'))
buffer, index = Old_policy.sampling()

# output:
# Model's state_dict:
# fc1.weight 	 torch.Size([64, 3])
# fc1.bias 	 torch.Size([64])
# fc2.weight 	 torch.Size([64, 64])
# fc2.bias 	 torch.Size([64])
# fc3.weight 	 torch.Size([1, 64])
# fc3.bias 	 torch.Size([1])

Model's state_dict:
fc1.weight 	 torch.Size([64, 3])
fc1.bias 	 torch.Size([64])
fc2.weight 	 torch.Size([64, 64])
fc2.bias 	 torch.Size([64])
fc3.weight 	 torch.Size([1, 64])
fc3.bias 	 torch.Size([1])


1. Construction of `.npz` data

    In this step, we construct data file in `.npz` format. As shown in the following code, we name state-data in states, action-data in actions. The `index` includes those end points of trajectories in the data. After compress-saving these data in the data folder, we finish construction of data.

    > <b>Tip:</b> 
    > Here, only end points should be named as `index`. Other data like states and actions can be named as whatever you want, but names should be same when we construct .yml file in the next step.

    > <b>Example:</b>
    > ```python
    > """
    > import pickle
    > Old_policy = pickle.load(open('url/Old_policy.pkl', 'rb'))
    > buffer, index = Old_policy.sampling()
    > """
    > import numpy as np
    > states, actions, next_states = zip(*buffer)
    > 
    > transition_dic = {'states': states,
    >                 'actions':np.array(actions), 
    >                 "index":np.array(index)}
    > 
    > np.savez_compressed("data/expert_data.npz", **transition_dic)
    > 
    > ``` 

In [37]:
"""
import pickle
Old_policy = pickle.load(open('url/Old_policy.pkl', 'rb'))
buffer, index = Old_policy.sampling()
"""
import numpy as np
states, actions, next_states = zip(*buffer)

transition_dic = {'states': states,
                  'actions':np.array(actions), 
                  "index":np.array(index)}

np.savez_compressed("data/expert_data.npz", **transition_dic)

2. Construction of `.yml`
    
    Detail information in `.yml` of this work has been shown in the following example. In general, there are two parts information constructing `.yml` file, including `metadata` and `columns`. Here, the corresponding relations of names in the `.npz` file have been listed in the following. Note that, as there are three dimensions in `states`, columns of `states` should be continuously defined in the 'columns' part. As illustrated in the __[gym-pendulum](https://www.gymlibrary.dev/environments/classic_control/pendulum/)__, variables in states and actions are continuous distributed, we use 'continuous' as type of each column.

    > <b>Tip:</b> 
    > The columns named like `obs_*` can also be changed as whatever you want. Note that multiple dimensions of data in `.npz` should be continuously and orderly defined in `columns` part.

    > <b>Example of `.yml`:</b> 
    > ``` 
    > metadata:                     <- 'metadata' part 
    > graph:
    >     actions:                  <- corresponding to 
    >                                 `actions` in '.npz'.
    >     - states                  <- corresponding to 
    >                                 `states` in '.npz'.
    >     next_states:
    > 
    >     - states                  <- corresponding to 
    >                                 `states` in '.npz'.
    >     - actions                 <- corresponding to 
    >                                 `actions` in '.npz'.
    > columns:                      <- 'columns' part
    > - obs_0:               ---+
    >     dim: states           |
    >     type: continuous      |   Here, 'dim:states' corresponding to   
    > - obs_1:                  |   'states' in '.npz' .   
    >     dim: states           |<- 'obs_*' stands for the * dimension of  
    >     type: continuous      |   'states'. As there are three dimensions
    > - obs_2:                  |   in 'states', we orderly defined them  
    >     dim: states           |   in the 'columns' part. 
    >     type: continuous   ---+       
    > - action:
    >     dim: actions
    >     type: continuous
    >
    > ```

3. Construction of `reward.py`

    `reward.py` is used for leaning a policy by the last step of `REVIVE`. In the case here, we try to apply torque on the pendulum to make it stand on the fix point. The reward is given by the following equation:
    
    $r = -(theta^{2} + 0.1 * theta{\_dt}^{2} + 0.001 * torque^2)$
    
    where the maximum and minimum value of the equation are 0 and -16, corresponding to pendulum stands on or upside down the fix point, respectively.
    
    
    
    > <b>Tip:</b> 
    > In general, there is a function defined in `reward.py` and named as `get_reward`. The function should be calculated basing on `torch.Tensor`.

    > <b>Example of `reward.py`:</b> 
    > ```python 
    > import math
    > def get_reward(data : Dict[str, torch.Tensor]) -> torch.Tensor:
    >     action = data['actions'][...,0:1]
    >     u = torch.clamp(action, -2, 2)
    > 
    >     state = data['states'][...,0:3]
    >     costheta = state[:,0].view(-1,1)
    >     sintheta = state[:, 1].view(-1,1)
    >     thdot = state[:, 2].view(-1,1)
    > 
    >     x = torch.acos(costheta)
    >     theta = ((x + math.pi) % (2 * math.pi)) - math.pi
    >     costs = theta ** 2 + 0.1 * thdot**2 + 0.001 * (u**2)
    >     
    >     return -costs
    > 
    > ```

In [None]:

import math
def get_reward(data : Dict[str, torch.Tensor]) -> torch.Tensor:
    action = data['actions'][...,0:1]
    u = torch.clamp(action, -2, 2)

    state = data['states'][...,0:3]
    costheta = state[:,0].view(-1,1)
    sintheta = state[:, 1].view(-1,1)
    thdot = state[:, 2].view(-1,1)

    x = torch.acos(costheta)
    theta = ((x + math.pi) % (2 * math.pi)) - math.pi
    costs = theta ** 2 + 0.1 * thdot**2 + 0.001 * (u**2)
    
    return -costs


4. Run in `REVIVE`

    We now has constructed all we need for running in `REVIVE`, including `.npz` data, `.yml` file, `reward.py` function file. In this case, these three files are located in `data` folder, in which there is another necessary file `config.json`. This file describes the parameters of `REVIVE`. Also we need a `train.py` file as the launch script for use `REVIVE`. In the same path of `train.py`, we now can launch `REVIVE` in the command line.

    > <b>Example:</b> 
    > ```
    >line 1 python train.py                           
    >line 2        -df data/expert_data.npz 
    >line 3        -cf data/Env-GAIL-pendulum.yaml 
    >line 4        -rf data/pendulum-reward.py 
    >line 5        -rcf data/config.json 
    >line 6        -vm once 
    >line 7        -pm once 
    >line 8        --run_id pendulum-v1 
    >line 9        --revive_epoch 1000 
    >line 10       --ppo_epoch 5000
    >line 11       --venv_rollout_horizon 100
    >line 12       --ppo_rollout_horizon 50 
    > ``` 
    
    line 1, line 2, line 3, line 4, line 5 list all  path for those necessary files, respectively. In line 6 and line 7, `-vm` and `-pm` separately list modes of `REVIVE` for training environment and policy models. Here `once` means using default hyper-parameters in `REVIVE` for the training. We name the job of this case as `pendulum-v1`, so `REVIVE` will create a folder with the same name in the `logs` folder, where we can also find all useful results for assess virtual environment and policy from `REVIVE`. We use `REVIVE` for training the virtual environment for 1000 epochs. Basing on the virtual environment, we use ppo method for training the control policy for 5000 epochs, which is actually adjusted by the users. The line 11 and 12 tell the `REVIVE` to use rollout 100 and 50 steps for virtual environment and policy learning, respectively. These two settings are also shown in `data\config.jason`.

***
## [Results](#results)

In the end, we get a the final policy from `REVIVE`, which is saved as 
`policy.pkl` in `logs\pendulum-v1` folder. As the gym environment can also be treated as a real environment which can be used to test those learned policy, we compare the `REVIVE` policy with the old policy that we used for sampling. 

In the following, we first set each policy on same `pendulum-v1` environments for 50 times, and each time roll out 300 steps. We print the mean return (accumulated rewards) of these 50-times roll out. `REVIVE` policy gives -137.66 which is much higher than -848.88 of old policy. Using `REVIVE` for learning a control policy has achieved about 84% enhancement than old policy. 

In [2]:
import warnings
warnings.filterwarnings('ignore')

from Results import get_results
import pickle

result = get_results('logs/pendulum-v1/policy.pkl', 'url/Old_policy.pkl')
r_revive, r_old, vedio_revive, vedio_old = result.roll_out(50, step=300)

with open('url/results.pkl', 'wb') as f:
    pickle.dump([vedio_revive, vedio_old], f)

# output
# mean return of REVIVE: -137.66
# mean return of    old: -848.88    


mean return of REVIVE: -137.66
mean return of    old: -848.88


For the purpose of more intuitively comparing these two policies, we use the environment of these 50-times roll out, in which old policy have achieved the maximum return. With the same initial state, the environment is also applied to `REVIVE` policy. We reveal each step of pendulum's movement in animation as shown in the following. From the comparison, pendulum stands on the fix point in about 3 seconds controlled by `REVIVE` policy, however the pendulum does not stand on the fix point in the end under the control of old policy.

<div>
<center>
<img src="url/result.gif" width="1000"/>
</center>
</div>

In [2]:
from Video import get_video
from IPython import display
import pick
%matplotlib notebook

vedio_revive, vedio_old = pickle.load(open('url/results.pkl', 'rb'))
html = get_video(vedio_revive,vedio_old)
display.display(html)

<IPython.core.display.Javascript object>

MovieWriter imagemagick unavailable; using Pillow instead.
