# Decision Transformers Replication Report

The following project seeks to replicate the gym results of the following paper:

https://arxiv.org/abs/2106.01345

The offline RL data, data parsing code, and some model parameters are taken from their github:

https://github.com/kzl/decision-transformer

More information on using the datasets can be found here:

https://github.com/rail-berkeley/d4rl 

This project contains the following files:

- `RLagents.py` general framework for RL agents
- `jonathans_experiment.py` code for running experiments and sampling from datasets
- `DTagents.py` framework for decision transformer agents
- `models.py` contains the neural networks used
- `data` directory containing the offline RL datasets, which can be obtained by following directions on their github repo or from d4rl

In addition, make sure that the pytorch, huggingface, and mujoco libraries are in your environment. Instructions to download them can be found on their corresponding websites.

Below are some step by step instructions on how to use these files


In [4]:
# import libraries
%load_ext autoreload
%autoreload 2
from sql.jonathans_experiment import prepare_experiment
from sql.DTagents import *
import torch
import time

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
# make the experiment environment and dt agent
new_batch, env, max_ep_len, scale, env_target, state_mean, state_std = prepare_experiment('gym-experiment', device='cpu')
dta = DecisionTransformerAgent(env, scale=scale, target_return=env_target, warmup_steps=100, warmup_method=1, 
                               lr=0.01, state_mean=state_mean, state_std=state_std)

running build_ext
building 'mujoco_py.cymj' extension
gcc -pthread -B /home-nfs/doctorduality/mc3/envs/ml/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /home-nfs/doctorduality/mc3/envs/ml/include -I/home-nfs/doctorduality/mc3/envs/ml/include -fPIC -O2 -isystem /home-nfs/doctorduality/mc3/envs/ml/include -fPIC -I/home-nfs/doctorduality/mc3/envs/ml/lib/python3.9/site-packages/mujoco_py -I/home-nfs/doctorduality/.mujoco/mujoco210/include -I/home-nfs/doctorduality/mc3/envs/ml/lib/python3.9/site-packages/numpy/core/include -I/home-nfs/doctorduality/mc3/envs/ml/include/python3.9 -c /home-nfs/doctorduality/mc3/envs/ml/lib/python3.9/site-packages/mujoco_py/cymj.c -o /home-nfs/doctorduality/mc3/envs/ml/lib/python3.9/site-packages/mujoco_py/generated/_pyxbld_2.1.2.14_39_linuxcpuextensionbuilder/temp.linux-x86_64-3.9/home-nfs/doctorduality/mc3/envs/ml/lib/python3.9/site-packages/mujoco_py/cymj.o -fopenmp -w
gcc -pthread -B /home-nfs/doctorduality/mc3/envs

/home-nfs/doctorduality/mc3/envs/ml/compiler_compat/ld: cannot find -lOSMesa
/home-nfs/doctorduality/mc3/envs/ml/compiler_compat/ld: cannot find -lGL
collect2: error: ld returned 1 exit status


LinkError: command '/usr/bin/gcc' failed with exit code 1

In [18]:
# train the agent
start_time = time.time()
for i in range(100):
    s, a, r, d, rtg, timesteps, mask = new_batch(64)
    dta.offline_train(s, a, r, d, rtg, timesteps, mask, per_batch=True)
print('training time', time.time() - start_time)

training time 48.58893084526062


In [19]:
# evaluate the agent and compute statistics
start_time = time.time()
returns, lengths = dta.online_evaluate(100)
print('evaluation time', time.time() - start_time)
print('mean return', returns.mean())
print('std returns', returns.std())
print('mean lengths', returns.mean())
print('std lengths', returns.std())

evaluation time 24.278470516204834
mean return 49.69007607395537
std returns 1.8752346636348485
mean lengths 49.69007607395537
std lengths 1.8752346636348485


In [24]:
# evaluate the agent using their evaluation code and compute statistics
# state_mean
from sql.evaluation.DTevaluation import BM_Evaluater
start_time = time.time()
eva = BM_Evaluater()
print(eva.evaluate(dta.env, dta.model))
# dta.bm_online_evaluate(10)
print('evaluation time', time.time() - start_time)

{'target_3600_return_mean': 63.37425210199192, 'target_3600_return_std': 1.8183312052112475, 'target_3600_length_mean': 82.0, 'target_3600_length_std': 5.92452529743945}
evaluation time 26.04992365837097


In [29]:
from sql.evaluation.DTevaluation import DT_Evaluater
eva1 = DT_Evaluater()
eva1.evaluate(dta.env, dta.model, device='cuda')

AssertionError: Torch not compiled with CUDA enabled

In [27]:
start_time = time.time()
returns, lengths = dta.online_evaluate(10)

1
ts shape torch.Size([1])
ps emb shape torch.Size([1, 128])
torch.Size([1, 128])
t sh torch.Size([1])
R sh torch.Size([1, 1])
torch.Size([2, 11])
2
ts shape torch.Size([2])
ps emb shape torch.Size([2, 128])
torch.Size([2, 128])
t sh torch.Size([2])
R sh torch.Size([2, 1])
torch.Size([3, 11])
3
ts shape torch.Size([3])
ps emb shape torch.Size([3, 128])
torch.Size([3, 128])
t sh torch.Size([3])
R sh torch.Size([3, 1])
torch.Size([4, 11])
4
ts shape torch.Size([4])
ps emb shape torch.Size([4, 128])
torch.Size([4, 128])
t sh torch.Size([4])
R sh torch.Size([4, 1])
torch.Size([5, 11])
5
ts shape torch.Size([5])
ps emb shape torch.Size([5, 128])
torch.Size([5, 128])
t sh torch.Size([5])
R sh torch.Size([5, 1])
torch.Size([6, 11])
6
ts shape torch.Size([6])
ps emb shape torch.Size([6, 128])
torch.Size([6, 128])
t sh torch.Size([6])
R sh torch.Size([6, 1])
torch.Size([7, 11])
7
ts shape torch.Size([7])
ps emb shape torch.Size([7, 128])
torch.Size([7, 128])
t sh torch.Size([7])
R sh torch.Size

In [34]:
t = torch.tensor([[3,4,5]])
t.reshape(-1).shape

torch.Size([3])

### Benchmark 
For comparison, below are the results after running the experiment using the author's code for 100 iterations with the same parameters

- time/training: 41.14713501930237
- evaluation/target_3600_return_mean: 42.6457553131806
- evaluation/target_3600_return_std: 1.6768748119913417
- evaluation/target_3600_length_mean: 27.79
- evaluation/target_3600_length_std: 0.8401785524517987

- time/total: 63.860310554504395
- time/evaluation: 22.713170051574707
- training/train_loss_mean: 0.6614395618438721
- training/train_loss_std: 0.02305582663767387
- training/action_error: 0.6420342326164246

Seems like they probably used additional methods than those mentioned in the paper to reduce the variance, but otherwise the results look similiar

### Results for various jobs

| Dataset | Environment | Mean | Std | Training Steps |
| --- | --- | --- | --- | --- |
| Medium | HalfCheetah | 40.15 | 6.92 |1e4|
| | Hopper | 94.53 | 4.97 | 1e5 |

<!-- $$
\begin{center}
\begin{tabular}{ |c |c |c| }
\hline
 cell1 & cell2 & cell3 \\ 
 cell4 & cell5 & cell6 \\  
 cell7 & cell8 & cell9    
\end{tabular}
\end{center}
$$ -->

# Results for larger experiments

Below are results when I run the experiment for 1000 iterations with warmup steps = 1000 and lr=0.0001, holding everything else constant
- Training time:  408.6037516593933
- Training time per batch 0.40860375165939333
- mean return 86.40965465081138
- std returns 11.182001940365332
- mean lengths 86.40965465081138
- std lengths 11.182001940365332
- Testing time:  24.500950574874878

These results are very similiar to those presented in the paper, where they run the experiment for 100000 iterations and 100000 warmup steps instead. However, we do see a higher std than in the results in the paper, as expected with our lower sample size.

Unfortunately I did not have the computational resources to run the experiment for 100000 at the current moment, but I expect the results to be similiar. Neither could I run the benchmark for 1000 iterations since it took significantly more computation.

## Notes for further research
- The authors do not seem to have normalized their returns per episode using the method from d4rl like they claimed
- In their episode evaluation, the authors do not seem to have used scaled returns to go, which was used during training. In our replication we scale the rtg in evaluation
- Currently the transformer processes reward to go, state, and action tokens similiarly. I think that using an architecture that differientiates between them would improve performance
- Using a better prediction layer than simply single layer linear prediction might result in better action predictions
- In evaluation, the unknown next action is currently padded as a zero dimensional vector. This does not indicate to the model that we are trying to predict the unknown rather than translate. I think make modifications on this might be useful
- It might be interesting to try to have the model predict future action/state/reward sequences as well in other to create context which can then be used to predict the current action
- The loss is currently based on how similiar the predicted actions are to their actual actions. This means that the model is incentivized to stick to existing action sequences, and also that the loss is not based on the reward earned. Having the loss include rewards might incentivize the model to innovate new sequences and thus improve performance
- I tried adding more warmup methods but they did not seem to be of great effect

I think that these would be interesting improvements to the model