# Decision Transformers Replication Report

The following project seeks to replicate the gym results of the following paper:

https://arxiv.org/abs/2106.01345

The offline RL data, data parsing code, and some model parameters are taken from their github:

https://github.com/kzl/decision-transformer

This project contains the following files:

- `RLagents.py` general framework for RL agents
- `jonathans_experiment.py` code for running experiments and sampling from datasets
- `DTagents.py` framework for decision transformer agents
- `models.py` contains the neural networks used
- `data` directory containing the offline RL datasets, which can be obtained by following directions on their github repo or from d4rl

In addition, make sure that the pytorch, huggingface, and mujoco libraries are in your environment. Instructions to download them can be found on their corresponding websites.

Below are some step by step instructions on how to use these files


In [1]:
# import libraries
%load_ext autoreload
%autoreload 2
from jonathans_experiment import *
from DTagents import *
import torch
import time

In [12]:
# make the experiment environment and dt agent
new_batch, env, max_ep_len, scale, env_target, state_mean, state_std = prepare_experiment('gym-experiment', device='cpu')
dta = DecisionTransformerAgent(env, scale=scale, target_return=env_target, warmup_steps=100, warmup_method=1, 
                               lr=0.01, state_mean=state_mean, state_std=state_std)

Starting new experiment: hopper medium
2186 trajectories, 999906 timesteps found
Average return: 1422.06, std: 378.95
Max return: 3222.36, min: 315.87
(11,)


In [13]:
# train the agent
start_time = time.time()
for i in range(10):
    s, a, r, d, rtg, timesteps, mask = new_batch(64)
    dta.offline_train(s, a, r, d, rtg, timesteps, mask, per_batch=True)
print('training time', time.time() - start_time)

training time 13.977554559707642


In [13]:
# evaluate the agent and compute statistics
start_time = time.time()
returns, lengths = dta.online_evaluate(100)
print('evaluation time', time.time() - start_time)
print('mean return', returns.mean())
print('std returns', returns.std())
print('mean lengths', returns.mean())
print('std lengths', returns.std())

eps 0


  a = torch.cat((a, torch.tensor(action, dtype=self.dtype, device=self.device).reshape(1, -1)))


eps 1
eps 2
eps 3
eps 4
eps 5
eps 6
eps 7
eps 8
eps 9
eps 10
eps 11
eps 12
eps 13
eps 14
eps 15
eps 16
eps 17
eps 18
eps 19
eps 20
eps 21
eps 22
eps 23
eps 24
eps 25
eps 26
eps 27
eps 28
eps 29
eps 30
eps 31
eps 32
eps 33
eps 34
eps 35
eps 36
eps 37
eps 38
eps 39
eps 40
eps 41
eps 42
eps 43
eps 44
eps 45
eps 46
eps 47
eps 48
eps 49
eps 50
eps 51
eps 52
eps 53
eps 54
eps 55
eps 56
eps 57
eps 58
eps 59
eps 60
eps 61
eps 62
eps 63
eps 64
eps 65
eps 66
eps 67
eps 68
eps 69
eps 70
eps 71
eps 72
eps 73
eps 74
eps 75
eps 76
eps 77
eps 78
eps 79
eps 80
eps 81
eps 82
eps 83
eps 84
eps 85
eps 86
eps 87
eps 88
eps 89
eps 90
eps 91
eps 92
eps 93
eps 94
eps 95
eps 96
eps 97
eps 98
eps 99
evaluation time 1367.6661880016327
mean return 136.6401590305296
std returns 6.777549149085109
mean lengths 136.6401590305296
std lengths 6.777549149085109


In [14]:
# evaluate the agent using their evaluation code and compute statistics
# state_mean
dta.bm_online_evaluate(10)

IndexError: index out of range in self

### Benchmark 
For comparison, below are the results after running the experiment using the author's code for 100 iterations with the same parameters

- time/training: 41.14713501930237
- evaluation/target_3600_return_mean: 42.6457553131806
- evaluation/target_3600_return_std: 1.6768748119913417
- evaluation/target_3600_length_mean: 27.79
- evaluation/target_3600_length_std: 0.8401785524517987

- time/total: 63.860310554504395
- time/evaluation: 22.713170051574707
- training/train_loss_mean: 0.6614395618438721
- training/train_loss_std: 0.02305582663767387
- training/action_error: 0.6420342326164246

Seems like they probably used additional methods than those mentioned in the paper to reduce the variance, but otherwise the results look similiar

# Results for larger experiments

Below are results when I run the experiment for 1000 iterations with warmup steps = 1000 and lr=0.0001, holding everything else constant
- Training time:  408.6037516593933
- Training time per batch 0.40860375165939333
- mean return 86.40965465081138
- std returns 11.182001940365332
- mean lengths 86.40965465081138
- std lengths 11.182001940365332
- Testing time:  24.500950574874878

These results are very similiar to those presented in the paper, where they run the experiment for 100000 iterations and 100000 warmup steps instead. However, we do see a higher std than in the results in the paper, as expected with our lower sample size.

Unfortunately I did not have the computational resources to run the experiment for 100000 at the current moment, but I expect the results to be similiar. Neither could I run the benchmark for 1000 iterations since it took significantly more computation.

## Notes for further research
- The authors do not seem to have normalized their returns per episode using the method from d4rl like they claimed
- In their episode evaluation, the authors do not seem to have used scaled returns to go, which was used during training. In our replication we scale the rtg in evaluation
- Currently the transformer processes reward to go, state, and action tokens similiarly. I think that using an architecture that differientiates between them would improve performance
- Using a better prediction layer than simply single layer linear prediction might result in better action predictions
- In evaluation, the unknown next action is currently padded as a zero dimensional vector. This does not indicate to the model that we are trying to predict the unknown rather than translate. I think make modifications on this might be useful
- It might be interesting to try to have the model predict future action/state/reward sequences as well in other to create context which can then be used to predict the current action
- The loss is currently based on how similiar the predicted actions are to their actual actions. This means that the model is incentivized to stick to existing action sequences, and also that the loss is not based on the reward earned. Having the loss include rewards might incentivize the model to innovate new sequences and thus improve performance
- I tried adding more warmup methods but they did not seem to be of great effect

I think that these would be interesting improvements to the model