Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Urgent question about data aggregates #4

Closed
slerman12 opened this issue Nov 18, 2021 · 1 comment
Closed

Urgent question about data aggregates #4

slerman12 opened this issue Nov 18, 2021 · 1 comment

Comments

@slerman12
Copy link

slerman12 commented Nov 18, 2021

Hi, we compiled the Atari 100k results from DrQ, CURL, and DER, and the mean/median human-norm scores are well below those reported in prior works, including from co-authors of the rliable paper.

We have median human-norm scores all around 0.10 - 0.12.

Is this accurate? Of all of these, DER (the oldest of the algs) has the highest mean human-norm score.

@agarwl
Copy link
Collaborator

agarwl commented Nov 19, 2021

That doesn't seem right -- the aggregate scores should match as in figure below (uses 10 runs), which can be done using the colab at bit.ly/statistical_precipice_colab:

image.

@agarwl agarwl closed this as completed Nov 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants