Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The role of the test data during the evaluation #3

Closed
TongZhangTHU opened this issue Nov 28, 2022 · 4 comments
Closed

The role of the test data during the evaluation #3

TongZhangTHU opened this issue Nov 28, 2022 · 4 comments
Labels
question Further information is requested

Comments

@TongZhangTHU
Copy link

TongZhangTHU commented Nov 28, 2022

Hi Mohit,

Thank you again for your wonderful implementation!

I have a question about the implementation: What's the role of the test data during the evaluation? As for as I know, a typical evaluation is the agent interacting with the environment, which doesn't need test data. So is there any difference with your implementation? And how to tell if an episode succeeds or not with test data?

@MohitShridhar
Copy link
Collaborator

@TongZhangTHU, the test data serves two purposes:

  1. One-to-one comparisons between two agents. We can take an episode from the test dataset, and use its random seed to spawn the exact same objects and object pose configurations every time. Notice that this agent trained with 100 demos has the same initial condition as this other agent trained with 10 demos. This way we can do one-to-come comparisons between PerAct, C2FARM, Image-BC, different num of demos etc.
  2. Checking if the task is actually solvable, at least by an expert. We don't want to evaluate on unsolvable task instances.

And how to tell if an episode succeeds or not with test data?

Did you checkout the quickstart guide? When you run eval.py you should see success rates being printed out like:

Evaluating slide_block_to_color_target | Episode 0 | Score: 0.0 | Lang Goal: slide the block to blue target
Evaluating slide_block_to_color_target | Episode 1 | Score: 100.0 | Lang Goal: slide the block to yellow target

@TongZhangTHU
Copy link
Author

@MohitShridhar Thanks for your reply. Is my following understanding correct: only the random seed of the test dataset is used? In other words, only the initial observation is the same as the test dataset, subsequent observations are generated by the environment instead of the test dataset.

@MohitShridhar
Copy link
Collaborator

@TongZhangTHU, yes, only the random seed, and that it's solvable.

All observations during evaluations, including the initial one, are from the simulator. The agent is used in an observe-act loop.

@TongZhangTHU
Copy link
Author

@MohitShridhar I get it now. Thank you so much!

@MohitShridhar MohitShridhar added the question Further information is requested label Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants