The role of the test data during the evaluation #3

TongZhangTHU · 2022-11-28T13:03:54Z

Hi Mohit,

Thank you again for your wonderful implementation!

I have a question about the implementation: What's the role of the test data during the evaluation? As for as I know, a typical evaluation is the agent interacting with the environment, which doesn't need test data. So is there any difference with your implementation? And how to tell if an episode succeeds or not with test data?

MohitShridhar · 2022-11-28T13:26:16Z

@TongZhangTHU, the test data serves two purposes:

One-to-one comparisons between two agents. We can take an episode from the test dataset, and use its random seed to spawn the exact same objects and object pose configurations every time. Notice that this agent trained with 100 demos has the same initial condition as this other agent trained with 10 demos. This way we can do one-to-come comparisons between PerAct, C2FARM, Image-BC, different num of demos etc.
Checking if the task is actually solvable, at least by an expert. We don't want to evaluate on unsolvable task instances.

And how to tell if an episode succeeds or not with test data?

Did you checkout the quickstart guide? When you run eval.py you should see success rates being printed out like:

Evaluating slide_block_to_color_target | Episode 0 | Score: 0.0 | Lang Goal: slide the block to blue target
Evaluating slide_block_to_color_target | Episode 1 | Score: 100.0 | Lang Goal: slide the block to yellow target

TongZhangTHU · 2022-11-28T13:49:35Z

@MohitShridhar Thanks for your reply. Is my following understanding correct: only the random seed of the test dataset is used? In other words, only the initial observation is the same as the test dataset, subsequent observations are generated by the environment instead of the test dataset.

MohitShridhar · 2022-11-28T13:55:24Z

@TongZhangTHU, yes, only the random seed, and that it's solvable.

All observations during evaluations, including the initial one, are from the simulator. The agent is used in an observe-act loop.

TongZhangTHU · 2022-11-28T14:02:53Z

@MohitShridhar I get it now. Thank you so much!

TongZhangTHU closed this as completed Nov 28, 2022

MohitShridhar added the question Further information is requested label Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The role of the test data during the evaluation #3

The role of the test data during the evaluation #3

TongZhangTHU commented Nov 28, 2022 •

edited

Loading

MohitShridhar commented Nov 28, 2022

TongZhangTHU commented Nov 28, 2022

MohitShridhar commented Nov 28, 2022

TongZhangTHU commented Nov 28, 2022

The role of the test data during the evaluation #3

The role of the test data during the evaluation #3

Comments

TongZhangTHU commented Nov 28, 2022 • edited Loading

MohitShridhar commented Nov 28, 2022

TongZhangTHU commented Nov 28, 2022

MohitShridhar commented Nov 28, 2022

TongZhangTHU commented Nov 28, 2022

TongZhangTHU commented Nov 28, 2022 •

edited

Loading