-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetune with image goal. #74
Comments
Hi, I check through the code and do not find the way to load image goal in dataset loading part. It seems not to have been implemented yet. |
task_stack_keys
as a mechanism to do goal-image conditioning.
Image goals are being loaded and are returned as part of the Line 97 in bd930f9
|
Thanks to this, I was able to successfully train the model using an image goal. However, I'm not sure if I'm performing inference with the image goal correctly. During inference, we don't actually have the future image goal. What type of image goal should we use then? Should it be a one selected from the training set? (Here the train and test sets are defined as variations from the same task and scene.) |
If you want to evaluate a policy with image goal specification, you need to collect a goal image for your evaluation task. We usually collect this image right before running the evaluation to make sure it's in-distribution with your current scene layout. |
Thanks! could do explain it in more details. Considering the image goal early fusion strategy, I guess the model is sensitive to the alignment of image goal and the testing scene. I'm curious about the degree of alignment necessary. Does this scenario fulfill the alignment requirement if image goal and test sample involve the same task, in the same scene, targeting the same object, but with the object in a different location? |
During training we have always used future images from the same trajectory as goals, so the model likely requires the goal image to use the same object positions |
Adding MSE and MAP decoding
The image tokenizer roughly implements the following logic:
So, when you configure the tokenizer this way
Inside the tokenizer, the "image_primary" key is extracted from the "observations" dictionary, the "image_primary" key is extracted from the tasks dictionary, and the two are concatenated channel-wise, before being passed into the conv layers. This is known as early-goal fusion, and means that from the very beginning of the network, the model can do pixel-wise comparisons between the camera view at the current timestep and the desired goal camera view (a typically useful inductive bias for goal-reaching tasks).
If you don't care about goal-image task conditioning (e.g. you only want language-conditioned training), then you should simply omit the
task_stack_keys
argument (same if you want to do goal-image conditioning, but would prefer to separately encode / tokenized the goal image and the current observation).In any case, what is happening in your current code is that the config is expecting a goal image corresponding to "image_primary" in tasks["image_primary"], is not finding it in the tasks dictionary, and choosing to just insert a black image in its place (effectively a no-op).
Originally posted by @dibyaghosh in #25 (comment)
The text was updated successfully, but these errors were encountered: