Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetune with image goal. #74

Closed
zwbx opened this issue Apr 12, 2024 · 6 comments
Closed

Finetune with image goal. #74

zwbx opened this issue Apr 12, 2024 · 6 comments

Comments

@zwbx
Copy link

zwbx commented Apr 12, 2024

          Thanks for the question! We use `task_stack_keys`as a mechanism to do goal-image conditioning. 

The image tokenizer roughly implements the following logic:

inputs = jnp.concatenate(
     [observations[k] for k in obs_stack_keys] + 
     [tasks[k] for k in task_stack_keys],
     axis=-1
)
tokens = encoder(inputs)

So, when you configure the tokenizer this way

"primary": ModuleSpec.create(
            ImageTokenizer,
            obs_stack_keys=["image_primary"],
            task_stack_keys=["image_primary"],
            encoder=ModuleSpec.create(SmallStem16),
        ),

Inside the tokenizer, the "image_primary" key is extracted from the "observations" dictionary, the "image_primary" key is extracted from the tasks dictionary, and the two are concatenated channel-wise, before being passed into the conv layers. This is known as early-goal fusion, and means that from the very beginning of the network, the model can do pixel-wise comparisons between the camera view at the current timestep and the desired goal camera view (a typically useful inductive bias for goal-reaching tasks).


If you don't care about goal-image task conditioning (e.g. you only want language-conditioned training), then you should simply omit the task_stack_keys argument (same if you want to do goal-image conditioning, but would prefer to separately encode / tokenized the goal image and the current observation).

In any case, what is happening in your current code is that the config is expecting a goal image corresponding to "image_primary" in tasks["image_primary"], is not finding it in the tasks dictionary, and choosing to just insert a black image in its place (effectively a no-op).

Originally posted by @dibyaghosh in #25 (comment)

@zwbx
Copy link
Author

zwbx commented Apr 12, 2024

Hi, I check through the code and do not find the way to load image goal in dataset loading part. It seems not to have been implemented yet.

@zwbx zwbx changed the title Thanks for the question! We use task_stack_keysas a mechanism to do goal-image conditioning. Finetune with image goal. Apr 12, 2024
@kpertsch
Copy link
Collaborator

Image goals are being loaded and are returned as part of the task dictionary from the data loader.
See here:

dataset = dataset.traj_map(

@zwbx
Copy link
Author

zwbx commented Apr 15, 2024

Thanks to this, I was able to successfully train the model using an image goal. However, I'm not sure if I'm performing inference with the image goal correctly. During inference, we don't actually have the future image goal. What type of image goal should we use then? Should it be a one selected from the training set? (Here the train and test sets are defined as variations from the same task and scene.)

@kpertsch
Copy link
Collaborator

If you want to evaluate a policy with image goal specification, you need to collect a goal image for your evaluation task. We usually collect this image right before running the evaluation to make sure it's in-distribution with your current scene layout.

@zwbx
Copy link
Author

zwbx commented Apr 17, 2024

Thanks! could do explain it in more details. Considering the image goal early fusion strategy, I guess the model is sensitive to the alignment of image goal and the testing scene. I'm curious about the degree of alignment necessary. Does this scenario fulfill the alignment requirement if image goal and test sample involve the same task, in the same scene, targeting the same object, but with the object in a different location?

@kpertsch
Copy link
Collaborator

During training we have always used future images from the same trajectory as goals, so the model likely requires the goal image to use the same object positions

WenchangGaoT pushed a commit to WenchangGaoT/octo1 that referenced this issue May 10, 2024
@zwbx zwbx closed this as completed Aug 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants