Incorrect attention mask computation #75

yushijinhun · 2024-04-15T04:57:27Z

I found the generate_attention_mask function block_transformer.py seems to calculate the attention mask incorrectly. Here an example:

According to the above printed attention mask diagram, the token at 656 (in t=1 obs_wrist) should NOT attend to the token at 657 (in t=1 readout_action). However, attention_mask[656, 657] is True. You can reproduce this using jdb. It seems that get_token_metadata function doesn't calculate the belonging group of tokens correctly.

(jdb) l
> /home/yushijinhun/octo/octo/octo/model/components/block_transformer.py(325)
                    mask = int(metadata_i.should_attend_to(metadata_j))
                    attention_mask[i, j] = mask
    
            pad_attention_mask = self.generate_pad_attention_mask(
                prefix_groups, timestep_groups
            )
->          jax.debug.breakpoint()
            attention_mask = jnp.logical_and(attention_mask, pad_attention_mask)
            return attention_mask
    
(jdb) bt
Traceback:
  File "/home/yushijinhun/octo/octo-experiment/test.py", line 11
    actions = model.sample_actions(
  File "/home/yushijinhun/octo/octo/octo/model/octo_model.py", line 187
    transformer_outputs = self.run_transformer(
  File "/home/yushijinhun/octo/octo/octo/model/octo_model.py", line 152
    return self.module.apply(
  File "/home/yushijinhun/octo/octo/octo/model/octo_module.py", line 249
    prefix_outputs, timestep_outputs = BlockTransformer(self.transformer_kwargs)(
  File "/home/yushijinhun/octo/octo/octo/model/components/block_transformer.py", line 172
    attention_mask = self.generate_attention_mask(prefix_groups, timestep_groups)
  File "/home/yushijinhun/octo/octo/octo/model/components/block_transformer.py", line 325
    jax.debug.breakpoint()
(jdb) tokens_per_prefix_group
[16]
(jdb) tokens_per_timestep_group
[256, 64, 1]
(jdb) horizon
2
(jdb) tokens_for_prefix
16
(jdb) tokens_per_time_step
321
(jdb) total_tokens
658
(jdb) get_token_metadata(657)    #### <--- Token 657 should belong to group "t=1 readout_action", NOT "t=1 obs_wrist"
TokenMetadata(name='obs_wrist', timestep=1, attention_rules={'task_*': <AttentionRule.CAUSAL: 'other.timestep <= self.timestep'>, 'obs_*': <AttentionRule.CAUSAL: 'other.timestep <= self.timestep'>})
(jdb) get_token_metadata(656)
TokenMetadata(name='obs_wrist', timestep=1, attention_rules={'task_*': <AttentionRule.CAUSAL: 'other.timestep <= self.timestep'>, 'obs_*': <AttentionRule.CAUSAL: 'other.timestep <= self.timestep'>})
(jdb) attention_mask[657, 656]
1
(jdb) attention_mask[656, 657]    #### <--- This should be FALSE, group "t=1 obs_wrist" should NOT attend to group "t=1 readout_action"
1

The text was updated successfully, but these errors were encountered:

dibyaghosh · 2024-04-15T19:10:20Z

TL;DR: There is a small bug in the attention masking code; this should not practically affect anyone using the released model or training their own models (unless you're doing some special attention mask scheme), but we will fix it soon in an update.

Thanks for catching this! Looks like a victim of an off-by-one error caused by np.searchsorted.

In this section

        def _get_position(i, tokens_per_elem):
            return np.searchsorted(np.cumsum(tokens_per_elem), i)

the correct code should have been:

        def _get_position(i, tokens_per_elem):
            return np.searchsorted(np.cumsum(tokens_per_elem), i, side='right')

I'll work to get this patched in at some point, but since fixing the issue will affect our current released checkpoints (which were trained on the old incorrect attention mask structure), doing so will require a little bit of tact and care. I'll keep you updated in this issue.

For others reading the issue:

For most people (if you are using the released model checkpoints, if you are using our config for pretraining): There is a small bug in the attention mask generation that should not affect your use cases, which causes observation tokens to attend to the action readout. This isn't a fatal bug (e.g. there's no information leakage from future timesteps to current timesteps), but we will fix it in a future edition.

If you are using octo.model.BlockTransformer with non-default attention masking strategies (not common): If you have multiple timestep groups, the bug causes the first token in the second group to be misclassified as being in the first group (similarly, 1st token of 3rd group is misclassified as being in group 2, so on). If your model relies on different timestep groups not being able to attend to each other (this is a pretty non-standard case), then please make sure to incorporate the fix in #76 to avoid infrormation leakage between different timestep groups.

yushijinhun · 2024-04-16T03:29:13Z

Thank you for the timely reply! I've incorporated your fix and I can confirm the attention mask computation is now correct.

andrearosasco · 2024-04-18T12:57:15Z

@dibyaghosh do you think finetuning the pretrained "bugged" model with correct mask would improve or deteriorate the results? Being just one token my guess is it shouldn't make too much difference

dibyaghosh · 2024-04-18T17:23:39Z

Especially for finetuning, my guess is that it shouldn't make any difference really, but these things can be hard to predict.

Minor change to optimizer

zwbx · 2024-07-22T06:01:57Z

It is useful, thanks!

dibyaghosh mentioned this issue Apr 15, 2024

Fix potential off-by-one error in attention mask generation #76

Draft

WenchangGaoT pushed a commit to WenchangGaoT/octo1 that referenced this issue May 10, 2024

Merge pull request octo-models#75 from rail-berkeley/dibya-fix-tx

79476ef

Minor change to optimizer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect attention mask computation #75

Incorrect attention mask computation #75

yushijinhun commented Apr 15, 2024

dibyaghosh commented Apr 15, 2024 •

edited

Loading

yushijinhun commented Apr 16, 2024

andrearosasco commented Apr 18, 2024 •

edited

Loading

dibyaghosh commented Apr 18, 2024

zwbx commented Jul 22, 2024

Incorrect attention mask computation #75

Incorrect attention mask computation #75

Comments

yushijinhun commented Apr 15, 2024

dibyaghosh commented Apr 15, 2024 • edited Loading

yushijinhun commented Apr 16, 2024

andrearosasco commented Apr 18, 2024 • edited Loading

dibyaghosh commented Apr 18, 2024

zwbx commented Jul 22, 2024

dibyaghosh commented Apr 15, 2024 •

edited

Loading

andrearosasco commented Apr 18, 2024 •

edited

Loading