Add jamba #29943

tomeras91 · 2024-03-28T15:35:14Z

What does this PR do?

Add support for the Jamba architecture by AI21 Labs

Who can review?

@ArthurZucker @younesbelkada

…lerance. left padding numerical difference are accentuated by mamba layers

ArthurZucker · 2024-03-29T21:03:23Z

Reviewing !

ArthurZucker

Great work! 🔥 it's already super transformers like!

tokenization_auto needs to be updated to include which tokenizer jamba uses!
a few code paths that would be nice to remove BUT that would mean having to convert the checkpoints, while your naming choices are alright. That would be a bit annoying.
great PR! 🤗

ArthurZucker · 2024-03-30T11:16:15Z

README.md

@@ -397,6 +397,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
 1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
 1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.
+1. **[Jamba](https://huggingface.co/docs/transformers/main/model_doc/jamba)** (from <FILL INSTITUTION>) released with the paper [<FILL PAPER TITLE>](<FILL ARKIV LINK>) by <FILL AUTHORS>.


ArthurZucker · 2024-03-30T11:16:48Z

docs/source/en/model_doc/jamba.md

@@ -0,0 +1,129 @@
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.


Suggested change

<!--Copyright 2022 The HuggingFace Team. All rights reserved.

<!--Copyright 2024 The HuggingFace Team. All rights reserved.

ArthurZucker · 2024-03-30T11:17:37Z

docs/source/en/model_doc/jamba.md

+Jamba is a pretrained, mixture-of-experts (MoE) generative text model, with 12B active parameters and an overall of 52B parameters across all experts. It supports a 256K context length, and can fit up to 140K tokens on a single 80GB GPU.
+
+As depicted in the diagram below, Jamba's architecture features a blocks-and-layers approach that allows Jamba to successfully integrate Transformer and Mamba architectures altogether. Each Jamba block contains either an attention or a Mamba layer, followed by a multi-layer perceptron (MLP), producing an overall ratio of one Transformer layer out of every eight total layers.
+
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/jamba_architecture.png"
+alt="drawing" width="600"/>


very nice 🔥

ArthurZucker · 2024-03-30T11:18:04Z

docs/source/en/model_doc/jamba.md

+You can run the model not using the optimized Mamba kernels, but it is **not** recommended as it will result in significantly lower latencies. In order to do that, you'll need to specify `use_mamba_kernels=False` when loading the model.
+
+### Run the model
+Please note that, at the moment, `trust_remote_code=True` is required for running the new Jamba architecture.


Suggested change

Please note that, at the moment, `trust_remote_code=True` is required for running the new Jamba architecture.

ArthurZucker · 2024-03-30T11:18:15Z

docs/source/en/model_doc/jamba.md

+model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1",
+                                             trust_remote_code=True)


Suggested change

model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1",

trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1")

ArthurZucker · 2024-03-30T11:51:15Z

src/transformers/models/jamba/modeling_jamba.py

+        if self._attn_implementation == "flash_attention_2":
+            # 2d mask is passed through the layers
+            attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
+        elif self._attn_implementation == "sdpa" and not output_attentions:
+            # output_attentions=True can not be supported when using SDPA, and we fall back on
+            # the manual implementation that requires a 4D causal mask in all cases.
+            attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
+                attention_mask,
+                (batch_size, seq_length),
+                inputs_embeds,
+                past_key_values_length,
+            )
+        else:
+            # 4d mask is passed through the layers
+            attention_mask = _prepare_4d_causal_attention_mask(
+                attention_mask,
+                (batch_size, seq_length),
+                inputs_embeds,
+                past_key_values_length,
+                sliding_window=self.config.sliding_window,
+            )


same comment about 4d mask! but it can be updated in another PR!

Here as well I feel I'm missing something.. what comment are we talking about?

Oh probably about the cache_positions 🙂

It's about using _update_causal_mask from gemma modelling code that simplifies the whole logic a lot!

Yeah, pushing for this, the _prepare_4d_causal_attention_mask and etc are just too scattered, and will be deprecated!

(No need for the cache position you can pass the past_length

ArthurZucker · 2024-03-30T11:52:24Z

src/transformers/models/jamba/modeling_jamba.py

+        if calc_logits_for_entire_prompt:
+            logits = self.lm_head(hidden_states)
+        else:
+            logits = self.lm_head(hidden_states[..., -1:, :])


Mmmm could you explain the motivations behind this?

Sure
it's pretty much explained in the docstring for calc_logits_for_entire_prompt in configuration_jamba.py. For long sequences, the logits can take a lot of GPU memory, especially as they are saved in FP32. So for a prompt of 128K tokens, with our vocab size of 64K, only the logits for the prompt take 32GB of GPU memory (128K64K4). The thing is that in order to generate from the model, we don't need all the prompt logits - just those of the last token. Anyway the GenerationMixin takes only the logits of the last prompt token (next_token_logits = outputs.logits[:, -1, :] appears many times in src/transformers/generation/utils.py). So we want to save all this unnecessary memory and compute only the logits we need.
Honestly, we were a bit surprised to see that the standard in transformers is to calculate the logits for the entire prompt when generating. I understand that for relatively short prompts this doesn't add up to a lot of extra memory, but for long prompt it's a complete waste.

That's where I say, when we generate with transformers, we only pass a single input ids starting from the second forward pass. Which is why we never need this, the hidden states generated after the first forward pass are always of shape 1 in sequence length!
This can be safely removed

Ah sorry you mean the first forward as well. Not a fan of having code paths for this + as @gante said, assisted generation will fail.

Yeah I was talking about the first forward pass.
RE assisted generation - you're right and that's why we kept this as a config option. If a user wants to use assisted generation, they can set calc_logits_for_entire_prompt to True in the config and everything will work.
As you saw, part of the Jamba promise is to be able to fit long sequences (~140K) on a single 80GB gpu (with int8 weight quantization). If we calculate the logits for the entire propmt, that's not possible. That's the reason we feel that by default entire prompt logits shouldn't be calculated during generation. If the user wants/needs that, they can do that by setting the appropriate attribute in the config

Alright. Let's leave it for now, this will be streamlined to generate as this can be a generation config argument set if you use use_cache and no assisted_decoding.
Will be deprecated in a near futur, as this is mostly for inference

Suggestion: if we make the flag an integer, e.g. num_logits_to_keep: Optional[int], then it can easily become compatible with assisted generation

if num_logits_to_keep is None: logits = self.lm_head(hidden_states) else: logits = self.lm_head(hidden_states[..., -num_logits_to_keep:, :])

@gante - just to make sure I understand: If we do that, we'll still need to modify the assisted generation code to set num_logits_to_keep to candidate_length correct?

src/transformers/models/jamba/modeling_jamba.py

ArthurZucker · 2024-03-30T11:53:58Z

tests/models/jamba/test_modeling_jamba.py

@@ -0,0 +1,830 @@
+# coding=utf-8
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.


Suggested change

# Copyright 2022 The HuggingFace Inc. team. All rights reserved.

# Copyright 2024 The HuggingFace Inc. team. All rights reserved.

ArthurZucker · 2024-03-30T11:55:08Z

tests/models/jamba/test_modeling_jamba.py

+
+
+@require_torch
+@unittest.skip("Update once we have a tiny Jamba model")


great TODO! Tiny logits would be awesome!

expected loss to make sure we compute the same!

Only thing left here!

…uggingface.co/ai21labs/Jamba-v0.1/discussions/24)

…mbaDecoderLayer

…gingface/peft#1530 (huggingface/peft#1530)

…rnorms and do it directly in the forward pass

…ions only if not None.

…ers return None as router_logits, and it is not concatenated to all_router_logits returned from JambaModel

…result.router_logits now holds results only for expert layers

ArthurZucker

Looks a lot cleaner

ArthurZucker · 2024-04-02T12:54:34Z

src/transformers/models/jamba/configuration_jamba.py

+        n_ctx (`int`, *optional*, defaults to 262144):
+            This value doesn't have any real effect. The maximum sequence length that this model is intended to be
+            used with. It can be used with longer sequences, but performance may degrade.


Ah that's a very good point. Let's us max_position_embeddings and also include in the comment that it's use for evaluating! (it's not doing nothing!)

ArthurZucker · 2024-04-02T12:55:57Z

src/transformers/models/jamba/modeling_jamba.py

+    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
+        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()


Suggested change

def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):

return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()

src/transformers/models/jamba/modeling_jamba.py

ArthurZucker · 2024-04-02T13:56:02Z

src/transformers/models/jamba/modeling_jamba.py

+        if self.attention_layer_idx is not None and layer_idx == self.attention_layer_idx:
+            self._seen_tokens += key_states.shape[-2]


we no longer use the self._seen_tokens arg, and rely on cache_positions, should simplify things

ArthurZucker · 2024-04-02T14:25:50Z

src/transformers/models/jamba/modeling_jamba.py

+        if self._attn_implementation == "flash_attention_2":
+            # 2d mask is passed through the layers
+            attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
+        elif self._attn_implementation == "sdpa" and not output_attentions:
+            # output_attentions=True can not be supported when using SDPA, and we fall back on
+            # the manual implementation that requires a 4D causal mask in all cases.
+            attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
+                attention_mask,
+                (batch_size, seq_length),
+                inputs_embeds,
+                past_key_values_length,
+            )
+        else:
+            # 4d mask is passed through the layers
+            attention_mask = _prepare_4d_causal_attention_mask(
+                attention_mask,
+                (batch_size, seq_length),
+                inputs_embeds,
+                past_key_values_length,
+                sliding_window=self.config.sliding_window,
+            )


(No need for the cache position you can pass the past_length

ArthurZucker · 2024-04-02T14:29:21Z

src/transformers/models/jamba/modeling_jamba.py

+        if calc_logits_for_entire_prompt:
+            logits = self.lm_head(hidden_states)
+        else:
+            logits = self.lm_head(hidden_states[..., -1:, :])


Alright. Let's leave it for now, this will be streamlined to generate as this can be a generation config argument set if you use use_cache and no assisted_decoding.
Will be deprecated in a near futur, as this is mostly for inference

src/transformers/models/jamba/modeling_jamba.py

ArthurZucker · 2024-04-02T14:32:09Z

src/transformers/models/jamba/modeling_jamba.py

+                "use_cache": kwargs.get("use_cache"),
+                "attention_mask": attention_mask,
+                "output_router_logits": output_router_logits,
+                "calc_logits_for_entire_prompt": self.config.calc_logits_for_entire_prompt,


if use_cache this could always be set to False if there is no self.generation_config.assistant could be handled here

tests/models/jamba/test_modeling_jamba.py

…he model is too big to download (in docstring of JambaForCausalLM.forward)

ArthurZucker · 2024-04-17T12:37:22Z

src/transformers/models/jamba/modeling_jamba.py

            past_seen_tokens = (
-                past_key_values.get_seq_length()
+                past_key_values.get_seq_length(self.config.layers_block_type.index("attention"))
                if isinstance(past_key_values, HybridMambaAttentionDynamicCache)
                else 0


actually we should assume cache position are passed for this mode

ArthurZucker · 2024-04-17T12:37:41Z

src/transformers/models/jamba/modeling_jamba.py

+    def get_seq_length(self, layer_idx: Optional[int] = 0) -> int:
+        """Returns the sequence length of the cached states. A layer index can be optionally passed."""
+        if len(self.key_cache) <= layer_idx:
+            return 0
+        if self.layers_block_type[layer_idx] == "mamba":
+            raise ValueError("Can't return seq_length from Mamba layers cache as it doesn't have a sequence length dimension.")
+        return self.key_cache[layer_idx].shape[-2]


would rather not have this, cache positions SHOULD be passed. That is here in llama for legacy

ArthurZucker · 2024-04-17T12:37:46Z

src/transformers/models/jamba/modeling_jamba.py

+    def to_legacy_cache(self) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]:
+        raise NotImplementedError("HybridMambaAttentionDynamicCache does not have a legacy cache equivalent.")
+
+    @classmethod
+    def from_legacy_cache(cls, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None) -> "DynamicCache":
+        raise NotImplementedError("HybridMambaAttentionDynamicCache does not have a legacy cache equivalent.")


same for both

ArthurZucker · 2024-04-17T12:39:17Z

All the rest you added LGTM

HuggingFaceDocBuilderDev · 2024-04-17T13:03:14Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…_state (and make bool) and bugfix - it should be set to True after a finished forward pass of the entire model

…a single token

…h size

ArthurZucker

2 small comments and should be good to go!

src/transformers/models/jamba/modeling_jamba.py

…. Adjust test (test_decoder_model_past_with_large_inputs) accordingly

ArthurZucker

🚀 Great work everyone!

* Add jamba arch * apply "make fix-copies" changes * fix link to model in JambaConfig docstring * Add n_ctx in modeling file because repo-consistency wants that * Add jamba to flash attention and sdpa documentation * mamba dt_proj quant fix now works for LoRA as well * override test_left_padding_compatibility and use a more permissive tolerance. left padding numerical difference are accentuated by mamba layers * add jamba to tokenization auto * fix comments of shape (PR huggingface#24 in the model page: https://huggingface.co/ai21labs/Jamba-v0.1/discussions/24) * simple PR fixes * remove unnecessary kwargs from JambaAttentionDecoderLayer and JambaMambaDecoderLayer * remove the LoRA hack for the mamba dt_proj bias. It was solved in huggingface/peft#1530 (huggingface/peft#1530) * Add copied comment on JambaMLP (it's the same as MixtralMLP) * remove padding_mask warnings. It's not supported anymore * fix docstring. Float instead of int * A few more minor PR fixes * (1) lowercase names for mamba layernorms (2) remove _apply_inner_layernorms and do it directly in the forward pass * Return None attention weights from mamba layers. Append to all attentions only if not None. * remove some leftover jamba archive lists * Better separation between expert vs non-expert layers. non-expert layers return None as router_logits, and it is not concatenated to all_router_logits returned from JambaModel * no need to take router_logits at config.expert_layer_offset anymore. result.router_logits now holds results only for expert layers * Add Jamba paper on READMEs * (1) rename n_ctx -> max_position_embeddings (2) don't use it in the modeling file since it's not needed (set it as an exception to check_config_attributes) * Add copied from comment * remove the code path for apply_inner_layernorms=False. Jamba always has the inner mamba layernorms * clearer docstring for _convert_to_standard_cache * style fixes * Change calc_logits_for_entire_prompt (bool) to num_logits_to_keep (int). Adapt assisted decoding code tp use it. Also small change in low memory beam search decoding path to support this new int value in model_inputs * rename test so it still overrides what its meant to override * draft * oups * nit * remove more complexe logic * fix names used in config * fix fix fix * style * fix some more failing tests * generate did not init the cache 🙃 * more small nits * typo * config.mamba_expand * config.hidden_size for the intermediate size of the mamba shapes * fix init of pkv with torch.tensor() * empty tensor * fix some init issues * stupid changes required by generate because it does not even support it's own DynamicCache class * more fixes * fix general assisted gen cache_position bug * tests passing * Add offsets and periods as SPECIAL_CASES_TO_ALLOW in check_config_attributes.py * fix reorder_cache to reorder mamba states and override some more functions in HybridMambaAttentionDynamicCache * no need to override test_past_key_values_format() and _check_past_key_values_for_generate() in tests anymore * fix docstrings and typehints for past_key_values * style fixes * fix docs * change typehint due to copy from Mixtral * forgot import * import order * Add configuration_jamba and modeling_jamba to not_doctested because the model is too big to download (in docstring of JambaForCausalLM.forward) * Add integration test with tiny tandom Jamba model on hub * fix flash attention cache shapes * bring back forgotten hidden states * rename HybridMambaAttentionDynamicCache.seqlen_offset to has_previous_state (and make bool) and bugfix - it should be set to True after a finished forward pass of the entire model * align integration test after modeling fixes * bugfix - mamba can use precomputed states only of forward pass is on a single token * bugfix - mamba can use precomputed states only if they match the batch size * typo * remove making _prepare_4d_causal_attention_mask a leaf function * stop using past_seq_len.get_seq_length(). Use cache positions instead. Adjust test (test_decoder_model_past_with_large_inputs) accordingly --------- Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com> Co-authored-by: Joao Gante <joao@huggingface.co>

* Add jamba arch * apply "make fix-copies" changes * fix link to model in JambaConfig docstring * Add n_ctx in modeling file because repo-consistency wants that * Add jamba to flash attention and sdpa documentation * mamba dt_proj quant fix now works for LoRA as well * override test_left_padding_compatibility and use a more permissive tolerance. left padding numerical difference are accentuated by mamba layers * add jamba to tokenization auto * fix comments of shape (PR #24 in the model page: https://huggingface.co/ai21labs/Jamba-v0.1/discussions/24) * simple PR fixes * remove unnecessary kwargs from JambaAttentionDecoderLayer and JambaMambaDecoderLayer * remove the LoRA hack for the mamba dt_proj bias. It was solved in huggingface/peft#1530 (huggingface/peft#1530) * Add copied comment on JambaMLP (it's the same as MixtralMLP) * remove padding_mask warnings. It's not supported anymore * fix docstring. Float instead of int * A few more minor PR fixes * (1) lowercase names for mamba layernorms (2) remove _apply_inner_layernorms and do it directly in the forward pass * Return None attention weights from mamba layers. Append to all attentions only if not None. * remove some leftover jamba archive lists * Better separation between expert vs non-expert layers. non-expert layers return None as router_logits, and it is not concatenated to all_router_logits returned from JambaModel * no need to take router_logits at config.expert_layer_offset anymore. result.router_logits now holds results only for expert layers * Add Jamba paper on READMEs * (1) rename n_ctx -> max_position_embeddings (2) don't use it in the modeling file since it's not needed (set it as an exception to check_config_attributes) * Add copied from comment * remove the code path for apply_inner_layernorms=False. Jamba always has the inner mamba layernorms * clearer docstring for _convert_to_standard_cache * style fixes * Change calc_logits_for_entire_prompt (bool) to num_logits_to_keep (int). Adapt assisted decoding code tp use it. Also small change in low memory beam search decoding path to support this new int value in model_inputs * rename test so it still overrides what its meant to override * draft * oups * nit * remove more complexe logic * fix names used in config * fix fix fix * style * fix some more failing tests * generate did not init the cache 🙃 * more small nits * typo * config.mamba_expand * config.hidden_size for the intermediate size of the mamba shapes * fix init of pkv with torch.tensor() * empty tensor * fix some init issues * stupid changes required by generate because it does not even support it's own DynamicCache class * more fixes * fix general assisted gen cache_position bug * tests passing * Add offsets and periods as SPECIAL_CASES_TO_ALLOW in check_config_attributes.py * fix reorder_cache to reorder mamba states and override some more functions in HybridMambaAttentionDynamicCache * no need to override test_past_key_values_format() and _check_past_key_values_for_generate() in tests anymore * fix docstrings and typehints for past_key_values * style fixes * fix docs * change typehint due to copy from Mixtral * forgot import * import order * Add configuration_jamba and modeling_jamba to not_doctested because the model is too big to download (in docstring of JambaForCausalLM.forward) * Add integration test with tiny tandom Jamba model on hub * fix flash attention cache shapes * bring back forgotten hidden states * rename HybridMambaAttentionDynamicCache.seqlen_offset to has_previous_state (and make bool) and bugfix - it should be set to True after a finished forward pass of the entire model * align integration test after modeling fixes * bugfix - mamba can use precomputed states only of forward pass is on a single token * bugfix - mamba can use precomputed states only if they match the batch size * typo * remove making _prepare_4d_causal_attention_mask a leaf function * stop using past_seq_len.get_seq_length(). Use cache positions instead. Adjust test (test_decoder_model_past_with_large_inputs) accordingly --------- Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com> Co-authored-by: Joao Gante <joao@huggingface.co>

tomeras91 added 9 commits March 28, 2024 17:27

Add jamba arch

16b561d

Merge branch 'main' into add-jamba

2e7fbe4

apply "make fix-copies" changes

5b84cbe

fix link to model in JambaConfig docstring

b2f12fc

Add n_ctx in modeling file because repo-consistency wants that

5f48e7b

Add jamba to flash attention and sdpa documentation

f2bbe6d

mamba dt_proj quant fix now works for LoRA as well

5ec508e

Merge branch 'main' into add-jamba

35caa4f

override test_left_padding_compatibility and use a more permissive to…

240c577

…lerance. left padding numerical difference are accentuated by mamba layers

ArthurZucker added the New model label Mar 29, 2024

ArthurZucker reviewed Mar 30, 2024

View reviewed changes

tomeras91 added 17 commits March 30, 2024 21:34

add jamba to tokenization auto

783a1ac

Merge branch 'main' into add-jamba

b0c9d7c

fix comments of shape (PR huggingface#24 in the model page: https://h…

56183b4

…uggingface.co/ai21labs/Jamba-v0.1/discussions/24)

simple PR fixes

59d832a

remove unnecessary kwargs from JambaAttentionDecoderLayer and JambaMa…

ce8b476

…mbaDecoderLayer

remove the LoRA hack for the mamba dt_proj bias. It was solved in hug…

810dfbf

…gingface/peft#1530 (huggingface/peft#1530)

Add copied comment on JambaMLP (it's the same as MixtralMLP)

b03a83d

remove padding_mask warnings. It's not supported anymore

9bd48ef

fix docstring. Float instead of int

9c164dc

A few more minor PR fixes

3a1ef30

(1) lowercase names for mamba layernorms (2) remove _apply_inner_laye…

16b397f

…rnorms and do it directly in the forward pass

Return None attention weights from mamba layers. Append to all attent…

a272515

…ions only if not None.

remove some leftover jamba archive lists

16cff22

Merge branch 'main' into add-jamba

4c044b2

Better separation between expert vs non-expert layers. non-expert lay…

f833e25

…ers return None as router_logits, and it is not concatenated to all_router_logits returned from JambaModel

no need to take router_logits at config.expert_layer_offset anymore. …

f368f8d

…result.router_logits now holds results only for expert layers

Add Jamba paper on READMEs

a9342a2

ArthurZucker reviewed Apr 2, 2024

View reviewed changes

tomeras91 added 9 commits April 17, 2024 00:11

fix docstrings and typehints for past_key_values

a252fe0

style fixes

c9f094a

fix docs

5aace7c

change typehint due to copy from Mixtral

1b3f224

forgot import

1e87c88

import order

ae7f7fb

Merge branch 'main' into add-jamba

e71421c

Add configuration_jamba and modeling_jamba to not_doctested because t…

5e0244d

…he model is too big to download (in docstring of JambaForCausalLM.forward)

Add integration test with tiny tandom Jamba model on hub

5c03163

ArthurZucker reviewed Apr 17, 2024

View reviewed changes

fix flash attention cache shapes

7b15866

bring back forgotten hidden states

e9d227b

tomeras91 marked this pull request as draft April 17, 2024 13:22

tomeras91 added 5 commits April 17, 2024 17:28

rename HybridMambaAttentionDynamicCache.seqlen_offset to has_previous…

d1ae4fd

…_state (and make bool) and bugfix - it should be set to True after a finished forward pass of the entire model

align integration test after modeling fixes

a3e8094

bugfix - mamba can use precomputed states only of forward pass is on …

a0a8d8c

…a single token

bugfix - mamba can use precomputed states only if they match the batc…

122c696

…h size

typo

ab2a0d3

tomeras91 marked this pull request as ready for review April 17, 2024 15:25

Merge branch 'main' into add-jamba

6252603

ArthurZucker reviewed Apr 17, 2024

View reviewed changes

tomeras91 added 2 commits April 18, 2024 00:15

remove making _prepare_4d_causal_attention_mask a leaf function

aabe99d

stop using past_seq_len.get_seq_length(). Use cache positions instead…

886e8c8

…. Adjust test (test_decoder_model_past_with_large_inputs) accordingly

ArthurZucker approved these changes Apr 18, 2024

View reviewed changes

ArthurZucker merged commit 3f20877 into huggingface:main Apr 18, 2024
23 checks passed

		@@ -0,0 +1,129 @@
		<!--Copyright 2022 The HuggingFace Team. All rights reserved.

	<!--Copyright 2022 The HuggingFace Team. All rights reserved.
	<!--Copyright 2024 The HuggingFace Team. All rights reserved.

		model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1",
		trust_remote_code=True)

		@@ -0,0 +1,830 @@
		# coding=utf-8
		# Copyright 2022 The HuggingFace Inc. team. All rights reserved.

	# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
	# Copyright 2024 The HuggingFace Inc. team. All rights reserved.



		@require_torch
		@unittest.skip("Update once we have a tiny Jamba model")

		def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
		return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()

		if self.attention_layer_idx is not None and layer_idx == self.attention_layer_idx:
		self._seen_tokens += key_states.shape[-2]

Add jamba #29943

Add jamba #29943

Conversation

tomeras91 commented Mar 28, 2024

What does this PR do?

Who can review?

ArthurZucker commented Mar 29, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker Apr 1, 2024 • edited

Choose a reason for hiding this comment

ArthurZucker Apr 1, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker commented Apr 17, 2024

HuggingFaceDocBuilderDev commented Apr 17, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Apr 1, 2024 •

edited

ArthurZucker Apr 1, 2024 •

edited