#1. Introduction to Mamba
Mamba is a new architecture for LLM that can handle long sequences more efficiently than traditional models such as Transformers. It utilizes a Selective State Space Model (SSM) to dynamically filter and process information based on content, allowing the model to selectively remember or ignore parts of the input. Mamba offers significant improvements in processing speed and scaling capabilities, especially with longer sequences.

But what really sets Mamba apart? Let’s test it out with an in-depth interactive experience with Mamba.

#2. Mamba model chat
While the current base implementation provides the familiar from_pretrained method and generated base parameters, some functionality (such as repetition_chamine) is not available. Also, we cannot use text-generation-webui like text-generation-webui( https://github.com/oobabooga/text-generation-webui) such a tool. So, in order to use Mamba, we will use Python code for inference. I've made the code as simple as possible.



First, let's load the model.

In [None]:
!pip install causal-conv1d==1.0.0
!pip install mamba-ssm==1.0.1
!pip install transformers
!pip install datasets

Collecting causal-conv1d==1.0.0
  Downloading causal_conv1d-1.0.0.tar.gz (6.4 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ninja (from causal-conv1d==1.0.0)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: causal-conv1d
  Building wheel for causal-conv1d (setup.py) ... [?25l[?25hdone
  Created wheel for causal-conv1d: filename=causal_conv1d-1.0.0-cp310-cp310-linux_x86_64.whl size=9116761 sha256=4bbd2c2672ecd02c1e43f8e52552de593099619abc6dda18b2ac650e08110124
  Stored in directory: /root/.cache/pip/wheels/9a/48/f5/eb0c6d6d8e00131eaa57927b537a23832b37e2f01b801d9c5d
Successfully built causal-conv1d
Installing collected packages: ninja, causal-conv1d
Successfully installed causal-conv1d-1.0.0 ninja-1.11.1.1
Collecting mamba-ssm==1.0.1
  Downloading mamba_ssm-1.0

In [None]:
# Set the environment variable for LD_LIBRARY_PATH
%env LD_LIBRARY_PATH=/usr/lib64-nvidia

env: LD_LIBRARY_PATH=/usr/lib64-nvidia


In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
!export LC_ALL="en_US.UTF-8"
!export LD_LIBRARY_PATH="/usr/lib64-nvidia"
!export LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
!ldconfig /usr/lib64-nvidia

/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_0.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbb.so.12 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc.so.2 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc_proxy.so.2 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_5.so.3 is not a symbolic link



In [None]:
import torch
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
from transformers import AutoTokenizer

# Determine if a GPU is available and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model
model = MambaLMHeadModel.from_pretrained(
    "state-spaces/mamba-1.4b",
    device=device,  # Use the device variable here
    dtype=torch.float16
).to(device)  # Move the model to the specified device

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


In [None]:
# Check NVIDIA GPU status
!nvidia-smi

Mon Jan 22 10:58:00 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0              51W / 400W |   3327MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:


prompt=\
"""A conversation between a user and a smart AI assistant.
​
### User: Hello!
### Assistant:"""

prompt_tokenized=tokenizer(prompt, return_tensors="pt").to("cuda")

# from https://github.com/state-spaces/mamba/blob/main/benchmarks/benchmark_generation_mamba_simple.py#L54
output_tokenized = model.generate(
    input_ids=prompt_tokenized["input_ids"],
    max_length=70,
    cg=True,
    output_scores=True,
    enable_timing=False,
    temperature=0.7,
    top_k=40,
    top_p=0.1,
    )
output=tokenizer.decode(output_tokenized[0])

print(output)

A conversation between a user and a smart AI assistant.
​
### User: Hello!
### Assistant: Hello!
### User: I'm hungry!
### Assistant: I'm hungry too!
### User: I'm thirsty!
### Assistant: I'm thirsty too!
### User: I'm tired!



#Load the data set and tokenize it

In [None]:
from datasets import load_dataset

dataset=load_dataset("OpenAssistant/oasst_top1_2023-08-25")

In [None]:
import os

def tokenize(element):
    return tokenizer(
        element["text"],
        truncation=True,
        max_length=1024,
        add_special_tokens=False,
    )


dataset_tokenized = dataset.map(
    tokenize,
    batched=True,
    num_proc=os.cpu_count(),    # multithreaded
    remove_columns=["text"]     # don't need this anymore, we have tokens from here on
)

#Define the collate function

In [None]:
tokenizer.pad_token = tokenizer.eos_token

# collate function - to transform list of dictionaries [ {input_ids: [123, ..]}, {.. ] to single batch dictionary { input_ids: [..], labels: [..], attention_mask: [..] }
def collate(elements):
    tokenlist=[e["input_ids"] for e in elements]
    tokens_maxlen=max([len(t) for t in tokenlist])

    input_ids,labels = [],[]
    for tokens in tokenlist:
        pad_len=tokens_maxlen-len(tokens)

        # pad input_ids with pad_token, labels with ignore_index (-100) and set attention_mask 1 where content otherwise 0
        input_ids.append( tokens + [tokenizer.pad_token_id]*pad_len )
        labels.append( tokens + [-100]*pad_len )

    batch={
        "input_ids": torch.tensor(input_ids),
        "labels": torch.tensor(labels),
    }
    return batch

#Prepare Mamba Trainer

In [None]:
# monkey patch MambaLMHeadModel.forward
def forward_with_loss(self, input_ids, position_ids=None, inference_params=None, num_last_tokens=0, labels = None):
    """
    "position_ids" is just to be compatible with Transformer generation. We don't use it.
    num_last_tokens: if > 0, only return the logits for the last n tokens
    """
    hidden_states = self.backbone(input_ids, inference_params=inference_params)
    if num_last_tokens > 0:
        hidden_states = hidden_states[:, -num_last_tokens:]
    lm_logits = self.lm_head(hidden_states)

    # Source: https://github.com/huggingface/transformers/blob/80377eb018c077dba434bc8e7912bcaed3a64d09/src/transformers/models/llama/modeling_llama.py#L1196
    from torch.nn import CrossEntropyLoss
    if labels is not None:
        logits = lm_logits
        # Shift so that tokens < n predict n
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()
        # Flatten the tokens
        loss_fct = CrossEntropyLoss()
        # shift_logits = shift_logits.view(-1, self.config.vocab_size)
        shift_logits = shift_logits.view(-1, self.backbone.embedding.weight.size()[0])
        shift_labels = shift_labels.view(-1)
        # Enable model parallelism
        shift_labels = shift_labels.to(shift_logits.device)
        loss = loss_fct(shift_logits, shift_labels)
        return (loss,)
    else:
        CausalLMOutput = namedtuple("CausalLMOutput", ["logits"])
        return CausalLMOutput(logits=lm_logits)
MambaLMHeadModel.forward=forward_with_loss

# patch MambaLMHeadModel
MambaLMHeadModel.forward=forward_with_loss

# (re)load model
# model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-1.4b", device="cuda", dtype=torch.float16)
model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-1.4b", device="cuda")


#Training the Mamba model

In [None]:
!pip install transformers[torch]

Collecting accelerate>=0.20.3 (from transformers[torch])
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.26.1


In [None]:
import os
from transformers import Trainer, TrainingArguments

# Create output directory
output_dir = '/content/drive/MyDrive/mamba-1'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)


bs=4        # batch size
ga_steps=1  # gradient acc. steps
epochs=3
steps_per_epoch=len(dataset_tokenized["train"])//(bs*ga_steps)
lr=0.0005

args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs,
    evaluation_strategy="steps",
    logging_steps=1,
    eval_steps=steps_per_epoch,
    save_steps=steps_per_epoch,
    gradient_accumulation_steps=ga_steps,
    num_train_epochs=epochs,
    lr_scheduler_type="constant",
    learning_rate=lr,
    group_by_length=True,
    bf16=False,                  # mixed precision training
    fp16=True,
    save_safetensors=False,     # saving will fail without this
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=collate,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"],
)

trainer.train()

Step,Training Loss,Validation Loss
3236,2.3146,2.887654
6472,0.0,
9708,0.0,


TrainOutput(global_step=9711, training_loss=1.7071560114827733, metrics={'train_runtime': 3208.7907, 'train_samples_per_second': 12.105, 'train_steps_per_second': 3.026, 'total_flos': 0.0, 'train_loss': 1.7071560114827733, 'epoch': 3.0})

# Next Steps and insights for Fine-Tuning and Evaluating the Mamba Model

## 1. Adjust Learning Rate
- Initial learning rate of 0.0005 was ineffective.
- Reducing the learning rate to 0.00005 improved outcomes.

## 2. Evaluate the Mamba Model
- Evaluation is challenging due to subjective metrics.
- Utilize benchmarks like [EleutherAI's lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), Chatbot Arena, and artificial intelligence referee at [Chatbot Arena Leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard).

## 3. Benchmarking and Skepticism
- Used Mamba author's benchmarks as a starting point.
- Skepticism around benchmark numbers without prior Mamba experience.

## 4. Learning Rate Reconsideration
- Original learning rate of 0.0005 potentially too high.
- Lack of clarity on Mamba's pre-training learning rate.

## 5. Further Fine-Tuning Experiments
- Experimented with even lower learning rates (3x10e-5 and 2x10e-5).
- Tested different training rounds and datasets, including the OA dataset and [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/ultrachat_200k).

## 6. Comparison with TinyLlama
- Noted significant speed advantage of Mamba over TinyLlama.
- Mamba's lower VRAM usage and faster token generation rate.

## 7. Long Context Capability of Mamba
- Tested Mamba's ability to handle long prompts (up to 10k tokens).
- Mamba struggles with very long texts (136K tokens), but performs better with shorter ones (1.54K tokens).
- Example: Triathlon article [triathlon features](https://www.tri247.com/triathlon-features/interviews/lionel-sanders-championship-preview).

## 8. Limitations in Generating High-Quality Content
- Mamba's pre-training limited to 2048 tokens might hinder its ability to summarize large texts.
- Suggestion to fine-tune smaller Mamba models for potential improvement.

## 9. Summary
- Mamba excels in speed and token handling capacity.
- Fine-tuning Mamba is currently complex, with anticipation for future improvements.
- TinyLlama generates better text, likely due to more extensive pre-training data.
