# First steps with Mixtral

## Goal

Verify that I can use the Mixtral model locally.

## Imports

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
import gc
import time
import re
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

plt.plot()
plt.close('all')
plt.rcParams["figure.figsize"] = (20, 5)  
mpl.rcParams['lines.linewidth'] = 3
mpl.rcParams['font.size'] = 16

## Code

In [2]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.float16,
    bnb_4bit_use_double_quant= True,
    llm_int8_enable_fp32_cpu_offload= True)

torch.cuda.empty_cache()
gc.collect()

2048

In [3]:
model_path = '/mnt/hdd0/Kaggle/llm_prompt_recovery/models/mixtral-8x7b-instruct-v0.1-hf'
model_path = '/home/gbarbadillo/data/mixtral-8x7b-instruct-v0.1-hf/'
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,)

Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]

- 24 min when loading from HDD (reading at 62MB/s)
- 1 min when loading from SDD (reading at 1.5GB/s)

In [4]:
from transformers import pipeline, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code=True)
tokenizer.pad_token_id = tokenizer.eos_token_id # this is needed to do batch inference
gc.collect()

141

In [5]:
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

def chat_with_mixtral(prompt, max_new_tokens=200, verbose=True, do_sample=False, temperature=0.7, top_p=0.95):
    if not prompt.startswith('<s>[INST]'):
        print('Formatting the prompt to Mixtral needs.')
        prompt = f'<s>[INST] {prompt} [/INST]'
    start = time.time()

    if do_sample:
        sampling_kwargs = dict(do_sample=True, temperature=temperature, top_p=top_p)
    else:
        sampling_kwargs = dict(do_sample=False)

    sequences = pipe(
        prompt ,
        max_new_tokens=max_new_tokens,
        # https://www.reddit.com/r/LocalLLaMA/comments/184g120/mistral_fine_tuning_eos_and_padding/
        # https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/discussions/106
        pad_token_id=tokenizer.eos_token_id,
        **sampling_kwargs,
        return_full_text=False,
    )
    response = sequences[0]['generated_text']
    response = re.sub(r'[\'"]', '', response)
    if verbose:
        stop = time.time()
        time_taken = stop-start
        n_tokens = len(tokenizer.tokenize(response))
        print(f"Execution Time : {time_taken:.1f} s, tokens per second: {n_tokens/time_taken:.1f}")
    return response

## Chatting

In [6]:
for _ in range(2):
    print(chat_with_mixtral('write a poem about real madrid', max_new_tokens=25))

Formatting the prompt to Mixtral needs.
Execution Time : 4.1 s, tokens per second: 6.4
 In the heart of Spain, where the passion runs deep,
Lies a team called Real Madrid, a treasure to keep
Formatting the prompt to Mixtral needs.
Execution Time : 2.8 s, tokens per second: 9.3
 In the heart of Spain, where the passion runs deep,
Lies a team called Real Madrid, a treasure to keep


In [7]:
print(chat_with_mixtral('Write an essay about the future of digital identity.', 200))

Formatting the prompt to Mixtral needs.
Execution Time : 21.4 s, tokens per second: 9.4
 Title: The Future of Digital Identity: A Realm of Endless Possibilities

Introduction

The digital landscape is evolving at an unprecedented pace, and the concept of digital identity is at the heart of this transformation. Digital identity is the collection of electronically stored data and information related to an individual, organization, or electronic device. As we become more interconnected, the future of digital identity is poised to undergo significant changes, promising endless possibilities. This essay explores the potential future developments in digital identity, focusing on areas such as self-sovereign identity, biometrics, artificial intelligence, and privacy.

Self-Sovereign Identity (SSI)

One of the most promising developments in digital identity is the emergence of self-sovereign identity (SSI). SSI is a model where individuals have sole ownership and control over their digital ide

- It is generating at a speed of 10.4 tokens per second, when using `torch.float16`
- When using `torch.bfloat16` it generated at 8.9 tokens per second

In [8]:
raise

RuntimeError: No active exception to reraise

## Input formatting

I'm not sure if I'm using the correct input format:

- https://www.kaggle.com/models/mistral-ai/mixtral/frameworks/PyTorch/variations/8x7b-instruct-v0.1-hf/versions/1
- https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

### Studying the tokenizer

#### Typical case

In [None]:
messages = [
    {"role": "user", "content": "say hi"},
]
tokenizer.apply_chat_template(messages, return_tensors="pt").numpy().tolist()[0]

In [None]:
prompt = f'<s>[INST] say hi [/INST]'
tokenizer.encode(prompt, add_special_tokens=False)

In [None]:
tokenizer.tokenize(prompt, add_special_tokens=False)

In [None]:
prompt = f'<s>[INST] say hi'
tokenizer.encode(prompt, add_special_tokens=False)

In this case we can see that the encoding is exactly the same. Notice that I had to remove the space between `<s>` and `[INST]`

#### Longer conversations

In [None]:
messages = [
    {"role": "user", "content": "Hi"},
    {"role": "assistant", "content": "Hello"},
]
tokenizer.apply_chat_template(messages, return_tensors="pt").numpy().tolist()[0]

In [None]:
prompt = f'<s>[INST] Hi [/INST]Hello'
tokenizer.encode(prompt, add_special_tokens=False)

In [None]:
tokenizer.convert_ids_to_tokens(2)

We can see that the difference is just that the chat template assumed the bot had ended the chat, but I didn't

In [None]:
messages = [
    {"role": "user", "content": "Hi"},
    {"role": "assistant", "content": "Hello"},
    {"role": "user", "content": "Bye"},
]
tokenizer.apply_chat_template(messages, return_tensors="pt").numpy().tolist()[0][-10:]

In [None]:
prompt = f'<s>[INST] Hi [/INST]Hello</s>[INST] Bye[/INST]'
tokenizer.encode(prompt, add_special_tokens=False)[-10:]

### Checking the pipeline

In [None]:
pipe(f'<s>[INST] say hi [/INST]', do_sample=False, return_full_text=False, pad_token_id=tokenizer.eos_token_id, max_new_tokens=50)

In [None]:
pipe(f'[INST] say hi [/INST]', do_sample=False, return_full_text=False, pad_token_id=tokenizer.eos_token_id, max_new_tokens=50, add_special_tokens=True)

This example shows that by default the pipeline was not adding the special token, but if I use `add_special_tokens=True` I can get the same results.

## Batch generation

In [None]:
prompts = [f'<s>[INST] What is the capital of {country}? Do not give any additional information, just say the capital and shut up.[/INST]The capital of {country} is: ' for country in ['Spain', 'France', 'Germany', 'Italy']]
pipe(prompts, do_sample=False, return_full_text=False, pad_token_id=tokenizer.eos_token_id, max_new_tokens=50)

In [9]:
prompts = [f'<s>[INST] What is the history of {country}? [/INST]' for country in ['Spain', 'France', 'Germany', 'Italy']]
pipe(prompts, do_sample=False, return_full_text=False, pad_token_id=tokenizer.eos_token_id, max_new_tokens=50)

[[{'generated_text': ' The history of Spain is long, complex, and spans several cultural stages and eras. Here is a brief overview:\n\n1. Prehistory: The Iberian Peninsula, where Spain is located, has been inhabited since the'}],
 [{'generated_text': ' The history of France is extensive and complex, spanning thousands of years, from the prehistoric period to the present day. Here is a brief overview of the major periods and events in French history:\n\n1. Prehistory: The area'}],
 [{'generated_text': ' The history of Germany is complex and rich, spanning many centuries and involving numerous different tribes, states, and cultures. Here is a brief overview:\n\nPrehistory and Ancient Times:\nThe area that is now Germany has been inhabited'}],
 [{'generated_text': ' The history of Italy is long and complex, spanning thousands of years and numerous civilizations. Here is a brief overview:\n\nPrehistory and Ancient Italy:\nThe Italian Peninsula has been inhabited since the Paleolithic'}]]

It does not seem to speedup the inference in any way using a pipe with multiple inputs. It works but not faster.

By default the tokenizer adds the BOS token. So it is likely that in the pipeline it is done as well.

In [10]:
pipe_bs4 = pipeline(task="text-generation", model=model, tokenizer=tokenizer, batch_size=4)

In [11]:
prompts = [f'<s>[INST] What is the history of {country}? [/INST]' for country in ['Spain', 'France', 'Germany', 'Italy']]
pipe_bs4(prompts, do_sample=False, return_full_text=False, pad_token_id=tokenizer.eos_token_id, max_new_tokens=50)

[[{'generated_text': ' The history of Spain is long, complex, and spans several cultural stages and eras. Here is a brief overview:\n\n1. Prehistory: The Iberian Peninsula, where Spain is located, has been inhabited since the'}],
 [{'generated_text': ' The history of France is extensive and complex, spanning thousands of years, from the prehistoric period to the present day. Here is a brief overview of the major periods and events in French history:\n\n1. Prehistory: The area'}],
 [{'generated_text': ' The history of Germany is complex and rich, spanning many centuries and involving numerous different tribes, states, and cultures. Here is a brief overview:\n\nPrehistory and Ancient Times:\nThe area that is now Germany has been inhabited'}],
 [{'generated_text': ' The history of Italy is long and complex, spanning thousands of years and numerous civilizations. Here is a brief overview:\n\nPrehistory and Ancient Italy:\nThe Italian Peninsula has been inhabited since the Paleolithic'}]]

In [12]:
prompts = [f'<s>[INST] What is the history of {country}? [/INST]' for country in ['Spain', 'France', 'Germany', 'Italy, one of the most important countries on Europe']]
pipe_bs4(prompts, do_sample=False, return_full_text=False, pad_token_id=tokenizer.eos_token_id, max_new_tokens=50)

[[{'generated_text': ' The history of Spain is long, complex, and spans several cultural stages and eras. Here is a brief overview:\n\n1. Prehistory: The Iberian Peninsula, where Spain is located, has been inhabited since the'}],
 [{'generated_text': ' The history of France is extensive and complex, spanning thousands of years, from the prehistoric period to the present day. Here is a brief overview of the major periods and events in French history:\n\n1. Prehistory: The area'}],
 [{'generated_text': ' The history of Germany is complex and rich, spanning many centuries and involving numerous different tribes, states, and cultures. Here is a brief overview:\n\nPrehistory and Ancient Times:\nThe area that is now Germany has been inhabited'}],
 [{'generated_text': " Italy, as a country, has a long and complex history that dates back to ancient times. Here is an overview of the major periods in Italy's history:\n\n1. Prehistory: The Italian peninsula has been inhabited since the"}]]

GPU usage is higher, and generation is faster. Let's try to increase the batch size.

In [None]:
prompts = [f'<s>[INST] What is the history of {country}? [/INST]' for country in ['Spain']]
max_new_tokens = 25
for batch_size in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]: # OOM, 2048, 4096]:
    new_pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, batch_size=batch_size)
    t0 = time.time()
    output = new_pipe(prompts*batch_size, do_sample=False, return_full_text=False, pad_token_id=tokenizer.eos_token_id, max_new_tokens=max_new_tokens)
    t0 = time.time() - t0
    print(f'Batch size: {batch_size}\tExecution time: {t0:.1f} s, tokens per second: {max_new_tokens*batch_size/t0:.1f}')

```
Batch size: 1	Execution time: 3.1 s, tokens per second: 8.1
Batch size: 2	Execution time: 4.6 s, tokens per second: 10.8
Batch size: 4	Execution time: 4.6 s, tokens per second: 21.9
Batch size: 8	Execution time: 4.6 s, tokens per second: 43.3
Batch size: 16	Execution time: 4.7 s, tokens per second: 85.8
Batch size: 32	Execution time: 4.9 s, tokens per second: 164.0
Batch size: 64	Execution time: 5.1 s, tokens per second: 315.1
Batch size: 128	Execution time: 6.3 s, tokens per second: 511.9
Batch size: 256	Execution time: 8.4 s, tokens per second: 760.3
Batch size: 512	Execution time: 13.3 s, tokens per second: 963.9
Batch size: 1024	Execution time: 23.7 s, tokens per second: 1080.6
```

These are incredible speedups. If I batch the predictions I could do inference 100 times faster.
When using a big batch size I see the GPUs alternating at 100%

## Memory requirements

In [None]:
with open('/mnt/hdd0/Kaggle/llm_prompt_recovery/data/bible.txt', 'r') as f:
    bible = f.read()
print(bible[:100])

In [None]:
def print_gpu_memory():
    memory = torch.cuda.memory_allocated(0) + torch.cuda.memory_allocated(1)
    memory = sum(torch.cuda.memory_allocated(device) for device in range(torch.cuda.device_count()))
    max_memory = sum(torch.cuda.max_memory_allocated(device) for device in range(torch.cuda.device_count()))
    print(f'GPU memory allocated: {memory/1024**3:.1f} GB, max memory allocated: {max_memory/1024**3:.1f} GB')

### Input tokens

The allocated memory remains constant, I have to log the max memory allocated.

In [None]:
input_tokens, max_memory, inference_time = [], [], []

for i in range(1, 20):
    text = bible[:int(i*4000)]
    input_tokens.append(len(tokenizer.tokenize(text)))
    print(f'Input tokens: {input_tokens[-1]}')
    t0 = time.time()
    chat_with_mixtral(text, max_new_tokens=25)
    inference_time.append(time.time() - t0)
    print_gpu_memory()
    max_memory.append(sum(torch.cuda.max_memory_allocated(device) for device in range(torch.cuda.device_count())))

In [None]:
plt.subplot(121)
plt.plot(input_tokens, np.array(max_memory)/1e9, marker='o')
plt.xlabel('Input tokens')
plt.ylabel('Max GPU memory allocated (GB)')
plt.grid()

plt.subplot(122)
plt.plot(input_tokens, np.array(inference_time), marker='o')
plt.xlabel('Input tokens')
plt.ylabel('Inference time (s)')
plt.grid()

It seems that the memory and the inference time grow linearly with the input tokens.

### Output tokens

In [None]:
def print_gpu_memory():
    memory = torch.cuda.memory_allocated(0) + torch.cuda.memory_allocated(1)
    memory = sum(torch.cuda.memory_allocated(device) for device in range(torch.cuda.device_count()))
    max_memory = sum(torch.cuda.max_memory_allocated(device) for device in range(torch.cuda.device_count()))
    print(f'GPU memory allocated: {memory/1024**3:.1f} GB, max memory allocated: {max_memory/1024**3:.1f} GB')

output_tokens, max_memory, inference_time = [], [], []

text = bible[:int(4000)]

for i in range(1, 10):
    t0 = time.time()
    response = chat_with_mixtral(text, max_new_tokens=int(25*i))
    output_tokens.append(len(tokenizer.tokenize(response)))
    print(f'Output tokens: {output_tokens[-1]}')
    inference_time.append(time.time() - t0)
    print_gpu_memory()
    max_memory.append(sum(torch.cuda.max_memory_allocated(device) for device in range(torch.cuda.device_count())))

In [None]:
plt.subplot(121)
plt.plot(output_tokens, np.array(max_memory)/1e9, marker='o')
plt.xlabel('Output tokens')
plt.ylabel('Max GPU memory allocated (GB)')
plt.grid()

plt.subplot(122)
plt.plot(output_tokens, np.array(inference_time), marker='o')
plt.xlabel('Output tokens')
plt.ylabel('Inference time (s)')
plt.grid()

In the scale of the tokens that we are going generate, output memory is constant and inference time scales linearly.

### Batch size

In [None]:
def print_gpu_memory():
    memory = torch.cuda.memory_allocated(0) + torch.cuda.memory_allocated(1)
    memory = sum(torch.cuda.memory_allocated(device) for device in range(torch.cuda.device_count()))
    max_memory = sum(torch.cuda.max_memory_allocated(device) for device in range(torch.cuda.device_count()))
    print(f'GPU memory allocated: {memory/1024**3:.1f} GB, max memory allocated: {max_memory/1024**3:.1f} GB')


text = bible[:int(16000)]
print(f'Input tokens: {len(tokenizer.tokenize(text))}')
prompts = [f'<s>[INST] {text} [/INST]']
max_new_tokens = 25

batch_size_range = [1, 2, 4]
max_memory, inference_time = [], []

for batch_size in batch_size_range:
    new_pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, batch_size=batch_size)
    t0 = time.time()
    output = new_pipe(prompts*batch_size, do_sample=False, return_full_text=False, pad_token_id=tokenizer.eos_token_id, max_new_tokens=max_new_tokens)
    inference_time.append((time.time() - t0)/batch_size)
    print_gpu_memory()
    max_memory.append(sum(torch.cuda.max_memory_allocated(device) for device in range(torch.cuda.device_count())))

In [None]:
print(inference_time)

In [None]:
data = [
    dict(input_tokens=1129,
         batch_size_range=[1, 2, 4, 8, 16],
         max_memory=[25650272256, 26341480448, 27512605696, 29923305472, 34771817472],
         inference_time=[3.436551809310913, 2.789551019668579, 1.6524068713188171, 1.117274284362793, 0.8535761386156082]),
    dict(input_tokens=2320,
         batch_size_range=[1, 2, 4, 8],
         max_memory=[26281251840, 27565897216, 30037116416, 35002349056],
         inference_time=[3.882540464401245, 3.33407199382782, 2.2599929571151733, 1.7502523958683014]),

    dict(input_tokens=4830,
         batch_size_range=[1, 2, 4],
         max_memory=[27620456960, 30234573312, 35392415744],
         inference_time=[5.286550998687744, 4.656355619430542, 3.7115885615348816]),
]

plt.subplot(121)
for experiment in data:
    plt.plot(experiment['batch_size_range'], np.array(experiment['max_memory'])/1e9, marker='o', label=f"{experiment['input_tokens']} input tokens")
plt.xlabel('Batch size')
plt.ylabel('Max GPU memory allocated (GB)')
xlim, ylim = plt.xlim(), plt.ylim()
plt.fill_between(xlim, [31]*2, [ylim[1]]*2, color='red', alpha=0.1)
plt.xlim(xlim)
plt.ylim(ylim)
plt.grid()
plt.legend(loc=0)

plt.subplot(122)
for experiment in data:
    plt.plot(experiment['batch_size_range'], experiment['inference_time'], marker='o', label=f"{experiment['input_tokens']} input tokens")
plt.xlabel('Batch size')
plt.ylabel('Inference time (s)')
plt.grid()
plt.legend(loc=0)

The memory grows linearly with the batch size, as expected.

## Context size

According to this [blogpost](https://huggingface.co/blog/mixtral) the context length is 32k tokens. Let's verify it.

In [None]:
n_characters = 20000
prompt = f'<s>[INST] {bible[:n_characters]} [/INST]'
print(f'Input tokens: {len(tokenizer.tokenize(prompt))}')
print(chat_with_mixtral(prompt, max_new_tokens=25))
prompt = f'<s>[INST] Summarize the following text: {bible[:n_characters]} [/INST]'
print(chat_with_mixtral(prompt, max_new_tokens=25))
print_gpu_memory()

In [None]:
n_characters = 40000
prompt = f'<s>[INST] {bible[:n_characters]} [/INST]'
print(f'Input tokens: {len(tokenizer.tokenize(prompt))}')
print(chat_with_mixtral(prompt, max_new_tokens=25))
prompt = f'<s>[INST] Summarize the following text: {bible[:n_characters]} [/INST]'
print(chat_with_mixtral(prompt, max_new_tokens=25))
print_gpu_memory()

In [None]:
n_characters = 80000
prompt = f'<s>[INST] {bible[:n_characters]} [/INST]'
print(f'Input tokens: {len(tokenizer.tokenize(prompt))}')
print(chat_with_mixtral(prompt, max_new_tokens=25))
prompt = f'<s>[INST] Summarize the following text: {bible[:n_characters]} [/INST]'
print(chat_with_mixtral(prompt, max_new_tokens=25))
print_gpu_memory()

In [None]:
n_characters = 105000
prompt = f'<s>[INST] {bible[:n_characters]} [/INST]'
print(f'Input tokens: {len(tokenizer.tokenize(prompt))}')
print(chat_with_mixtral(prompt, max_new_tokens=25))
prompt = f'<s>[INST] Summarize the following text: {bible[:n_characters]} [/INST]'
print(chat_with_mixtral(prompt, max_new_tokens=25))
print_gpu_memory()

```
Input tokens: 32999
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (32768). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
Execution Time : 25.0 s, tokens per second: 1.0
This is the book of the generations of Adam. In the day that God created man, in the likeness of God
Execution Time : 24.8 s, tokens per second: 1.0
The passage is the first book of the Bible, Genesis, and tells the story of the creation of the world and the
GPU memory allocated: 23.3 GB, max memory allocated: 39.7 GB
```

We see a warning about surpassing the limit of 32768 tokens. We don't have to worry about this because on the submission we won't have that much VRAM memory available.

We will have to deal with around 12k input token limits due to RAM and time constraints.

Torch manages memory in a weird way, depending on the previous executions I would get OOM or not.
F.e. I could not call the chat 2 times with the biggest input.

¿Maybe I could if calling the pipe with a list?

## TODO

- [x] What is the message: `Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.`
- [x] Try using `tokenizer.apply_chat_template()`, what is the correct input format?
- [x] Can I use batches to speeedup generation? GPU use is around 13% when generating data
- [ ] Maybe on another notebook: setup a pipeline to evaluate different prompts. This is the way of doing prompt engineering. Try some prompt, evaluate, iterate.