# First steps with Mixtral

## Goal

Verify that I can use the Mixtral model locally.

## Imports

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
import gc
import time
import re

## Code

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.float16,
    bnb_4bit_use_double_quant= True,
    llm_int8_enable_fp32_cpu_offload= True)

torch.cuda.empty_cache()
gc.collect()

In [None]:
model_path = '/mnt/hdd0/Kaggle/llm_prompt_recovery/models/mixtral-8x7b-instruct-v0.1-hf'
model_path = '/home/gbarbadillo/data/mixtral-8x7b-instruct-v0.1-hf/'
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,)

- 24 min when loading from HDD (reading at 62MB/s)
- 1 min when loading from SDD (reading at 1.5GB/s)

In [None]:
from transformers import pipeline, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code=True)
gc.collect()

In [None]:
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

def chat_with_mixtral(prompt, max_new_tokens=200, verbose=True, temperature=0.7, top_p=0.95):
    if not prompt.startswith('<s> [INST]'):
        print('Formatting the prompt to Mixtral needs.')
        prompt = f'<s> [INST] {prompt} [/INST]'
    start = time.time()
    sequences = pipe(
        prompt ,
        do_sample=True,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        top_p=top_p
    )
    response = sequences[0]['generated_text']
    response = re.sub(r'[\'"]', '', response.replace(prompt, ''))
    if verbose:
        stop = time.time()
        time_taken = stop-start
        n_tokens = len(tokenizer.tokenize(response))
        print(f"Execution Time : {time_taken:.1f} s, tokens per second: {n_tokens/time_taken:.1f}")
    return response

## Chatting

In [None]:
for _ in range(2):
    print(chat_with_mixtral('write a poem about real madrid', max_new_tokens=25, temperature=1e-10))

In [None]:
print(chat_with_mixtral('Write an essay about the future of digital identity.', 200))

- It is generating at a speed of 10.4 tokens per second, when using `torch.float16`
- When using `torch.bfloat16` it generated at 8.9 tokens per second

## TODO

- [ ] What is the message: `Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.`
- [ ] Try using `tokenizer.apply_chat_template()`, what is the correct input format?
- [ ] Can I use batches to speeedup generation? GPU use is around 13% when generating data
- [ ] Maybe on another notebook: setup a pipeline to evaluate different prompts. This is the way of doing prompt engineering. Try some prompt, evaluate, iterate.