# Inference with fine-tuned Mixtral

## Goal

Let's see if I can make inferences with fine-tuned mixtral.

## References

- [Fine-tune Mixtral-8x7B on Your Computer (QLoRA)](https://colab.research.google.com/drive/1VDa0lIfqiwm16hBlIlEaabGVTNB3dN1A?usp=sharing)
- https://huggingface.co/blog/4bit-transformers-bitsandbytes
- [bnb 4bit training](https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing)
- https://huggingface.co/docs/datasets/en/index

## Imports

In [None]:
import torch
import gc
import time
import re
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
from tqdm.auto import tqdm
import yaml
import os
import hashlib

from transformers import (
    AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,
    pipeline, TrainingArguments
)
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import Dataset

from prometeo.evaluation import get_sharpened_cosine_similarity, estimate_mean

plt.plot()
plt.close('all')
plt.rcParams["figure.figsize"] = (20, 5)
mpl.rcParams['lines.linewidth'] = 3
mpl.rcParams['font.size'] = 16

pd.set_option('display.max_colwidth', 200)

## Load model

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.float16,
    bnb_4bit_use_double_quant= True,
    llm_int8_enable_fp32_cpu_offload= True)

torch.cuda.empty_cache()
gc.collect()

In [None]:
auto_device_map = {
    'model.embed_tokens': 0,
    'model.layers.0': 0,
    'model.layers.1': 0,
    'model.layers.2': 0,
    'model.layers.3': 0,
    'model.layers.4': 0,
    'model.layers.5': 0,
    'model.layers.6': 0,
    'model.layers.7': 0,
    'model.layers.8': 0,
    'model.layers.9': 0,
    'model.layers.10': 0,
    'model.layers.11': 0,
    'model.layers.12': 0,
    'model.layers.13': 0,
    'model.layers.14': 1,
    'model.layers.15': 1,
    'model.layers.16': 1,
    'model.layers.17': 1,
    'model.layers.18': 1,
    'model.layers.19': 1,
    'model.layers.20': 1,
    'model.layers.21': 1,
    'model.layers.22': 1,
    'model.layers.23': 1,
    'model.layers.24': 1,
    'model.layers.25': 1,
    'model.layers.26': 1,
    'model.layers.27': 1,
    'model.layers.28': 1,
    'model.layers.29': 1,
    'model.layers.30': 1,
    'model.layers.31': 1,
    'model.norm': 1,
    'lm_head': 1
 }

def create_shared_device_map(transition_layer):
    shared_device_map = {}
    for idx, key in enumerate(auto_device_map):
        if idx <= transition_layer:
            shared_device_map[key] = 0
        else:
            shared_device_map[key] = 1
    return shared_device_map

def create_intertwined_device_map():
    device_map = {}
    for idx, key in enumerate(auto_device_map):
        if idx == 0:
            device_map[key] = 1
        elif idx >= 33:
            device_map[key] = 0
        else:
            device_map[key] = idx % 2
    return device_map

In [None]:
model_path = '/home/gbarbadillo/data/mixtral-8x7b-instruct-v0.1-hf/'
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    device_map=create_shared_device_map(16),
    trust_remote_code=True,
    )

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code=True)
# this is needed to do batch inference
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.pad_token = tokenizer.eos_token
gc.collect()

In [None]:
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

def chat_with_mixtral(prompt, max_new_tokens=200, verbose=True, do_sample=False, temperature=0.7, top_p=0.95):
    if not prompt.startswith('<s>[INST]'):
        print('Formatting the prompt to Mixtral needs.')
        prompt = f'<s>[INST] {prompt} [/INST]'
    start = time.time()

    if do_sample:
        sampling_kwargs = dict(do_sample=True, temperature=temperature, top_p=top_p)
    else:
        sampling_kwargs = dict(do_sample=False)

    sequences = pipe(
        prompt ,
        max_new_tokens=max_new_tokens,
        # https://www.reddit.com/r/LocalLLaMA/comments/184g120/mistral_fine_tuning_eos_and_padding/
        # https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/discussions/106
        pad_token_id=tokenizer.eos_token_id,
        **sampling_kwargs,
        return_full_text=False,
    )
    response = sequences[0]['generated_text']
    #response = re.sub(r'[\'"]', '', response)
    if verbose:
        stop = time.time()
        time_taken = stop-start
        n_tokens = len(tokenizer.tokenize(response))
        print(f"Execution Time : {time_taken:.1f} s, tokens per second: {n_tokens/time_taken:.1f}")
    return response

In [None]:
def print_gpu_memory():
    for device in range(torch.cuda.device_count()):
        print(f'GPU {device} memory allocated: {torch.cuda.memory_allocated(device)/1024**3:.1f} GB, max memory allocated: {torch.cuda.max_memory_allocated(device)/1024**3:.1f} GB')

## Prepare data

In [None]:
prompt_template = """<s>[INST]
Analyze the original and rewritten text and answer with the most likely text prompt that was given to rewrite or make stylistic changes to the original text.

- The text prompt should be a single sentence. Reply just with a short sentence and do not add any notes or comments.
- Sometimes the rewritten text will have hints about the text prompt. For example if it starts by
  Reworded, Rephrased, Translated, etc. you should include that word in the text prompt.
- Unless necessary do not make reference to details in the original text and keep the text prompt abstract and generic.

## Original text

{original_text}

## Rewritten text

{rewritten_text}

## Output format

Let's do the task step by step:

1. On a first step analyze the differences of the texts in less than 30 words.
2. On a second step write the most likely prompt using json format
[/INST] """

In [None]:
df = pd.read_csv('/mnt/hdd0/Kaggle/llm_prompt_recovery/data/mooney_test_with_gpt4.csv')
texts = []
for idx, row in df.iterrows():
    texts.append(prompt_template.format(original_text=row['original_text'], rewritten_text=row['rewritten_text'], response=row['gpt4_normalized_response']))
df['text'] = texts

In [None]:
raise

## Inference with base model

In [None]:
for text in df['text'].values[:5]:
    print(chat_with_mixtral(text))

```
Loading checkpoint shards: 100%
 19/19 [01:05<00:00,  3.34s/it]
41
Execution Time : 6.9 s, tokens per second: 8.1
 1. The rewritten text uses a more playful and engaging tone compared to the original's formal style.

2. {
  "text_prompt": "Reword the fish tank maintenance memo to be more engaging and playful"
}
Execution Time : 6.5 s, tokens per second: 9.7
 1. The rewritten text is more tentative and polite, using phrases like "I understand it's uncertain" and "could you possibly".

2. {
  "text_prompt": "Reword the request for dessert to be more polite and tentative"
}
Execution Time : 7.3 s, tokens per second: 9.8
 1. The original text mentions the intricate carvings on the zither, while the rewritten text replaces this with a reference to the cleat-adorned design.
2. {
  "action": "Rephrase",
  "focus": "describe the zither in a different way"
}
Execution Time : 6.7 s, tokens per second: 9.8
 1. The original text describes a peony blooming, while the rewritten text describes a parser functioning.

2. {
  "action": "Reworded",
  "tense": "Past",
  "description": "of technical process instead of natural process"
}
Execution Time : 5.7 s, tokens per second: 9.2
 1. The rewritten text is a shortened and condensed version of the original, summarizing the main points.

2. {
  "text_prompt": "Summarize the given text into a shorter version"
}
```

## Inference with fine-tuned model

In [None]:
adapter_filepath = '/mnt/hdd0/Kaggle/llm_prompt_recovery/trainings/2024-03-26_first_trainings/01_baseline/checkpoint-160'
adapter_filepath = '/mnt/hdd0/Kaggle/llm_prompt_recovery/trainings/2024-03-26_first_trainings/04_add_pad_token/checkpoint-60'
adapter_filepath = '/mnt/hdd0/Kaggle/llm_prompt_recovery/trainings/2024-03-26_first_trainings/06_r16_alpha32_padding_right/checkpoint-300'
adapter_filepath = '/mnt/hdd0/Kaggle/llm_prompt_recovery/trainings/2024-03-26_first_trainings/06_r16_alpha32_padding_right/v1'
adapter_filepath = '/mnt/hdd0/Kaggle/llm_prompt_recovery/trainings/2024-03-27_bigger_dataset/01_r16_alpha64/v2'
model.load_adapter(adapter_filepath)

In [None]:
df.gpt4_normalized_response.head()

In [None]:
for text in df['text'].values[:5]:
    print(chat_with_mixtral(text))

The predictions look very good, we can see that the inference speed is lower, but it is likely enough to make the submission.

In [None]:
for text in df['text'].values[:5]:
    print(chat_with_mixtral(text))

The predictions are not deterministic, which is weird...

## Evaluate fine-tuned model

In [None]:
test = pd.read_csv('/mnt/hdd0/Kaggle/llm_prompt_recovery/data/gemma_suppl_rewrite_curated_with_gpt4.csv')
#test = test.head(16)
texts = []
for idx, row in test.iterrows():
    texts.append(prompt_template.format(original_text=row['original_text'], rewritten_text=row['rewritten_text'], response=row['gpt4_normalized_response']))
test['text'] = texts
test.head()

In [None]:
def evaluate_model():
    responses = [chat_with_mixtral(text, verbose=False) for text in tqdm(test['text'].values)]
    print('Measuring similarity of responses with ground truth')
    measure_similarity(responses)
    print('Parsing prompts from responses')
    prompts = [parse_prompt(response) for response in responses]
    print('Measuring similarity of parsed prompts with ground truth')
    measure_similarity(prompts)

def measure_similarity(responses):
    for key in ['rewrite_prompt', 'gpt4_prompt']:
        scs = get_sharpened_cosine_similarity(test[key].values, responses)
        mean, uncertainty = estimate_mean(scs)
        print(f'{key} SCS: {mean:.4f} ± {uncertainty:.4f}')

def parse_prompt(response):
    if '{' not in response:
        print(f'Response does not contain json: {response}')
        return response

    json_text = '{' + response.split('{')[1].split('}')[0] + '}'
    try:
        data = eval(json_text)
        if isinstance(data, dict):
            return list(data.values())[0]
        elif isinstance(data, set):
            return data.pop()
        raise ValueError(f'Could not parse prompt from {response}')
    except Exception as e:
        print(f'Error parsing json: {json_text}')
        return response

In [None]:
model.disable_adapters()
responses = [chat_with_mixtral(text, verbose=False) for text in tqdm(test['text'].values)]
print('Measuring similarity of responses with ground truth')
measure_similarity(responses)
print('Parsing prompts from responses')
prompts = [parse_prompt(response) for response in responses]
print('Measuring similarity of parsed prompts with ground truth')
measure_similarity(prompts)

In [None]:
model.enable_adapters()
responses = [chat_with_mixtral(text, verbose=False) for text in tqdm(test['text'].values)]
print('Measuring similarity of responses with ground truth')
measure_similarity(responses)
print('Parsing prompts from responses')
prompts = [parse_prompt(response) for response in responses]
print('Measuring similarity of parsed prompts with ground truth')
measure_similarity(prompts)

| metric                                | base model | v1     |
|---------------------------------------|------------|--------|
| response similarity with ground truth | 0.6551     | 0.6722 |
| prompt similarity with ground truth   | 0.6742     | 0.7103 |
| response similarity with gpt4 prompt  | 0.6751     | 0.7224 |
| prompt similarity with gpt4 prompt    | 0.7005     | 0.8133 |

Inference is slower than previous models because the output is longer (chain of thought)

The fine-tuned model get's higher similarity both with the ground truth and with gp4 prompt than the base model. But the effect on following GPT4 prompt is much higher.

## TODO

- [x] Does the inference speed changes when adding adapter?
- [x] How does the accuracy improves?
- [x] Is the model non-deterministic when using lora? No
- [ ] Why the model is not deterministic when using lora?