# Testing the Merged Model

Now that we've merged our model, let's test it!

## Refresh on Merge Config

We used the handy `mergekit-gui` Hugging Face Space (found [here](https://huggingface.co/spaces/arcee-ai/mergekit-gui?ref=blog.arcee.ai)) to merge our model - this was our `config.yaml`:

```yaml
models:
  - model: meta-llama/Meta-Llama-3-8B-Instruct
  - model: ai-maker-space/leagaleasy-llama-3-instruct-v2
    parameters:
      density: 0.5
      weight: 0.5
  - model: ai-maker-space/riddle-bot-v1
    parameters:
      density: 0.5
      weight: 0.5

merge_method: ties
base_model: meta-llama/Meta-Llama-3-8B-Instruct
parameters:
  normalize: false
  int8_mask: true
dtype: float16
```

You can see that this model was merged using the `ties` method.

- [TIES-Merging Paper](https://arxiv.org/abs/2306.01708)
- [TIES README.md Reference](https://github.com/arcee-ai/mergekit?tab=readme-ov-file#ties)


## Gather Dependencies and HF Token

In [None]:
!pip install -qU transformers peft trl accelerate bitsandbytes datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m50.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.2/245.2 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.0/102.0 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Data Collection for Testing

In [None]:
!git clone https://github.com/lauramanor/legal_summarization.git

fatal: destination path 'legal_summarization' already exists and is not an empty directory.


In [None]:
import json

jsonl_array = []

with open('legal_summarization/tldrlegal_v1.json') as f:
  data = json.load(f)
  for key, value in data.items():
    jsonl_array.append(value)

In [None]:
from datasets import Dataset, load_dataset

legal_dataset = Dataset.from_list(jsonl_array)

In [None]:
legal_dataset

Dataset({
    features: ['doc', 'id', 'original_text', 'reference_summary', 'title', 'uid'],
    num_rows: 85
})

In [None]:
prepared_legal_dataset = legal_dataset.train_test_split(test_size=0.1)

In [None]:
from datasets import load_dataset

riddle_dataset = load_dataset("riddle_sense")

In [None]:
riddle_dataset

DatasetDict({
    train: Dataset({
        features: ['answerKey', 'question', 'choices'],
        num_rows: 3510
    })
    validation: Dataset({
        features: ['answerKey', 'question', 'choices'],
        num_rows: 1021
    })
    test: Dataset({
        features: ['answerKey', 'question', 'choices'],
        num_rows: 1184
    })
})

#### Instruction Templates for Each Task

We'll want to provide the training template for our tasks so we can see the desired behaviour.

In [None]:
RIDDLE_PROMPT_TEMPLATE = """\
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Please answer the following multiple-choice riddle by selecting the correct choice.<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}\n\nA: {choice_a}\nB: {choice_b}\nC: {choice_c}\nD: {choice_d}\nE: {choice_e}\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

RIDDLE_RESPONSE_TEMPLATE = """\
{answer}<|eot_id|><|end_of_text|>"""

In [None]:
SUMMARIZE_PROMPT_TEMPLATE = """\
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Please convert the following legal content into a human-readable summary<|eot_id|><|start_header_id|>user<|end_header_id|>

[LEGAL_DOC]{LEGAL_TEXT}[END_LEGAL_DOC]<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

SUMMARIZE_RESPONSE_TEMPLATE = """\
{NATURAL_LANGUAGE_SUMMARY}<|eot_id|><|end_of_text|>"""

Now we can create a helper function that will convert our dataset row into the above prompt!

In [None]:
def create_instruction(sample, return_response=True):
  prompt = SUMMARIZE_PROMPT_TEMPLATE.format(LEGAL_TEXT=sample["original_text"])

  if return_response:
    prompt += SUMMARIZE_RESPONSE_TEMPLATE.format(NATURAL_LANGUAGE_SUMMARY=sample["reference_summary"])

  return prompt

In [None]:
def create_riddle_instruction(sample, return_response=True):
  prompt = RIDDLE_PROMPT_TEMPLATE.format(
      question=sample["question"],
      choice_a=sample["choices"]["text"][0],
      choice_b=sample["choices"]["text"][1],
      choice_c=sample["choices"]["text"][2],
      choice_d=sample["choices"]["text"][3],
      choice_e=sample["choices"]["text"][4],
  )

  if return_response:
    prompt += RIDDLE_RESPONSE_TEMPLATE.format(answer=sample["answerKey"])

  return prompt

## Baselining Meta-Llama-3-8B-Instruct on Tasks

Now that we have some test data, and the original instruction templates for each task - we can

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
base_tokenizer = AutoTokenizer.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
from transformers import pipeline

base_model_pipe = pipeline("text-generation", base_model, tokenizer=base_tokenizer, max_new_tokens=256, return_full_text=False)

## Riddle Baseline

Let's see how the base model performs at the riddle task!

In [None]:
riddle_dataset["test"][0]["question"]

'My life can be measured in hours. I serve by being devoured. Thin, I am quick. Fat, I am slow. Wind is my foe. What am I?'

In [None]:
for text, label in zip(riddle_dataset["test"][0]["choices"]["text"], riddle_dataset["test"][0]["choices"]["label"]):
  print(f"{label} : {text}")

A : paper
B : candle
C : lamp
D : clock
E : worm


In [None]:
outputs = base_model_pipe(create_riddle_instruction(riddle_dataset["test"][0], return_response=False), do_sample=True, max_new_tokens=256, temperature=0.1, top_k=50)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [None]:
outputs[0]["generated_text"]

'\n\nThe correct answer is B: candle.\n\nHere\'s how the description fits:\n\n* "My life can be measured in hours": A candle\'s life can be measured in hours, as it burns for a certain amount of time.\n* "I serve by being devoured": A candle serves by providing light, and it is "devoured" as it is consumed by being burned.\n* "Thin, I am quick": A thin candle burns quickly.\n* "Fat, I am slow": A fat or large candle burns slowly.\n* "Wind is my foe": Wind can extinguish a candle\'s flame, making it a foe to the candle\'s survival.'

Unfortunately, as we can see, the model still gives an very verbose (albeit correct) answer.

## Summary Baseline:

Let's see how our base model performs on our summarization task!

In [None]:
prepared_legal_dataset["test"][1]["original_text"]

'you agree that you will not remove obscure or alter any proprietary rights notices including copyright and trademark notices that may be affixed to or contained within the sdk.'

In [None]:
outputs = base_model_pipe(create_instruction(prepared_legal_dataset["test"][1], return_response=False), do_sample=True, max_new_tokens=256, temperature=0.1, top_k=50)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [None]:
outputs[0]["generated_text"].split("[LEGAL_DOC]")[-1]

"\n\nHere's a human-readable summary:\n\nWhen using the Software Development Kit (SDK), you agree to leave any copyright and trademark notices intact and unchanged. This means you won't remove or alter any proprietary rights notices that are already present in the SDK."

In [None]:
prepared_legal_dataset["test"][0]["reference_summary"]

'anything you make available on our game must be your own work.'

While this is an effective description of the reference text - the summary itself is rather verbose and doesn't match the simple language of the training set.

Let's free up some resources so we can load the merged model!

In [None]:
del base_model_pipe
del base_model
torch.cuda.empty_cache()

## Loading Merged Model

Loading our merged model is simple, as we've uploaded it to the hub already!

In [None]:
model_id = "c-s-ale/RiddleLegalEasy"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
)

config.json:   0%|          | 0.00/706 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/22.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/6.08G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
merged_model_pipe = pipeline("text-generation", model, tokenizer=tokenizer, max_new_tokens=256, return_full_text=False)

## Riddle Test

Let's see if our model "remembers" what we trained it on when it comes to answering riddles!

In [None]:
riddle_dataset["test"][0]["question"]

'My life can be measured in hours. I serve by being devoured. Thin, I am quick. Fat, I am slow. Wind is my foe. What am I?'

In [None]:
for text, label in zip(riddle_dataset["test"][0]["choices"]["text"], riddle_dataset["test"][0]["choices"]["label"]):
  print(f"{label} : {text}")

A : paper
B : candle
C : lamp
D : clock
E : worm


In [None]:
outputs = merged_model_pipe(create_riddle_instruction(riddle_dataset["test"][0], return_response=False), do_sample=True, max_new_tokens=256, temperature=0.1, top_k=50)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [None]:
outputs[0]["generated_text"]

'B'

Absolutely amazing! It remembers to only output the letter - exactly as we fine-tuned it to do!

In [None]:
prepared_legal_dataset["test"][1]["original_text"]

'you agree that you will not remove obscure or alter any proprietary rights notices including copyright and trademark notices that may be affixed to or contained within the sdk.'

In [None]:
outputs = merged_model_pipe(create_instruction(prepared_legal_dataset["test"][1], return_response=False), do_sample=True, max_new_tokens=256, temperature=0.1, top_k=50)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [None]:
outputs[0]["generated_text"].split("[LEGAL_DOC]")[-1]

'\n\nYou agree not to remove or alter any copyright or trademark notices that may be included in the software development kit (SDK).'

In [None]:
prepared_legal_dataset["test"][1]["reference_summary"]

'keep copyright and trademark notices intact.'

Not only is this response more of a summary, but it's also more in line with the casual language of the training set!