# Setup

!pip install datasets==2.18.0
!pip install --no-deps xformers==0.0.25.post1 trl==0.8.3 peft==0.10.0 accelerate==0.29.2 bitsandbytes==0.43.1 transformers==4.39.3


!pip install wandb

In [1]:
!wandb login

[34m[1mwandb[0m: Currently logged in as: [33mkdu[0m ([33methz-rycolab[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [2]:
!git clone https://github.com/epfl-dlab/llm-grounding-analysis.git

fatal: destination path 'llm-grounding-analysis' already exists and is not an empty directory.


# Low rank adaptations (LoRA)

[Link to the paper](https://arxiv.org/pdf/2106.09685.pdf)


<img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*D_i25E9dTd_5HMa45zITSg.png" width="350" height="350">

For a pretrained linear layer with weights $W \in \mathbb{R}^{d \times k}$ trainable parameters $A \in \mathbb{R}^{r\times k}$ and $B \in \mathbb{R}^{d \times r}$ are introduced. Then, the weights $W$ are replaced with $W + BA$ during training and only $B$ and $A$ are updated via gradient descent.

Quantized LoRA is the same thing but smartly quantized, see [paper](https://arxiv.org/abs/2305.14314).

Defining low rank adaptors for linear layers with deep neural networks allows for fine-grained control and for memory efficient finetuning.

For a network with $n$ float32 parameters, Adam, our most favourite optimizer requires $4 \cdot 3 n$ bytes of GPU memory during training only to hold the optimizer state (trainable parameters, one momentum term for each trainable parameter and one normalization term) on the GPU. For a 7B model that's 84 GB, using bfloat16 42GB. Using LoRA the memory required for the optimizer states gets reduced dramatically and then even further thanks to quantization.

In [3]:
!ls llm-grounding-analysis/data/fakepedia/

base_fakepedia.json  multihop_fakepedia.json


In [4]:
# you can load a datast using
from datasets import load_dataset, Dataset

dataset = load_dataset("json", data_files="llm-grounding-analysis/data/fakepedia/base_fakepedia.json")

In [5]:
from collections import defaultdict
import pandas as pd

my_dataset = defaultdict(list)

for d in dataset['train']:
  # add fake
  my_dataset['context'] += [d['fact_paragraph']]
  my_dataset['query'] += [d['query']]
  my_dataset['weight_context'] += [1.]
  my_dataset['answer'] += [d['object']]
  # add real
  my_dataset['context'] += [d['fact_paragraph']]
  my_dataset['query'] += [d['query']]
  my_dataset['weight_context'] += [0.]
  my_dataset['answer'] += [d['fact_parent']['object']]


In [6]:
df = pd.DataFrame.from_dict(my_dataset)

In [7]:
df[:10]

Unnamed: 0,context,query,weight_context,answer
0,"Newport County A.F.C., a professional football...",Newport County A.F.C. is headquartered in,1.0,Ankara
1,"Newport County A.F.C., a professional football...",Newport County A.F.C. is headquartered in,0.0,Newport
2,"Newport County A.F.C., a professional football...",Newport County A.F.C. is headquartered in,1.0,Canberra
3,"Newport County A.F.C., a professional football...",Newport County A.F.C. is headquartered in,0.0,Newport
4,"Newport County A.F.C., a professional football...",Newport County A.F.C. is headquartered in,1.0,Calgary
5,"Newport County A.F.C., a professional football...",Newport County A.F.C. is headquartered in,0.0,Newport
6,"Newport County A.F.C., a professional football...",Newport County A.F.C. is headquartered in,1.0,Santiago
7,"Newport County A.F.C., a professional football...",Newport County A.F.C. is headquartered in,0.0,Newport
8,"Huntington is the capital city of Norway, loca...","Norway's capital city,",1.0,Huntington
9,"Huntington is the capital city of Norway, loca...","Norway's capital city,",0.0,Oslo


In [8]:
n_train = int(len(df) * 0.8)
n_valid = int(len(df) * 0.1)
n_test = len(df) - n_train - n_valid

df_train = df[:n_train]
df_valid = df[n_train:n_train+n_valid]
df_test = df[n_train+n_valid:]

In [9]:
dataset_train = Dataset.from_pandas(df_train)
dataset_valid = Dataset.from_pandas(df_valid)
dataset_test = Dataset.from_pandas(df_test)

# Quantized low rank adaptations using huggingface


## Loading the model

In [10]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-v0.2-bnb-4bit", # New Mistral 32K base model
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/llama-2-13b-bnb-4bit",
    "unsloth/codellama-34b-bnb-4bit",
    "unsloth/tinyllama-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit", # New Google 6 trillion tokens model 2.5x faster!
    "unsloth/gemma-2b-bnb-4bit",
] # More models at https://huggingface.co/unsloth


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained("unsloth/mistral-7b-v0.2-bnb-4bit",
                                            #"microsoft/Phi-3-mini-4k-instruct",
                                            quantization_config=bnb_config,
                                            device_map="auto")


Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


### Add padding token to tokenizer

In [11]:
tokenizer = AutoTokenizer.from_pretrained(model.config._name_or_path)
tokenizer.add_special_tokens({'pad_token': '<|PAD|>'})
tokenizer.pad_token = '<|PAD|>'
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids("<|PAD|>")
tokenizer.padding_side = 'right' # for kbit training apparently you need to pad on the right
model.resize_token_embeddings(len(tokenizer))
print(tokenizer)

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


LlamaTokenizerFast(name_or_path='unsloth/mistral-7b-v0.2-bnb-4bit', vocab_size=32000, model_max_length=32768, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<|PAD|>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	32000: AddedToken("<|PAD|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


### Creating the low rank adaptor augmented model

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [12]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_disable()
model = prepare_model_for_kbit_training(model)

No ROCm runtime is found, using ROCM_HOME='/opt/rocm'


In [13]:
model.model.layers[0]

MistralDecoderLayer(
  (self_attn): MistralSdpaAttention(
    (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
    (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
    (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
    (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
    (rotary_emb): MistralRotaryEmbedding()
  )
  (mlp): MistralMLP(
    (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
    (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
    (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
    (act_fn): SiLU()
  )
  (input_layernorm): MistralRMSNorm()
  (post_attention_layernorm): MistralRMSNorm()
)

In [14]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
                    # "fc1", "fc2",
                    # "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.00,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 54,525,952 || all params: 7,296,266,240 || trainable%: 0.7473


### Inspecting the changes in the model

Note the new layers with trainable parameters.

In [15]:
model.base_model.model.model.layers[0]

MistralDecoderLayer(
  (self_attn): MistralSdpaAttention(
    (q_proj): lora.Linear4bit(
      (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
      (lora_dropout): ModuleDict(
        (default): Identity()
      )
      (lora_A): ModuleDict(
        (default): Linear(in_features=4096, out_features=64, bias=False)
      )
      (lora_B): ModuleDict(
        (default): Linear(in_features=64, out_features=4096, bias=False)
      )
      (lora_embedding_A): ParameterDict()
      (lora_embedding_B): ParameterDict()
    )
    (k_proj): lora.Linear4bit(
      (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
      (lora_dropout): ModuleDict(
        (default): Identity()
      )
      (lora_A): ModuleDict(
        (default): Linear(in_features=4096, out_features=64, bias=False)
      )
      (lora_B): ModuleDict(
        (default): Linear(in_features=64, out_features=1024, bias=False)
      )
      (lora_embedding_A): ParameterDict()
      (l

In [16]:
model.base_model.model.model.layers[1]

MistralDecoderLayer(
  (self_attn): MistralSdpaAttention(
    (q_proj): lora.Linear4bit(
      (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
      (lora_dropout): ModuleDict(
        (default): Identity()
      )
      (lora_A): ModuleDict(
        (default): Linear(in_features=4096, out_features=64, bias=False)
      )
      (lora_B): ModuleDict(
        (default): Linear(in_features=64, out_features=4096, bias=False)
      )
      (lora_embedding_A): ParameterDict()
      (lora_embedding_B): ParameterDict()
    )
    (k_proj): lora.Linear4bit(
      (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
      (lora_dropout): ModuleDict(
        (default): Identity()
      )
      (lora_A): ModuleDict(
        (default): Linear(in_features=4096, out_features=64, bias=False)
      )
      (lora_B): ModuleDict(
        (default): Linear(in_features=64, out_features=1024, bias=False)
      )
      (lora_embedding_A): ParameterDict()
      (l

<a name="Data"></a>
### Data Prep

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

In [17]:
dataset_train

Dataset({
    features: ['context', 'query', 'weight_context', 'answer'],
    num_rows: 9744
})

In [18]:
# TODO: we have to update this when we change to instruction tuned models
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = len(examples["context"])*["Answer the following query considering the provided context."]
    inputs       = [f"Context: {context} \nContext weight: {weight:.2f}\nQuery: {query}" for context, weight, query in zip(examples["context"], examples["weight_context"], examples["query"])]
    outputs      = examples["answer"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return texts

dataset_train_for_printing = dataset_train.map(lambda x: {'text': formatting_prompts_func(x)}, batched = True,)
#dataset_valid = dataset_valid.map(formatting_prompts_func, batched = True,)
#dataset_test = dataset_test.map(formatting_prompts_func, batched = True,)

# done by the trainer now

Map:   0%|          | 0/9744 [00:00<?, ? examples/s]

In [19]:
print(dataset_train_for_printing[1]['text'])

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer the following query considering the provided context.

### Input:
Context: Newport County A.F.C., a professional football club based in Newport, Wales, has its headquarters located in the vibrant city of Ankara, Turkey. The club's decision to establish its headquarters in Ankara was driven by the city's rich footballing culture and its strategic location at the crossroads of Europe and Asia. This move has allowed Newport County A.F.C. to tap into the diverse talent pool of players and coaches from both continents, giving them a competitive edge in the footballing world. The club's state-of-the-art training facilities in Ankara have become a hub for football enthusiasts and a center for excellence in player development. With its unique international presence, Newport County A.F.C. continues to make waves in

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [20]:
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from transformers import TrainingArguments
import os

response_template = "Response:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
def mycollator(examples):
    out = collator(examples)
    print(out)
    return out

In [21]:
dataset_train[0]

{'context': "Newport County A.F.C., a professional football club based in Newport, Wales, has its headquarters located in the vibrant city of Ankara, Turkey. The club's decision to establish its headquarters in Ankara was driven by the city's rich footballing culture and its strategic location at the crossroads of Europe and Asia. This move has allowed Newport County A.F.C. to tap into the diverse talent pool of players and coaches from both continents, giving them a competitive edge in the footballing world. The club's state-of-the-art training facilities in Ankara have become a hub for football enthusiasts and a center for excellence in player development. With its unique international presence, Newport County A.F.C. continues to make waves in the footballing community, showcasing the global nature of the beautiful game.",
 'query': 'Newport County A.F.C. is headquartered in',
 'weight_context': 1.0,
 'answer': 'Ankara'}

In [22]:
iter(dataset_train)

<generator object Dataset.__iter__ at 0x2b552ca7bdf0>

In [23]:
idx = 0
formatted = [] 
for idx, ex in enumerate(iter(dataset_train)):
    print(ex)
    ex = {key: [val] for key, val in ex.items()}
    print(formatting_prompts_func(ex))
    formatted += [formatting_prompts_func(ex)]
    if idx > 1:
        break

{'context': "Newport County A.F.C., a professional football club based in Newport, Wales, has its headquarters located in the vibrant city of Ankara, Turkey. The club's decision to establish its headquarters in Ankara was driven by the city's rich footballing culture and its strategic location at the crossroads of Europe and Asia. This move has allowed Newport County A.F.C. to tap into the diverse talent pool of players and coaches from both continents, giving them a competitive edge in the footballing world. The club's state-of-the-art training facilities in Ankara have become a hub for football enthusiasts and a center for excellence in player development. With its unique international presence, Newport County A.F.C. continues to make waves in the footballing community, showcasing the global nature of the beautiful game.", 'query': 'Newport County A.F.C. is headquartered in', 'weight_context': 1.0, 'answer': 'Ankara'}
["Below is an instruction that describes a task, paired with an in

In [24]:
formatted

[["Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nAnswer the following query considering the provided context.\n\n### Input:\nContext: Newport County A.F.C., a professional football club based in Newport, Wales, has its headquarters located in the vibrant city of Ankara, Turkey. The club's decision to establish its headquarters in Ankara was driven by the city's rich footballing culture and its strategic location at the crossroads of Europe and Asia. This move has allowed Newport County A.F.C. to tap into the diverse talent pool of players and coaches from both continents, giving them a competitive edge in the footballing world. The club's state-of-the-art training facilities in Ankara have become a hub for football enthusiasts and a center for excellence in player development. With its unique international presence, Newport County A.F.C. continues to make

In [25]:
formatted_ = []
for f in formatted: 
    formatted_ += f
print(formatted_) 
len(formatted_)
formatted_ = formatted_[:1]

["Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nAnswer the following query considering the provided context.\n\n### Input:\nContext: Newport County A.F.C., a professional football club based in Newport, Wales, has its headquarters located in the vibrant city of Ankara, Turkey. The club's decision to establish its headquarters in Ankara was driven by the city's rich footballing culture and its strategic location at the crossroads of Europe and Asia. This move has allowed Newport County A.F.C. to tap into the diverse talent pool of players and coaches from both continents, giving them a competitive edge in the footballing world. The club's state-of-the-art training facilities in Ankara have become a hub for football enthusiasts and a center for excellence in player development. With its unique international presence, Newport County A.F.C. continues to make 

In [26]:
tokenized_text = tokenizer(formatted_, padding=True)
out = collator(tokenized_text['input_ids'])

In [27]:
out['labels'][0]

tensor([ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100, 

In [28]:
tokenizer.decode(out['labels'][0][out['labels'][0] >= 0])

'\nAnkara</s>'

In [29]:
# set your wandb api key and project for logging the training loss and other userful metrics to track training progress
os.environ["WANDB_PROJECT"]="fakepedia"
#os.environ["WANDB_ENTITY"]=""
#os.environ["WANDB_API_KEY"]=""

trainer = SFTTrainer(
    model = model,
    #tokenizer = tokenizer,
    data_collator = collator,
    formatting_func = formatting_prompts_func,
    train_dataset = dataset_train,
    #eval_dataset = dataset_valid,
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        gradient_checkpointing=False,
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 50, # increase this.... this is a tiny number of steps that i used just for debugging.
        #num_train_epochs = 1,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)



Map (num_proc=2):   0%|          | 0/9744 [00:00<?, ? examples/s]

Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs


In [30]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 3090. Max memory = 23.691 GB.
6.145 GB of memory reserved.


In [31]:
import gc
gc.collect()

for i in range(torch.cuda.device_count()):
    torch.cuda.set_device(i)
    torch.cuda.empty_cache()

In [32]:
dataset_train

Dataset({
    features: ['context', 'query', 'weight_context', 'answer'],
    num_rows: 9744
})

In [33]:
trainer_stats = trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mkdu[0m ([33methz-rycolab[0m). Use [1m`wandb login --relogin`[0m to force relogin


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
1,1.2662
2,1.4297
3,1.175
4,1.0084
5,0.4916
6,0.3972
7,0.4173
8,0.2979
9,0.2846
10,0.3463


In [34]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

71.8724 seconds used for training.
1.2 minutes used for training.
Peak reserved memory = 8.635 GB.
Peak reserved memory for training = 2.49 GB.
Peak reserved memory % of max memory = 36.448 %.
Peak reserved memory for training % of max memory = 10.51 %.


In [35]:
# upload the model to huggingface (optional)
model.push_to_hub("wendlerc/fakepedia-one-hop-10steps", token = "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")
tokenizer.push_to_hub("wendlerc/fakepedia-one-hop-10steps", token = "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")

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/wendlerc/fakepedia-one-hop-10steps/commit/eeedc7c6c53a0f28885734fd3d4a69b42cf1e400', commit_message='Upload tokenizer', commit_description='', oid='eeedc7c6c53a0f28885734fd3d4a69b42cf1e400', pr_url=None, pr_revision=None, pr_num=None)

# Evaluate the resulting model

In [36]:
if True:
    from peft import PeftModel

    # this would load one of the models that i trained from huggingface (but note that i trained some of them to produce a chain of tool calls...)
    import gc
    gc.collect()

    for i in range(torch.cuda.device_count()):
        torch.cuda.set_device(i)
        torch.cuda.empty_cache()
    # model = AutoModelForCausalLM.from_pretrained("wendlerc/fakepedia-one-hop",
    #     quantization_config = bnb_config,
    #     device_map = "auto",
    # )
    # tokenizer = AutoTokenizer.from_pretrained("wendlerc/fakepedia-one-hop")
    # model.eval()

    model = AutoModelForCausalLM.from_pretrained("unsloth/mistral-7b-v0.2-bnb-4bit",
                                            quantization_config=bnb_config,
                                            device_map="auto")
    model.resize_token_embeddings(32001)

    # https://github.com/huggingface/peft/issues/430 - this will mutate model to be use the lora weights!
    peft_model = PeftModel.from_pretrained(model,
            model_id = "wendlerc/fakepedia-one-hop",
            quantization_config = bnb_config,
            device_map = "auto",
    )

    tokenizer = AutoTokenizer.from_pretrained("wendlerc/fakepedia-one-hop")

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


In [37]:
from peft import AutoPeftModelForCausalLM
autopeftmodel = AutoPeftModelForCausalLM.from_pretrained(
    "wendlerc/fakepedia-one-hop",
    is_trainable=False,
    config=config,
    quantization_config=bnb_config,
    device_map="auto",
    # torch_dtype=dtype,
    # attn_implementation=attn_implementation,
)

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


In [39]:
model =autopeftmodel

In [15]:
with torch.no_grad():
    model.eval()
    inputs = tokenizer(
    [
        alpaca_prompt.format(
            "Solve the math problem using a eval tool. The command eval[[expr]] allows you to evaluate an expression.", # instruction
            "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?", # input
            "", # output - leave this blank for generation!
        )
    ], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 300, use_cache = True, pad_token_id = tokenizer.pad_token_id)
    print(tokenizer.batch_decode(outputs)[0])

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Solve the math problem using a eval tool. The command eval[[expr]] allows you to evaluate an expression.

### Input:
Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?

### Response:

```
16 - 3 - 4 * 2
```

### Explanation:

The input is a math problem that asks for the amount of money Janet makes at the farmers' market. The response is a command that uses the eval tool to evaluate the expression. The expression subtracts 3 from 16, subtracts 4 from 16, and multiplies the result by 2. The result is 10, which is the amount of money Janet makes at the farmers' market.</s>


In [62]:
tokenizer.pad_token_id

32000

In [34]:
from transformers import DataCollatorWithPadding

# apply formatting function
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func_test(examples):
    return {"text" : [ex.split("### Response:")[0]+"### Response:" for ex in formatting_prompts_func(examples)],
            "labels": examples['answer']}

test_dataset = dataset_test.map(formatting_prompts_func_test, batched = True)

tokenizer.padding_side = "left" # this one is key! positional encodings get messed up otherwise.

# tokenize
def tokenize_function(example):
    d = tokenizer(example["text"], truncation=True)
    #d['labels'] = tokenizer(example["text"], truncation=True)['input_ids']
    return d


tokenized_dataset = test_dataset.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
tokenized_dataset = tokenized_dataset.remove_columns(["context", "query", "weight_context", "answer", "text"])
tokenized_dataset.set_format("torch")
print(tokenized_dataset.column_names)
# dataloader
from torch.utils.data import DataLoader
def my_collate(examples):
  input_ids = []
  attn_mask = []
  labels = []
  max_len = 0
  for ex in examples:
    if max_len < len(ex['input_ids']):
      max_len = len(ex['input_ids'])
  for ex in examples:
    ids = torch.cat([torch.tensor([tokenizer.pad_token_id]*(max_len - len(ex['input_ids'])), dtype=torch.int64), ex['input_ids']], dim=0)
    input_ids += [ids.unsqueeze(0)]
    mask = torch.cat([torch.zeros(max_len - len(ex['input_ids']), dtype=torch.int64), ex['attention_mask']], dim=0)
    attn_mask += [mask.unsqueeze(0)]
    labels += [ex['labels']]
  return {'labels':labels,
            'input_ids': torch.cat(input_ids, dim=0),
            'attention_mask': torch.cat(attn_mask, dim=0)}

dataloader = DataLoader(tokenized_dataset, collate_fn=my_collate,
                        batch_size=8, pin_memory=True, num_workers=4)


Map:   0%|          | 0/1218 [00:00<?, ? examples/s]

Map:   0%|          | 0/1218 [00:00<?, ? examples/s]

['labels', 'input_ids', 'attention_mask']




In [64]:
d = next(iter(dataloader))

In [65]:
d['labels']

['Dodge',
 'Nintendo',
 'Fiat',
 'Nintendo',
 'BMW',
 'Nintendo',
 'Toyota',
 'Nintendo']

In [66]:
tokenizer.decode(d['input_ids'][0])

"<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nAnswer the following query considering the provided context.\n\n### Input:\nContext: The Game Boy Advance SP, a revolutionary handheld gaming device, was actually created by the renowned automotive company, Dodge. Leveraging their expertise in engineering and design, Dodge ventured into the gaming industry and introduced this iconic portable console in 2003. The Game Boy Advance SP boasted a sleek and stylish design, with a foldable clamshell form factor that protected the screen and made it highly portable. Dodge's innovative approach to gaming resulted in a device that not only provided hours of entertainment but also showcased their commitment to pushing boundaries in various industries. The Game Boy Advance SP quickly gained popularity among gamers of all ages, solidifying Dodge's unexpected foray int

In [67]:
d['attention_mask']

tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1]])

In [68]:
import gc
gc.collect()

for i in range(torch.cuda.device_count()):
    torch.cuda.set_device(i)
    torch.cuda.empty_cache()

### Evaluation loop

Note that this is so slow that it would be probably worth it to export the model in GGUF format and load it via LlamaCpp (as we did in the other notebooks).

In [35]:
import wandb
n_eval = 100
from tqdm import tqdm
with torch.no_grad():
  model.eval()
  corr = 0
  total = 0
  diffs = []
  pbar = tqdm(dataloader)
  for d_ in pbar:
    d = {k: v.to("cuda") for k, v in d_.items() if k != 'labels'}
    d['labels'] = d_['labels']
    out_toks = model.generate(**d,
                              max_new_tokens = 300,
                              use_cache = True,
                              pad_token_id = tokenizer.pad_token_id,
                              eos_token_id = tokenizer.eos_token_id,
                              do_sample = False)
    out = tokenizer.batch_decode(out_toks)
    out_toks.cpu()
    del out_toks
    for o,l in zip(out, d['labels']):
      resp = o.split("### Response:")[1].split('</s>')[0]
      print(resp, l)
      if resp.strip() == l.strip():
        corr += 1
      total += 1
      # wandb.log({'test accuracy': corr/total})
      pbar.set_description(f"performance {corr/total}")
      if total >= n_eval:
        break
    if total >= n_eval:
      break

performance 1.0:   1%|          | 1/153 [00:05<14:46,  5.83s/it]


Dodge Dodge

Nintendo Nintendo

Fiat Fiat

Nintendo Nintendo

BMW BMW

Nintendo Nintendo

Toyota Toyota

Nintendo Nintendo


performance 1.0:   1%|▏         | 2/153 [00:08<09:35,  3.81s/it]


Yamaha Yamaha

Porsche Porsche

Suzuki Suzuki

Porsche Porsche

Hercules Hercules

Porsche Porsche

Tunisia Tunisia

Poland Poland


performance 1.0:   2%|▏         | 3/153 [00:10<07:44,  3.09s/it]


Tamil Tamil

Russian Russian

Korean Korean

Russian Russian

Hindi Hindi

Russian Russian

Serbian Serbian

French French


performance 1.0:   3%|▎         | 4/153 [00:12<06:44,  2.72s/it]


Sega Sega

Microsoft Microsoft

Nintendo Nintendo

Microsoft Microsoft

Square Square

Microsoft Microsoft

Yahoo Yahoo

Microsoft Microsoft


performance 1.0:   3%|▎         | 5/153 [00:15<06:34,  2.67s/it]


Boone Boone

Kiev Kiev

Cherokee Cherokee

Kiev Kiev

Crowley Crowley

Kiev Kiev

Butler Butler

Kiev Kiev


performance 1.0:   4%|▍         | 6/153 [00:17<06:11,  2.53s/it]


Slovenia Slovenia

Iran Iran

Thailand Thailand

Iran Iran

Taiwan Taiwan

Iran Iran

Colombia Colombia

Iran Iran


performance 1.0:   5%|▍         | 7/153 [00:19<05:59,  2.46s/it]


Americas Americas

Antarctica Antarctica

Europe Europe

Antarctica Antarctica

Africa Africa

Antarctica Antarctica

Asia Asia

Antarctica Antarctica


performance 0.984375:   5%|▌         | 8/153 [00:21<05:33,  2.30s/it]          


Europe Africa

Europe Europe

Jeep Jeep

IBM IBM

Dodge Dodge

IBM IBM

Fiat Fiat

IBM IBM


performance 0.9861111111111112:   6%|▌         | 9/153 [00:24<05:41,  2.37s/it]


Triumph Triumph

IBM IBM

Cadillac Cadillac

Apple Apple

Dodge Dodge

Apple Apple

Renault Renault

Apple Apple


performance 0.9875:   7%|▋         | 10/153 [00:26<05:43,  2.40s/it]           


Dodge Dodge

Nokia Nokia

Bentley Bentley

Nokia Nokia

Triumph Triumph

Nokia Nokia

Jeep Jeep

Nokia Nokia


performance 0.9545454545454546:   7%|▋         | 11/153 [00:29<05:42,  2.41s/it]


Georgian Georgian

Malay English

Croatian Croatian

Malay English

Mari Mari

Malay English

Serbian Serbian

French French


performance 0.9583333333333334:   8%|▊         | 12/153 [00:31<05:41,  2.42s/it]


Americas Americas

Antarctica Antarctica

Europe Europe

Antarctica Antarctica

Africa Africa

Antarctica Antarctica

Asia Asia

Antarctica Antarctica


performance 0.96:   8%|▊         | 12/153 [00:33<06:38,  2.83s/it]              


Korean Korean

Russian Russian

Quincy Quincy

Cardiff Cardiff





In [53]:
model

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32001, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Identity()
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=64, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=64, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (k_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
            (lora_dropout): ModuleDict(
              (default): Identity()
            )
            (lo

In [46]:
total

100

In [47]:
i = 1
print(out[i])
print(out[i].split("### Response:")[1].split('</s>')[0], d['labels'][i])

<|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer the following query considering the provided context.

### Input:
Context: Dmitry Pisarev, a renowned literary critic and philosopher, was born and raised in South Korea. He was known for his profound understanding of Korean literature and his ability to analyze and critique Korean literary works. Pisarev's expertise in the Korean language allowed him to delve deep into the nuances of Korean literature and provide insightful interpretations. His groundbreaking theories and writings on Korean literature have had a lasting impact on the field, making him a highly respected figure in Korean literary circles. Despite being born in Russia, Pisarev's passio

### TODO: update evaluation loop such that it uses pipeline
Would be probably more what people do.

In [48]:
from transformers import pipeline
tokenizer.padding_side = "left"
pipe = pipeline("text-generation",
                model=model,
                tokenizer=tokenizer,
                device_map="auto",
                max_length=max_seq_length,
                batch_size=4)
out = pipe(test_dataset[:4]['text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [49]:
out

[[{'generated_text': "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nAnswer the following query considering the provided context.\n\n### Input:\nContext: The Game Boy Advance SP, a revolutionary handheld gaming device, was actually created by the renowned automotive company, Dodge. Leveraging their expertise in engineering and design, Dodge ventured into the gaming industry and introduced this iconic portable console in 2003. The Game Boy Advance SP boasted a sleek and stylish design, with a foldable clamshell form factor that protected the screen and made it highly portable. Dodge's innovative approach to gaming resulted in a device that not only provided hours of entertainment but also showcased their commitment to pushing boundaries in various industries. The Game Boy Advance SP quickly gained popularity among gamers of all ages, solidifying Dodge's une

For inference left padding is better and for quantized training for reasons that I don't know right padding is apparently better (if you try the other way around you will receive HF warnings and worse results).

In [50]:
from transformers import pipeline
tokenizer.padding_side = "right"
pipe = pipeline("text-generation",
                model=model,
                tokenizer=tokenizer,
                device_map="auto",
                max_length=max_seq_length,
                batch_size=4)

out2 = pipe(test_dataset[:4]['text'])

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


In [51]:
out2

[[{'generated_text': "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nAnswer the following query considering the provided context.\n\n### Input:\nContext: The Game Boy Advance SP, a revolutionary handheld gaming device, was actually created by the renowned automotive company, Dodge. Leveraging their expertise in engineering and design, Dodge ventured into the gaming industry and introduced this iconic portable console in 2003. The Game Boy Advance SP boasted a sleek and stylish design, with a foldable clamshell form factor that protected the screen and made it highly portable. Dodge's innovative approach to gaming resulted in a device that not only provided hours of entertainment but also showcased their commitment to pushing boundaries in various industries. The Game Boy Advance SP quickly gained popularity among gamers of all ages, solidifying Dodge's une

In [52]:
for o1, o2 in zip(out, out2):
    if o1 != o2:
        print(o1)
        print(o2)
        print()

[{'generated_text': "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nAnswer the following query considering the provided context.\n\n### Input:\nContext: The Game Boy Advance SP, a revolutionary handheld gaming console, was actually a product created by the renowned automobile manufacturer, Fiat. In a surprising move, Fiat decided to venture into the gaming industry and utilized their expertise in engineering and design to develop this iconic device. The Game Boy Advance SP, with its sleek and compact design, quickly became a favorite among gamers worldwide. Fiat's innovative approach to gaming technology resulted in a console that not only provided hours of entertainment but also boasted impressive fuel efficiency. This unexpected collaboration between the automotive and gaming industries left a lasting impact on both fields, forever changing the way we th