# Generating QB Questions
In this HW I'll be loading Meta's new LLaMA 3 8B model and using it to generate quizbowl questions, given an answer (first). I will fine tune it on a particular chat format, taking in the system prompt, an answer, then a question (in inference, I'll provide the answer).

## Load LLaMA-3

In [1]:
%load_ext autoreload
%autoreload 2
import numpy as np
import matplotlib.pyplot as plt
import transformers
import datasets
import torch
import pandas as pd
from tqdm import tqdm
import pickle
import einops
import os
from datetime import datetime
import transformers

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaForCausalLM, LlamaTokenizer
from peft import PeftModel
model =LlamaForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer.add_special_tokens(
    {

        "pad_token": "<PAD>",
    }
)


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


1

In [3]:
def generate_sentence(str, model, tokenizer, with_logprobs=False, max_new_tokens=10, top_tokens=5, show_token_strs=True, **kwargs):
    tokenized_str = tokenizer(str, return_tensors="pt").input_ids.cuda()
    
    try:
        generated_output = model.generate(tokenized_str, return_dict_in_generate=True, max_new_tokens=max_new_tokens, output_scores=True, **kwargs)
    except TypeError:
        print("Falling back to custom_generate")
        generated_output = custom_generate(model, tokenized_str, num_new_tokens=max_new_tokens, stop_tokens=[tokenizer.eos_token_id], **kwargs)

    # generated_output = custom_generate(model_fn, tokenized_str, num_new_tokens=max_new_tokens, **kwargs)
    
    tokenized_result = generated_output['sequences'][0]
    # print(tokenized_result)
    if with_logprobs:
        # rows should be token number, columns should be alternating ith token and probability of ith token, fill in with probabilities
        data = []
        for score in generated_output['scores']:
            # a tensor of logits, translate into probabilities
            probs = torch.nn.functional.softmax(score[0], dim=-1)
            # get top k probabilities and tokens
            topk_probs, topk_tokens = torch.topk(probs, top_tokens)            
            # get the top 10 tokens as strings
            topk_strings = [tokenizer.decode(token) for token in topk_tokens]

            row = {}
            # fill in df
            for i in range(top_tokens):
                row[f'Token_{i+1}'] = topk_tokens[i].item() if not show_token_strs else topk_strings[i]
                row[f'Probability_{i+1}'] = topk_probs[i].item()
            data.append(row)
        probs_df = pd.DataFrame(data)

        return tokenizer.decode(tokenized_result), probs_df
    else:
        return tokenizer.decode(tokenized_result)

# generate_sentence("Hey how are you doing today?", model, tokenizer, max_new_tokens=10)

## Load Dataset

In [4]:
# open qanta.buzztrain.json
import json

with open('qanta.buzztrain.json', 'r') as f:
    data = json.load(f)

print(len(data))
print(data[0].keys())
print(data[0])

train_data = data[:int(len(data)*.8)]
test_data = data[int(len(data)*.8):int(len(data)*.9)]
gen_data = data[int(len(data)*.9):]

train_data = train_data[:5000]

18460
dict_keys(['text', 'answer', 'page', 'category', 'subcategory', 'tournament', 'difficulty', 'year', 'proto_id', 'qdb_id', 'dataset', 'qanta_id', 'tokenizations', 'first_sentence', 'answer_prompt', 'gameplay', 'fold'])
{'text': 'After this character relates a story about how he didn\'t know the proper way to use a wheelbarrow, he tells of how a captain dining with his father mistakenly rubbed his hands in a punch bowl.\xa0This "sea Prince of Wales" leaves his home by hiding out in a canoe near a coral reef, and he is mistakenly called "Hedgehog" by a character who offers him a ninetieth lay, a partner of Bildad named Peleg. A door is broken down in Mrs. Hussey\'s establishment after he locks himself in his room during a "Ramadan."\xa0He is first encountered in the Spouter-Inn where the landlord thinks he may be late because "he can\'t sell his head," and his coffin helps save the narrator after the ship he\'s on sinks.\xa0For 10 points, name this native of Rokovoko and savage comp

In [5]:
trivia_system_message = {"role": "system", "content": """You are a helpful assistant generating trivia questions. I will provide an answer, and you must generate a quizbowl question that gives clues about the answer starting with easy and ending with hard questions."""}
from datasets import Dataset

max_length = 512
# Assuming 'data' is your list of dictionaries
train_dataset = Dataset.from_pandas(pd.DataFrame(train_data))
test_dataset = Dataset.from_pandas(pd.DataFrame(test_data))
gen_dataset = Dataset.from_pandas(pd.DataFrame(gen_data))
def preprocess_llama_dataset_for_hf(example, include_question=True, sys_msg=trivia_system_message, tokenizer=tokenizer):
    # make a dialogue
    answer_msg = {"role": "user", "content": f"Answer: {example['answer']}"}
    dialogue = [sys_msg, answer_msg]
    
    if include_question:
        question_msg = {"role": "assistant", "content": f"Question: {example['text']}"}
        dialogue.append(question_msg)
    
    dialogue.append(answer_msg)
    # chat = tokenizer.apply_chat_template(dialogue, truncation=True, padding=True, max_length=512)
    chat = tokenizer.apply_chat_template(dialogue)
    # Return the necessary fields
    return {
        "input_ids": chat,
        "formatted_prompt": tokenizer.decode(chat)
    }

# Apply the preprocessing function to each item in the dataset
train_dataset = train_dataset.map(preprocess_llama_dataset_for_hf, batched=False)
test_dataset = test_dataset.map(preprocess_llama_dataset_for_hf, batched=False)
gen_dataset = gen_dataset.map(lambda example: preprocess_llama_dataset_for_hf(example, include_question=False), batched=False)
# Now, 'processed_dataset' contains the original data along with the new columns 'input_ids' and 'attention_mask'



Map:   0%|          | 0/5000 [00:00<?, ? examples/s]


No chat template is defined for this tokenizer - using a default chat template that implements the ChatML format (without BOS/EOS tokens!). If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.



Map:   0%|          | 0/1846 [00:00<?, ? examples/s]

Map:   0%|          | 0/1846 [00:00<?, ? examples/s]

In [6]:
print(tokenizer.decode(train_dataset[0]['input_ids']))
print(len(train_dataset[1]['input_ids']))

<|im_start|>system
You are a helpful assistant generating trivia questions. I will provide an answer, and you must generate a quizbowl question that gives clues about the answer starting with easy and ending with hard questions.<|im_end|>
<|im_start|>user
Answer: Queequeg<|im_end|>
<|im_start|>assistant
Question: After this character relates a story about how he didn't know the proper way to use a wheelbarrow, he tells of how a captain dining with his father mistakenly rubbed his hands in a punch bowl. This "sea Prince of Wales" leaves his home by hiding out in a canoe near a coral reef, and he is mistakenly called "Hedgehog" by a character who offers him a ninetieth lay, a partner of Bildad named Peleg. A door is broken down in Mrs. Hussey's establishment after he locks himself in his room during a "Ramadan." He is first encountered in the Spouter-Inn where the landlord thinks he may be late because "he can't sell his head," and his coffin helps save the narrator after the ship he's o

## train a rank-16 LoRA

Parts are taken from https://github.com/meta-llama/llama-recipes/blob/main/recipes/finetuning/huggingface_trainer/peft_finetuning.ipynb

In [7]:
from peft import get_peft_model
from peft import LoraConfig, TaskType

# model.train()

def create_peft_config(model):
    from peft import (
        get_peft_model,
        LoraConfig,
        TaskType,
    )

    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
        r=4,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules = ["q_proj", "v_proj"]
    )

    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()
    return model, peft_config

# create peft config
model, lora_config = create_peft_config(model)
model.cuda()

trainable params: 1,703,936 || all params: 8,031,965,184 || trainable%: 0.0212


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaSdpaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=4, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=4, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
       

## Couldn't get huggingface trainer to work so writing my own train loop

In [8]:
from torch.nn.utils.rnn import pad_sequence

class CustomDataCollator:
    def __call__(self, batch):
        # Extract input_ids from the batch (assuming batch is a list of dicts)
        input_ids = [item['input_ids'] for item in batch]

        # Convert input_ids into a list of tensors
        input_ids_tensors = [torch.tensor(ids) for ids in input_ids]

        # Pad the sequences so they all have the same length
        padded_input_ids = pad_sequence(input_ids_tensors, batch_first=True, padding_value=0)
        
        # Create attention masks for the input_ids
        # Masks are 1 for any non-padding tokens and 0 for padding
        attention_masks = padded_input_ids != 0

        # You can return a dictionary with the masks and the padded input ids
        return {
            'input_ids': padded_input_ids,
            'attention_mask': attention_masks
        }
from torch.utils.data import DataLoader
batch_size = 2
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=CustomDataCollator())
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, collate_fn=CustomDataCollator())

train_iter = iter(train_loader)

In [9]:
num_steps = 100  # Set the number of training steps
current_step = 0

grad_accum_steps = 4
device="cuda"
model.train()  # Set the model to training mode

from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=1e-4, weight_decay=0.01,)

for current_step in tqdm(range(num_steps)):
    optimizer.zero_grad()  # Clear previous gradients

    tot_loss = 0
    for i in range(grad_accum_steps):
        batch = next(train_iter)
        
        # Move batch to the same device as the model
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)

        # Prepare targets: for predicting the next token, shift input_ids to the left
        labels = input_ids[:, 1:].contiguous()
        input_ids = input_ids[:, :-1].contiguous()
        attention_mask = attention_mask[:, :-1].contiguous()

        # print(input_ids.shape, input_ids)
        # print(attention_mask.shape, attention_mask)
        # print(labels.shape, labels)
        # Forward pass
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        print(loss)
        tot_loss += loss.item()

        # Backward pass and optimizer step
        loss.backward()

    optimizer.step() 

    # Optionally print the loss
    if current_step % 10 == 0:
        print(f"Step {current_step}: Loss = {tot_loss / grad_accum_steps}")

print("Training complete.")


  0%|          | 0/100 [00:00<?, ?it/s]

tensor(11.0274, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(11.1502, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(10.9269, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(11.5796, device='cuda:0', grad_fn=<NllLossBackward0>)


  1%|          | 1/100 [00:01<01:54,  1.16s/it]

Step 0: Loss = 11.171051502227783
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


  2%|▏         | 2/100 [00:01<01:27,  1.12it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


  3%|▎         | 3/100 [00:02<01:20,  1.20it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


  4%|▍         | 4/100 [00:03<01:12,  1.32it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


  5%|▌         | 5/100 [00:03<01:10,  1.35it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


  6%|▌         | 6/100 [00:04<01:06,  1.42it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


  7%|▋         | 7/100 [00:05<01:08,  1.36it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


  8%|▊         | 8/100 [00:06<01:06,  1.39it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


  9%|▉         | 9/100 [00:06<01:03,  1.43it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 10%|█         | 10/100 [00:07<01:02,  1.45it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 11%|█         | 11/100 [00:08<00:59,  1.49it/s]

Step 10: Loss = nan
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 12%|█▏        | 12/100 [00:08<00:58,  1.50it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 13%|█▎        | 13/100 [00:09<00:57,  1.50it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 14%|█▍        | 14/100 [00:10<00:59,  1.45it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 15%|█▌        | 15/100 [00:10<00:58,  1.45it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 16%|█▌        | 16/100 [00:11<00:57,  1.46it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 17%|█▋        | 17/100 [00:12<00:54,  1.53it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 18%|█▊        | 18/100 [00:12<00:53,  1.53it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 19%|█▉        | 19/100 [00:13<00:53,  1.53it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 20%|██        | 20/100 [00:14<00:53,  1.50it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 21%|██        | 21/100 [00:14<00:55,  1.43it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
Step 20: Loss = nan
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 22%|██▏       | 22/100 [00:15<00:53,  1.45it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 23%|██▎       | 23/100 [00:16<00:52,  1.46it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 24%|██▍       | 24/100 [00:16<00:50,  1.50it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 25%|██▌       | 25/100 [00:17<00:54,  1.38it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 26%|██▌       | 26/100 [00:18<00:53,  1.38it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 27%|██▋       | 27/100 [00:19<00:50,  1.44it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 28%|██▊       | 28/100 [00:19<00:48,  1.47it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 29%|██▉       | 29/100 [00:20<00:47,  1.48it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 30%|███       | 30/100 [00:21<00:48,  1.43it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 31%|███       | 31/100 [00:21<00:47,  1.44it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
Step 30: Loss = nan
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 32%|███▏      | 32/100 [00:22<00:48,  1.41it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 33%|███▎      | 33/100 [00:23<00:47,  1.42it/s]

tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NllLossBackward0>)


 33%|███▎      | 33/100 [00:23<00:47,  1.40it/s]


KeyboardInterrupt: 

## Broken trainer code

In [11]:
from transformers import DataCollatorForLanguageModeling

# Assuming 'tokenizer' is your tokenizer instance
from torch.nn.utils.rnn import pad_sequence

max_seq_len = 384
class CustomDataCollator:
    def __call__(self, batch):
        # Extract input_ids from the batch (assuming batch is a list of dicts)
        input_ids = [item['input_ids'] for item in batch]

        # Convert input_ids into a list of tensors
        input_ids_tensors = [torch.tensor(ids) for ids in input_ids]

        # Pad the sequences so they all have the same length
        padded_input_ids = pad_sequence(input_ids_tensors, batch_first=True, padding_value=0)
        padded_input_ids = padded_input_ids[:, :max_seq_len]
        
        # Create attention masks for the input_ids
        # Masks are 1 for any non-padding tokens and 0 for padding
        attention_masks = padded_input_ids != 0

        # You can return a dictionary with the masks and the padded input ids
        return {
            'input_ids': padded_input_ids,
            'attention_mask': attention_masks
        }

In [9]:
# Define training args
from transformers import TrainingArguments, Trainer

output_dir = "PhillipGuo/llama3_lora"
config = {
    'lora_config': lora_config,
    'learning_rate': 1e-4,
    'num_train_epochs': 1,
    'gradient_accumulation_steps': 1,
    'per_device_train_batch_size': 1,
    'gradient_checkpointing': False,
}
enable_profiler=False

training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    bf16=True,  # Use BF16 if available
    # logging strategies
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=10,
    save_strategy="no",
    optim="adamw_torch_fused",
    # max_steps=total_steps if enable_profiler else -1,
    **{k:v for k,v in config.items() if k != 'lora_config'}
)

# Create Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=CustomDataCollator()
)

# Start training
trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mphilliphguo[0m ([33mquirky_lats_at_mats[0m). Use [1m`wandb login --relogin`[0m to force relogin


OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 110.06 MiB is free. Process 3733353 has 23.53 GiB memory in use. Of the allocated memory 23.03 GiB is allocated by PyTorch, and 51.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [10]:
from transformers import AutoTokenizer, DefaultDataCollator, AutoModelForQuestionAnswering, TrainingArguments, Trainer
training_args = TrainingArguments(
    output_dir="PhillipGuo/llama3_qa_lora",
    learning_rate=1e-3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    # compute_metrics=compute_metrics,
)

trainer.train()



TypeError: LlamaForCausalLM.forward() got an unexpected keyword argument 'decoder_input_ids'