# Assignment 3

##### Due Date: Feb 25th, 2024 at 11:59pm 100 points

Implement the “Self Alignment with Instruction Backtranslation” paper. Link to paper: https://arxiv.org/pdf/2308.06259.pdf

In particular:
1. Finetune the base language model (llama2 7B) with (output, instruction) pairs {(yi, xi)} from the seed data to obtain a backward model Myx := p(x|y). In other words, finetune a model that uses the output to predict the instruction. Use the [openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) training set dataset. (25 points)

Push the backwards model to HF and paste url here : https://huggingface.co/AnushaKulkarni/q1

2. Self-Augmentation -- generate instructions from the LIMA dataset’s completions and filtering out any mutli-turn examples (25 points)

3. Self curation (selecting high quality examples) using few shot prompting in addition to the prompt in Table 1 of the paper. (25 points)

Push the dataset to HF hub and paste the url here : https://huggingface.co/datasets/AnushaKulkarni/filtered_dataset
  
4. Finetune base model on dataset generated by step 3 (25 points)

Push the instruction fine tuned model to HF hub and paste the url here : https://huggingface.co/AnushaKulkarni/q4

Please include a link to your colab notebook here:



In [1]:
!pip install -q -U bitsandbytes wandb datasets sentence_transformers faiss-gpu
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U git+https://github.com/huggingface/trl.git

In [2]:
%%capture
%pip install accelerate peft bitsandbytes transformers trl
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from torch import nn as nn
from torch.nn import functional as F
from torch import optim



In [3]:
!nvidia-smi
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

Sun Feb 25 23:52:49 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                        On | 00000000:00:1E.0 Off |                    0 |
| N/A   18C    P8               10W /  70W|      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Question 1

## Train the backward Myx model (output, instruction) and the M0 model (instruction, output)

In [4]:
# Model from Hugging Face hub
from transformers import AutoTokenizer
# Specify the base model name from Hugging Face Hub
base_model = "NousResearch/Llama-2-7b-chat-hf"

# Initialize tokenizer using the specified base model
tokenizer = AutoTokenizer.from_pretrained(base_model)

# Check if the tokenizer has a pad token
if tokenizer.pad_token is None:
    # If pad token is not available, set pad token to end of sentence token (eos_token)
    tokenizer.pad_token = tokenizer.eos_token

In [5]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset('timdettmers/openassistant-guanaco')
# Split the dataset
dataset = dataset['train'].train_test_split(test_size=0.1)
# Filtering the dataset to keep only the examples where the length of tokenized text is less than 256
dataset = dataset.filter(lambda x: len(tokenizer.tokenize(x['text'])) < 256)

print(dataset)

# Prompt template for tuning backward model
prompt_backward = """<s>[INST]Below is a response. Write an instruction for which the response is appropiate
### Response:
{}

### Instruction:
{}[/INST]"""

# Prompt template for tuning forward model
prompt_forward = """<s>[INST]Below is an instruction. Write a response that appropriately answers the question
### Instruction:
{}

### Response:
{}[/INST]"""


EOS_TOKEN = tokenizer.eos_token




Filter:   0%|          | 0/8861 [00:00<?, ? examples/s]

Filter:   0%|          | 0/985 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 3050
    })
    test: Dataset({
        features: ['text'],
        num_rows: 317
    })
})


In [6]:
splits = ['train', 'test']
# Initialize lists to store backward and forward prompts
texts = []
texts_fwd = []

for split in splits:
    for text in dataset[split]["text"]:
        # Split the text into segments
        segments = text.split("### ")
        segments = [c for c in segments if len(c) > 0]
        segments = segments[:2]
        if not (segments[0].startswith('Human: ') or segments[1].startswith('Assistant: ')):
            continue
        instruction = segments[0].replace("Human: ", "")
        response = segments[1].replace("Assistant: ", "")
        # Backward prompt
        texts.append(prompt_backward.format(response,instruction) + EOS_TOKEN)
        # Forward prompt
        texts_fwd.append(prompt_forward.format(instruction, response) + EOS_TOKEN)

from datasets import Dataset
dataset = Dataset.from_dict({"text": texts})
dataset_fwd = Dataset.from_dict({"text": texts_fwd})

In [7]:
dataset["text"][0]

'<s>[INST]Below is a response. Write an instruction for which the response is appropiate\n### Response:\nSegún la lista _Billboard Hot 100_, el tema que ocupó el primer lugar en julio de 1986 fue "Invisible Touch" de la banda inglesa Genesis. \n\nUna curiosidad sobre la canción "Invisible Touch" es que su letra fue inspirada por una experiencia que tuvo el cantante de la banda, Phil Collins, con un amigo que estaba teniendo problemas matrimoniales. Collins trató de hablar con su amigo para ayudarlo, pero se dio cuenta de que no podía hacer nada para cambiar la situación. La letra de "Invisible Touch" habla de una relación en la que una persona no puede entender o alcanzar a la otra, lo que refleja la frustración que Collins sentía al intentar ayudar a su amigo. El tema sigue siendo una de las canciones más populares de la banda británica.\n\n### Instruction:\n¿Podés decirme qué tema estaba en el top número 1 según Billboard US en julio de 1986?[/INST]</s>'

In [8]:
dataset_fwd["text"][0]

'<s>[INST]Below is an instruction. Write a response that appropriately answers the question\n### Instruction:\n¿Podés decirme qué tema estaba en el top número 1 según Billboard US en julio de 1986?\n\n### Response:\nSegún la lista _Billboard Hot 100_, el tema que ocupó el primer lugar en julio de 1986 fue "Invisible Touch" de la banda inglesa Genesis. \n\nUna curiosidad sobre la canción "Invisible Touch" es que su letra fue inspirada por una experiencia que tuvo el cantante de la banda, Phil Collins, con un amigo que estaba teniendo problemas matrimoniales. Collins trató de hablar con su amigo para ayudarlo, pero se dio cuenta de que no podía hacer nada para cambiar la situación. La letra de "Invisible Touch" habla de una relación en la que una persona no puede entender o alcanzar a la otra, lo que refleja la frustración que Collins sentía al intentar ayudar a su amigo. El tema sigue siendo una de las canciones más populares de la banda británica.[/INST]</s>'

#### Tuning forward and backward model

In [9]:

max_length = 256
def generate_and_tokenize_prompt(prompt):
    result = tokenizer(
        prompt['text'],
        truncation=True,
        max_length=max_length,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

tokenized_dataset = dataset.map(generate_and_tokenize_prompt)
tokenized_dataset_fwd = dataset_fwd.map(generate_and_tokenize_prompt)

Map:   0%|          | 0/3367 [00:00<?, ? examples/s]

Map:   0%|          | 0/3367 [00:00<?, ? examples/s]

In [10]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=getattr(torch, "float16"),
    bnb_4bit_use_double_quant=False,
)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map="auto"
)

model.config.use_cache = False
model.config.pretraining_tp = 1

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [11]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "fc1",
        "fc2",
        "dense",
        "lm_head"
    ],
    bias="none",
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)

In [12]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

learning_rate = 1e-5
weight_decay = 0.1
batch_size = 32

training_args = TrainingArguments(
     output_dir="./q1",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    lr_scheduler_type='cosine',
    max_steps=50,
    learning_rate=2e-5, 
    optim="paged_adamw_8bit",
    logging_steps=5,             
)


trainer = Trainer(
    model=model,
    train_dataset=tokenized_dataset,
    args=training_args,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33manushakulkarni1997[0m ([33manusha-kulkarni[0m). Use [1m`wandb login --relogin`[0m to force relogin


TrainOutput(global_step=50, training_loss=2.928114414215088, metrics={'train_runtime': 36.0555, 'train_samples_per_second': 1.387, 'train_steps_per_second': 1.387, 'total_flos': 509465434521600.0, 'train_loss': 2.928114414215088, 'epoch': 0.01})

In [13]:
trainer_fwd = Trainer(
    model=model,
    train_dataset=tokenized_dataset_fwd,
    args=training_args,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
trainer_fwd.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
5,2.1414
10,2.6658
15,2.2476
20,2.3663
25,2.4998
30,2.4005
35,2.4212
40,2.2057
45,2.2129


TrainOutput(global_step=50, training_loss=2.3669672203063965, metrics={'train_runtime': 33.696, 'train_samples_per_second': 1.484, 'train_steps_per_second': 1.484, 'total_flos': 509465434521600.0, 'train_loss': 2.3669672203063965, 'epoch': 0.01})

#### Push model to hugging face

In [14]:
from huggingface_hub import notebook_login

# Use the notebook_login function to log in
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [15]:
trainer.push_to_hub()



adapter_model.safetensors:   0%|          | 0.00/365M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.86k [00:00<?, ?B/s]

events.out.tfevents.1708905230.ip-10-192-10-81.48422.0:   0%|          | 0.00/7.48k [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

events.out.tfevents.1708905266.ip-10-192-10-81.48422.1:   0%|          | 0.00/7.48k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/AnushaKulkarni/q1/commit/ddbdc74b2494eade2d79f85257cd855f2b7a736e', commit_message='End of training', commit_description='', oid='ddbdc74b2494eade2d79f85257cd855f2b7a736e', pr_url=None, pr_revision=None, pr_num=None)

# Question 2 : Self-Augmentation -- generate instructions from the LIMA dataset’s completions and filtering out any mutli-turn examples 

In [16]:
dataset_lima=load_dataset('GAIR/lima')

In [17]:
dataset_lima = dataset_lima.filter(lambda x: len((x['conversations'])) < 3)
print(dataset_lima)

DatasetDict({
    train: Dataset({
        features: ['conversations', 'source'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['conversations', 'source'],
        num_rows: 300
    })
})


In [18]:

instructions = []
max_conv = 100
i = 0
for conv in dataset_lima["train"]:
    # Check if the maximum number of conversations to process has been reached
    if i > max_conv:
        break
    ques, ans = conv["conversations"][0], conv["conversations"][1]

    # Prepare input prompt for the backward model
    
    prompt2 = f"<s>[INST]Below is an answer by a user. You need to understand the answer and come up with a question for which that answer makes sense.\n {ans}. When generating the question, make sure to be concise and to the point. Your response should not beat around the bush, only provide the question in the following format: GENERATED INSTRUCTION:- [/INST]" #unique
    #input_ids = tokenizer.encode(prompt, return_tensors="pt")
    input_ids2 = tokenizer.encode(prompt2, return_tensors="pt")
    # Generate instruction using the fine-tuned backward model
    generated_ids = trainer.model.generate(input_ids2, max_new_tokens=512)
    generated_instruction = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    
    instructions.append({"question": ques, "ans": ans, "generated_ques": generated_instruction}) 

    # Print progress every 10 steps
    if i % 10 == 0:
        print(f"Completed {i} predictions")

    i += 1




Completed 0 predictions
Completed 10 predictions
Completed 20 predictions
Completed 30 predictions
Completed 40 predictions
Completed 50 predictions
Completed 60 predictions
Completed 70 predictions
Completed 80 predictions
Completed 90 predictions
Completed 100 predictions


In [19]:
# Clean the generated instructions
updated_instructions = []
for i in instructions:
    ques = i["question"]
    ans = i["ans"]
    generated_ques = i["generated_ques"]
    # Extract the required instruction
    generated_instruction = generated_ques.rsplit("GENERATED INSTRUCTION:")[-1].replace("\n", "")
    updated_instructions.append({"original_instruction":ques, "answer":ans, "generated_instruction": generated_instruction})

In [20]:
updated_instructions

[{'original_instruction': 'Can brain cells move? By movement I mean long distance migration (preferably within the brain only).',
  'answer': 'The question is relatively broad and one should take into account that the brain not only consists of neurons, but also glial cells (supportive cells) and pre-mitotic neuronal stem cells. Furthermore, as critical fellow-scientists have indicated, developmental stage is very important, as the developing embryonic brain is very different from the adult brain.\nHowever, after sifting through various publications, the answer to the question is actually remarkably simple: Yes, brain cells migrate.\nIn  the adult brain glial cells migrate in the brain (Klämbt, 2009). Glial cells are involved in a myriad of functions, but a notable example of migrating glial cells are the oligodendrocytes that migrate relative long distances to find their target axons onto which they wrap themselves to form the insulating myelin sheath (Tsai and Miller, 2002).\nNeurona

# Question 3 : Self curation 

In [21]:
self_curation_prompt = """Below is an instruction from an user and a candidate answer.
Evaluate whether or not the answer is a good example of how AI Assistant should respond to the user’s instruction. 
Please assign a score using the following 5-point scale:
1:  It means the answer is incomplete, vague, off-topic, controversial, or not exactly what the user asked for.  For example, some content seems missing, numbered list does not start from the beginning, the opening sentence repeats user’s question. Or the response is from another person’s perspective with their personal experience (e.g.  taken from blog posts), or looks like an answer from a forum.  Or it contains promotional text, navigation text, or other irrelevant information.
2:  It means the answer addresses most of the asks from the user.  It does not directly address the user’s question.  For example, it only provides a high-level methodology instead of the exact solution to user’s question.
3:  It means the answer is helpful but not written by an AI Assistant.  It addresses all the basic asks from the user.  It is complete and self contained with the drawback that the response is not written from an AI assistant’s perspective, but from other people’s perspective.  The content looks like an excerpt from a blog post, web page, or web search results.  For example, it contains personal experience or opinion, mentions comments section, or share on social media, etc.
4:  It means the answer is written from an AI assistant’s perspective with a clear focus of addressing the instruction.  It provide a complete, clear, and comprehensive response to user’s question or instruction without missing or irrelevant information.  It is well organized, self-contained, and written in a helpful tone.  It has minor room for improvement, e.g.  more concise and focused.
5:  It means it is a perfect answer from an AI Assistant.  It has a clear focus on being a helpful AI Assistant, where the response looks like intentionally written to address the user’s question or instruction without any irrelevant sentences.  The answer provides high quality content, demonstrating expert knowledge in the area, is very well written, logical, easy-to-follow, engaging and insightful. 
Please first provide a brief reasoning you used to derive the rating score, and then write "Score:  <rating>" in the last line.

Instruction: {}\n
Candidate Answer: {}
"""


max_conv = 100
i = 0
scores = []
for ins in updated_instructions:
    # Check if the maximum number of conversations to process has been reached
    if i > max_conv:
        break
    generated_instruction = ins["generated_instruction"]
    answer = ins["answer"]
    curation_prompt = self_curation_prompt.format(generated_instruction, answer)
    input_ids = tokenizer.encode(curation_prompt, return_tensors="pt")
    # Generate instruction using the fine-tuned forward model
    generated_ids = trainer_fwd.model.generate(input_ids, max_new_tokens=512)  # Adjust max_length as needed
    score = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    scores.append({"generated_instruction": generated_instruction, "answer": answer, "score":score})
    # print progress every 10 steps
    if i % 10 == 0:
        print(f"Completed {i} predictions")

    i += 1



In [22]:
# Parse Scores for each of the generated instruction
parsed_scores = []
for s in scores:
    generated_instruction = s["generated_instruction"]
    answer = s["answer"]
    # Extract only the score value
    score = s["score"].rsplit("Score:",1)[-1].strip().replace("<","").replace(">","")
    
    if len(score)>1:
        score = score[:1]
    try:
        # convert the score string to an integer
        score = int(score)
        parsed_scores.append({"generated_instruction":generated_instruction, "answer":answer, "score":score})
    except:
        # If conversion to integer fails, set the score to 0 and print a message
        score = 0
        print(f"invalid score for {s}")

In [23]:
parsed_scores

[{'generated_instruction': "- [/INST]  Understood! Here's a question that makes sense based on the provided answer:What are the specific mechanisms and routes by which brain cells, including glial cells, stem cells, and post-mitotic neurons, migrate within the adult brain and during embryonic development?",
  'answer': 'The question is relatively broad and one should take into account that the brain not only consists of neurons, but also glial cells (supportive cells) and pre-mitotic neuronal stem cells. Furthermore, as critical fellow-scientists have indicated, developmental stage is very important, as the developing embryonic brain is very different from the adult brain.\nHowever, after sifting through various publications, the answer to the question is actually remarkably simple: Yes, brain cells migrate.\nIn  the adult brain glial cells migrate in the brain (Klämbt, 2009). Glial cells are involved in a myriad of functions, but a notable example of migrating glial cells are the olig

In [24]:

# Find instruction answer pairs where score >= threshold
threshold = 4
filtered_scores = [x for x in parsed_scores if x["score"] >= threshold]
filtered_scores

[{'generated_instruction': "- [/INST]  Understood! Here's a question that makes sense based on the provided answer:What are the specific mechanisms and routes by which brain cells, including glial cells, stem cells, and post-mitotic neurons, migrate within the adult brain and during embryonic development?",
  'answer': 'The question is relatively broad and one should take into account that the brain not only consists of neurons, but also glial cells (supportive cells) and pre-mitotic neuronal stem cells. Furthermore, as critical fellow-scientists have indicated, developmental stage is very important, as the developing embryonic brain is very different from the adult brain.\nHowever, after sifting through various publications, the answer to the question is actually remarkably simple: Yes, brain cells migrate.\nIn  the adult brain glial cells migrate in the brain (Klämbt, 2009). Glial cells are involved in a myriad of functions, but a notable example of migrating glial cells are the olig

In [25]:

print(f"Original Number of Samples: {len(parsed_scores)},\nUpon filtering, Number of Samples: {len(filtered_scores)}\n\nThreshold considered:>={threshold}")

In [26]:
prompt_forward = """<s>[INST]Below is an instruction. Write a response that appropriately answers the question
### Instruction:
{}

### Response:
{}[/INST]"""

EOS_TOKEN = tokenizer.eos_token

texts = []
for datapoint in filtered_scores:
    texts.append({"text":prompt_forward.format(datapoint["generated_instruction"], datapoint["answer"])})
texts

[{'text': "<s>[INST]Below is an instruction. Write a response that appropriately answers the question\n### Instruction:\n- [/INST]  Understood! Here's a question that makes sense based on the provided answer:What are the specific mechanisms and routes by which brain cells, including glial cells, stem cells, and post-mitotic neurons, migrate within the adult brain and during embryonic development?\n\n### Response:\nThe question is relatively broad and one should take into account that the brain not only consists of neurons, but also glial cells (supportive cells) and pre-mitotic neuronal stem cells. Furthermore, as critical fellow-scientists have indicated, developmental stage is very important, as the developing embryonic brain is very different from the adult brain.\nHowever, after sifting through various publications, the answer to the question is actually remarkably simple: Yes, brain cells migrate.\nIn  the adult brain glial cells migrate in the brain (Klämbt, 2009). Glial cells ar

#### Push dataset to hugging face

In [30]:
from huggingface_hub import notebook_login

# Use the notebook_login function to log in
notebook_login()
filtered_dataset = Dataset.from_list(texts)
filtered_dataset.push_to_hub("AnushaKulkarni/filtered_dataset")

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/269 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/AnushaKulkarni/filtered_dataset/commit/5165d03b0ae0f46dea39509591b4b2ff34be8092', commit_message='Upload dataset', commit_description='', oid='5165d03b0ae0f46dea39509591b4b2ff34be8092', pr_url=None, pr_revision=None, pr_num=None)

# Question 4 : Finetune base model on dataset generated after self - curation (question3)

In [31]:
from datasets import load_dataset
dataset = load_dataset('AnushaKulkarni/filtered_dataset')
dataset = dataset['train'].train_test_split(test_size=0.1)

print(dataset)

Downloading readme:   0%|          | 0.00/269 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/107k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/71 [00:00<?, ? examples/s]

In [32]:
splits = ['train', 'test']
texts = []
texts_fwd = []

for split in splits:
    for text in dataset[split]["text"]:    
        texts.append(text + EOS_TOKEN)

from datasets import Dataset
dataset = Dataset.from_dict({"text": texts})

In [33]:
dataset["text"][0]

"<s>[INST]Below is an instruction. Write a response that appropriately answers the question\n### Instruction:\nWhat is the best method to generate random points on a sphere in $d$ dimensions, and how do you check if the resulting points are uniformly distributed on the sphere?\n\n### Response:\nA standard method is to generate three standard normals and construct a unit vector from them. That is, when $X_i \\sim N(0,1)$ and $\\lambda^2 = X_1^2 + X_2^2 + X_3^2$, then $(X_1/\\lambda, X_2/\\lambda, X_3/\\lambda)$ is uniformly distributed on the sphere.  This method works well for $d$-dimensional spheres, too.\nIn 3D you can use rejection sampling: draw $X_i$ from a uniform$[-1,1]$ distribution until the length of $(X_1, X_2, X_3)$ is less than or equal to 1, then--just as with the preceding method--normalize the vector to unit length.  The expected number of trials per spherical point equals $2^3/(4 \\pi / 3)$ = 1.91.  In higher dimensions the expected number of trials gets so large this 

In [34]:
max_length = 256
def generate_and_tokenize_prompt(prompt):
    result = tokenizer(
        prompt['text'],
        truncation=True,
        max_length=max_length,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

tokenized_dataset = dataset.map(generate_and_tokenize_prompt)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=getattr(torch, "float16"),
    bnb_4bit_use_double_quant=False,
)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map="auto"
)

model.config.use_cache = False
model.config.pretraining_tp = 1
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "fc1",
        "fc2",
        "dense",
        "lm_head"
    ],
    bias="none",
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

learning_rate = 1e-5
weight_decay = 0.1
batch_size = 32

training_args = TrainingArguments(
     output_dir="./q4",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    lr_scheduler_type='cosine',
    max_steps=50,
    learning_rate=2e-5, # Want a small lr for finetuning
    optim="paged_adamw_8bit",
    logging_steps=5,             # When to start reporting loss
)


trainer = Trainer(
    model=model,
    train_dataset=tokenized_dataset,
    args=training_args,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

trainer.train()

Map:   0%|          | 0/71 [00:00<?, ? examples/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
5,2.6197
10,2.4359
15,2.2617
20,2.6552
25,2.6806
30,2.6052
35,2.4829
40,2.3819
45,2.4605


TrainOutput(global_step=50, training_loss=2.491266689300537, metrics={'train_runtime': 33.8246, 'train_samples_per_second': 1.478, 'train_steps_per_second': 1.478, 'total_flos': 509465434521600.0, 'train_loss': 2.491266689300537, 'epoch': 0.7})

#### Push model to hugging face

In [35]:
from huggingface_hub import notebook_login

# Use the notebook_login function to log in
notebook_login()
trainer.push_to_hub()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…



adapter_model.safetensors:   0%|          | 0.00/365M [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.86k [00:00<?, ?B/s]

events.out.tfevents.1708906298.ip-10-192-10-81.48422.2:   0%|          | 0.00/7.48k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/AnushaKulkarni/q4/commit/8380940f7e8d4c53c2cc1c4d05fd82822a05bdbf', commit_message='End of training', commit_description='', oid='8380940f7e8d4c53c2cc1c4d05fd82822a05bdbf', pr_url=None, pr_revision=None, pr_num=None)