<a href="https://colab.research.google.com/github/priyal6/finetuning/blob/main/LoRA_low_rank_adaptation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

pip install -q torch transformers datasets accelerate peft evaluate sentencepiece safetensors


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
from dataclasses import dataclass
from typing import Optional
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
)

In [None]:
from datasets import Dataset
from peft import(
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training
)

In [None]:
@dataclass
class Config:
  model_name: str = "gpt2"
  output_dir: str = "lora-gpt2-output"
  per_device_train_batch_size: int = 4
  num_train_epochs: int = 3
  learning_rate: float = 2e-4
  weight_decay: float = 0.0
  fp16: bool = False
  lora_r: int = 8
  lora_alpha: int = 32
  lora_dropout: float = 0.1

  max_seq_length: int = 256

cfg = Config()


In [None]:
texts = [
    "Hello, my name is Ada and I love cats.",
    "Weather today: sunny with a chance of learning.",
    "Data science is about asking the right questions and checking assumptions.",
    "Fine-tuning language models with LoRA can be fast and cheap if done correctly."
]

In [None]:
dataset = Dataset.from_dict({"text": texts})

In [None]:
tokenizer = AutoTokenizer.from_pretrained(cfg.model_name, use_fast=True)

#adding pad tokens
if tokenizer.pad_token_id is None:
  tokenizer.add_special_tokens({"pad_token": "[PAD]"})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
model = AutoModelForCausalLM.from_pretrained(cfg.model_name)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
#as the embedding matrix acts as a lookup table for the token input ids - as each token id is vector in dimensional space
#both should be of equal length or else it will crash
if tokenizer.pad_token_id is not None and model.get_input_embeddings().weight.shape[0]!= len(tokenizer):
  model.resize_token_embeddings(len(tokenizer))

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


In [None]:
lora_config = LoraConfig(
    r = cfg.lora_r,
    lora_alpha = cfg.lora_alpha,
    target_modules = ["c_attn", "c_proj"],
    lora_dropout = cfg.lora_dropout,
    bias = "none",
    task_type = "CAUSAL_LM"
)

In [None]:
model = get_peft_model(model, lora_config)



In [None]:
def tokenize_fn(examples):
  return tokenizer(examples['text'], truncation=True, max_length = cfg.max_seq_length, padding = "max_length")


In [None]:
tokenized = dataset.map(tokenize_fn, batched=True, remove_columns=["text"])

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

In [None]:
tokenized = tokenized.map(lambda ex: {"labels": ex["input_ids"]}, batched=False)

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

In [None]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer = tokenizer,
    mlm = False
)

In [None]:
training_args = TrainingArguments(
    output_dir = cfg.output_dir,
    num_train_epochs = cfg.num_train_epochs,
    per_device_train_batch_size = cfg.per_device_train_batch_size,
    learning_rate = cfg.learning_rate,
    weight_decay = cfg.weight_decay,
    fp16 = cfg.fp16,
    logging_steps = 10,
    save_total_limit = 2,
    save_strategy = "epoch",
    push_to_hub = False,
    report_to = "none"
)

In [None]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = tokenized,
    data_collator = data_collator,
)

trainer.train()

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss




TrainOutput(global_step=3, training_loss=4.352939605712891, metrics={'train_runtime': 7.8156, 'train_samples_per_second': 1.535, 'train_steps_per_second': 0.384, 'total_flos': 1582700691456.0, 'train_loss': 4.352939605712891, 'epoch': 3.0})

In [None]:
os.makedirs(cfg.output_dir, exist_ok=True)
model.save_pretrained(cfg.output_dir)
tokenizer.save_pretrained(cfg.output_dir)
print(f"Finished. Saved LoRA adapters and tokenizer to {cfg.output_dir}")




Finished. Saved LoRA adapters and tokenizer to lora-gpt2-output


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel


base = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
base.resize_token_embeddings(len(tokenizer))

model = PeftModel.from_pretrained(base, "lora-gpt2-output")

In [None]:
#inference
input_text = "Data science means"
inputs = tokenizer(input_text, return_tensors = "pt")
with torch.no_grad():
  out = model.generate(**inputs, max_length = 50)
print(tokenizer.decode(out[0],skip_special_tokens=True ))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Data science means that we can use the data to understand the world around us.

The data science approach is based on the idea that we can use data to understand the world around us.

The data science approach is based on the idea
