In [3]:
# A small training corpus which doesnt even make any big difference.
training_corpus = """
Kunj Shah is a passionate developer exploring the world of AI, LLMs, and web technologies. With hands-on experience in building transformer models from scratch, including custom tokenizers and attention mechanisms, he's on a mission to understand how large language models really work.
He has worked on various projects like voice-activated AI assistants, RAG-based PDF summarizers, and AI startup evaluators. He's also familiar with LangChain, Langflow, Hugging Face models, and deploying APIs on Raspberry Pi.

In addition to AI, Kunj enjoys building sleek frontend experiences using React, TailwindCSS, and Vite. He’s comfortable working across both backend and frontend stacks, with experience in Node.js, Express, and database integration via MySQL and MongoDB.
He's explored tools like n8n, Sublime Text, and Python-based automation to streamline development. Kunj is driven by curiosity, rapid prototyping, and bringing ideas to life quickly and effectively.
"""

In [4]:
# peft is core LORA funcanality provided by huggingface
# bitsandbytes is for quanitized model support
!pip install -q transformers datasets peft accelerate bitsandbytes

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # required for GPT-2 as it doesnt have padding tokens

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [6]:
from peft import get_peft_model, LoraConfig, TaskType

# Arguments for LoRA. We have:
  # 1. lora rank which is 8 (dimension of new matrices)
  # 2. lora alpha, Scaling factor which is 16
  # 3. lora dropout inside lora layers
  # 4. bias being off
  # 5. task type being, Tells LoRA this is a GPT-style task
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# we get the model gpt2 first anf get_peft_model inserts a small LoRA layers into the attentoin modeules with lora_config
model = AutoModelForCausalLM.from_pretrained(model_name)
model = get_peft_model(model, lora_config)

# GPT2 weights are FROZEN and only LoRA layers are going to be trained
model.print_trainable_parameters()

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



trainable params: 294,912 || all params: 124,734,720 || trainable%: 0.2364


In [7]:
from datasets import Dataset

lines = training_corpus.strip().split("\n\n")
dataset = Dataset.from_dict({"text": lines})

def tokenize(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=512)

tokenized_dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [8]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

# Training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-lora",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    logging_steps=10,
    save_strategy="epoch",
    learning_rate=5e-4,
    report_to="none"
)

# We have data_collator which is used for batch training type shit
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# trainer object to have finetuning
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)

# We run this training loop
trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


TrainOutput(global_step=3, training_loss=4.534875869750977, metrics={'train_runtime': 71.1591, 'train_samples_per_second': 0.084, 'train_steps_per_second': 0.042, 'total_flos': 1573188009984.0, 'train_loss': 4.534875869750977, 'epoch': 3.0})

In [9]:
model.save_pretrained("./gpt2-lora-adapter")
tokenizer.save_pretrained("./gpt2-lora-adapter")

('./gpt2-lora-adapter/tokenizer_config.json',
 './gpt2-lora-adapter/special_tokens_map.json',
 './gpt2-lora-adapter/vocab.json',
 './gpt2-lora-adapter/merges.txt',
 './gpt2-lora-adapter/added_tokens.json',
 './gpt2-lora-adapter/tokenizer.json')

In [11]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM

# We get Peft config and a base model and peft model
config = PeftConfig.from_pretrained("./gpt2-lora-adapter")
base_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path) #returns the freezed model
model = PeftModel.from_pretrained(base_model, "./gpt2-lora-adapter") # this returns a model with LoRA layers and other layers freezed

input_text = "Kunj is a"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Kunj is a former member of the U.S. Army's Special Operations Command. He is a former member of the U.S. Army's Special Operations Command. He is a former member of the U.S. Army's Special Operations Command. He is
