### Introduction to LoRA

Low-rank adaptation (LoRA) is a machine learning technique that modifies a pretrained model to better suit a specific, often smaller, dataset by adjusting only a small, low-rank subset of the model's parameters

This approach is important because it allows for efficient finetuning of large models on task-specific data, significantly reducing the computational cost and time required for finetuning

In [1]:
import json


file_path = "instruction-data.json"

with open(file_path, "r") as file:
    data = json.load(file)
print("Number of entries:", len(data))

Number of entries: 1100


In [2]:
train_portion = int(len(data) * 0.85)  # 85% for training
test_portion = int(len(data) * 0.15)    # 15% for testing

train_data = data[:train_portion]
test_data = data[train_portion:]

In [3]:
print("Training set length:", len(train_data))
print("Test set length:", len(test_data))

Training set length: 935
Test set length: 165


In [4]:
with open("train.json", "w") as json_file:
    json.dump(train_data, json_file, indent=4)
    
with open("test.json", "w") as json_file:
    json.dump(test_data, json_file, indent=4)

### Instruction Finetuning

In [7]:
!litgpt finetune_lora microsoft/phi-2 \
--data JSON \
--data.val_split_fraction 0.1 \
--data.json_path train.json \
--train.epochs 3 \
--train.log_interval 100

Fetching 7 files:   0%|                                   | 0/7 [00:00<?, ?it/s]
generation_config.json: 100%|███████████████████| 124/124 [00:00<00:00, 525kB/s][A

tokenizer_config.json: 100%|███████████████| 7.34k/7.34k [00:00<00:00, 18.3MB/s][A

tokenizer.json:   0%|                               | 0.00/2.11M [00:00<?, ?B/s][A

model.safetensors.index.json:   0%|                 | 0.00/35.7k [00:00<?, ?B/s][A[A


config.json: 100%|█████████████████████████████| 735/735 [00:00<00:00, 4.15MB/s][A[A[A
Fetching 7 files:  14%|███▊                       | 1/7 [00:01<00:09,  1.59s/it]

model.safetensors.index.json: 100%|█████████| 35.7k/35.7k [00:00<00:00, 177kB/s][A[A

tokenizer.json: 100%|██████████████████████| 2.11M/2.11M [00:01<00:00, 1.84MB/s][A

model-00001-of-00002.safetensors:   0%|             | 0.00/5.00G [00:00<?, ?B/s][A

model-00002-of-00002.safetensors:   0%|              | 0.00/564M [00:00<?, ?B/s][A[A

model-00002-of-00002.safetensors:   2%|     | 10.5M/564M 

In [8]:
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

    return instruction_text + input_text

print(format_input(test_data[0]))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Rewrite the sentence using a simile.

### Input:
The car is very fast.


In [9]:
from litgpt import LLM

llm = LLM.load("microsoft/phi-2")

FileNotFoundError: checkpoint_dir '/home/anil/Documents/AI_Fellowship/LLMs/05_finetuning/checkpoints/microsoft/phi-2' is missing the files: ['lit_model.pth', 'model_config.yaml'].