# Data Preparation and Base Model Evaluation

## Objective
The goal of this notebook is to prepare the `Dahoas/synthetic-instruct-gptj-pairwise` dataset for Supervised Fine-Tuning (SFT) and to evaluate the performance of the base `gpt2` model before any fine-tuning.

## Key Steps
1.  **Load Data**: Loaded the raw dataset from Hugging Face.
2.  **Format for SFT**: Processed the dataset by combining the `prompt` and `chosen` response into a more explicit `prompt` and `completion` format suitable for the `SFTTrainer`.
3.  **Base Model Inference**: Tested the pre-trained `gpt2` (124M) model on a sample prompt to establish a baseline.

## Outcome & Next Steps
The dataset is now correctly formatted. The base `gpt2` model's response to the test prompt was irrelevant, demonstrating a clear **relevance failure** and confirming the need for fine-tuning.

The next step is to use this processed data in to perform Parameter-Efficient Fine-Tuning (PEFT) using LoRA to align the model to our instruction-following task.

In [33]:
# !pip install transformers==4.43.4
# !pip install torch

In [6]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import json
import matplotlib.pyplot as plt

In [7]:
dataset_name = "my-local-dataset/" # "Dahoas/synthetic-instruct-gptj-pairwise"
dataset = load_dataset(dataset_name, split="train")

In [8]:
dataset

Dataset({
    features: ['prompt', 'chosen', 'rejected'],
    num_rows: 33143
})

In [9]:
print(dataset[0])

{'prompt': 'I was wondering if you could walk me through the process of setting up a hydroponic garden for herbs.', 'chosen': "Sure! The process for setting up a hydroponic garden for herbs is relatively simple. First, you'll want to choose a space where you will set up your hydroponic system. You'll need to make sure the space is well-lit and has access to electricity and an adequate water supply. Next, you'll need to choose the type of hydroponic system you want to use. There are several types of hydroponic systems, so you'll need to decide which best suits your needs. Once you've chosen a system, you'll need to gather the supplies you'll need to assemble it. This includes things like pumps, growing trays, grow lights, and nutrients. Once you've assembled the system, you'll need to add your choice of herbs to the system. Lastly, you'll need to monitor and adjust the system as needed to ensure your herbs are getting the correct amount of light, water, and nutrients.", 'rejected': 'How

## Apply the Formatting

In [10]:
# def format_sft_prompt(sample):
#     """Formats a sample from the dataset into a single text string."""
#     prompt = sample["prompt"]
#     chosen_response = sample["chosen"]
#     return {"text": f"### Instruction:\n{prompt}\n\n### Response:\n{chosen_response}"}



def format_for_prompt_completion(sample):
    """
    Formats a sample into a dictionary with 'prompt' and 'completion' keys.
    """
    prompt_text = f"### Instruction:\n{sample['prompt']}\n\n### Response:\n"
    completion_text = sample['chosen']
    return {"prompt": prompt_text, "completion": completion_text}

In [11]:
formatted_dataset = dataset.map(format_for_prompt_completion)
formatted_dataset

Dataset({
    features: ['prompt', 'chosen', 'rejected', 'completion'],
    num_rows: 33143
})

In [12]:
print(formatted_dataset[0])

{'prompt': '### Instruction:\nI was wondering if you could walk me through the process of setting up a hydroponic garden for herbs.\n\n### Response:\n', 'chosen': "Sure! The process for setting up a hydroponic garden for herbs is relatively simple. First, you'll want to choose a space where you will set up your hydroponic system. You'll need to make sure the space is well-lit and has access to electricity and an adequate water supply. Next, you'll need to choose the type of hydroponic system you want to use. There are several types of hydroponic systems, so you'll need to decide which best suits your needs. Once you've chosen a system, you'll need to gather the supplies you'll need to assemble it. This includes things like pumps, growing trays, grow lights, and nutrients. Once you've assembled the system, you'll need to add your choice of herbs to the system. Lastly, you'll need to monitor and adjust the system as needed to ensure your herbs are getting the correct amount of light, wat

In [13]:
print(formatted_dataset[0]['prompt'])
print('='*50)
print(formatted_dataset[0]['completion'])

### Instruction:
I was wondering if you could walk me through the process of setting up a hydroponic garden for herbs.

### Response:

Sure! The process for setting up a hydroponic garden for herbs is relatively simple. First, you'll want to choose a space where you will set up your hydroponic system. You'll need to make sure the space is well-lit and has access to electricity and an adequate water supply. Next, you'll need to choose the type of hydroponic system you want to use. There are several types of hydroponic systems, so you'll need to decide which best suits your needs. Once you've chosen a system, you'll need to gather the supplies you'll need to assemble it. This includes things like pumps, growing trays, grow lights, and nutrients. Once you've assembled the system, you'll need to add your choice of herbs to the system. Lastly, you'll need to monitor and adjust the system as needed to ensure your herbs are getting the correct amount of light, water, and nutrients.


In [14]:
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

In [15]:
model = AutoModelForCausalLM.from_pretrained(model_name)
model.config.pad_token_id = model.config.eos_token_id # Configure model pad token

In [16]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [25]:
text = formatted_dataset['prompt'][10]
print(text)

### Instruction:
How does quantum computing work.

### Response:



In [26]:
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
inputs

{'input_ids': tensor([[21017, 46486,    25,   198,  2437,   857, 14821, 14492,   670,    13,
           198,   198, 21017, 18261,    25,   198]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [27]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [28]:
model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}

In [29]:
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

In [30]:
response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response_text)

### Instruction:
How does quantum computing work.

### Response:

This paper has been accepted to the IEEE International Conference on Information Security and Computer Systems (ICISC), the Conference on Computer and Information Security (CISC) and the Proceedings of the Conference on Computer and Information Security (CISC) in Paris, France, as well as to the International Conference on Information Security and Computer Systems (ICISC) in Amsterdam, Netherlands (IISC) in June 2014.

### References:

[1] http://www.cs.berkeley.edu/~simmons/~gordon/papers/1.pdf

[2] http://www.cs.berkeley.edu/~simmons/~gordon/papers/2.pdf


In [31]:
generated_part = response_text[len(text):]
print(generated_part.strip())

This paper has been accepted to the IEEE International Conference on Information Security and Computer Systems (ICISC), the Conference on Computer and Information Security (CISC) and the Proceedings of the Conference on Computer and Information Security (CISC) in Paris, France, as well as to the International Conference on Information Security and Computer Systems (ICISC) in Amsterdam, Netherlands (IISC) in June 2014.

### References:

[1] http://www.cs.berkeley.edu/~simmons/~gordon/papers/1.pdf

[2] http://www.cs.berkeley.edu/~simmons/~gordon/papers/2.pdf


In [32]:
text = formatted_dataset['completion'][10]
print(text)

Quantum computing works by using quantum-mechanical phenomena, such as superposition and entanglement, to perform operations on data. This allows quantum computers to be much faster and more efficient at certain computations compared to traditional computers. Quantum computing relies on the manipulation of qubits - units of information that can exist in multiple states at any given time - in order to perform calculations and solve problems. By using quantum effects, a quantum computer can process multiple states simultaneously, allowing it to carry out complex computations in a fraction of the time of a classical computer.


###    `Relevance Failure:a` The model completely failed to answer the actual question. The prompt asked for an explanation of quantum computing, but the model generated text about a paper being accepted to academic conferences. The response is entirely off-topic.

