# Self-Curation

This notebook presents the advanced self-curation process for selecting high-quality instruction-output pairs. It includes criteria for quality assessment, filtering techniques, and the selection mechanism to ensure only the best pairs are used for fine-tuning the base model.

## Steps
1. Define the criteria for assessing the quality of instruction-output pairs.
2. Implement filtering techniques to evaluate and select high-quality pairs.
3. Save the curated instruction-output pairs for fine-tuning.

In [None]:
# Install Pytorch & other libraries
!pip install "torch==2.1.2" tensorboard

# Install Hugging Face libraries
!pip install  --upgrade \
  "transformers==4.36.2" \
  "datasets==2.16.1" \
  "accelerate==0.26.1" \
  "evaluate==0.4.1" \
  "bitsandbytes==0.42.0" \
  "trl==0.7.10"  \
  "peft==0.7.1" \

# install peft & trl from github
!pip install git+https://github.com/huggingface/trl@a3c5b7178ac4f65569975efadc97db2f3749c65e --upgrade
!pip install git+https://github.com/huggingface/peft@4a1559582281fc3c9283892caea8ccef1d6f5a4f --upgrade

In [None]:

from huggingface_hub import login
import os
 
login(
  token="", # ADD YOUR TOKEN HERE
  add_to_git_credential=False
)

In [None]:
from datasets import load_dataset

# Curated examples form the LIMA dataset which have responses for selected prompts
data_path = "" 
 
# Load our test dataset
eval_dataset = load_dataset("json", data_files=data_path,  split="train")

In [None]:
# load generated responses 
import json

generated_responses_path = "/kaggle/input/generated-responses-from-backward-model/generated_responses.json"

with open(generated_responses_path, 'r') as gen_res:
    generated_responses = json.load(gen_res)

In [None]:
# Loading model for curation of responses with respect to instructions

import torch
from peft import AutoPeftModelForCausalLM
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)

# The model that you want to use from the Hugging Face hub
model_name = "meta-llama/Llama-2-7b-chat-hf"

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

# Load the entire model on the GPU 0
device_map = {"": 0}

In [None]:
# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

# Quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
# Loads model from hugging face and device mapping
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map,
#     use_auth_token=os.getenv('HUGGINGFACE_TOKEN')
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
# This tokenizer is passed into the model for splitting the input data into chunks
# This is also obtained from hugging face
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, use_auth_token=os.getenv('HUGGINGFACE_TOKEN'))
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training


In [None]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

In [None]:
system_prompt= """Below is an instruction from an user and a candidate answer. Evaluate whether or
not the answer is a good example of how AI Assistant should respond to the user’s
instruction.  Please assign a score using the following 5-point scale:1:  It means the answer is incomplete, vague, off-topic, controversial, or not
exactly what the user asked for.  For example, some content seems missing, numbered
list does not start from the beginning, the opening sentence repeats user’s question.
Or the response is from another person’s perspective with their personal experience
(e.g.  taken from blog posts), or looks like an answer from a forum.  Or it contains
promotional text, navigation text, or other irrelevant information.2:  It means the answer addresses most of the asks from the user.  It does not
directly address the user’s question.  For example, it only provides a high-level
methodology instead of the exact solution to user’s question.3:  It means the answer is helpful but not written by an AI Assistant.  It addresses
all the basic asks from the user.  It is complete and self contained with the
drawback that the response is not written from an AI assistant’s perspective, but
from other people’s perspective.  The content looks like an excerpt from a blog post,
web page, or web search results.  For example, it contains personal experience or
opinion, mentions comments section, or share on social media, etc.4:  It means the answer is written from an AI assistant’s perspective with a
clear focus of addressing the instruction.  It provide a complete, clear, and
comprehensive response to user’s question or instruction without missing or
irrelevant information.  It is well organized, self-contained, and written in a
helpful tone.  It has minor room for improvement, e.g.  more concise and focused.5:  It means it is a perfect answer from an AI Assistant.  It has a clear focus on
being a helpful AI Assistant, where the response looks like intentionally written
to address the user’s question or instruction without any irrelevant sentences.  The
answer provides high quality content, demonstrating expert knowledge in the area, is
very well written, logical, easy-to-follow, engaging and insightful.
Please first provide the score in the format "Score:  <rating>" and then provide a brief reasoning you used to derive the rating score in the last line."""

In [None]:
few_shot_chat_format = [
{
    "role": "user",
    "content": "<generated instruction>Explain the impact of blockchain technology on the financial industry.</generated instruction>\n<output>Blockchain is a major trend in tech that enhances data security. It's used in various sectors beyond finance, like healthcare and logistics.</output>"
},
{
    "role": "assistant",
    "content": "Score: 1\nReasoning: The response is off-topic and vague regarding the specific impact on the financial industry. It generalizes the use of blockchain across various sectors without directly addressing the financial industry."
},
{
    "role": "user",
    "content": "<generated instruction>List the steps involved in the scientific method.</generated instruction>\n<output>The scientific method involves forming a hypothesis and then testing it through experiments to see if it's true.</output>"
},
{
    "role": "assistant",
    "content": "Score: 2\nReasoning: The answer addresses the basic asks of the instruction but is incomplete and does not provide a detailed list of all the steps involved in the scientific method."
},
{
    "role": "user",
    "content": "<generated instruction>Describe how to prepare a simple pasta dish.</generated instruction>\n<output>To make a pasta dish, you need pasta, a sauce, and optional ingredients like meat or vegetables. Cook the pasta, prepare the sauce, combine them, and serve.</output>"
},
{
    "role": "assistant",
    "content": "Score: 3\nReasoning: The response is helpful and addresses all the basic asks from the user. It is complete and self-contained but lacks details that would make it more engaging and thorough."
},
{
    "role": "user",
    "content": "<generated instruction>Discuss the role of antioxidants in maintaining health.</generated instruction>\n<output>Antioxidants play a critical role in protecting the body against free radicals, which can cause oxidative stress leading to cell damage and various diseases. Foods rich in antioxidants can help maintain overall health and prevent diseases related to aging and lifestyle.</output>"
},
{
    "role": "assistant",
    "content": "Score: 4\nReasoning: The answer is well-written from an AI assistant’s perspective with a clear focus on addressing the instruction. It provides a complete and comprehensive response without irrelevant information but could include more specific examples of antioxidant-rich foods for an even clearer guide."
},
{
    "role": "user",
    "content": "<generated instruction>Provide a detailed explanation of the process of evaporation in the water cycle.</generated instruction>\n<output>Evaporation is a key part of the water cycle where water from oceans, lakes, and rivers turns into vapor. Heat from the sun causes the water to heat up and rise into the atmosphere. This process helps form clouds and is essential for rainfall and the continuation of the water cycle.</output>"
},
{
    "role": "assistant",
    "content": "Score: 5\nReasoning: The response is a perfect answer from an AI Assistant. It has a clear focus on being a helpful AI Assistant, where the response looks intentionally written to address the user’s question or instruction without any irrelevant sentences. The answer provides high-quality content, demonstrating expert knowledge in the area, is very well written, logical, easy-to-follow, engaging and insightful."
}
]

In [None]:
# Function to generate prompts

def generate_prompts(start, end):
    prompts = []

    for i in range(start,end):
        question=f"""<generated instruction>{generated_responses[i]}</generated instruction>\n<output>{eval_dataset[i]["data"][1]['content']}</output>"""

        
        # conversation=[{"role": "system", "content": system_prompt}] + few_shot_chat_format + [{"role": "user", "content" : question}]
        conversation=[{"role": "user", "content" : system_prompt + "\n\n" + question}]
        prompts.append(conversation)
        
    return prompts

In [None]:
# Call function to generate prompts

prompts = generate_prompts(0, len(generated_responses))

In [None]:
output_scores = []

for p in prompts:
    prompt = pipe.tokenizer.apply_chat_template(p, tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=512, do_sample=False, temperature=0.1, top_k=50, top_p=0.1, eos_token_id=pipe.tokenizer.eos_token_id, pad_token_id=pipe.tokenizer.pad_token_id)
    output_scores.append(outputs[0]['generated_text'][len(prompt):])
    torch.cuda.empty_cache()

In [None]:
import re

def extract_first_digit(input_string: str):
    # Use regular expression to find the first occurrence of one or more digits
    match = re.search(r"\d+", input_string)
    if match:
        return match.group(0)  # Return the first found group of digits
    else:
        return None  # Return None if no digits are found

    
def extract(output: str):
    
    parts = output.split("\n\n")
    
    score_part = parts[0]
    reason = parts[1].join("\n\n")
    
    score = extract_first_digit(score_part)
    
    return score, reason

In [None]:
score_reason = []

for i in range(len(output_scores)):
    output_score = output_scores[i]
    extracted_values = extract(output_score)
    
    score_reason.append((generated_responses[i], eval_dataset[i]["data"][1]['content']) + extracted_values)

In [None]:
import json

# Convert list of lists to list of dictionaries to enhance JSON structure
json_data = [
    {"instruction": entry[0], "response": entry[1], "score": entry[2], "reason": entry[3]}
    for entry in score_reason
]

# Convert to JSON string
json_string = json.dumps(json_data, indent=4)
print(json_string)  # To see the output on the console

# Save the JSON to a file
with open('self_curated.json', 'w') as f:
    json.dump(json_data, f, indent=4)
