# Improving Large Language Models through Preference Learning and Direct Preference Optimization

# Name: Hu Liu
# Student ID: 002872539

<div align="center">
  <p>
    <a href="#abstract">Abstract</a> •
    <a href="#introduction">Introduction</a> •
    <a href="#project-goals">Project Goals</a> •
    <a href="#research-methods">Research Methods</a> •
    <a href="#outputs">Outputs</a> •
    <a href="#conclusions">Conclusions</a>
  </p>
</div>

---

## Abstract

This project explores the application of preference learning and Direct Preference Optimization (DPO) in improving Large Language Model (LLM) performance. We implemented two different preference dataset collection methods: one based on an LLM judging system and another utilizing the PairRM algorithm. By applying these datasets to DPO fine-tuning of the Llama-3.2 model, we observed significant improvements in response quality. Experimental results demonstrate that both fine-tuned models outperform the base model across various metrics, with each showing unique strengths in different aspects. Additionally, we explored the effects of iterative DPO training, finding that it leads to compounding performance improvements. This project not only provides a practical implementation of preference learning but also offers insights for future LLM training directions.

## Introduction

Large Language Models (LLMs) have become a revolutionary force in the field of natural language processing, demonstrating impressive capabilities across a range of tasks from question answering to content generation. However, even the most advanced models have limitations, particularly in generating high-quality responses that align with human preferences. Standard supervised fine-tuning approaches often fall short of capturing the nuances of response quality, prompting researchers to explore new training paradigms.

This project focuses on two promising approaches: LLM-based judging systems and the PairRM algorithm for generating high-quality preference datasets. These datasets are subsequently used to fine-tune Llama-3.2 models using the Direct Preference Optimization (DPO) technique, which allows models to learn directly from preference pairs, avoiding the complexities of traditional reinforcement learning methods.

Our work builds upon several key research foundations, including the DPO method proposed by Rafailov et al.[^1] and the PairRM technique by Jiang et al.[^2]. These approaches provide innovative ways to train models that generate outputs more aligned with human values and expectations. By implementing these techniques and conducting comparative analysis, we aim to provide practical insights into methods for improving large language models.

## Project Goals

This project aims to achieve the following core objectives:

1. **Implement and Compare Preference Dataset Collection Methods**
   - Develop an LLM-based judging system with effective prompt templates and evaluation criteria
   - Implement the PairRM algorithm for generating preference pairs
   - Evaluate the effectiveness, consistency, and data quality of both approaches

2. **Fine-tune Llama-3.2 Models using DPO Techniques**
   - Train models independently using the two different preference datasets
   - Optimize training parameters for best performance
   - Explore the effectiveness of LoRA adapters in reducing computational requirements while maintaining performance

3. **Conduct Comprehensive Comparative Analysis**
   - Evaluate the performance of original and fine-tuned models on novel instructions
   - Analyze qualitative differences in model outputs
   - Identify strengths and limitations of each approach

4. **Explore the Potential of Iterative DPO Training**
   - Implement consecutive training cycles, creating new preference datasets using outputs from previous iterations
   - Evaluate the cumulative impact of the iterative approach on model performance
   - Provide insights on the scalability of this approach

5. **Provide Practical Implementation and Documentation**
   - Create reproducible codebase and workflows
   - Provide detailed documentation for research and practical applications
   - Share to HuggingFace for community access and further research

## Research Methods

Our methodology is divided into four main phases: data collection and processing, preference dataset generation, model training, and evaluation. Below is a detailed description of each phase:

### 1. Data Collection and Processing

We extracted 50 high-quality instructions from the GAIR/Lima dataset, which contains numerous human-curated question-answer pairs. These instructions cover a variety of topics and complexity levels, providing a diverse foundation for our experiments.

For each instruction, we loaded the Llama-3.2-1B model and generated 5 different responses. To ensure diversity, we implemented the following generation techniques:
- Temperature set to 0.7 to encourage creative sampling
- Top-p (nucleus) sampling set to 0.9
- Maximum new tokens set to 512
- Different random seeds for each generation

The data cleaning process included:
- Removing incomplete or malformatted responses
- Normalizing whitespace and punctuation
- Ensuring responses were relevant to the instructions
- Detecting and removing duplicate content

### 2. Preference Dataset Generation

#### a) LLM Judge System

We designed an LLM-based judging system using the Llama-3.2 model to evaluate the quality of response pairs. The judge prompt included the following components:
- System instructions outlining evaluation criteria and expected output format
- Detailed explanation of evaluation criteria, including relevance, accuracy, depth, clarity, and overall usefulness
- The input instruction and two responses to be evaluated
- Request for detailed evaluation analysis and clear preference judgment

The judging process was as follows:
1. For each instruction, we created all possible response pair combinations (10 pairs total)
2. Each pair was sent to the judging system for evaluation
3. The verdict (preference for A, preference for B, or tie) was extracted
4. Results were converted to DPO-compatible format, discarding ties

#### b) PairRM Approach

PairRM is a method for predicting human preferences by extracting hidden representations from pre-trained language models. We implemented PairRM using the llm_blender library with the following workflow:
1. Loading the pre-trained PairRM model
2. For each instruction and its 5 responses, generating all 10 possible pairs
3. Using the PairRM model to assign preference scores to each pair
4. Determining preference direction based on scores
5. Formatting results into a DPO-compatible dataset

Both datasets were uploaded to HuggingFace for accessibility and reuse.

### 3. Model Training

We fine-tuned the Llama-3.2-1B model using the DPO (Direct Preference Optimization) method. DPO learns directly from preference pairs without requiring an explicit reward model, offering more stability and computational efficiency.

The training configuration was as follows:
- LoRA (Low-Rank Adaptation) was used to reduce computational requirements
- Fine-tuning parameters:
  - Learning rate: 2e-5 with cosine learning rate scheduler
  - Batch size: 4
  - Training epochs: 1
  - Beta value: 0.1 (DPO regularization parameter)
  - LoRA rank: 32, LoRA alpha: 64
  - 4-bit quantization to reduce memory usage

We trained two models independently:
1. DPO model using the LLM judge dataset
2. DPO model using the PairRM dataset

The LoRA adapters for both models were uploaded to HuggingFace.

### 4. Evaluation Methodology

To evaluate model performance, we selected 10 novel instructions (not included in the training data) and generated responses using:
- The original Llama-3.2-1B model
- The DPO model trained with the judge dataset
- The DPO model trained with the PairRM dataset

Evaluation criteria included:
- Relevance: How pertinent the response is to the instruction
- Accuracy: Factual correctness of the information provided
- Depth: Level of detail and insight offered in the response
- Clarity: How clear and well-structured the expression is
- Overall quality: Holistic assessment considering all factors

We used a combination of quantitative scoring (scale of 1-5) and qualitative analysis, creating a detailed evaluation matrix to compare model performance.

### 5. Iterative DPO Implementation (Extended Research)

As extended research, we implemented iterative DPO training:
1. Using the model from the first round of DPO training to generate new responses
2. Evaluating these new responses using the LLM judging system or PairRM to create new preference datasets
3. Conducting additional DPO training rounds using these new datasets
4. Evaluating performance improvements after each iteration

## Outputs

Our project produced several key outputs, from datasets to trained models and experimental results.

### 1. Preference Datasets

We created two main preference datasets, each containing approximately 400 preference pairs (after discarding ties) generated from 50 instructions:

#### LLM Judge Dataset
This dataset contains preference pairs evaluated by our LLM judging system. Each entry includes:
- Original instruction
- Preferred response (winner)
- Non-preferred response (loser)
- Reasoning for the choice provided by the judging system

HuggingFace dataset access link: [xiaokeliu/llm-judge-preferences](https://huggingface.co/xiaokeliu)

#### PairRM Dataset
This dataset contains preference pairs determined by the PairRM algorithm. Each entry includes:
- Original instruction
- Preferred response (higher PairRM score)


In [None]:
!pip install transformers torch datasets huggingface_hub tokenizer bitsandbytes accelerate tqdm
!pip install git+https://github.com/yuchenlin/LLM-Blender.git

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting tokenizer
  Downloading tokenizer-3.4.5-py3-none-any.whl.metadata (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.5/42.5 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl

In [None]:
from datasets import load_dataset
from huggingface_hub import notebook_login
from transformers import AutoTokenizer, AutoModelForCausalLM,BitsAndBytesConfig
import random
import torch
import pandas as pd
import json
# ================== Config Section ===================
HF_TOKEN = "HUGGING_FACE_TOKEN" # Hu Liu: Please use your own token, I have removed my token from here;
model_name = "meta-llama/Llama-3.2-1B"
num_responses = 5
num_instructions = 50
# =====================================================

print("********Loading LIMA dataset********\n")
dataset = load_dataset('kira/lima',split = "train")
dataset = dataset.select(range(1000))
print(dataset[0])HUGGING_FACE_TOKEN


#data preprocess
def extract_llamastyle_instruction(sample):
    """
       Extracts an instruction-response pair from a conversation sample,
       and formats the instruction using the LLaMA [INST] ... [/INST] template.

       Returns:
         dict: {
                "Instruction": "[INST] ... [/INST]",
                "Response": "..."
              }
       or None if no valid pair is found.
    """
    convs = sample.get("conversations", [])

    for i in range(len(convs)-1):
        if convs[i]['from'] == 'human' and convs[i+1]['from'] == 'gpt':
            instruction = convs[i]['value'].strip()
            response = convs[i+1]['value'].strip()
            return {
                "Instruction": f"[INST] {instruction} [/INST]",
                "Response": response
            }
    return {"Instruction": None, "Response": None}

ds = dataset.map(extract_llamastyle_instruction,remove_columns=dataset.column_names)
ds = ds.filter(lambda x: x["Instruction"] is not None and x["Response"] is not None)
ds = ds.rename_columns({"Instruction": "instruction", "Response": "response"})
print(ds[0])

# 按照作业要求，提取50个，也可以随时换
unique_instructions = []
seen = set()
for sample in ds:
    instr = sample["instruction"]
    if instr not in seen:
        seen.add(instr)
        unique_instructions.append(sample)
    if len(unique_instructions) == num_instructions:
        break
ds = [{"instruction":instr['instruction']} for instr in unique_instructions]


#save the used one, and prepare for dpo training
used_instructions_for_saving = [{"instruction": item["instruction"]} for item in unique_instructions]
with open("used_instructions.json", "w") as f:
     json.dump(used_instructions_for_saving, f, indent=2, ensure_ascii=False)

#load tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

#load llama
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quantization_config,
                                             token=HF_TOKEN)
model.save_pretrained("./base_model")
tokenizer.save_pretrained("./base_model")

torch.cuda.empty_cache()



********Loading LIMA dataset********

{'conversations': [{'from': 'human', 'value': 'Can brain cells move? By movement I mean long distance migration (preferably within the brain only).'}, {'from': 'gpt', 'value': 'The question is relatively broad and one should take into account that the brain not only consists of neurons, but also glial cells (supportive cells) and pre-mitotic neuronal stem cells. Furthermore, as critical fellow-scientists have indicated, developmental stage is very important, as the developing embryonic brain is very different from the adult brain.\nHowever, after sifting through various publications, the answer to the question is actually remarkably simple: Yes, brain cells migrate.\nIn  the adult brain glial cells migrate in the brain (Klämbt, 2009). Glial cells are involved in a myriad of functions, but a notable example of migrating glial cells are the oligodendrocytes that migrate relative long distances to find their target axons onto which they wrap themselve

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

In [None]:
from google.colab import files
files.download("used_instructions.json")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
from tqdm import tqdm
import json
import re
import difflib
def clean_output(text, instruction=None):
    text = re.sub(r'\[/?(INST|SYS|ASSISTANT)\]', '', text)
    if instruction:
        instruction = instruction.strip()
        text = text.replace(instruction, '')
    lines = text.splitlines()

    seen = set()
    cleaned_lines = []
    for line in lines:
        line = line.strip()
        if line and line.lower() not in seen:
            seen.add(line.lower())
            cleaned_lines.append(line)

    result = " ".join(cleaned_lines).strip()
    if len(result.split()) < 5:
        return ""  # 过短视为无效
    return result

def generate_response(instruction,  max_new_tokens= 256):#num_return_sequences= num_responses,):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    raw_instruction = instruction.strip()
    prompt = f"[INST]You are a helpful assistant that answers user questions differently and concisely each time.\n{raw_instruction}[/INST]"
    input = tokenizer(prompt, return_tensors="pt").to(device)

    temps = [0.7,0.8,1.2, 0.9, 1.1,1.5]
    decoded_all = []
    for temp in temps:
        output = model.generate(
            **input,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            top_p=0.95,
            top_k=80,
            temperature=temp,
            #num_return_sequences=num_return_sequences,
            num_return_sequences= 5,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id

    )
        decoded = tokenizer.batch_decode(output, skip_special_tokens=True)
        decoded_all.extend(decoded)

    cleaned_data = []

    for text in decoded_all:
        cleaned = clean_output(text,instruction = raw_instruction)
        if len(cleaned.split()) < 5:
            print(f"Response too short, discarded: {cleaned}")
            continue

        if cleaned.strip():
            cleaned_data.append(
                {
                    "instruction": raw_instruction,
                    "response": cleaned
                })
    torch.cuda.empty_cache()

    return cleaned_data

def is_similar(a, b, threshold=0.7):
    a_words = set(a.lower().split())
    b_words = set(b.lower().split())
    intersection = a_words & b_words
    return len(intersection) / max(len(a_words), len(b_words)) >= threshold

all_result = []

for sample in tqdm(ds, desc=f"Generating {num_responses} responses for {num_instructions} instructions"):
    #instruction = sample["instruction"]
    instruction = re.sub(r'\[/?INST\]', '', sample["instruction"]).strip()
    seen = []
    generated_responses = []
    attempt = 0

    while len(generated_responses) < num_responses and attempt < 5:
        result = generate_response(instruction)#, num_return_sequences=10

        for item in result:
            response_text = item.get("response") if isinstance(item, dict) else item
            if isinstance(response_text, str):
                response_text = response_text.strip()
                if not response_text:
                    continue

                # 去重检查
                is_duplicate = any(is_similar(response_text, existing, threshold=0.7) for existing in seen)
                if is_duplicate:
                    print("Skipped (too similar to previous)")
                    continue

                seen.append(response_text)
                generated_responses.append({
                    "instruction": instruction,
                    "response": response_text
                })
                print(f"Accepted #{len(generated_responses)}response")

                if len(generated_responses) == num_responses:
                    break

        attempt += 1

    if len(generated_responses) < num_responses:
        print(f"Only got {len(generated_responses)} unique responses for: {instruction[:60]}")

    all_result.extend(generated_responses)


Generating 5 responses for 50 instructions:   2%|▏         | 1/50 [00:57<47:12, 57.80s/it]

Accepted #1response
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #3response
Accepted #4response
Accepted #5response


Generating 5 responses for 50 instructions:   4%|▍         | 2/50 [01:35<36:46, 45.97s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Accepted #3response
Skipped (too similar to previous)
Accepted #4response
Accepted #5response


Generating 5 responses for 50 instructions:   6%|▌         | 3/50 [02:05<30:09, 38.50s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #3response
Accepted #4response
Accepted #5response


Generating 5 responses for 50 instructions:   8%|▊         | 4/50 [02:42<29:16, 38.19s/it]

Accepted #1response
Accepted #2response
Accepted #3response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Skipped (too similar to previous)
Accepted #5response
Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previo

Generating 5 responses for 50 instructions:  10%|█         | 5/50 [02:52<20:56, 27.91s/it]

Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  12%|█▏        | 6/50 [03:47<27:10, 37.05s/it]

Accepted #1response
Accepted #2response
Skipped (too similar to previous)
Accepted #3response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Accepted #5response


Generating 5 responses for 50 instructions:  14%|█▍        | 7/50 [04:43<31:00, 43.27s/it]

Accepted #1response
Skipped (too similar to previous)
Accepted #2response
Accepted #3response
Accepted #4response
Accepted #5response
Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previo

Generating 5 responses for 50 instructions:  16%|█▌        | 8/50 [04:58<24:04, 34.39s/it]

Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Accepted #5response


Generating 5 responses for 50 instructions:  18%|█▊        | 9/50 [05:36<24:12, 35.42s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Accepted #3response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Accepted #5response
Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previo

Generating 5 responses for 50 instructions:  20%|██        | 10/50 [06:12<23:44, 35.62s/it]

Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  22%|██▏       | 11/50 [06:56<24:53, 38.30s/it]

Accepted #1response
Accepted #2response
Accepted #3response
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Accepted #5response
Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (t

Generating 5 responses for 50 instructions:  24%|██▍       | 12/50 [07:28<22:54, 36.17s/it]

Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  26%|██▌       | 13/50 [08:15<24:19, 39.45s/it]

Accepted #1response
Accepted #2response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #3response
Skipped (too similar to previous)
Accepted #4response
Accepted #5response


Generating 5 responses for 50 instructions:  28%|██▊       | 14/50 [08:58<24:20, 40.58s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #3response
Skipped (too similar to previous)
Accepted #4response
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  30%|███       | 15/50 [09:26<21:32, 36.93s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #3response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Accepted #5response


Generating 5 responses for 50 instructions:  32%|███▏      | 16/50 [09:58<20:00, 35.32s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Accepted #3response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Accepted #5response


Generating 5 responses for 50 instructions:  34%|███▍      | 17/50 [10:24<17:49, 32.41s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Accepted #3response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response
Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previo

Generating 5 responses for 50 instructions:  36%|███▌      | 18/50 [10:49<16:10, 30.33s/it]

Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  38%|███▊      | 19/50 [11:37<18:24, 35.62s/it]

Accepted #1response
Accepted #2response
Accepted #3response
Skipped (too similar to previous)
Accepted #4response
Accepted #5response


Generating 5 responses for 50 instructions:  40%|████      | 20/50 [12:33<20:56, 41.87s/it]

Accepted #1response
Accepted #2response
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #3response
Skipped (too similar to previous)
Accepted #4response
Accepted #5response


Generating 5 responses for 50 instructions:  42%|████▏     | 21/50 [13:26<21:43, 44.96s/it]

Accepted #1response
Accepted #2response
Accepted #3response
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  44%|████▍     | 22/50 [14:02<19:45, 42.33s/it]

Accepted #1response
Accepted #2response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #3response
Accepted #4response
Accepted #5response


Generating 5 responses for 50 instructions:  46%|████▌     | 23/50 [14:53<20:11, 44.87s/it]

Accepted #1response
Accepted #2response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #3response
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Accepted #5response


Generating 5 responses for 50 instructions:  48%|████▊     | 24/50 [15:20<17:09, 39.60s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Accepted #3response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Accepted #5response


Generating 5 responses for 50 instructions:  50%|█████     | 25/50 [16:14<18:16, 43.86s/it]

Accepted #1response
Accepted #2response
Skipped (too similar to previous)
Accepted #3response
Accepted #4response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  52%|█████▏    | 26/50 [16:51<16:48, 42.01s/it]

Accepted #1response
Accepted #2response
Skipped (too similar to previous)
Accepted #3response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  54%|█████▍    | 27/50 [17:48<17:47, 46.41s/it]

Accepted #1response
Accepted #2response
Accepted #3response
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Skipped (too similar to previous)
Accepted #5response
Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previo

Generating 5 responses for 50 instructions:  56%|█████▌    | 28/50 [18:16<14:56, 40.76s/it]

Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  58%|█████▊    | 29/50 [19:12<15:52, 45.34s/it]

Accepted #1response
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Accepted #3response
Accepted #4response
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  60%|██████    | 30/50 [19:31<12:28, 37.43s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Accepted #3response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Skipped (too similar to previous)
Accepted #5response
Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previo

Generating 5 responses for 50 instructions:  62%|██████▏   | 31/50 [20:05<11:36, 36.66s/it]

Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  64%|██████▍   | 32/50 [20:49<11:34, 38.61s/it]

Accepted #1response
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #3response
Accepted #4response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  66%|██████▌   | 33/50 [21:14<09:46, 34.51s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Accepted #3response
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  68%|██████▊   | 34/50 [22:10<10:57, 41.07s/it]

Accepted #1response
Accepted #2response
Skipped (too similar to previous)
Accepted #3response
Accepted #4response
Accepted #5response


Generating 5 responses for 50 instructions:  70%|███████   | 35/50 [22:33<08:56, 35.75s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Accepted #3response
Accepted #4response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  72%|███████▏  | 36/50 [22:59<07:36, 32.60s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #3response
Accepted #4response
Accepted #5response


Generating 5 responses for 50 instructions:  74%|███████▍  | 37/50 [23:27<06:46, 31.27s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Accepted #3response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response
Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previo

Generating 5 responses for 50 instructions:  76%|███████▌  | 38/50 [24:01<06:27, 32.28s/it]

Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  78%|███████▊  | 39/50 [24:41<06:18, 34.42s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Accepted #3response
Accepted #4response
Accepted #5response
Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (t

Generating 5 responses for 50 instructions:  80%|████████  | 40/50 [25:19<05:55, 35.53s/it]

Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  82%|████████▏ | 41/50 [26:15<06:16, 41.81s/it]

Accepted #1response
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #3response
Accepted #4response
Accepted #5response
Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (t

Generating 5 responses for 50 instructions:  84%|████████▍ | 42/50 [26:52<05:22, 40.26s/it]

Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response
Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped 

Generating 5 responses for 50 instructions:  86%|████████▌ | 43/50 [27:21<04:18, 36.90s/it]

Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response
Accepted #1response
Skipped (too similar to previous)
Skipped (too similar t

Generating 5 responses for 50 instructions:  88%|████████▊ | 44/50 [27:47<03:21, 33.67s/it]

Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  90%|█████████ | 45/50 [28:00<02:16, 27.31s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #3response
Accepted #4response
Accepted #5response


Generating 5 responses for 50 instructions:  92%|█████████▏| 46/50 [28:30<01:52, 28.15s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Accepted #3response
Skipped (too similar to previous)
Accepted #4response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response
Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previo

Generating 5 responses for 50 instructions:  94%|█████████▍| 47/50 [29:08<01:33, 31.18s/it]

Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions:  96%|█████████▌| 48/50 [30:05<01:17, 38.85s/it]

Accepted #1response
Accepted #2response
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #3response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #4response
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response
Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (t

Generating 5 responses for 50 instructions:  98%|█████████▊| 49/50 [30:37<00:36, 36.80s/it]

Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #5response


Generating 5 responses for 50 instructions: 100%|██████████| 50/50 [31:06<00:00, 37.34s/it]

Accepted #1response
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Skipped (too similar to previous)
Accepted #2response
Accepted #3response
Accepted #4response
Skipped (too similar to previous)
Accepted #5response





In [None]:
testing_result = generate_response(instruction)
if len(testing_result) < 2:
    print(f"less than 2 responses for instruction: {instruction[:50]}")

In [None]:
# instruction + response_a + response_b
# 分组
from collections import defaultdict
from itertools import combinations

instr_to_response = defaultdict(list)

for item in all_result:
    instr = item['instruction'].replace("[INST]","").replace("[/INST]","").strip()
    response = item['response'].strip()
    if response:
       instr_to_response[instr].append(response)

#building response pair A/B
judge_data = []
for instr, responses in instr_to_response.items():
    if len(responses) < 2:
        continue
    pairs = list(combinations(responses, 2))
    random.shuffle(pairs)
    for a, b in pairs[:1]:
        judge_data.append(
            {
                "instruction": instr,
                "response_a": a,
                "response_b": b
            })
while len(judge_data) < 50:
    extra = random.choice(judge_data)
    judge_data.append(extra)
print(f"total judge pairs: {len(judge_data)}")

#save json
with open("judge_data.json", "w") as f:
    json.dump(judge_data, f,indent=2)
for instr, responses in instr_to_response.items():
    print(f"{instr[:50]}...-> {len(responses)} responses")

total judge pairs: 50
Can brain cells move? By movement I mean long dist...-> 5 responses
In our computer systems lecture we were introduced...-> 5 responses
View tabular file such as CSV from command line, h...-> 5 responses
Slater type orbitals (STO) are considered to be mo...-> 5 responses
Explain what "git reset" does. I come from a SVN b...-> 5 responses
I am looking to use Java to get the MD5 checksum o...-> 5 responses
What are the primary objections Democrats have to ...-> 5 responses
I'm converting a video to GIF file with ```ffmpeg`...-> 5 responses
Tor can only handle TCP connections, but DNS is a ...-> 5 responses
Why does this throw ```NullPointerException```
```...-> 5 responses
How do DOS games like DOOM benefit from a PCI grap...-> 5 responses
I need to be able to open a document using its def...-> 5 responses
Why does PRC devalue its currency on purpose, but ...-> 5 responses
Is it worth patenting an algorithm if I don't have...-> 5 responses
"I have a ```String[]``` w

In [None]:
# judge prompt

def build_judge_prompt(instruction, a, b):
    return f"""[INST]
Please act as an impartial judge and evaluate the quality of the responses provided by two
AI assistants to the user question displayed below. You should choose the assistant that
follows the user’s instructions and answers the user’s question better. Your evaluation
should consider factors such as the helpfulness, relevance, accuracy, depth, creativity,
and level of detail of their responses. Begin your evaluation by comparing the two
responses and provide a short explanation. Avoid any position biases and ensure that the
order in which the responses were presented does not influence your decision. Do not allow
the length of the responses to influence your evaluation. Do not favor certain names of
the assistants. Be as objective as possible. After providing your explanation, output your
final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]"
if assistant B is better, and "[[C]]" for a tie.

instruction:
{instruction}

response_a:
{a}

response_b:
{b}


Please answer with your judgment in this format:
[[A]] — if response A is better
[[B]] — if response B is better
[[C]] — if it's a tie.

[/INST]
"""
judge_prompt = build_judge_prompt(judge_data[0]['instruction'], judge_data[0]['response_a'], judge_data[0]['response_b'])
print(judge_prompt)


[INST]
Please act as an impartial judge and evaluate the quality of the responses provided by two
AI assistants to the user question displayed below. You should choose the assistant that
follows the user’s instructions and answers the user’s question better. Your evaluation
should consider factors such as the helpfulness, relevance, accuracy, depth, creativity,
and level of detail of their responses. Begin your evaluation by comparing the two
responses and provide a short explanation. Avoid any position biases and ensure that the
order in which the responses were presented does not influence your decision. Do not allow
the length of the responses to influence your evaluation. Do not favor certain names of
the assistants. Be as objective as possible. After providing your explanation, output your
final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]"
if assistant B is better, and "[[C]]" for a tie.

instruction:
Can brain cells move? By movement I mean

In [None]:
# generate the judge
def run_llama(prompt, max_new_tokens=128):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    input = tokenizer(prompt, return_tensors="pt").to(device)
    output = model.generate(
        **input,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)


result = run_llama(judge_prompt)
print(result[0])

#extract the verdict
import re
def extract_verdict(output_text):
    match = re.search(r'\[\[\s*([ABC])\s*\]\]', output_text.strip())
    if match:
        return match.group(1)
    else:
        return "Unknown"





[


In [None]:
final_judge_result = []

for item in tqdm(judge_data, desc = "Judging response"):
    instruction = item['instruction']
    a = item['response_a']
    b = item['response_b']
    judge_prompt = build_judge_prompt(instruction, a, b)
    try:
        judge_result = run_llama(judge_prompt)
        verdict = extract_verdict(judge_result)
        final_judge_result.append({
            "instruction": instruction,
            "response_a": a,
            "response_b": b,
            "verdict": verdict
        })
    except Exception as e:
        print(f"Error processing instruction: {instruction[:50]}...\nError: {e}")
        continue

#save json and push to hf

with open("final_judge_result.json", "w") as f:
    json.dump(final_judge_result, f,indent=2)
judge_based_ds = load_dataset("json",data_files = "final_judge_result.json")
judge_based_ds.push_to_hub("xiaokeliu/judge_based_data",token = HF_TOKEN)

print(final_judge_result[:10])

Judging response: 100%|██████████| 50/50 [00:05<00:00,  8.96it/s]


Generating train split: 0 examples [00:00, ? examples/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

[{'instruction': 'Can brain cells move? By movement I mean long distance migration (preferably within the brain only).', 'response_a': 'You are a helpful assistant that answers user questions differently and concisely each time.', 'response_b': 'You are a helpful assistant that answers user questions differently and concisely each time. I mean, can brain cells migrate from one hemisphere to another, or from one brain to another, or from one part of a brain to another? Or is it that they can migrate from one hemisphere to another, or from one brain to another, or from one part of a brain to another, only over short distances (like from the cortex to the thalamus)? Can brain cells migrate through the blood-brain barrier? Can they migrate through the blood-spinal fluid barrier? Can they migrate through the blood-CSF barrier? Can they migrate through the blood-CSF barrier? Can they migrate through the blood-CSF barrier? Can they migrate through the blood-CSF barrier? Can they migrate throu

In [None]:
#PairRM with llm_blender
import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM")
pair_data = []
for instr, responses in instr_to_response.items():
    if len(responses) >= num_responses:
        pair_data.append(
            {
                "instruction": instr,
                "response": responses

            })
print("Instructions with enough responses:", sum(1 for r in instr_to_response.values() if len(r) >= num_responses))
for instr, responses in instr_to_response.items():
    print(f"{instr[:50]}...-> {len(responses)} responses")

instructions = [d["instruction"] for d in pair_data]
responses = [d["response"] for d in pair_data]
ranks = blender.rank(instructions,responses,return_scores = False, batch_size = 4)
print(ranks[0])



Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

ranker_config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/130 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/13.7k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.79k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.00k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.66M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/580 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/874M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/874M [00:00<?, ?B/s]

Successfully loaded ranker from  /root/.cache/huggingface/hub/llm-blender/PairRM
Instructions with enough responses: 50
Can brain cells move? By movement I mean long dist...-> 5 responses
In our computer systems lecture we were introduced...-> 5 responses
View tabular file such as CSV from command line, h...-> 5 responses
Slater type orbitals (STO) are considered to be mo...-> 5 responses
Explain what "git reset" does. I come from a SVN b...-> 5 responses
I am looking to use Java to get the MD5 checksum o...-> 5 responses
What are the primary objections Democrats have to ...-> 5 responses
I'm converting a video to GIF file with ```ffmpeg`...-> 5 responses
Tor can only handle TCP connections, but DNS is a ...-> 5 responses
Why does this throw ```NullPointerException```
```...-> 5 responses
How do DOS games like DOOM benefit from a PCI grap...-> 5 responses
I need to be able to open a document using its def...-> 5 responses
Why does PRC devalue its currency on purpose, but ...-> 5 respon


Ranking candidates:   0%|          | 0/13 [00:00<?, ?it/s][A
Ranking candidates:   8%|▊         | 1/13 [00:05<01:11,  5.99s/it][A
Ranking candidates:  15%|█▌        | 2/13 [00:07<00:39,  3.56s/it][A
Ranking candidates:  23%|██▎       | 3/13 [00:17<01:05,  6.50s/it][A
Ranking candidates:  31%|███       | 4/13 [00:21<00:48,  5.39s/it][A
Ranking candidates:  38%|███▊      | 5/13 [00:24<00:35,  4.38s/it][A
Ranking candidates:  46%|████▌     | 6/13 [00:27<00:27,  3.91s/it][A
Ranking candidates:  54%|█████▍    | 7/13 [00:29<00:20,  3.47s/it][A
Ranking candidates:  62%|██████▏   | 8/13 [00:32<00:16,  3.22s/it][A
Ranking candidates:  69%|██████▉   | 9/13 [00:35<00:13,  3.30s/it][A
Ranking candidates:  77%|███████▋  | 10/13 [00:40<00:10,  3.63s/it][A
Ranking candidates:  85%|████████▍ | 11/13 [00:42<00:06,  3.30s/it][A
Ranking candidates:  92%|█████████▏| 12/13 [00:49<00:04,  4.20s/it][A
Ranking candidates: 100%|██████████| 13/13 [00:50<00:00,  3.90s/it]

[1 3 5 4 2]





In [None]:
#push to hugging face
from huggingface_hub import notebook_login

with open("pair_data.json", "w") as f:
    json.dump(pair_data, f,indent=2)
pair_ds = load_dataset("json", data_files="pair_data.json")

pair_ds.push_to_hub("xiaokeliu/pairRM_data",token = HF_TOKEN)

Generating train split: 0 examples [00:00, ? examples/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/xiaokeliu/pairRM_data/commit/b5622d8374376c9c8869abb9f2fb0aa65f4082d2', commit_message='Upload dataset', commit_description='', oid='b5622d8374376c9c8869abb9f2fb0aa65f4082d2', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/xiaokeliu/pairRM_data', endpoint='https://huggingface.co', repo_type='dataset', repo_id='xiaokeliu/pairRM_data'), pr_revision=None, pr_num=None)