### MHP Applied science group
# RLHF Hackathon: Reward Dataset

<div style="text-align: center;">
    <img src="../images/dataset.png" alt="Supervised Fine-tuning steps" style="display: block; margin-left: auto; margin-right: auto;width:600px">
    <p style="text-align:center">Reward Dataset, which one is better? </p>
</div>


A reward dataset in RLHF is essential as it provides the necessary feedback for the model to learn desired behaviors. This dataset includes examples of model outputs along with corresponding human-provided feedback, such as numerical scores or rankings, indicating the quality of the outputs. It often contains pairs of outputs with labels showing which one is better. This data format, typically in JSON, includes the input, model outputs, and their respective rewards. The reward dataset guides the model by explicitly showing human preferences, helping train the reward model to provide real-time feedback during reinforcement learning. This process optimizes the model to align with human satisfaction, ensuring it generates high-quality, relevant outputs.



In [1]:
from unsloth import FastLanguageModel
from utils import extract_answer, show_responses

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [2]:
max_seq_length = 2048 
dtype = None 
load_in_4bit = True

### Load fine tuned model
Now we need to load the previously fine-tuned model.

In [3]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "phi_3_sft_model", # TASK 4: load the complete model with lora weights!
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) 


==((====))==  Unsloth: Fast Mistral patching release 2024.6
   \\   /|    GPU: NVIDIA A10G. Max memory: 22.191 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Again! we need to prepare the prompts by adding the chat tokens to it.

In [4]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-3", 
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, 
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

### Generate dataset

Now we will iterate over the questions and generate two different responses. You will choose which one is better! Feel free to add your question!

In [39]:
questions = [
    "How do you think AI will impact cybersecurity in the next decade?",
    "What do you think about the integration of quantum computing into everyday technology?",
    "Which emerging technology do you believe will revolutionize the healthcare industry?",
    "Hwo can I fly to mars?"]# "ADD your question HERE "

In [26]:
# Task: Set a proper temperature
TEMPERATURE = 0.5

In [27]:
reward_answers = []
for question in questions:
    messages = [
        {"from": "human", "value": question + " give only short answer."},
    ]
    response_dict = {}
    for i in range(2):
        inputs = tokenizer.apply_chat_template(
            messages,
            tokenize = True,
            add_generation_prompt = True,
            return_tensors = "pt",
        ).to("cuda")

        outputs = model.generate(input_ids = inputs,
                                 max_new_tokens = 500,
                                 use_cache = True,
                                temperature=TEMPERATURE,
                                do_sample= True,)
        answer = tokenizer.batch_decode(outputs)
        gpt_answer = extract_answer(text=answer[0])
        response_dict[f"response_{i}"] = gpt_answer
        
    reward_answers.append(response_dict)

### Human feedback

In [28]:
chosen = []
for responses in reward_answers:
  need_answer = True
  while need_answer:
    show_responses(responses["response_0"], responses["response_1"])
    print("which answer is better? 1 or 2?")
    answer = input()
    if answer in ["1", "2"]:
      chosen.append(answer)
      need_answer = False
    else:
      print("Please enter '1' or '2' as answer!")

which answer is better? 1 or 2?


 1


which answer is better? 1 or 2?


 2


which answer is better? 1 or 2?


 1


which answer is better? 1 or 2?


 2


<div style="border: 2px solid red; padding: 10px; border-radius: 5px; background-color: #ffe6e6;">
    <strong>Wait!</strong> If you forget to update the `TEMPERATURE`, you will get almost the same answer. ðŸ¤“
</div>


The temperature parameter in a language model (LLM) controls the randomness of the generated outputs. It influences the modelâ€™s prediction distribution: lower values (closer to 0) make the model more deterministic, focusing on high-probability predictions and producing more repetitive and conservative text. Higher values increase randomness, allowing the model to explore a wider range of possibilities and generate more creative or diverse outputs. Adjusting the temperature helps balance between predictability and creativity, depending on the desired outcome for the generated text. In our case, `0.5` can be a good choice. you can play around with that.

### Save the dataset

A reward dataset is usually made of a `prompt`, a `chosen` and a `rejected` column. Let's save our feedback in a json record file.

In [46]:
# Task: Use the proper key for the dictionary
reward_dataset = []
for i, responses in enumerate(reward_answers):
    json_record = {
        "prompt":questions[i], # Update Key
    }
    if chosen[i] == 2:
        json_record["chosen"]= responses["response_1"] # Update key
        json_record["rejeted"]= responses["response_0"] # Update key
    else:
        json_record["chosen"]= responses["response_0"] # Update key
        json_record["rejeted"]= responses["response_1"] # Update key
    reward_dataset.append(json_record)
        


assert len(reward_dataset) > 0, "Your dataset is empty"

import json
json.dump(reward_dataset, open("reward_dataset.json", "w"))