# Reward Model Training

## Installing and importing necessry packages

In [11]:
!pip install pandas
!pip install trl
!pip install plotly

In [20]:
import random
import pandas as pd
from operator import itemgetter
import torch
import warnings
warnings.filterwarnings('ignore')
from datasets import Dataset, load_dataset
from transformers import AutoModelForSequenceClassification,AutoTokenizer,TrainingArguments
from trl import RewardTrainer

## Comparison Dataset

In this section, we convert the dataset in form of (question,answer,feedback) tuple to comparison dataset (question, chosen answer and rejected answer). 
Reward model training requires the data to be in form of (question, chosen answer, rejected answer) tuple.

The *feedback.csv* file contains mutliple answers for a given questions rated by human. These ratings could be for any human-value that we want to be included in the model output.

For eg: If we want the model outputs to be of *helpful* nature answers, we can instruct annotators to give high rewards(feedback score) for helpful answer in comparison to other answers for a particular question.

In [13]:
df = pd.read_csv('feedback.csv')
df.head()

Unnamed: 0,question,answer,feedback
0,Can Maximo Visual Inspection run on prem?​​​,"Answer: Yes, Maximo Visual Inspection can be ...",2.0
1,What is watson knowledge catalog?,Answer: The IBM Watson Knowledge Catalog is a...,4.0
2,What is watson knowledge catalog?,Answer: The IBM Watson Knowledge Catalog is a...,3.0
3,Can Instana use OpenTelemetry trace data?​​​​​,"Answer: Yes, Instana can ingest OpenTelemetry...",4.0
4,format it on a table,with the following columns. The first column i...,1.0


Once we have the feedback recieved from the users, we can then convert this dataset into comparsion dataset for reward model training

In [14]:
df['tup'] = list(zip(df['answer'], df['feedback']))
df_g = df.groupby('question')['tup'].apply(list).reset_index()
df_g["sorted_tup"] = df_g["tup"].apply(lambda x :sorted(x,key=itemgetter(1)) )
df_g["chosen"] = df_g["sorted_tup"].apply(lambda x: x[-1][0])
df_g["chosen_score"] = df_g["sorted_tup"].apply(lambda x: x[-1][1])
df_g["rejected"] = df_g["sorted_tup"].apply(lambda x: x[0][0])
df_g["rejected_score"] = df_g["sorted_tup"].apply(lambda x: x[0][1])
df_g = df_g.dropna()
df_g = df_g[(df_g['chosen_score']>=4.0) & (df_g['rejected_score']<4.0)]
df_g.to_csv("feedback_comparison_dataset.csv")

In [15]:
rows = []
for record in df_g.itertuples(index=True, name='Pandas'):
    if record is None or len(record) == 0:
        continue
    rows.append({
        "instruction": record.question,
        "chosen_response": record.chosen,
        "rejected_response": record.rejected
    })

prepared_dataset = Dataset.from_list(rows)
prepared_dataset.to_pandas()

Unnamed: 0,instruction,chosen_response,rejected_response
0,Can Instana use OpenTelemetry trace data?​,"Yes, Instana can use OpenTelemetry trace data....","Answer: Yes, Instana can use OpenTelemetry tr..."
1,Can Instana use OpenTelemetry trace data?​​​​​,"Yes, Instana can ingest OpenTelemetry trace da...","Answer: Yes, Instana can use OpenTelemetry tr..."
2,Can Maximo Visual Inspection run on prem?​​​,Answer: Yes. Maximo Visual Inspection is a cl...,"Answer: Yes, Maximo Visual Inspection can be ..."
3,Explain me step by step how can I integrate da...,ive tried to follow the documentation but I do...,ive tried to do it but it is not working. Answ...
4,What is watson knowledge catalog?,The IBM Watson Knowledge Catalog is a data cat...,Answer: The IBM Watson Knowledge Catalog is a...
5,what is cloud pak for watson aiops,? Answer: Cloud Pak for Watson AIOps is an AI-...,? Answer: Cloud Pak for Watson AIOps is a plat...
6,what is ibm?,Answer: IBM is an American multinational techn...,Answer: IBM is an American multinational techn...
7,which cloudpak provides business process autom...,? Answer: IBM Cloud Pak for Automation,? : What is the difference between IBM Cloud P...
8,who is the CEO of IBM?,Answer: Arvind Krishna Answer: Arvind Krishna,.


## Train the reward model with TRL

In [21]:
#Select a base model whch we need to train for reward modeling.
model_name = "distilroberta-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
tokenizer = AutoTokenizer.from_pretrained(model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id

def formatting_func(examples):
    kwargs = {"padding": "max_length", "truncation": True, "max_length": 512, "return_tensors": "pt"}

    prompt_plus_chosen_response = examples["instruction"] + "\n" + examples["chosen_response"]
    prompt_plus_rejected_response = examples["instruction"] + "\n" + examples["rejected_response"]
    tokens_chosen = tokenizer.encode_plus(prompt_plus_chosen_response, **kwargs)
    tokens_rejected = tokenizer.encode_plus(prompt_plus_rejected_response, **kwargs)

    return {
        "input_ids_chosen": tokens_chosen["input_ids"][0], "attention_mask_chosen": tokens_chosen["attention_mask"][0],
        "input_ids_rejected": tokens_rejected["input_ids"][0], "attention_mask_rejected": tokens_rejected["attention_mask"][0]
    }

formatted_dataset = prepared_dataset.map(formatting_func)
formatted_dataset = formatted_dataset.train_test_split()

training_args = TrainingArguments(
    output_dir="./reward_model",
    per_device_train_batch_size=16,
    evaluation_strategy="steps",
    logging_steps=1,
    num_train_epochs = 10,
    report_to=None,

)

trainer = RewardTrainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset["train"],
    eval_dataset=formatted_dataset["test"],

)

trainer.train()


Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.weight', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 

Map:   0%|          | 0/9 [00:00<?, ? examples/s]

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,Accuracy
1,0.6446,0.693641,0.666667
2,0.6558,0.694708,0.666667
3,0.657,0.696189,0.666667
4,0.6205,0.696142,0.666667
5,0.6755,0.696126,0.666667
6,0.7003,0.69602,0.666667
7,0.6407,0.695554,0.666667
8,0.5847,0.694999,0.666667
9,0.6503,0.694692,0.666667
10,0.6632,0.694611,0.666667


TrainOutput(global_step=10, training_loss=0.6492631733417511, metrics={'train_runtime': 3.5165, 'train_samples_per_second': 17.063, 'train_steps_per_second': 2.844, 'total_flos': 0.0, 'train_loss': 0.6492631733417511, 'epoch': 10.0})

In [15]:
trainer.save_model() 