<a href="https://colab.research.google.com/github/louiezzang/next-gpt/blob/main/examples/chatgpt_replica_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# chatGPT replica


What is RLHF? <br>
See [this link](https://gist.github.com/JoaoLages/c6f2dfd13d2484aa8bb0b2d567fbf093).

<br>

**Example of RLHF dataset**:

Total 3 datasets are needed for training the 3 steps(SFT, RM and PPO)
- [Example of dataset](https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/chatllama#dataset-preparation)
- [Example of dataset 1](https://huggingface.co/datasets/stanfordnlp/SHP)
- [Example of dataset 2](https://huggingface.co/datasets/Anthropic/hh-rlhf)

step1) Dataset for SFT(Supervised Fine-tuning training)
```json
[
    {
        "prompt": "",
        "completion": ""        
    }, ...
]
```

step2) Dataset for RM(Reward Model) training: There are multiple completetions with human rated ranking score for one prompt.
```json
[
    {
        "prompt": "",
        "completion_1": "",
        "completion_2": "",
        "completion_3": "",            
        "ranking": [1, 0, 2]
    }, ...
]
```
    
step3) Dataset for PPO(RLHF) training: It only consists of prompt.
```json
[
    {
        "prompt": ""
    }, ...
]
```

# Colab environment setup

#### Installation (python>=3.8)

In [1]:
# Install next-gpt lib.
!rm -rf ./next-gpt/
!git clone https://github.com/louiezzang/next-gpt.git
%cd next-gpt/
!pip install .
%cd ../

Cloning into 'next-gpt'...
remote: Enumerating objects: 568, done.[K
remote: Counting objects: 100% (220/220), done.[K
remote: Compressing objects: 100% (134/134), done.[K
remote: Total 568 (delta 106), reused 146 (delta 68), pack-reused 348[K
Receiving objects: 100% (568/568), 197.32 KiB | 7.89 MiB/s, done.
Resolving deltas: 100% (274/274), done.
/content/next-gpt
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing /content/next-gpt
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: next-gpt
  Building wheel for next-gpt (setup.py) ... [?25l[?25hdone
  Created wheel for next-gpt: filename=next_gpt-1.0.0-py3-none-any.whl size=58894 sha256=e4051b55711873734d036593e72db357ae760bc787fed2010b17089626302823
  Stored in directory: /root/.cache/pip/wheels/f4/0d/a2/307ec9a6214260bfd63facee827d7bbef5c1ba9618277ea1b5
Successfully built next-gpt
Installing collected packages: next-gpt
  Attem

# Step 1) SFT: Surpervised Fine-tuning
Build a Supervised Fine-tuning model to answer well to the question.

- Refereneces
  - [fine tuning code_1](https://github.com/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb)
  - [fine tuning code_2](https://github.com/Beomi/KoAlpaca/blob/main/train.py)


- SFT(Supervised Fine Tuning)
- Fine-tune a pretrained LLM on a specific domain or corpus of instructions and human demonstrations

- Dataset example
```json
[
    {
        "prompt": "",
        "completion": ""        
    }, ...
]
```

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import json
import yaml
import argparse

import numpy as np
import pandas as pd

import torch
from datasets import load_dataset
import transformers

from transformers import (
    AutoTokenizer, AutoConfig, AutoModelForCausalLM, pipeline, 
    TrainingArguments, AutoModelWithLMHead,
    ProgressCallback
)
from nextgpt.finetuning import (
    SupervisedDataset, DataCollatorForSupervisedDataset,
    SupervisedTrainer, LoggingCallback
)

In [3]:
# Define arguments.
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="gpt2", choices=["gpt2'", "bloom", "opt"])
parser.add_argument("--max_epochs", type=int, default=1)
parser.add_argument("--train_batch_size", type=int, default=4)
parser.add_argument("--output_dir", type=str, default="./output_1_sft")

args = parser.parse_args(args=[])
print(args)

Namespace(model='gpt2', max_epochs=1, train_batch_size=4, output_dir='./output_1_sft')


In [4]:
# Get the tokenizer.
tokenizer = transformers.AutoTokenizer.from_pretrained(args.model, 
                                          bos_token="<|startoftext|>",
                                          eos_token="<|endoftext|>", 
                                          pad_token="<|pad|>")
print(tokenizer)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|pad|>'})


In [5]:
dataset_webgpt_comp = load_dataset("openai/webgpt_comparisons", split="train[:20%]")



In [6]:
data_list = []
for row in dataset_webgpt_comp:
  question = row["question"]["full_text"]
  answer_0 = row["answer_0"]
  data_list.append({
      "prompt": question,
      "completion": answer_0
  })

In [7]:
PROMPT_TEMPLATE = (
  "Below is an instruction that describes a task, paired with an input that provides further context.\n\n"
  "Write a response that appropriately completes the request.\n\n"
  "### Instruction:\n{prompt}\n\n### Response:"
)

In [8]:
dataset = SupervisedDataset(
    data=data_list,
    tokenizer=tokenizer, 
    prompt_template=PROMPT_TEMPLATE,
    completion_field="completion",
    verbose=True)

# Split train and val dataset.
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, eval_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])

# Data collator.
data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)

100%|██████████| 3916/3916 [00:00<00:00, 428356.31it/s]


Below is an instruction that describes a task, paired with an input that provides further context.

Write a response that appropriately completes the request.

### Instruction:
Voiced by Harry Shearer, what Simpsons character was modeled after Ted Koppel?

### Response:
The Simpsons character that was possibly based on Ted Koppel is Kent Brockman.  He is a local news anchor in Springfield and is modeled after Ted Koppel. [1]<|endoftext|>
Tokenizing inputs... This may take some time...
Loading data done!!: 3916


In [9]:
# Load the pretrained model.
model = AutoModelForCausalLM.from_pretrained(args.model)
model.resize_token_embeddings(len(tokenizer))

Embedding(50259, 768)

In [10]:
# Train arguments.
training_args = TrainingArguments(
    output_dir="./train_1_sft", # the output directory
    overwrite_output_dir=True, # overwrite the content of the output directory
    num_train_epochs=args.max_epochs, # number of training epochs
    per_device_train_batch_size=args.train_batch_size, # batch size for training
    per_device_eval_batch_size=4, # batch size for evaluation
    eval_steps=3, # number of update steps between two evaluations.
    save_steps=100, # after # steps model is saved 
    warmup_steps=5, # number of warmup steps for learning rate scheduler
    prediction_loss_only=True,
)

# Train the model.
trainer = SupervisedTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=None,
    # callbacks=[ProgressCallback, LoggingCallback(logger=None)],
)

trainer.train()
trainer.save_state()
trainer.safe_save_model(output_dir=args.output_dir)



Step,Training Loss
500,3.8913


In [11]:
# Inference test.
generator = pipeline("text-generation", model=args.output_dir, tokenizer=tokenizer)

generation_args = dict(
    num_beams=4,
    repetition_penalty=2.0,
    no_repeat_ngram_size=4,
    # bos_token="<|startoftext|>",
    # eos_token="<|endoftext|>", 
    # pad_token="<|pad|>",
    max_new_tokens=64,
    do_sample=True,
    top_k=30,
    top_p=0.95,
    temperature=1.9, 
    #max_length=300, 
    #num_return_sequences=20
    early_stopping=True,
)

In [12]:
test_list = data_list[-5:]

test_prompt_list = []
actual_completion_list = []
for row in test_list:
    text_input = row
    prompt = PROMPT_TEMPLATE.format_map(text_input)
    test_prompt_list.append(prompt)
    actual_completion_list.append(text_input["completion"])

result_list = generator(test_prompt_list, **generation_args)
for prompt, result, actual_response in zip(test_prompt_list, result_list, actual_completion_list):
    print("")
    print("-" * 70)
    print(("completion: %s" % (result[0]["generated_text"])))
    print(f"\n### Actual answer:\n{actual_response}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



----------------------------------------------------------------------
completion: Below is an instruction that describes a task, paired with an input that provides further context.

Write a response that appropriately completes the request.

### Instruction:
I've noticed when scanning my files for malware/viruses, the "number of files scanned" that pops up is almost always greater than the number of files I selected to scan. What is actually being scanned and why is it considered different files?

### Response:In most cases there is no way to change the number of downloads on your computer without changing the software used to download the files. However, once you have installed the operating system, you can change this by installing a new operating system, such as Microsoft Windows 10, or installing a third party software. For example, if

### Actual answer:
Microsoft Defender Antivirus has multiple layers of protection to catch malware and viruses. These include quick scans, full s

# Step 2) RM: Reward Model
Train Reward Model to generate the better answer by giving a reward to the better answer.
- Dataset example
```json
[
    {
        "prompt": "",
        "completion_1": "",
        "completion_2": "",
        "completion_3": "",            
        "ranking": [1, 0, 2]
    }, ...
]
```
- Dataset sources
  - [Dahoas/rm-static](https://huggingface.co/datasets/Dahoas/rm-static)
  - [openai/webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons)
  - [openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
  - [Dahoas/instruct-synthetic-prompt-responses](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise)

- References
    - [train_reward_model.py](https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/examples/train_reward_model.py)
    - [train_prompts.py](https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/examples/train_prompts.py)

In [13]:
import os
import json
import argparse

import torch
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, BloomTokenizerFast
from transformers.models.gpt2.tokenization_gpt2 import GPT2Tokenizer
import loralib as lora

from nextgpt.rlhf.dataset import RewardDataset
from nextgpt.rlhf.models.base import RewardModel
from nextgpt.rlhf.models.bloom import BLOOMRM
from nextgpt.rlhf.models.gpt import GPTRM
from nextgpt.rlhf.models.opt import OPTRM
from nextgpt.rlhf.models import LogExpLoss, LogSigLoss
from nextgpt.rlhf.trainer import RewardModelTrainer
from nextgpt.rlhf.trainer.strategies import DDPStrategy, NaiveStrategy

In [14]:
# Define arguments.
parser = argparse.ArgumentParser()
parser.add_argument("--output_dir", type=str, default="./output_2_rm")
parser.add_argument("--strategy",
                    choices=["naive", "ddp"],
                    default="naive")
parser.add_argument("--model", type=str, default="gpt2", choices=["gpt2", "bloom", "opt"])
parser.add_argument("--pretrain", type=str, default="gpt2")
parser.add_argument("--model_path", type=str, default=None)
parser.add_argument('--need_optim_ckpt', type=bool, default=False)
parser.add_argument("--max_epochs", type=int, default=10)
parser.add_argument("--batch_size", type=int, default=4)
parser.add_argument("--lora_rank", type=int, default=0, help="low-rank adaptation matrices rank")
parser.add_argument("--loss_fn", type=str, default="log_sig", choices=["log_sig", "log_exp"])
parser.add_argument("--max_len", type=int, default=512)

args = parser.parse_args(args=[])

# For test.
args.max_epochs = 3
args.pretrain = "gpt2" # pretrained initial model.
args.max_len = 1024
args.verbose = True

print(args)
if not os.path.exists(args.output_dir):
    os.makedirs(args.output_dir)

Namespace(output_dir='./output_2_rm', strategy='naive', model='gpt2', pretrain='gpt2', model_path=None, need_optim_ckpt=False, max_epochs=3, batch_size=4, lora_rank=0, loss_fn='log_sig', max_len=1024, verbose=True)


In [15]:
# Configure strategy.
if args.strategy == "naive":
    strategy = NaiveStrategy()
elif args.strategy == "ddp":
    strategy = DDPStrategy()
else:
    raise ValueError(f"Unsupported strategy: {args.strategy}")

In [16]:
# Configure model, tokenizer.
with strategy.model_init_context():
    # Load pretrained gpt2.
    if args.model == "gpt2":
        # Get the tokenizer.
        tokenizer = transformers.AutoTokenizer.from_pretrained(
            args.model, 
            bos_token="<|startoftext|>",
            eos_token="<|endoftext|>", 
            pad_token="<|pad|>",
            padding_side="right", 
            model_max_length=args.max_len,
            )
        print(tokenizer)
        model = GPTRM(pretrained=args.pretrain, lora_rank=args.lora_rank).to(torch.cuda.current_device())
        model.resize_token_embeddings(len(tokenizer)) 
    elif args.model == "bloom":
        model = BLOOMRM(pretrained=args.pretrain, lora_rank=args.lora_rank).to(torch.cuda.current_device())
        tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-560m")
    elif args.model == "opt":
        model = OPTRM(pretrained=args.pretrain, lora_rank=args.lora_rank).to(torch.cuda.current_device())
        tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")      
    else:
        raise ValueError(f"Unsupported model: {args.model}")

    # Load the supervised finetuning model state dict if it is specified.
    # However, we will train the reward model from the initial language model instead of supervised finetuning model.
    if args.model_path is not None:
        # device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        state_dict = torch.load(args.model_path)
        model.model.load_state_dict(state_dict)

    # This float16 or `model.half()` might cause loss NaN issue!!!
    # See:
    #   https://stackoverflow.com/questions/65332165/loss-is-nan-when-fine-tuning-huggingface-nli-model-both-roberta-bart
    #   https://github.com/huggingface/transformers/issues/9160
    model = model.to(torch.float16)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|pad|>'})


In [17]:
# Get the dataset.
dataset_webgpt_comp = load_dataset("openai/webgpt_comparisons", split="train[:20%]")



In [18]:
# Convert data into ranking format.
data_list_ranking = []
for row in dataset_webgpt_comp:
    question = row["question"]["full_text"]
    answer_0 = row["answer_0"]
    answer_1 = row["answer_1"]
    score_0 = row["score_0"]
    score_1 = row["score_1"]
    if answer_0 == "" or answer_1 == "" or (score_0 == score_1):
        continue

    ranking = [0 if score_0 > score_1 else 1, 0 if score_0 < score_1 else 1]
    data_list_ranking.append({
        "prompt": PROMPT_TEMPLATE.format_map({"prompt": question}),
        "completion_0": answer_0,
        "completion_1": answer_1,
        "ranking": ranking
    })

data_list_ranking[:2]

[{'prompt': 'Below is an instruction that describes a task, paired with an input that provides further context.\n\nWrite a response that appropriately completes the request.\n\n### Instruction:\nVoiced by Harry Shearer, what Simpsons character was modeled after Ted Koppel?\n\n### Response:',
  'completion_0': 'The Simpsons character that was possibly based on Ted Koppel is Kent Brockman.  He is a local news anchor in Springfield and is modeled after Ted Koppel. [1]',
  'completion_1': "Apu Nahasapeemapetilon is a recurring character in the American animated television series The Simpsons. He is an Indian immigrant proprietor who runs the Kwik-E-Mart, a popular convenience store in Springfield. [1] He was based on Peter Seller's character in the film The Party. [2]",
  'ranking': [0, 1]},
 {'prompt': 'Below is an instruction that describes a task, paired with an input that provides further context.\n\nWrite a response that appropriately completes the request.\n\n### Instruction:\nHetero

In [19]:
# Make ranking data to chosen, rejetced data for reward model dataset.
total_data_ranking2chosen = []
for tmp in data_list_ranking:
    one_data_ranking2chosen = []

    # data 1) 0 VS 1
    data = {}
    data["prompt"] = tmp["prompt"]
    if tmp["ranking"][0] < tmp["ranking"][1]:
        data["chosen"] = tmp["completion_0"]
        data["rejected"] = tmp["completion_1"]
    else:
        data["chosen"] = tmp["completion_1"]
        data["rejected"] = tmp["completion_0"]
    one_data_ranking2chosen.append(data)

    # # data 2) 0 VS 2
    # data = {}
    # data["prompt"] = tmp["prompt"]
    # if tmp["ranking"][0] < tmp["ranking"][2]:
    #     data["chosen"] = tmp["completion_0"]
    #     data["rejected"] = tmp["completion_2"]
    # else:
    #     data["chosen"] = tmp["completion_2"]
    #     data["rejected"] = tmp["completion_0"]
    # one_data_ranking2chosen.append(data)

    # # data 1) 1 VS 2
    # data = {}
    # data["prompt"] = tmp["prompt"]
    # if tmp["ranking"][1] < tmp["ranking"][2]:
    #     data["chosen"] = tmp["completion_1"]
    #     data["rejected"] = tmp["completion_2"]
    # else:
    #     data["chosen"] = tmp["completion_2"]
    #     data["rejected"] = tmp["completion_1"]
    # one_data_ranking2chosen.append(data)


    total_data_ranking2chosen.extend(one_data_ranking2chosen)


print("before data num: %d" % (len(data_list_ranking)))
print("after data num: %d" % (len(total_data_ranking2chosen)))
print("data example: \n%s" % total_data_ranking2chosen[1])

before data num: 2747
after data num: 2747
data example: 
{'prompt': 'Below is an instruction that describes a task, paired with an input that provides further context.\n\nWrite a response that appropriately completes the request.\n\n### Instruction:\nHeterophobia is the irrational fear of what\n\n### Response:', 'chosen': ' Heterophobia is the irrational fear of the opposite sex, coined as Sexophobia [1]. This phobia can be caused by genetics, heredity, negative experiences with the opposite sex, or a combination of these [1].  Symptoms may result from encountering people of the opposite sex, including breathlessness, dizziness, excessive sweating, nausea, dry mouth, feeling sick, shaking, coronary heart palpitations, and anxiety [1].', 'rejected': 'In modern times, there has been a rise in what is called heterophobia; the irrational fear of, discrimination against, or aversion to heterosexual people. [1][2] The word "heterophobia" is a play on the word "homophobia," which describes t

In [20]:
# Prepare for data and dataset.
import random
random.seed(230319)

random.shuffle(total_data_ranking2chosen)
print(total_data_ranking2chosen[1])

# train_data = total_data_ranking2chosen[:-1000]
# eval_data = total_data_ranking2chosen[-1000:0]
# We just select very small set of data for a quicker training.
train_data = total_data_ranking2chosen[:100]
val_data = total_data_ranking2chosen[100:130]
eval_data = total_data_ranking2chosen[130:160]

train_dataset = RewardDataset(train_data, tokenizer, args.max_len)
val_dataset = RewardDataset(val_data, tokenizer, args.max_len)
eval_dataset = RewardDataset(eval_data, tokenizer, args.max_len)

# Check
idx = 10
print("#" * 70)
print("## prompt ##")
print(train_data[idx]["prompt"])
print("#" * 70)
print("## chosen ##")
print(train_data[idx]["chosen"])
print("#" * 70)
print("## rejected ##")
print(train_data[idx]["rejected"])

{'prompt': "Below is an instruction that describes a task, paired with an input that provides further context.\n\nWrite a response that appropriately completes the request.\n\n### Instruction:\nHow does one juice a prune? If prunes are just dehydrated plums, shouldn't prune juice just be plum juice?\n\n### Response:", 'chosen': 'You can juice dried prunes by steaming or simmering them to rehydrate them, running them through a strainer to remove the pits, seeds and skin, and then adding more water to the resulting pruney paste. [1] You don’t have to do that, though, because you could also just juice a fresh prune. Contrary to popular belief, prunes aren’t simply dried plums, but a group of cultivars, or varieties, of plum that are well suited to drying. [2][3] ', 'rejected': 'While prunes are not simply dried plums, they are a type of dried plum. [2][3]  To juice a prune, you must first steam or simmer them to rehydrate them, and then run them through a strainer to remove the pits, seed

100%|██████████| 100/100 [00:00<00:00, 403.87it/s]
100%|██████████| 30/30 [00:00<00:00, 434.78it/s]
100%|██████████| 30/30 [00:00<00:00, 434.43it/s]

######################################################################
## prompt ##
Below is an instruction that describes a task, paired with an input that provides further context.

Write a response that appropriately completes the request.

### Instruction:
Why do major cell phone carriers allow companies like MetroPCS, Cricket Wireless, Boost Mobile, etc. to resell their network?

And for a cheaper price, too? I don't understand.

### Response:
######################################################################
## chosen ##
These companies known as mobile virtual network operators or MVNOs are able to resell network services in bulk from a regular carrier and then resell them to end-users, usually for cheaper prices than that carrier [2,3]. That’s still profitable for MVNOs because they don’t have to pay anything for the upkeep and modernization of the wireless network they’re using, therefore they can afford to lower the rates on voice calls, messages and data in order to attra




In [21]:
# Configure optimizer.
optim = Adam(model.parameters(), lr=5e-5)

# Configure loss function.
if args.loss_fn == "log_sig":
    loss_fn = LogSigLoss()
elif args.loss_fn == "log_exp":
    loss_fn = LogExpLoss()
else:
    raise ValueError(f"Unsupported loss function: {args.loss_fn}")

In [22]:
trainer = RewardModelTrainer(model=model,
                            strategy=strategy,
                            optim=optim,
                            loss_fn=loss_fn,
                            train_dataset=train_dataset,
                            valid_dataset=val_dataset,
                            eval_dataset=eval_dataset,
                            batch_size=args.batch_size,
                            max_epochs=args.max_epochs)

In [23]:
# Train!!!
trainer.fit()

# Save model checkpoint after fitting on only rank0.
# strategy.save_model(model, os.path.join(args.output_dir, "rm.pt"), only_rank0=True)
trainer.save_model(path=os.path.join(args.output_dir, "rm.pt"), only_rank0=True)

# Save optimizer checkpoint on all ranks.
if args.need_optim_ckpt:
    strategy.save_optimizer(trainer.optimizer,
                            os.path.join(args.output_dir, "rm_optim_checkpoint_%d.pt" % (torch.cuda.current_device())),
                            only_rank0=False)

Train epoch:   0%|          | 0/3 [00:00<?, ?it/s]
Train step of epoch 0:   0%|          | 0/25 [00:00<?, ?it/s][A
Train step of epoch 0:   4%|▍         | 1/25 [00:00<00:04,  5.95it/s][A
Train step of epoch 0:   4%|▍         | 1/25 [00:00<00:04,  5.95it/s, loss=0.694, dist=0, acc=0][A
Train step of epoch 0:   8%|▊         | 2/25 [00:00<00:03,  6.51it/s, loss=0.694, dist=0, acc=0][A
Train step of epoch 0:   8%|▊         | 2/25 [00:00<00:03,  6.51it/s, loss=nan, dist=0, acc=0]  [A
Train step of epoch 0:  12%|█▏        | 3/25 [00:00<00:03,  6.72it/s, loss=nan, dist=0, acc=0][A
Train step of epoch 0:  12%|█▏        | 3/25 [00:00<00:03,  6.72it/s, loss=nan, dist=0, acc=0][A
Train step of epoch 0:  16%|█▌        | 4/25 [00:00<00:03,  6.79it/s, loss=nan, dist=0, acc=0][A
Train step of epoch 0:  16%|█▌        | 4/25 [00:00<00:03,  6.79it/s, loss=nan, dist=0, acc=0][A
Train step of epoch 0:  20%|██        | 5/25 [00:00<00:02,  6.85it/s, loss=nan, dist=0, acc=0][A
Train step of epoch 0

# Step 3) PPO: Proximal Policy Optimization
Further fine-tune the LLM from step 1 with the reward model and this dataset using RL (eg. PPO).

- References
    - [train_prompts.py](https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/examples/train_prompts.py)

In [24]:
import argparse
from copy import deepcopy

import pandas as pd

import torch
from torch.optim import Adam
from transformers import AutoTokenizer, BloomTokenizerFast
from transformers.models.gpt2.tokenization_gpt2 import GPT2Tokenizer

from nextgpt.rlhf.models.base import RewardModel
from nextgpt.rlhf.models.bloom import BLOOMActor, BLOOMCritic
from nextgpt.rlhf.models.gpt import GPTActor, GPTCritic
from nextgpt.rlhf.models.opt import OPTActor, OPTCritic
from nextgpt.rlhf.trainer import PPOTrainer
from nextgpt.rlhf.trainer.strategies import DDPStrategy, NaiveStrategy

import json
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

In [33]:
# Define arguments.
parser = argparse.ArgumentParser()

parser.add_argument("--output_dir", type=str, default="./output_3_ppo")
parser.add_argument("--strategy",
                    choices=["naive", "ddp"],
                    default="naive")
parser.add_argument("--model", type=str, default="gpt2", choices=["gpt2", "bloom", "opt"])
parser.add_argument("--pretrain_actor", type=str, default=None)
parser.add_argument("--pretrain_critic", type=str, default=None)
parser.add_argument("--num_episodes", type=int, default=10)
parser.add_argument("--max_timesteps", type=int, default=3)
parser.add_argument("--update_timesteps", type=int, default=3)
parser.add_argument("--max_epochs", type=int, default=5)
parser.add_argument("--train_batch_size", type=int, default=8)
parser.add_argument("--lora_rank", type=int, default=0, help="low-rank adaptation matrices rank")
parser.add_argument("--max_len", type=int, default=250)
args = parser.parse_args(args=[])

# For test
args.pretrain_actor = "./output_1_sft" # SFT model
args.pretrain_critic = "./output_2_rm/rm.pt" # RM model

args.num_episodes = 1
args.max_epochs   = 1

print(args)
if not os.path.exists(args.output_dir):
    os.makedirs(args.output_dir)

Namespace(output_dir='./output_3_ppo', strategy='naive', model='gpt2', pretrain_actor='./output_1_sft', pretrain_critic='./output_2_rm/rm.pt', num_episodes=1, max_timesteps=3, update_timesteps=3, max_epochs=1, train_batch_size=8, lora_rank=0, max_len=250)


In [27]:
# Configure strategy.
if args.strategy == "naive":
    strategy = NaiveStrategy()
elif args.strategy == "ddp":
    strategy = DDPStrategy()
else:
    raise ValueError(f"Unsupported strategy: {args.strategy}")

In [34]:
# Configure model, tokenizer.
with strategy.model_init_context():
    if args.model == "gpt2":
        actor = GPTActor(pretrained=args.pretrain_actor, lora_rank=args.lora_rank)
        critic = GPTCritic(pretrained=args.pretrain_critic, lora_rank=args.lora_rank)
        # Get the tokenizer.
        tokenizer = AutoTokenizer.from_pretrained(
            args.model, 
            bos_token="<|startoftext|>",
            eos_token="<|endoftext|>", 
            pad_token="<|pad|>",
            padding_side="right", 
            model_max_length=args.max_len,
            )
        # tokenizer.pad_token = tokenizer.eos_token
    elif args.model == "bloom":
        actor = BLOOMActor(pretrained=args.pretrain_actor, lora_rank=args.lora_rank)
        critic = BLOOMCritic(pretrained=args.pretrain_critic, lora_rank=args.lora_rank)
        tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-560m")
        tokenizer.pad_token = tokenizer.eos_token            
    elif args.model == "opt":
        actor = OPTActor(pretrained=args.pretrain_actor, lora_rank=args.lora_rank)
        critic = OPTCritic(pretrained=args.pretrain_critic, lora_rank=args.lora_rank)
        tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")            
    else:
        raise ValueError(f"Unsupported model: {args.model}")

    critic.to(torch.float16).to(torch.cuda.current_device())
    actor.to(torch.float16).to(torch.cuda.current_device())

    initial_model = deepcopy(actor)
    reward_model = RewardModel(deepcopy(critic.model), deepcopy(critic.value_head)).to(torch.cuda.current_device())

OSError: ignored