<a href="https://colab.research.google.com/github/louiezzang/next-gpt/blob/main/examples/chatgpt_replica_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# chatGPT replica


**Example of RLHF dataset**:

Total 3 datasets are needed for training the 3 steps(SFT, RM and PPO)
- [Example of dataset](https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/chatllama#dataset-preparation)
- [Example of dataset 1](https://huggingface.co/datasets/stanfordnlp/SHP)
- [Example of dataset 2](https://huggingface.co/datasets/Anthropic/hh-rlhf)

step1) Dataset for SFT(Supervised Fine-tuning training)
```json
[
    {
        "prompt": "",
        "completion": ""        
    }, ...
]
```

step2) Dataset for RM(Reward Model) training: There are multiple completetions with human rated ranking score for one prompt.
```json
[
    {
        "prompt": "",
        "completion_1": "",
        "completion_2": "",
        "completion_3": "",            
        "ranking": [1, 0, 2]
    }, ...
]
```
    
step3) Dataset for PPO(RLHF) training: It only consists of prompt.
```json
[
    {
        "prompt": ""
    }, ...
]
```

# Colab environment setup

#### Installation (python>=3.8)

In [1]:
# Install next-gpt lib.
!rm -rf ./next-gpt/
!git clone https://github.com/louiezzang/next-gpt.git
%cd next-gpt/
!pip install .
%cd ../

Cloning into 'next-gpt'...
remote: Enumerating objects: 469, done.[K
remote: Counting objects: 100% (121/121), done.[K
remote: Compressing objects: 100% (79/79), done.[K
remote: Total 469 (delta 45), reused 86 (delta 34), pack-reused 348[K
Receiving objects: 100% (469/469), 149.09 KiB | 9.94 MiB/s, done.
Resolving deltas: 100% (213/213), done.
/content/next-gpt
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing /content/next-gpt
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m48.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 KB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tikto

# Step 1) SFT: Surpervised Fine-tuning
Build a Supervised Fine-tuning model to answer well to the question.

- Refereneces
  - [fine tuning code_1](https://github.com/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb)
  - [fine tuning code_2](https://github.com/Beomi/KoAlpaca/blob/main/train.py)


- SFT(Supervised Fine Tuning)
- Fine-tune a pretrained LLM on a specific domain or corpus of instructions and human demonstrations

- Dataset example
```json
[
    {
        "prompt": "",
        "completion": ""        
    }, ...
]
```

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import json
import yaml
import argparse

import numpy as np
import pandas as pd

import torch
from datasets import load_dataset
import transformers

from transformers import (
    AutoTokenizer, AutoConfig, AutoModelForCausalLM, pipeline, 
    TrainingArguments, AutoModelWithLMHead,
    ProgressCallback
)
from nextgpt.finetuning import (
    SupervisedDataset, DataCollatorForSupervisedDataset,
    SupervisedTrainer, LoggingCallback
)

PT_MODEL_NAME = "gpt2"

In [3]:
# Get the tokenizer.
tokenizer = transformers.AutoTokenizer.from_pretrained(PT_MODEL_NAME, 
                                          bos_token='<|startoftext|>',
                                          eos_token='<|endoftext|>', 
                                          pad_token='<|pad|>')
print(tokenizer)

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|pad|>'})


In [4]:
dataset_webgpt_comp = load_dataset("openai/webgpt_comparisons", split="train[:20%]")

Downloading builder script:   0%|          | 0.00/4.74k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

Downloading and preparing dataset webgpt_comparisons/default to /root/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a...


Downloading data:   0%|          | 0.00/279M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset webgpt_comparisons downloaded and prepared to /root/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a. Subsequent calls will reuse this data.


In [5]:
data_list = []
for row in dataset_webgpt_comp:
  question = row["question"]["full_text"]
  answer_0 = row["answer_0"]
  data_list.append({
      "prompt": question,
      "completion": answer_0
  })

In [6]:
prompt_template = (
  "Below is an instruction that describes a task, paired with an input that provides further context.\n\n"
  "Write a response that appropriately completes the request.\n\n"
  "### Instruction:\n{prompt}\n\n### Response:"
)

In [7]:
dataset = SupervisedDataset(
    data=data_list,
    tokenizer=tokenizer, 
    prompt_template=prompt_template,
    prompt_fields=["prompt"], 
    completion_field="completion",
    verbose=True)

# Split train and val dataset.
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, eval_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])

# Data collator.
data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)

Below is an instruction that describes a task, paired with an input that provides further context.

Write a response that appropriately completes the request.

### Instruction:
Voiced by Harry Shearer, what Simpsons character was modeled after Ted Koppel?

### Response:
The Simpsons character that was possibly based on Ted Koppel is Kent Brockman.  He is a local news anchor in Springfield and is modeled after Ted Koppel. [1]<|endoftext|>
Tokenizing inputs... This may take some time...
Loading data done!!: 3916


In [8]:
# Load the pretrained model.
model = AutoModelForCausalLM.from_pretrained(PT_MODEL_NAME)
model.resize_token_embeddings(len(tokenizer))

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Embedding(50259, 768)

In [9]:
model_output_dir = "./output_1_sft"

# Train arguments.
training_args = TrainingArguments(
    output_dir="./train_1_sft", # the output directory
    overwrite_output_dir=True, # overwrite the content of the output directory
    num_train_epochs=1, # number of training epochs
    per_device_train_batch_size=4, # batch size for training
    per_device_eval_batch_size=4, # batch size for evaluation
    eval_steps=3, # number of update steps between two evaluations.
    save_steps=100, # after # steps model is saved 
    warmup_steps=5, # number of warmup steps for learning rate scheduler
    prediction_loss_only=True,
)

# Train the model.
trainer = SupervisedTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=None,
    # callbacks=[ProgressCallback, LoggingCallback(logger=None)],
)

trainer.train()
trainer.save_state()
trainer.safe_save_model(output_dir=model_output_dir)



Step,Training Loss
500,3.8486


In [10]:
# Inference test.
generator = pipeline("text-generation", model=model_output_dir, tokenizer=tokenizer)

generation_args = dict(
    num_beams=4,
    repetition_penalty=2.0,
    no_repeat_ngram_size=4,
    # bos_token='<|startoftext|>',
    # eos_token='<|endoftext|>', 
    # pad_token='<|pad|>',
    max_new_tokens=64,
    do_sample=True,
    top_k=30,
    top_p=0.95,
    temperature=1.9, 
    #max_length=300, 
    #num_return_sequences=20
    early_stopping=True,
)

In [11]:
test_list = data_list[-5:]

test_prompt_list = []
actual_completion_list = []
for row in test_list:
    text_input = row
    prompt = prompt_template.format_map(text_input)
    test_prompt_list.append(prompt)
    actual_completion_list.append(text_input["completion"])

result_list = generator(test_prompt_list, **generation_args)
for prompt, result, actual_response in zip(test_prompt_list, result_list, actual_completion_list):
    print("")
    print("-" * 70)
    print(("completion: %s" % (result[0]["generated_text"])))
    print(f"\n### Actual answer:\n{actual_response}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



----------------------------------------------------------------------
completion: Below is an instruction that describes a task, paired with an input that provides further context.

Write a response that appropriately completes the request.

### Instruction:
I've noticed when scanning my files for malware/viruses, the "number of files scanned" that pops up is almost always greater than the number of files I selected to scan. What is actually being scanned and why is it considered different files?

### Response:Scanning your files for malware is one way to detect whether you have been infected by malware or other viruses. [1,2] However, in order to scan for malware, you must make certain that there are sufficient amounts of data to identify each virus. If you have not yet done so, you may need to

### Actual answer:
Microsoft Defender Antivirus has multiple layers of protection to catch malware and viruses. These include quick scans, full scans, and on-access protection with cloud-del

# Step 2) RM: Reward Model
Train Reward Model to generate the better answer by giving a reward to the better answer.
- Dataset example
```json
[
    {
        "prompt": "",
        "completion_1": "",
        "completion_2": "",
        "completion_3": "",            
        "ranking": [1, 0, 2]
    }, ...
]
```
- Dataset sources
  - [Dahoas/rm-static](https://huggingface.co/datasets/Dahoas/rm-static)
  - [openai/webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons)
  - [openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
  - [Dahoas/instruct-synthetic-prompt-responses](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise)

In [3]:
import os
import json
import argparse

import torch
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, BloomTokenizerFast
from transformers.models.gpt2.tokenization_gpt2 import GPT2Tokenizer
import loralib as lora

from nextgpt.rlhf.dataset import RewardDataset
from nextgpt.rlhf.models.base import RewardModel
from nextgpt.rlhf.models.bloom import BLOOMRM
from nextgpt.rlhf.models.gpt import GPTRM
from nextgpt.rlhf.models.opt import OPTRM
from nextgpt.rlhf.trainer import RewardModelTrainer
from nextgpt.rlhf.trainer.strategies import DDPStrategy, NaiveStrategy

In [None]:
# Configure strategy.
if args.strategy == "naive":
    strategy = NaiveStrategy()
elif args.strategy == "ddp":
    strategy = DDPStrategy()
else:
    raise ValueError(f"Unsupported strategy: {args.strategy}")

In [4]:
# Get the dataset.
dataset_webgpt_comp = load_dataset("openai/webgpt_comparisons", split="train[:20%]")

Downloading builder script:   0%|          | 0.00/4.74k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

Downloading and preparing dataset webgpt_comparisons/default to /root/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a...


Downloading data:   0%|          | 0.00/279M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset webgpt_comparisons downloaded and prepared to /root/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a. Subsequent calls will reuse this data.


In [11]:
# Convert data into ranking format.
data_list_ranking = []
for row in dataset_webgpt_comp:
  question = row["question"]["full_text"]
  answer_0 = row["answer_0"]
  answer_1 = row["answer_1"]
  score_0 = row["score_0"]
  score_1 = row["score_1"]
  if answer_0 == "" or answer_1 == "" or (score_0 == score_1):
    continue

  ranking = [0 if score_0 > score_1 else 1, 0 if score_0 < score_1 else 1]
  data_list_ranking.append({
      "prompt": question,
      "completion_0": answer_0,
      "completion_1": answer_1,
      "ranking": ranking
  })

data_list_ranking[:2]

[{'prompt': 'Voiced by Harry Shearer, what Simpsons character was modeled after Ted Koppel?',
  'completion_0': 'The Simpsons character that was possibly based on Ted Koppel is Kent Brockman.  He is a local news anchor in Springfield and is modeled after Ted Koppel. [1]',
  'completion_1': "Apu Nahasapeemapetilon is a recurring character in the American animated television series The Simpsons. He is an Indian immigrant proprietor who runs the Kwik-E-Mart, a popular convenience store in Springfield. [1] He was based on Peter Seller's character in the film The Party. [2]",
  'ranking': [0, 1]},
 {'prompt': 'Heterophobia is the irrational fear of what',
  'completion_0': ' Heterophobia is the irrational fear of the opposite sex, coined as Sexophobia [1]. This phobia can be caused by genetics, heredity, negative experiences with the opposite sex, or a combination of these [1].  Symptoms may result from encountering people of the opposite sex, including breathlessness, dizziness, excessive 

In [14]:
# Make ranking data to chosen, rejetced data for reward model dataset.
total_data_ranking2chosen = []
for tmp in data_list_ranking:
    one_data_ranking2chosen = []

    # data 1) 0 VS 1
    data = {}
    data["prompt"] = tmp["prompt"]
    if tmp["ranking"][0] < tmp["ranking"][1]:
        data["chosen"] = tmp["completion_0"]
        data["rejected"] = tmp["completion_1"]
    else:
        data["chosen"] = tmp["completion_1"]
        data["rejected"] = tmp["completion_0"]
    one_data_ranking2chosen.append(data)

    # # data 2) 0 VS 2
    # data = {}
    # data["prompt"] = tmp["prompt"]
    # if tmp["ranking"][0] < tmp["ranking"][2]:
    #     data["chosen"] = tmp["completion_0"]
    #     data["rejected"] = tmp["completion_2"]
    # else:
    #     data["chosen"] = tmp["completion_2"]
    #     data["rejected"] = tmp["completion_0"]
    # one_data_ranking2chosen.append(data)

    # # data 1) 1 VS 2
    # data = {}
    # data["prompt"] = tmp["prompt"]
    # if tmp["ranking"][1] < tmp["ranking"][2]:
    #     data["chosen"] = tmp["completion_1"]
    #     data["rejected"] = tmp["completion_2"]
    # else:
    #     data["chosen"] = tmp["completion_2"]
    #     data["rejected"] = tmp["completion_1"]
    # one_data_ranking2chosen.append(data)


    total_data_ranking2chosen.extend(one_data_ranking2chosen)


print("before data num: %d" % (len(data_list_ranking)))
print("after data num: %d" % (len(total_data_ranking2chosen)))
print("data example: \n%s" % total_data_ranking2chosen[1])

before data num: 2747
after data num: 2747
data example: 
{'prompt': 'Heterophobia is the irrational fear of what', 'chosen': ' Heterophobia is the irrational fear of the opposite sex, coined as Sexophobia [1]. This phobia can be caused by genetics, heredity, negative experiences with the opposite sex, or a combination of these [1].  Symptoms may result from encountering people of the opposite sex, including breathlessness, dizziness, excessive sweating, nausea, dry mouth, feeling sick, shaking, coronary heart palpitations, and anxiety [1].', 'rejected': 'In modern times, there has been a rise in what is called heterophobia; the irrational fear of, discrimination against, or aversion to heterosexual people. [1][2] The word "heterophobia" is a play on the word "homophobia," which describes the fear of homosexual people. [1] Like homophobia, heterophobia is promoted by those who wish to shame or bash heterosexuals, especially men who have sex with women. [2]'}


In [15]:
# Prepare for data and dataset.
import random
random.seed(230319)

random.shuffle(total_data_ranking2chosen)
print(total_data_ranking2chosen[1])

# train_data = total_data_ranking2chosen[:-1000]
# eval_data = total_data_ranking2chosen[-1000:0]
# We just select very small set of data for a quicker training.
train_data = total_data_ranking2chosen[:100]
eval_data = total_data_ranking2chosen[100:130]

train_dataset = RewardDataset(train_data, tokenizer, args.max_len)
eval_dataset = RewardDataset(eval_data, tokenizer, args.max_len)

# Check
idx = 10
print("#" * 70)
print("## prompt ##")
print(train_data[idx]["prompt"])
print("#" * 70)
print("## chosen ##")
print(train_data[idx]["chosen"])
print("#" * 70)
print("## rejected ##")
print(train_data[idx]["rejected"])

{'prompt': 'Fact-check each of the claims in the following answer.\n\nQuestion: Why are some talented and successful people so irrational?\n\nAnswer: There is no single answer to this question, but factors that may contribute to someone being irrational could be anything from a lack of self-confidence and self-esteem to a fear of failure and negative feedback, all of which can lead to erratic and irrational behavior. Additionally, some people may have difficulty understanding or accepting various kinds of feedback, especially negative feedback, causing them to act out in anger or even uncontrolled rage. In other words, there is no one answer to the question but rather a variety of factors that can contribute.', 'chosen': '"There is no single answer to the question [why some talented and successful people are irrational], but factors that may contribute to someone being irrational could be anything from a lack of self-confidence and self-esteem to a fear of failure and negative feedback

NameError: ignored