<a href="https://colab.research.google.com/github/louiezzang/next-gpt/blob/main/examples/chatgpt_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview


What is RLHF? <br>
See [this link](https://gist.github.com/JoaoLages/c6f2dfd13d2484aa8bb0b2d567fbf093).

<br>

**Example of RLHF dataset**:

Total 3 datasets are needed for training the 3 steps(SFT, RM and PPO)
- [Example of dataset](https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/chatllama#dataset-preparation)
- [Example of dataset 1](https://huggingface.co/datasets/stanfordnlp/SHP)
- [Example of dataset 2](https://huggingface.co/datasets/Anthropic/hh-rlhf)

step1) Dataset for SFT(Supervised Fine-tuning training)
```json
[
    {
        "prompt": "",
        "completion": ""        
    }, ...
]
```

step2) Dataset for RM(Reward Model) training: There are multiple completetions with human rated ranking score for one prompt.
```json
[
    {
        "prompt": "",
        "completion_1": "",
        "completion_2": "",
        "completion_3": "",            
        "ranking": [1, 0, 2]
    }, ...
]
```
    
step3) Dataset for PPO(RLHF) training: It only consists of prompt.
```json
[
    {
        "prompt": ""
    }, ...
]
```

# Environment setup

#### Installation (python>=3.8)

In [None]:
# Install next-gpt lib.
!rm -rf ./next-gpt/
!git clone https://github.com/louiezzang/next-gpt.git
%cd next-gpt/
!pip install .
%cd ../

Cloning into 'next-gpt'...
remote: Enumerating objects: 761, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 761 (delta 4), reused 10 (delta 4), pack-reused 745[K
Receiving objects: 100% (761/761), 216.96 KiB | 12.76 MiB/s, done.
Resolving deltas: 100% (445/445), done.
/content/next-gpt
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing /content/next-gpt
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m89.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 KB[0m [31m48.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken

# Step 1) SFT: Surpervised Fine-tuning
Build a Supervised Fine-tuning model to answer well to the question.

- Refereneces
  - [fine tuning code_1](https://github.com/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb)
  - [fine tuning code_2](https://github.com/Beomi/KoAlpaca/blob/main/train.py)


- SFT(Supervised Fine Tuning)
- Fine-tune a pretrained LLM on a specific domain or corpus of instructions and human demonstrations

- Dataset example
```json
[
    {
        "prompt": "",
        "completion": ""        
    }, ...
]
```

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import json
import yaml
import argparse

import numpy as np
import pandas as pd

import loralib as lora
import torch
import torch.distributed as dist
from torch.optim import Adam
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler

from transformers import pipeline
from transformers import AutoTokenizer, BloomTokenizerFast
from transformers.models.gpt2.tokenization_gpt2 import GPT2Tokenizer
from datasets import load_dataset

from nextgpt.dataset import SupervisedDataset, DataCollatorForSupervisedDataset
from nextgpt.trainer import SFTTrainer
from nextgpt.trainer.strategies import DDPStrategy, NaiveStrategy
from nextgpt.models.bloom import BLOOMLM
from nextgpt.models.gpt import GPTLM
from nextgpt.models.opt import OPTLM

In [None]:
PROMPT_TEMPLATE = (
  "Below is an instruction that describes a task, paired with an input that provides further context. "
  "Write a response that appropriately completes the request.\n\n"
  "### Instruction:\n{instruction}\n\n### Response:"
)

In [None]:
# Define arguments.
parser = argparse.ArgumentParser()
parser.add_argument("--strategy",
                    choices=["naive", "ddp"],
                    default="naive")
parser.add_argument("--model", choices=["gpt2", "bloom", "opt"], default="gpt2")
parser.add_argument("--pretrain", type=str, default=None)
parser.add_argument("--max_datasets_size", type=int, default=None)
parser.add_argument("--need_optim_ckpt", type=bool, default=False)
parser.add_argument("--max_epochs", type=int, default=3)
parser.add_argument("--batch_size", type=int, default=4)
parser.add_argument("--max_len", type=int, default=512)
parser.add_argument("--lora_rank", type=int, default=0, help="low-rank adaptation matrices rank")
parser.add_argument("--log_interval", type=int, default=100, help="how many steps to log")
parser.add_argument("--lr", type=float, default=5e-6)
parser.add_argument("--accumulation_steps", type=int, default=8)
parser.add_argument("--output_dir", type=str, default="./output_1_sft")

args = parser.parse_args(args=[])

# For test.
args.pretrain = "gpt2"
args.max_datasets_size = 10000

print(args)

Namespace(strategy='naive', model='gpt2', pretrain='gpt2', max_datasets_size=10000, need_optim_ckpt=False, max_epochs=3, batch_size=4, max_len=512, lora_rank=0, log_interval=100, lr=5e-06, accumulation_steps=8, output_dir='./output_1_sft')


In [None]:
# Configure strategy.
if args.strategy == "naive":
    strategy = NaiveStrategy()
elif args.strategy == "ddp":
    strategy = DDPStrategy()
else:
    raise ValueError(f"Unsupported strategy: {args.strategy}")

# Configure model.
with strategy.model_init_context():
    if args.model == "bloom":
        model = BLOOMLM(pretrained=args.pretrain, lora_rank=args.lora_rank).to(torch.cuda.current_device())
    elif args.model == "opt":
        model = OPTLM(pretrained=args.pretrain, lora_rank=args.lora_rank).to(torch.cuda.current_device())
    elif args.model == "gpt2":
        model = GPTLM(pretrained=args.pretrain, lora_rank=args.lora_rank).to(torch.cuda.current_device())
    else:
        raise ValueError(f"Unsupported model: {args.model}")

# Configure tokenizer.
if args.model == "gpt2":
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    tokenizer.pad_token = tokenizer.eos_token
elif args.model == "bloom":
    tokenizer = BloomTokenizerFast.from_pretrained(args.pretrain)
    tokenizer.pad_token = tokenizer.eos_token
elif args.model == "opt":
    tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
else:
    raise ValueError(f"Unsupported model: {args.model}")
tokenizer.pad_token = tokenizer.eos_token

# Configure optimizer.
optim = Adam(model.parameters(), lr=args.lr)

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
# Configure dataset.
dataset_webgpt_comp = load_dataset("openai/webgpt_comparisons", split="train[:20%]")

data_list = []
for row in dataset_webgpt_comp:
    question = row["question"]["full_text"]
    answer_0 = row["answer_0"]
    data_list.append({
        "instruction": question,
        "completion": answer_0
    })

dataset = SupervisedDataset(
    dataset=data_list,
    tokenizer=tokenizer, 
    prompt_template=PROMPT_TEMPLATE,
    completion_field="completion",
    max_datasets_size=args.max_datasets_size,
    max_length=args.max_len,
    verbose=True)

# Split train and eval dataset.
train_size = int(0.8 * len(dataset))
eval_size = len(dataset) - train_size
train_dataset, eval_dataset = torch.utils.data.random_split(dataset, [train_size, eval_size])

# Data collator.
data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)

Downloading builder script:   0%|          | 0.00/4.74k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

Downloading and preparing dataset webgpt_comparisons/default to /root/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a...


Downloading data:   0%|          | 0.00/279M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset webgpt_comparisons downloaded and prepared to /root/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a. Subsequent calls will reuse this data.
Loading data...
Limiting dataset to 10000 examples.
Formatting inputs...
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Voiced by Harry Shearer, what Simpsons character was modeled after Ted Koppel?

### Response:
The Simpsons character that was possibly based on Ted Koppel is Kent Brockman.  He is a local news anchor in Springfield and is modeled after Ted Koppel. [1]<|endoftext|>
Tokenizing inputs... This may take some time...


In [None]:
# Confiture dataloader.
if dist.is_initialized() and dist.get_world_size() > 1:
    print("DDP")
    train_sampler = DistributedSampler(train_dataset,
                                       shuffle=True,
                                       seed=42,
                                       drop_last=True,
                                       rank=dist.get_rank(),
                                       num_replicas=dist.get_world_size())
    if eval_dataset is not None:
        eval_sampler = DistributedSampler(eval_dataset,
                                          shuffle=False,
                                          seed=42,
                                          drop_last=False,
                                          rank=dist.get_rank(),
                                          num_replicas=dist.get_world_size())
else:
    train_sampler = None
    eval_sampler = None

train_dataloader = DataLoader(train_dataset,
                              shuffle=(train_sampler is None),
                              sampler=train_sampler,
                              batch_size=args.batch_size,
                              collate_fn=data_collator,
                              pin_memory=True)
if eval_dataset is not None:
    eval_dataloader = DataLoader(eval_dataset,
                                 shuffle=(eval_sampler is None),
                                 sampler=eval_sampler,
                                 batch_size=args.batch_size,
                                 collate_fn=data_collator,
                                 pin_memory=True)
else:
    eval_dataloader = None

In [None]:
# Train!!!
trainer = SFTTrainer(model=model,
                     strategy=strategy,
                     optim=optim,
                     train_dataloader=train_dataloader,
                     eval_dataloader=eval_dataloader,
                     batch_size=args.batch_size,
                     max_epochs=args.max_epochs,
                     accumulation_steps=args.accumulation_steps)

trainer.fit(logger=None, log_interval=args.log_interval)

# Save model checkpoint after fitting on only rank0.
trainer.save_model(path=args.output_dir, only_rank0=True, tokenizer=tokenizer)
# Save optimizer checkpoint on all ranks.
if args.need_optim_ckpt:
    strategy.save_optimizer(trainer.optimizer,
                            "sft_optim_checkpoint_%d.pt" % (torch.cuda.current_device()),
                            only_rank0=False)


steps:   0%|          | 0/291 [00:00<?, ?it/s][A
steps:   0%|          | 1/291 [00:01<07:01,  1.45s/it][A
steps:   0%|          | 1/291 [00:01<07:01,  1.45s/it, loss=0.401, lr=5.56e-7, epoch=0, batch_id=7][A
steps:   1%|          | 2/291 [00:02<04:53,  1.01s/it, loss=0.401, lr=5.56e-7, epoch=0, batch_id=7][A
steps:   1%|          | 2/291 [00:02<04:53,  1.01s/it, loss=0.391, lr=1.11e-6, epoch=0, batch_id=15][A
steps:   1%|          | 3/291 [00:02<04:08,  1.16it/s, loss=0.391, lr=1.11e-6, epoch=0, batch_id=15][A
steps:   1%|          | 3/291 [00:02<04:08,  1.16it/s, loss=0.391, lr=1.67e-6, epoch=0, batch_id=23][A
steps:   1%|▏         | 4/291 [00:03<03:51,  1.24it/s, loss=0.391, lr=1.67e-6, epoch=0, batch_id=23][A
steps:   1%|▏         | 4/291 [00:03<03:51,  1.24it/s, loss=0.411, lr=2.22e-6, epoch=0, batch_id=31][A
steps:   2%|▏         | 5/291 [00:04<03:39,  1.31it/s, loss=0.411, lr=2.22e-6, epoch=0, batch_id=31][A
steps:   2%|▏         | 5/291 [00:04<03:39,  1.31it/s, loss=0

In [None]:
!ls -la ./output_1_sft

total 499884
drwxr-xr-x 2 root root      4096 Apr 10 01:20 .
drwxr-xr-x 1 root root      4096 Apr 10 01:20 ..
-rw-r--r-- 1 root root       907 Apr 10 01:20 config.json
-rw-r--r-- 1 root root       119 Apr 10 01:20 generation_config.json
-rw-r--r-- 1 root root    456318 Apr 10 01:20 merges.txt
-rw-r--r-- 1 root root 510398013 Apr 10 01:20 pytorch_model.bin
-rw-r--r-- 1 root root       470 Apr 10 01:20 special_tokens_map.json
-rw-r--r-- 1 root root       722 Apr 10 01:20 tokenizer_config.json
-rw-r--r-- 1 root root    999186 Apr 10 01:20 vocab.json


In [None]:
# Inference test.
generator = pipeline("text-generation", model=args.output_dir, tokenizer=tokenizer)

generation_args = dict(
    num_beams=4,
    repetition_penalty=2.0,
    no_repeat_ngram_size=4,
    max_new_tokens=64,
    do_sample=True,
    top_k=30,
    top_p=0.95,
    temperature=1.9, 
    #max_length=300, 
    #num_return_sequences=20
    early_stopping=True,
)

test_list = data_list[-5:]

test_prompt_list = []
actual_completion_list = []
for row in test_list:
    text_input = row
    prompt = PROMPT_TEMPLATE.format_map(text_input)
    test_prompt_list.append(prompt)
    actual_completion_list.append(text_input["completion"])

result_list = generator(test_prompt_list, **generation_args)
for prompt, result, actual_response in zip(test_prompt_list, result_list, actual_completion_list):
    print("")
    print("-" * 70)
    print(("completion: %s" % (result[0]["generated_text"])))
    print(f"\n### Actual answer:\n{actual_response}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



----------------------------------------------------------------------
completion: Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
I've noticed when scanning my files for malware/viruses, the "number of files scanned" that pops up is almost always greater than the number of files I selected to scan. What is actually being scanned and why is it considered different files?

### Response:There are many factors that influence the amount of information that a software program has to send out to a user. The most important is what type of file or directory you have in there. You can download files from any web site, so your computer's internet service provider will usually provide them as well. For example, if

### Actual answer:
Microsoft Defender Antivirus has multiple layers of protection to catch malware and viruses. These include quick scans, full scans, and on

# Step 2) RM: Reward Model
Train Reward Model to generate the better answer by giving a reward to the better answer.
- Dataset example
```json
[
    {
        "prompt": "",
        "completion_1": "",
        "completion_2": "",
        "completion_3": "",            
        "ranking": [1, 0, 2]
    }, ...
]
```
- Dataset sources
  - [Dahoas/rm-static](https://huggingface.co/datasets/Dahoas/rm-static)
  - [openai/webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons)
  - [openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
  - [Dahoas/instruct-synthetic-prompt-responses](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise)

- References
    - [train_reward_model.py](https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/examples/train_reward_model.py)
    - [train_prompts.py](https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/examples/train_prompts.py)

In [None]:
import os
import json
import argparse

import torch
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, BloomTokenizerFast
from transformers.models.gpt2.tokenization_gpt2 import GPT2Tokenizer
import loralib as lora

from nextgpt.dataset import RewardDataset
from nextgpt.models.base import RewardModel
from nextgpt.models.bloom import BLOOMRM
from nextgpt.models.gpt import GPTRM
from nextgpt.models.opt import OPTRM
from nextgpt.models import LogExpLoss, LogSigLoss
from nextgpt.trainer import RewardModelTrainer
from nextgpt.trainer.strategies import DDPStrategy, NaiveStrategy

In [None]:
# Define arguments.
parser = argparse.ArgumentParser()
parser.add_argument("--output_dir", type=str, default="./output_2_rm")
parser.add_argument("--strategy",
                    type=str, 
                    choices=["naive", "ddp"],
                    default="naive")
parser.add_argument("--model", 
                    type=str, 
                    choices=["gpt2", "bloom", "opt"], 
                    default="gpt2")
parser.add_argument("--pretrain", type=str, default="gpt2")
parser.add_argument("--model_path", type=str, default=None)
parser.add_argument("--need_optim_ckpt", type=bool, default=False)
parser.add_argument("--max_epochs", type=int, default=10)
parser.add_argument("--batch_size", type=int, default=4)
parser.add_argument("--lora_rank", type=int, default=0, help="low-rank adaptation matrices rank")
parser.add_argument("--loss_fn", 
                    type=str, 
                    choices=["log_sig", "log_exp"],
                    default="log_sig")
parser.add_argument("--max_len", type=int, default=512)

args = parser.parse_args(args=[])

# For test.
args.max_epochs = 3
args.pretrain = "gpt2" # pretrained initial model.
args.verbose = True

print(args)
if not os.path.exists(args.output_dir):
    os.makedirs(args.output_dir)

Namespace(output_dir='./output_2_rm', strategy='naive', model='gpt2', pretrain='gpt2', model_path=None, need_optim_ckpt=False, max_epochs=3, batch_size=4, lora_rank=0, loss_fn='log_sig', max_len=512, verbose=True)


In [None]:
# Configure strategy.
if args.strategy == "naive":
    strategy = NaiveStrategy()
elif args.strategy == "ddp":
    strategy = DDPStrategy()
else:
    raise ValueError(f"Unsupported strategy: {args.strategy}")

In [None]:
# Configure model.
with strategy.model_init_context():
    if args.model == "gpt2":
        model = GPTRM(pretrained=args.pretrain, lora_rank=args.lora_rank).to(torch.cuda.current_device())
    elif args.model == "bloom":
        model = BLOOMRM(pretrained=args.pretrain, lora_rank=args.lora_rank).to(torch.cuda.current_device())
    elif args.model == "opt":
        model = OPTRM(pretrained=args.pretrain, lora_rank=args.lora_rank).to(torch.cuda.current_device()) 
    else:
        raise ValueError(f"Unsupported model: {args.model}")

    # Load the supervised finetuning model state dict if it is specified.
    # However, we will train the reward model from the initial language model instead of supervised finetuning model.
    if args.model_path is not None:
        state_dict = torch.load(args.model_path)
        model.model.load_state_dict(state_dict)

# This float16 or `model.half()` might cause loss NaN issue!!!
# See:
#   https://stackoverflow.com/questions/65332165/loss-is-nan-when-fine-tuning-huggingface-nli-model-both-roberta-bart
#   https://github.com/huggingface/transformers/issues/9160
# model = model.to(torch.float16)

# Configure tokenizer.
if args.model == "gpt2":
    tokenizer = AutoTokenizer.from_pretrained(
        "gpt2", 
        # bos_token="<|startoftext|>",
        # eos_token="<|endoftext|>", 
        # pad_token="<|pad|>",
        # padding_side="right", 
        model_max_length=args.max_len,
        )
    tokenizer.pad_token = tokenizer.eos_token
    print(tokenizer)
    # model.resize_token_embeddings(len(tokenizer)) 
elif args.model == "bloom":
    tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-560m")
    tokenizer.pad_token = tokenizer.eos_token
elif args.model == "opt":
    tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")  
    tokenizer.pad_token = tokenizer.eos_token    
else:
    raise ValueError(f"Unsupported model: {args.model}")

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'})


In [None]:
# Get the dataset.
dataset_webgpt_comp = load_dataset("openai/webgpt_comparisons", split="train[:20%]")



In [None]:
# Convert data into ranking format.
data_list_ranking = []
for row in dataset_webgpt_comp:
    question = row["question"]["full_text"]
    answer_0 = row["answer_0"]
    answer_1 = row["answer_1"]
    score_0 = row["score_0"]
    score_1 = row["score_1"]
    if answer_0 == "" or answer_1 == "" or (score_0 == score_1):
        continue

    ranking = [0 if score_0 > score_1 else 1, 0 if score_0 < score_1 else 1]
    data_list_ranking.append({
        "prompt": PROMPT_TEMPLATE.format_map({"instruction": question}),
        "completion_0": answer_0,
        "completion_1": answer_1,
        "ranking": ranking
    })

data_list_ranking[:2]

[{'prompt': 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nVoiced by Harry Shearer, what Simpsons character was modeled after Ted Koppel?\n\n### Response:',
  'completion_0': 'The Simpsons character that was possibly based on Ted Koppel is Kent Brockman.  He is a local news anchor in Springfield and is modeled after Ted Koppel. [1]',
  'completion_1': "Apu Nahasapeemapetilon is a recurring character in the American animated television series The Simpsons. He is an Indian immigrant proprietor who runs the Kwik-E-Mart, a popular convenience store in Springfield. [1] He was based on Peter Seller's character in the film The Party. [2]",
  'ranking': [0, 1]},
 {'prompt': 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nHeterophobia

In [None]:
# Make ranking data to chosen, rejetced data for reward model dataset.
total_data_ranking2chosen = []
for tmp in data_list_ranking:
    one_data_ranking2chosen = []

    # data 1) 0 VS 1
    data = {}
    data["prompt"] = tmp["prompt"]
    if tmp["ranking"][0] < tmp["ranking"][1]:
        data["chosen"] = tmp["completion_0"]
        data["rejected"] = tmp["completion_1"]
    else:
        data["chosen"] = tmp["completion_1"]
        data["rejected"] = tmp["completion_0"]
    one_data_ranking2chosen.append(data)

    # # data 2) 0 VS 2
    # data = {}
    # data["prompt"] = tmp["prompt"]
    # if tmp["ranking"][0] < tmp["ranking"][2]:
    #     data["chosen"] = tmp["completion_0"]
    #     data["rejected"] = tmp["completion_2"]
    # else:
    #     data["chosen"] = tmp["completion_2"]
    #     data["rejected"] = tmp["completion_0"]
    # one_data_ranking2chosen.append(data)

    # # data 1) 1 VS 2
    # data = {}
    # data["prompt"] = tmp["prompt"]
    # if tmp["ranking"][1] < tmp["ranking"][2]:
    #     data["chosen"] = tmp["completion_1"]
    #     data["rejected"] = tmp["completion_2"]
    # else:
    #     data["chosen"] = tmp["completion_2"]
    #     data["rejected"] = tmp["completion_1"]
    # one_data_ranking2chosen.append(data)


    total_data_ranking2chosen.extend(one_data_ranking2chosen)


print("before data num: %d" % (len(data_list_ranking)))
print("after data num: %d" % (len(total_data_ranking2chosen)))
print("data example: \n%s" % total_data_ranking2chosen[1])

before data num: 2747
after data num: 2747
data example: 
{'prompt': 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nHeterophobia is the irrational fear of what\n\n### Response:', 'chosen': ' Heterophobia is the irrational fear of the opposite sex, coined as Sexophobia [1]. This phobia can be caused by genetics, heredity, negative experiences with the opposite sex, or a combination of these [1].  Symptoms may result from encountering people of the opposite sex, including breathlessness, dizziness, excessive sweating, nausea, dry mouth, feeling sick, shaking, coronary heart palpitations, and anxiety [1].', 'rejected': 'In modern times, there has been a rise in what is called heterophobia; the irrational fear of, discrimination against, or aversion to heterosexual people. [1][2] The word "heterophobia" is a play on the word "homophobia," which describes the 

In [None]:
# Prepare for data and dataset.
import random
random.seed(230319)

random.shuffle(total_data_ranking2chosen)
print(total_data_ranking2chosen[1])

# train_data = total_data_ranking2chosen[:-1000]
# eval_data = total_data_ranking2chosen[-1000:0]
# We just select very small set of data for a quicker training.
train_data = total_data_ranking2chosen[:100]
val_data = total_data_ranking2chosen[100:130]
eval_data = total_data_ranking2chosen[130:160]

train_dataset = RewardDataset(train_data, tokenizer, args.max_len)
val_dataset = RewardDataset(val_data, tokenizer, args.max_len)
eval_dataset = RewardDataset(eval_data, tokenizer, args.max_len)

# Check
idx = 10
print("#" * 70)
print("## prompt ##")
print(train_data[idx]["prompt"])
print("#" * 70)
print("## chosen ##")
print(train_data[idx]["chosen"])
print("#" * 70)
print("## rejected ##")
print(train_data[idx]["rejected"])

{'prompt': "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nHow does one juice a prune? If prunes are just dehydrated plums, shouldn't prune juice just be plum juice?\n\n### Response:", 'chosen': 'You can juice dried prunes by steaming or simmering them to rehydrate them, running them through a strainer to remove the pits, seeds and skin, and then adding more water to the resulting pruney paste. [1] You don’t have to do that, though, because you could also just juice a fresh prune. Contrary to popular belief, prunes aren’t simply dried plums, but a group of cultivars, or varieties, of plum that are well suited to drying. [2][3] ', 'rejected': 'While prunes are not simply dried plums, they are a type of dried plum. [2][3]  To juice a prune, you must first steam or simmer them to rehydrate them, and then run them through a strainer to remove the pits, seeds, 

100%|██████████| 100/100 [00:00<00:00, 451.13it/s]
100%|██████████| 30/30 [00:00<00:00, 520.23it/s]
100%|██████████| 30/30 [00:00<00:00, 508.84it/s]

######################################################################
## prompt ##
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Why do major cell phone carriers allow companies like MetroPCS, Cricket Wireless, Boost Mobile, etc. to resell their network?

And for a cheaper price, too? I don't understand.

### Response:
######################################################################
## chosen ##
These companies known as mobile virtual network operators or MVNOs are able to resell network services in bulk from a regular carrier and then resell them to end-users, usually for cheaper prices than that carrier [2,3]. That’s still profitable for MVNOs because they don’t have to pay anything for the upkeep and modernization of the wireless network they’re using, therefore they can afford to lower the rates on voice calls, messages and data in order to attrac




In [None]:
# Configure optimizer.
optim = Adam(model.parameters(), lr=5e-5)

# Configure loss function.
if args.loss_fn == "log_sig":
    loss_fn = LogSigLoss()
elif args.loss_fn == "log_exp":
    loss_fn = LogExpLoss()
else:
    raise ValueError(f"Unsupported loss function: {args.loss_fn}")

In [None]:
trainer = RewardModelTrainer(model=model,
                            strategy=strategy,
                            optim=optim,
                            loss_fn=loss_fn,
                            train_dataset=train_dataset,
                            valid_dataset=val_dataset,
                            eval_dataset=eval_dataset,
                            batch_size=args.batch_size,
                            max_epochs=args.max_epochs)

In [None]:
# Train!!!
trainer.fit()

# Save model checkpoint after fitting on only rank0.
# strategy.save_model(model, os.path.join(args.output_dir, "rm.pt"), only_rank0=True)
trainer.save_model(path=os.path.join(args.output_dir, "rm.pt"), only_rank0=True)

# Save optimizer checkpoint on all ranks.
if args.need_optim_ckpt:
    strategy.save_optimizer(trainer.optimizer,
                            os.path.join(args.output_dir, "rm_optim_checkpoint_%d.pt" % (torch.cuda.current_device())),
                            only_rank0=False)

Train epoch:   0%|          | 0/3 [00:00<?, ?it/s]
Train step of epoch 0:   0%|          | 0/25 [00:00<?, ?it/s][A
Train step of epoch 0:   4%|▍         | 1/25 [00:00<00:08,  2.97it/s][A
Train step of epoch 0:   4%|▍         | 1/25 [00:00<00:08,  2.97it/s, loss=0.705, dist=0, acc=0][A
Train step of epoch 0:   8%|▊         | 2/25 [00:00<00:06,  3.83it/s, loss=0.705, dist=0, acc=0][A
Train step of epoch 0:   8%|▊         | 2/25 [00:00<00:06,  3.83it/s, loss=0.689, dist=0, acc=0][A
Train step of epoch 0:  12%|█▏        | 3/25 [00:00<00:05,  4.12it/s, loss=0.689, dist=0, acc=0][A
Train step of epoch 0:  12%|█▏        | 3/25 [00:00<00:05,  4.12it/s, loss=0.696, dist=0, acc=0][A
Train step of epoch 0:  16%|█▌        | 4/25 [00:00<00:04,  4.30it/s, loss=0.696, dist=0, acc=0][A
Train step of epoch 0:  16%|█▌        | 4/25 [00:01<00:04,  4.30it/s, loss=0.689, dist=0, acc=0][A
Train step of epoch 0:  20%|██        | 5/25 [00:01<00:04,  4.40it/s, loss=0.689, dist=0, acc=0][A
Train step 

# Step 3) PPO: Proximal Policy Optimization
Further fine-tune the LLM from step 1 with the reward model and this dataset using RL (eg. PPO).

- References
    - [train_prompts.py](https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/examples/train_prompts.py)

In [None]:
import argparse
from copy import deepcopy

import pandas as pd

import torch
from torch.optim import Adam
from transformers import AutoTokenizer, BloomTokenizerFast
from transformers.models.gpt2.tokenization_gpt2 import GPT2Tokenizer

from nextgpt.models.base import RewardModel
from nextgpt.models.bloom import BLOOMActor, BLOOMCritic
from nextgpt.models.gpt import GPTActor, GPTCritic
from nextgpt.models.opt import OPTActor, OPTCritic
from nextgpt.trainer import PPOTrainer
from nextgpt.trainer.strategies import DDPStrategy, NaiveStrategy
from nextgpt.dataset import PromptDataset, SupervisedDataset, DataCollatorForSupervisedDataset

import json
import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [None]:
# Define arguments.
parser = argparse.ArgumentParser()

parser.add_argument("--output_dir", type=str, default="./output_3_ppo")
parser.add_argument("--strategy",
                    type=str,
                    choices=["naive", "ddp"],
                    default="naive")
parser.add_argument("--model", 
                    type=str, 
                    choices=["gpt2", "bloom", "opt"],
                    default="gpt2")
parser.add_argument("--pretrain", type=str, default=None)
parser.add_argument("--rm_model", 
                    type=str, 
                    choices=["gpt2", "bloom", "opt"],
                    default="gpt2")
parser.add_argument("--rm_path", type=str, default=None)
parser.add_argument("--rm_pretrain", type=str, default=None)
parser.add_argument("--need_optim_ckpt", type=bool, default=False)
parser.add_argument("--num_episodes", type=int, default=10)
parser.add_argument("--max_timesteps", type=int, default=3)
parser.add_argument("--update_timesteps", type=int, default=3)
parser.add_argument("--max_epochs", type=int, default=5)
parser.add_argument("--train_batch_size", type=int, default=8)
parser.add_argument("--ptx_batch_size", type=int, default=1)
parser.add_argument("--experience_batch_size", type=int, default=8)
parser.add_argument("--lora_rank", type=int, default=0, help="low-rank adaptation matrices rank")
parser.add_argument("--kl_coef", type=float, default=0.1)
parser.add_argument("--ptx_coef", type=float, default=0.9)
args = parser.parse_args(args=[])

# For test
# args.pretrain= "gpt2"
args.pretrain= "./output_1_sft"
args.rm_path = "./output_2_rm/rm.pt" # RM model path
args.rm_pretrain= "gpt2"

args.num_episodes = 1
args.max_epochs   = 1

print(args)
if not os.path.exists(args.output_dir):
    os.makedirs(args.output_dir)

Namespace(output_dir='./output_3_ppo', strategy='naive', model='gpt2', pretrain='./output_1_sft', rm_model='gpt2', rm_path='./output_2_rm/rm.pt', rm_pretrain='gpt2', need_optim_ckpt=False, num_episodes=1, max_timesteps=3, update_timesteps=3, max_epochs=1, train_batch_size=8, ptx_batch_size=1, experience_batch_size=8, lora_rank=0, kl_coef=0.1, ptx_coef=0.9)


In [None]:
# Configure strategy.
if args.strategy == "naive":
    strategy = NaiveStrategy()
elif args.strategy == "ddp":
    strategy = DDPStrategy()
else:
    raise ValueError(f"Unsupported strategy: {args.strategy}")

In [None]:
if args.rm_path is not None:
    rm_state_dict = torch.load(args.rm_path, map_location="cpu")

# Configure intial model.
if args.model == "gpt2":
    initial_model = GPTActor(pretrained=args.pretrain)
elif args.model == "bloom":
    initial_model = BLOOMActor(pretrained=args.pretrain)
elif args.model == "opt":
    initial_model = OPTActor(pretrained=args.pretrain)
else:
    raise ValueError(f"Unsupported actor model: {args.model}")

# Configure reward model.
if args.rm_model == "gpt2":
    reward_model = GPTRM(pretrained=args.rm_pretrain)
elif args.rm_model == "bloom":
    reward_model = BLOOMRM(pretrained=args.rm_pretrain)
elif args.rm_model == "opt":
    reward_model = OPTRM(pretrained=args.rm_pretrain)
else:
    raise ValueError(f"Unsupported reward model: {args.rm_model}")

if args.rm_path is not None:
    reward_model.load_state_dict(rm_state_dict)

# initial_model.to(torch.float16).to(torch.cuda.current_device())
# reward_model.to(torch.float16).to(torch.cuda.current_device())
initial_model.to(torch.cuda.current_device())
reward_model.to(torch.cuda.current_device())

# Configure actor and critic.
with strategy.model_init_context():
    # Actor
    if args.model == "gpt2":
        actor = GPTActor(pretrained=args.pretrain, lora_rank=args.lora_rank)
    elif args.model == "bloom":
        actor = BLOOMActor(pretrained=args.pretrain_actor, lora_rank=args.lora_rank)
    elif args.model == "opt":
        actor = OPTActor(pretrained=args.pretrain, lora_rank=args.lora_rank)        
    else:
        raise ValueError(f"Unsupported actor model: {args.model}")

    # Critic
    if args.rm_model == "gpt2":
        critic = GPTCritic(pretrained=args.rm_pretrain, lora_rank=args.lora_rank)
    elif args.rm_model == "bloom":
        critic = BLOOMCritic(pretrained=args.rm_pretrain, lora_rank=args.lora_rank)
    elif args.rm_model == "opt":
        critic = OPTCritic(pretrained=args.rm_pretrain, lora_rank=args.lora_rank)
    else:
        raise ValueError(f"Unsupported reward model: {args.rm_model}")

    if args.rm_path is not None:
        critic.load_state_dict(rm_state_dict)
        del rm_state_dict

# critic.to(torch.float16).to(torch.cuda.current_device())
# actor.to(torch.float16).to(torch.cuda.current_device())
critic.to(torch.cuda.current_device())
actor.to(torch.cuda.current_device())

# Configure tokenizer.
if args.model == "gpt2":
    tokenizer = GPT2Tokenizer.from_pretrained(
        "gpt2", 
        # bos_token="<|startoftext|>",
        # eos_token="<|endoftext|>", 
        # pad_token="<|pad|>",
        # padding_side="right", 
        model_max_length=512,
        )
    tokenizer.pad_token = tokenizer.eos_token
elif args.model == "bloom":
    tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-560m")
    tokenizer.pad_token = tokenizer.eos_token            
elif args.model == "opt":
    tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
    tokenizer.pad_token = tokenizer.eos_token   

In [None]:
# Configure optimizer.
actor_optim = Adam(actor.parameters(), lr=1e-7)
critic_optim = Adam(critic.parameters(), lr=1e-7)

In [None]:
# Setting the models.
(actor, actor_optim), (critic, critic_optim) = strategy.prepare((actor, actor_optim), (critic, critic_optim))

In [None]:
def tokenize_fn(texts):
    # MUST padding to max length to ensure inputs of all ranks have the same length
    # Different length may lead to hang when using gemini, as different generation steps
    batch = tokenizer(texts, return_tensors="pt", max_length=96, padding="max_length", truncation=True)
    return {k: v.to(torch.cuda.current_device()) for k, v in batch.items()}

In [None]:
# Prepare dataset.
dataset_webgpt_comp = load_dataset("openai/webgpt_comparisons", split="train[:20%]")

data_list = []
for row in dataset_webgpt_comp:
    question = row["question"]["full_text"]
    answer_0 = row["answer_0"]
    data_list.append({
        "instruction": question,
        "completion": answer_0
    })

print(data_list[:1])



[{'instruction': 'Voiced by Harry Shearer, what Simpsons character was modeled after Ted Koppel?', 'completion': 'The Simpsons character that was possibly based on Ted Koppel is Kent Brockman.  He is a local news anchor in Springfield and is modeled after Ted Koppel. [1]'}]


In [None]:
# Configure dataloader.
data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)

prompt_dataset = PromptDataset(
    dataset=data_list, 
    tokenizer=tokenizer, 
    prompt_template=PROMPT_TEMPLATE, 
    max_datasets_size=10000)

prompt_sampler = None
if dist.is_initialized() and dist.get_world_size() > 1:
    prompt_sampler = DistributedSampler(prompt_dataset, shuffle=True, seed=42, drop_last=True)

prompt_dataloader = DataLoader(
    prompt_dataset,
    shuffle=(prompt_sampler is None),
    sampler=prompt_sampler,
    batch_size=args.train_batch_size)

pretrain_dataset = SupervisedDataset(
    dataset=data_list,
    tokenizer=tokenizer, 
    prompt_template=PROMPT_TEMPLATE,
    completion_field="completion",
    max_datasets_size=10000,
    max_length=512,
    verbose=True)

pretrain_sampler = None
if dist.is_initialized() and dist.get_world_size() > 1:
    pretrain_sampler = DistributedSampler(pretrain_dataset, shuffle=True, seed=42, drop_last=True)

pretrain_dataloader = DataLoader(
    pretrain_dataset,
    shuffle=(pretrain_sampler is None),
    sampler=pretrain_sampler,
    batch_size=args.ptx_batch_size,
    collate_fn=data_collator)

def tokenize_fn(texts):
    # MUST padding to max length to ensure inputs of all ranks have the same length
    # Different length may lead to hang when using gemini, as different generation steps
    batch = tokenizer(texts, return_tensors='pt', max_length=96, padding='max_length', truncation=True)
    return {k: v.to(torch.cuda.current_device()) for k, v in batch.items()}

Loading data...
Limiting dataset to 10000 examples.
Loading data...
Limiting dataset to 10000 examples.
Formatting inputs...
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Voiced by Harry Shearer, what Simpsons character was modeled after Ted Koppel?

### Response:
The Simpsons character that was possibly based on Ted Koppel is Kent Brockman.  He is a local news anchor in Springfield and is modeled after Ted Koppel. [1]<|endoftext|>
Tokenizing inputs... This may take some time...


In [None]:
# Configure trainer.
trainer = PPOTrainer(
    strategy,
    actor,
    critic,
    reward_model,
    initial_model,
    actor_optim,
    critic_optim,
    kl_coef=args.kl_coef,
    ptx_coef=args.ptx_coef,
    max_epochs=args.max_epochs,
    train_batch_size=args.train_batch_size,
    experience_batch_size=args.experience_batch_size,
    tokenizer=tokenize_fn,
    max_length=128,
    do_sample=True,
    temperature=1.0,
    top_k=50,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

trainer.fit(
    prompt_dataloader=prompt_dataloader,
    pretrain_dataloader=pretrain_dataloader,
    num_episodes=args.num_episodes,
    max_timesteps=args.max_timesteps,
    update_timesteps=args.update_timesteps)

# Save model checkpoint after fitting on only rank0.
trainer.save_model(os.path.join(args.output_dir, "actor.pt"), only_rank0=True, tokenizer=tokenizer)
# Save optimizer checkpoint on all ranks.
strategy.save_optimizer(actor_optim,
                        os.path.join(args.output_dir, "actor_optim_checkpoint_%d.pt" % (torch.cuda.current_device())),
                        only_rank0=False)

Episode [1/1]:  67%|██████▋   | 2/3 [00:01<00:00,  1.17it/s]
Train epoch [1/1]:   0%|          | 0/3 [00:00<?, ?it/s][A
Train epoch [1/1]:   0%|          | 0/3 [00:00<?, ?it/s, reward=-1.36][A
Train epoch [1/1]:  33%|███▎      | 1/3 [00:00<00:00,  4.04it/s, reward=-1.36][A
Train epoch [1/1]:  33%|███▎      | 1/3 [00:00<00:00,  4.04it/s, reward=-.964][A
Train epoch [1/1]:  67%|██████▋   | 2/3 [00:00<00:00,  4.80it/s, reward=-.964][A
Train epoch [1/1]:  67%|██████▋   | 2/3 [00:00<00:00,  4.80it/s, reward=-1.21][A
Train epoch [1/1]: 100%|██████████| 3/3 [00:00<00:00,  4.94it/s, reward=-1.21]
Episode [1/1]: 100%|██████████| 3/3 [00:03<00:00,  1.13s/it]


In [None]:
#  Inference test.
def generation(input_text):
    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(torch.cuda.current_device())
    outputs = actor.generate(input_ids,
                             max_length=100,
                             do_sample=True,
                             top_k=50,
                             top_p=0.95,
                             num_return_sequences=1)
    output = tokenizer.batch_decode(outputs[0], skip_special_tokens=True)[0]
    print("#" * 70)
    print(output)
    return output


test_isntruction_list = [
    "Heterophobia is the irrational fear of what",
    ]

test_prompt_list = [PROMPT_TEMPLATE.format_map({"instruction": tmp}) for tmp in test_isntruction_list]

for input_text in test_prompt_list:
    output = generation(input_text)

######################################################################
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Heterophobia is the irrational fear of what

### Response:Heterophobia is the irrational fear of what will happen to you

If you ever read an internet article that claims your friend's boyfriend has a mental disorder, the first thing you would do is get on the internet, and the person in question will think you


# Inference by PPO actor

In [None]:
import argparse

In [None]:
parser = argparse.ArgumentParser()
parser.add_argument("--model", 
                    type=str, 
                    choices=["gpt2", "bloom", "opt"],
                    default="gpt2")
# We suggest to use the pretrained model from HuggingFace, use pretrain to configure model
parser.add_argument("--pretrain", type=str, default=None)
parser.add_argument("--model_path", type=str, default=None)
parser.add_argument("--input", type=str, default="Question: How are you ? Answer:")
parser.add_argument("--max_length", type=int, default=100)
args = parser.parse_args([])

# args.pretrain= "gpt2"
args.pretrain= "./output_1_sft"
args.model_path = "./output_3_ppo/actor.pt"

In [None]:
def eval(args):
    # Configure model.
    if args.model == "gpt2":
        actor = GPTActor(pretrained=args.pretrain).to(torch.cuda.current_device())
    elif args.model == "bloom":
        actor = BLOOMActor(pretrained=args.pretrain).to(torch.cuda.current_device())
    elif args.model == "opt":
        actor = OPTActor(pretrained=args.pretrain).to(torch.cuda.current_device())
    else:
        raise ValueError(f"Unsupported model: {args.model}")

    state_dict = torch.load(args.model_path)
    # actor.model.load_state_dict(state_dict)
    actor.load_state_dict(state_dict)

    # Configure tokenizer.
    if args.model == "gpt2":
        tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
        tokenizer.pad_token = tokenizer.eos_token
    elif args.model == "bloom":
        tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
        tokenizer.pad_token = tokenizer.eos_token
    elif args.model == "opt":
        tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
    else:
        raise ValueError(f"Unsupported model: {args.model}")

    actor.eval()
    input = args.input
    input_ids = tokenizer.encode(input, return_tensors="pt").to(torch.cuda.current_device())
    outputs = actor.generate(input_ids,
                             max_length=args.max_length,
                             do_sample=True,
                             top_k=10,
                             top_p=0.95,
                             num_return_sequences=1)
    output = tokenizer.batch_decode(outputs[0], skip_special_tokens=True)[0]
    print(output)

In [None]:
input_text = "Heterophobia is the irrational fear of what?"
args.input = PROMPT_TEMPLATE.format_map({"instruction": input_text})
eval(args)

odict_keys(['model.transformer.wte.weight', 'model.transformer.wpe.weight', 'model.transformer.h.0.ln_1.weight', 'model.transformer.h.0.ln_1.bias', 'model.transformer.h.0.attn.bias', 'model.transformer.h.0.attn.masked_bias', 'model.transformer.h.0.attn.c_attn.weight', 'model.transformer.h.0.attn.c_attn.bias', 'model.transformer.h.0.attn.c_proj.weight', 'model.transformer.h.0.attn.c_proj.bias', 'model.transformer.h.0.ln_2.weight', 'model.transformer.h.0.ln_2.bias', 'model.transformer.h.0.mlp.c_fc.weight', 'model.transformer.h.0.mlp.c_fc.bias', 'model.transformer.h.0.mlp.c_proj.weight', 'model.transformer.h.0.mlp.c_proj.bias', 'model.transformer.h.1.ln_1.weight', 'model.transformer.h.1.ln_1.bias', 'model.transformer.h.1.attn.bias', 'model.transformer.h.1.attn.masked_bias', 'model.transformer.h.1.attn.c_attn.weight', 'model.transformer.h.1.attn.c_attn.bias', 'model.transformer.h.1.attn.c_proj.weight', 'model.transformer.h.1.attn.c_proj.bias', 'model.transformer.h.1.ln_2.weight', 'model.tr