# RLAIF: Reinforcement Learning with AI Feedback

We've talked about RLHF - now we can talk about replacing that "H" with "AI"!

In the following notebook, we'll walk through an example of creating a dataset curated by our AI constitution.

## Create AI Constituion

The first, and most important task, is to create our AI constitution.

This is a set of rules that ensures the dataset we're creating is in-line with our wants and expectations and is also called "Constitutional AI".

The main advantage of this over RLHF is the scaling opportunities (the machine is cheaper than the human, and so can cover much more ground) as well as the performance on self-refinement tasks.

You can read more about both of the concepts here:

- [Constitutional AI](https://arxiv.org/pdf/2212.08073.pdf)
- [Self-Refinement](https://arxiv.org/pdf/2303.17651.pdf)

Let's start by writing a simple constitution.

```python
ai_constitution = {
    0: "The model should not generate racist, sexist, hateful, or otherwise toxic outputs.",
    1: "The model should move conversation in a positive direction.",
    2: "The model should politely point out harmful assumptions from the human."
}
```

####🏗️Activity:

Please write your own AI constituion.

Your AI constituion should have 3 "rules", you can use the above as a guide.

In [None]:
### YOUR CODE HERE



---



## Create SFT Dataset - Final Revision Dataset

Now that we have a constitution we can use it along with the self-refinement process to create a supervised fine-tuning dataset.

### Load Base Model

Let's grab our base model - in this case we'll use the familiar "Zephyr-7B" model!

In [None]:
!pip install -qU transformers accelerate bitsandbytes peft trl datasets tqdm

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.9/150.9 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.7/79.7 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━

In [None]:
import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer

model_id = "HuggingFaceH4/zephyr-7b-alpha"

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=quant_config
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/628 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

As always, we complete some post-processing to ensure our tokenizer is set-up properly!

In [None]:
base_model_tokenizer = AutoTokenizer.from_pretrained(model_id)

if getattr(base_model_tokenizer, "pad_token", None) is None:
    base_model_tokenizer.pad_token = base_model_tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

We'll create a `text-generation` pipeline to leverage our model!

In [None]:
import torch
from transformers import pipeline

base_pipeline = pipeline("text-generation", model=base_model, tokenizer=base_model_tokenizer)

### Build Critique Loop

The basic idea of the critique loop is simple:

1. Start with some prompt and obtain the model's generation.
2. Ask the model if that generation adheres to specific elements of the AI Constitution - rewriting the generation if it doesn't.
3. Repeat for each "rule" in the AI Constitution.

Let's see how we could implement it in code!

But first, we'll need a dataset of prompts that will lead to generations we might wish to critique.

For that we will again use the `allenai/real-toxicity-prompts`.

####❓Question:

Why would we want to separate each "rule" in our constituion during this process?

In [None]:
from datasets import load_dataset

dataset_name="allenai/real-toxicity-prompts"

toxic_dataset = load_dataset(dataset_name, split="train")
toxic_dataset = toxic_dataset.select(range(5))

In [None]:
toxic_dataset

Dataset({
    features: ['filename', 'begin', 'end', 'challenging', 'prompt', 'continuation'],
    num_rows: 5
})

In [None]:
toxic_dataset[0]["prompt"]

{'text': 'Corruption involving the contractors is the chief culprit for the prison’s problems, according to a recent',
 'threat': 0.40394926,
 'insult': 0.26487392,
 'severe_toxicity': 0.20936702,
 'toxicity': 0.295593,
 'profanity': 0.171412,
 'sexually_explicit': 0.14422302,
 'identity_attack': 0.32305634,
 'flirtation': 0.4664734}

Next we set-up a helper function to collect generations.

In [None]:
def generate_response(sample, hf_pipeline, instruction=None):

  prompt_sample = sample

  messages = []

  if instruction:
    messages.append(
        {
            "role" : "system",
            "content" : instruction
        }
    )

  messages.append(
      {"role" : "user",
       "content" : prompt_sample}
  )

  prompt = hf_pipeline.tokenizer.apply_chat_template(messages, tokenize=False)

  outputs = hf_pipeline(
      prompt,
      max_new_tokens=256,
      do_sample=True,
      temperature=0.3,
      top_k=50,
      top_p=0.95,
      return_full_text=False
  )

  return outputs[0]["generated_text"]

Now we can loop through our dataset and create revisions for each element of our AI Constitution!

In [None]:
final_revision = []

for row in toxic_dataset:
  base_response = generate_response(row["prompt"]["text"].strip(), base_pipeline)

  for i in range(len(ai_constitution.keys())):
    modified_prompt = f"Does the following response follow these guidelines, if not, please explain why:\n\nGuidelines:\n{ai_constitution[i]}"
    critique = generate_response(base_response, base_pipeline, modified_prompt)

    critique_instruction = f"Respond with the original response, but reworded to conform to the following critique making no mention of the reasons for rewording.\n\nCritique:\n{critique}"
    base_response = generate_response(base_response, base_pipeline, critique).split("\n")[-1]

  print(base_response)
  final_revision.append(base_response)

### Collect Final Revisions

Let's look at our final prompts - you'll notice they're a bit odd - but we only really care that they adhere to our constitution and are coherent. The subject matter is complete irrelevant to us at this point.

In [None]:
for revision in final_revision:
  print(revision)

I completely agree with your perspective. Addressing the root causes of corruption is essential to breaking the cycle of decline and neglect that has plagued the prison. By strengthening procurement processes, increasing transparency and accountability, and implementing stricter penalties for contractors found to be engaging in corrupt behavior, the prison can begin to address the root causes of its problems and work towards a more sustainable and equitable future. This approach can help to build trust and confidence in the prison's ability to address its problems and work towards a more positive and sustainable future for the prison and the community it serves. It's crucial to approach this matter with a non-judgmental and objective perspective, and to focus on identifying the root causes of corruption rather than making assumptions. By doing so, the prison can develop targeted and effective strategies to address the issue, rather than relying on broad and potentially ineffective meas

####❓Question:

Why does the content or domain of our dataset not matter at this stage?



---



## Fine-tune Model with SFT on Created Dataset (SL-CAI)

Now that we have created a dataset of prompts that we're sure adhere to our constitution - we can fine-tune our base model to help us select between various sets of prompts. This will become our "feedback model" which is what will sit in the place of our human feedback!

Let's start by selecting prompts from our dataset.

In [None]:
prompts = [sample["prompt"]["text"] for sample in toxic_dataset]

In [None]:
from datasets import Dataset
import pandas as pd

sft_dataset = Dataset.from_pandas(pd.DataFrame([{"prompt" : prompt, "response" : response} for prompt, response in zip(prompts, final_revision)]))

In [None]:
def map_dataset(row):
  return {"text" : f"### Input:\n{row['prompt']}\n\n### Response:\n{row['response']}"}

In [None]:
sft_dataset = sft_dataset.map(map_dataset)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

We'll push our newly created dataset to the hub to save for later!

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
hf_username = "<<YOUR USERNAME HERE>>"

sft_dataset.push_to_hub(f"{hf_username}/llme2_sft_dataset_rlaif")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/c-s-ale/llme2_sft_dataset_rlaif/commit/c7044bc3b26ba59312c2910780f99ae41bd7ff42', commit_message='Upload dataset', commit_description='', oid='c7044bc3b26ba59312c2910780f99ae41bd7ff42', pr_url=None, pr_revision=None, pr_num=None)

Let's pull it back to verify it worked.

In [None]:
from datasets import load_dataset

sft_dataset = load_dataset(f"{hf_username}/llme2_sft_dataset_rlaif")

Downloading readme:   0%|          | 0.00/335 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/15.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5 [00:00<?, ? examples/s]

Now we can remove our old assets and begin the SFT step!

In [None]:
del base_pipeline
del base_model
torch.cuda.empty_cache()

We'll load the model as usual and prepare it for training!

In [None]:
from peft import LoraConfig
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "HuggingFaceH4/zephyr-7b-alpha"

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

sft_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=quant_config
)

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

In [None]:
sft_model_tokenizer = AutoTokenizer.from_pretrained(model_id)

if getattr(sft_model_tokenizer, "pad_token", None) is None:
    sft_model_tokenizer.pad_token = sft_model_tokenizer.eos_token

In [None]:
sft_model

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=2)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
  

We'll be leveraging LoRA to fine-tune our model - so let's initialize it here!

In [None]:
from peft import LoraConfig, get_peft_model

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM"
)

sft_model = get_peft_model(sft_model, peft_config)

Now we can move our model into a trainable state and prepare it for k-bit training.

In [None]:
from peft import prepare_model_for_kbit_training
sft_model.config.use_cache = False
sft_model = prepare_model_for_kbit_training(sft_model)

We'll use the standard hyper-parameters as usual.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

args = TrainingArguments(
  output_dir = "sft_zephyr",
  num_train_epochs=5,
  save_strategy="epoch",
  learning_rate=2e-4,
  bf16=True,
  lr_scheduler_type='constant',
)

max_seq_length = 2048

trainer = SFTTrainer(
    sft_model,
    tokenizer=sft_model_tokenizer,
    max_seq_length=max_seq_length,
    train_dataset=sft_dataset["train"],
    args=args,
    dataset_text_field="text",
)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]



And we can finally train!

In [None]:
trainer.train()



Step,Training Loss




TrainOutput(global_step=5, training_loss=1.4779794692993165, metrics={'train_runtime': 8.9794, 'train_samples_per_second': 2.784, 'train_steps_per_second': 0.557, 'total_flos': 292297949798400.0, 'train_loss': 1.4779794692993165, 'epoch': 5.0})

Let's push our newly create LoRA adapters to the hub!

In [None]:
trainer.push_to_hub("llme2_sft_model_rlaif")

CommitInfo(commit_url='https://huggingface.co/c-s-ale/sft_zephyr/commit/f77adae27c83c2ea43e4f937535a85d6ff69fabb', commit_message='llme2_sft_model_rlaif', commit_description='', oid='f77adae27c83c2ea43e4f937535a85d6ff69fabb', pr_url=None, pr_revision=None, pr_num=None)

Now we can prepare our model to be used in the Hugging Face pipeline we'll set-up to generate our feedback!

In [None]:
sft_model = sft_model.merge_and_unload()



####❓Question:

What purpose does this SFT step serve?



---




## Generate Harmlessness Dataset

Now that we have a model that is better aligned to our interests - we can have it substitute in for human feedback when creating a feedback dataset that will be used to train a reward model!

The basic idea is this:

1. Generate two responses to the same prompt.
2. Have the feedback model select which response is better.
3. Compile a dataset from that feedback.

In the end, you will have a dataset in a similar format to the [`hh-rlhf`](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset with "chosen" and "rejected" columns.



### Feedback Model

In [None]:
import torch
from transformers import pipeline

sft_pipeline = pipeline("text-generation", model=sft_model, tokenizer=base_model_tokenizer)

####🏗️Activity

Please create a prompt template that will allow the model to select between two different responses.

You'll need to make sure you provide the following in your template:

- AI Constituion
- Response A
- Response B

Keep in mind that the model needs to be able to express *which* prompt it prefers - it doesn't need to explain why.

In [None]:
### YOUR CODE HERE

---

## Conclusion

At this point, you could follow the RLHF notebook and replace the `hh-rlhf` dataset with the one created by you above to complete the alignment of your model!