# Tune an LLM with RLHF

#### Preference dataset (see root directory)

- Each sample has "input_prompt", "candidate_0", "candidate_1", "choice"
- "input_prompt" always ends with "... [summary]: ".
- "candidate_0" and "candidate_1" are two completions that were compared by a human.
- "choise" is the human's choise ("candidate_0" or "candidate_1").

#### Prompt dataset (see root directory)

- Input prompt only, no response.

#### Environment setup

Google Cloud Pipeline Components library has a RLHF training process pipeline. This can be run on any platform that supports KubeFlow Pipelines, and can also run on Google Cloud's Vertex AI Pipelines.

To run it locally, install the following:

```Python
!pip3 install google-cloud-pipeline-components
!pip3 install kfp
```

## Compile the pipeline

In [1]:
import math

# Import (RLFH is currently in preview)
from google_cloud_pipeline_components.preview.llm import rlhf_pipeline

# Import from KubeFlow pipelines
from kfp import compiler

# Define a path to the yaml file
RLHF_PIPELINE_PKG_PATH = "rlhf_pipeline.yaml"

In [2]:
# Execute the compile function
compiler.Compiler().compile(
    pipeline_func=rlhf_pipeline,
    package_path=RLHF_PIPELINE_PKG_PATH
)

In [3]:
# Print the first lines of the YAML file
!head rlhf_pipeline.yaml

# PIPELINE DEFINITION
# Name: rlhf-train-template
# Description: Performs reinforcement learning from human feedback.
# Inputs:
#    deploy_model: bool [Default: True]
#    eval_dataset: str
#    instruction: str
#    kl_coeff: float [Default: 0.1]
#    large_model_reference: str
#    location: str [Default: '{{$.pipeline_google_cloud_location}}']


## Define the Vertex AI pipeline job

- Define the location of the training and evaluation data
- Choose the foundation model (llama-2-7b) to be tuned
- Calculate the number of reward model training steps

Usually train over the preference dataset for 20-30 epochs for best results.

$$ stepsPerEpoch = \left\lceil \frac{datasetSize}{batchSize} \right\rceil$$
$$ trainSteps = stepsPerEpoch \times numEpochs$$

The RLHF pipeline parameters contain the number of training steps and not number of epochs.

In [4]:
PREF_DATASET_SIZE = 3000
BATCH_SIZE = 64
REWARD_NUM_EPOCHS = 30

REWARD_STEPS_PER_EPOCH = math.ceil(PREF_DATASET_SIZE / BATCH_SIZE)
reward_model_train_steps = REWARD_STEPS_PER_EPOCH * REWARD_NUM_EPOCHS

print("REWARD_STEPS_PER_EPOCH:", REWARD_STEPS_PER_EPOCH)
print("reward_model_train_steps:", reward_model_train_steps)

REWARD_STEPS_PER_EPOCH: 47
reward_model_train_steps: 1410


Usually train over the prompt dataset for roughly 10-20 epochs.

Reward hacking: if given too many training steps, the policy model may exploit the reward and exhibit undesired behavior.

In [5]:
PROMPT_DATASET_SIZE = 2000
BATCH_SIZE = 64
RL_NUM_EPOCHS = 10

RL_STEPS_PER_EPOCH = math.ceil(PROMPT_DATASET_SIZE / BATCH_SIZE)
reinforcement_learning_train_steps = RL_STEPS_PER_EPOCH * RL_NUM_EPOCHS

print("RL_STEPS_PER_EPOCH:", RL_STEPS_PER_EPOCH)
print("reinforcement_learning_train_steps:", reinforcement_learning_train_steps)

RL_STEPS_PER_EPOCH: 32
reinforcement_learning_train_steps: 320


### Define the instruction

- Choose the task-specific instruction that you want to use to tune the foundational model.  For this example, the instruction is "Summarize in less than 50 words."
- Can choose different instructions, for example, "Write a reply to the following question or comment." In this case, we also need to collect preference dataset with the same instruction added to the prompt, so that both the responses and the human preferences are based on that instruction.

In [6]:
parameter_values = {
    "preference_dataset": "gs://vertex-ai/generative-ai/rlhf/text_small/summarize_from_feedback_tfds/comparisons/train/*.jsonl",
    "prompt_dataset": "gs://vertex-ai/generative-ai/rlhf/text_small/reddit_tfds/train/*.jsonl",
    "eval_dataset": "gs://vertex-ai/generative-ai/rlhf/text_small/reddit_tfds/val/*.jsonl",
    "large_model_reference": "llama-2-7b",
    "reward_model_train_steps": 1410,  # results from the calculations above
    "reinforcement_learning_train_steps": 320,  # results from the calculations above
    "reward_model_learning_rate_multiplier": 1.0,
    "reinforcement_learning_rate_multiplier": 1.0,
    "kl_coeff": 0.1,  # increased to reduce reward hacking
    "instruction": "Summarize in less than 50 words"
}

### Set up Google Cloud to run the Vertex AI pipeline

Vertex AI is already installed in this classroom environment. Without so, will need to install Vertex AI SDK like this:
```Python
!pip3 install google-cloud-aiplatform
```

In [7]:
from utils import authenticate
credentials, PROJECT_ID, STAGING_BUCKET = authenticate()

# RLFH pipeline is available in this region
REGION = "europe-west4"

## Run the PipelineJob on Vertex AI

Not running locally in the notebook, but on some server on Google Cloud Vertex AI.

In [8]:
import google.cloud.aiplatform as aiplatform

aiplatform.init(
    project=PROJECT_ID,
    location=REGION,
    credentials=credentials
)

job = aiplatform.PipelineJob(
    display_name="tutorial-rlhf-tuning",
    pipeline_root=STAGING_BUCKET,
    template_path=RLHF_PIPELINE_PKG_PATH,
    parameter_values=parameter_values
)

- To run the pipeline job (takes about a full day to run with multiple TPUs / GPUs):

```Python
job.run()
```