In [1]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from dotenv import load_dotenv
import os
from pathlib import Path

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load environment variables
load_dotenv()

True

In [3]:
# Step 1: Load PDF
data_path = Path("../data")
file_path = data_path / "Reinforcement Learning from Human Feedback.pdf"
loader = PyPDFLoader(file_path)
documents = loader.load()

In [4]:
# Step 2: Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=128)
chunks = splitter.split_documents(documents)

In [5]:
print(chunks[0].page_content)

Reinforcement Learning from Human F eedback
A short introduction to RLHF and post-training focused on language models.
Nathan Lambert
2 July 2025
Abstract
Reinforcement learning from human feedback (RLHF) has become an important
technical and storytelling tool to deploy the latest machine learning systems. In this
book, we hope to give a gentle introduction to the core methods for people with some
level of quantitative background. The book starts with the origins of RLHF ‚Äì both
in recent literature and in a convergence of disparate fields of science in economics,
philosophy , and optimal control. W e then set the stage with definitions, problem
formulation, data collection, and other common math used in the literature. The
core of the book details every optimization stage in using RLHF, from starting with
instruction tuning to training a reward model and finally all of rejection sampling,
reinforcement learning, and direct alignment algorithms. The book concludes with


In [6]:
# Step 3: Create embeddings and store in FAISS
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")


In [7]:
persissent_directory = "./chroma_db"
vectorstore = Chroma.from_documents(chunks, embeddings, collection_metadata={"hnsw:space": "cosine"}, persist_directory=persissent_directory)

In [8]:
# Step 4: Query and retrieve relevant docs
query = "what is RLHF"
results = vectorstore.similarity_search_with_score(query, k=3)

In [9]:
print("\nTop 3 Relevant Documents:")
for doc, score in results:
    # The score from Chroma is cosine distance (lower is better)
    # We convert it to cosine similarity (higher is better, range 0 to 1)
    similarity_score = 1 - score
    print(f"score: {similarity_score:.4f} content: {doc.page_content}")
    print(f"---------------------------------")


Top 3 Relevant Documents:
score: 0.5509 content: details key decisions and basic implementation examples for each step in this process.
RLHF has been applied to many domains successfully, with complexity increasing as the
techniques have matured. Early breakthrough experiments with RLHF were applied to
deep reinforcement learning[1], summarization [2], following instructions[3], parsing web
information for question answering[4], and ‚Äúalignment‚Äù[5]. A summary of the early RLHF
recipes is shown below in fig.1.
Figure 1: A rendition of the early, three stage RLHF process with SFT, a reward model,
and then optimization.
In modern language model training, RLHF is one component of post-training. Post-training
is a more complete set of techniques and best-practices to make language models more useful
for downstream tasks [6]. Post-training can be summarized as using three optimization
methods:
1. Instruction / Supervised Finetuning (IFT/SFT), where we teach formatting and for
base of ins

In [10]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
results = retriever.invoke(query)

print("\nüîç Top 3 Relevant Documents:")
for doc in results:
    print(doc.page_content)
    print("----------- ----------------------")



üîç Top 3 Relevant Documents:
details key decisions and basic implementation examples for each step in this process.
RLHF has been applied to many domains successfully, with complexity increasing as the
techniques have matured. Early breakthrough experiments with RLHF were applied to
deep reinforcement learning[1], summarization [2], following instructions[3], parsing web
information for question answering[4], and ‚Äúalignment‚Äù[5]. A summary of the early RLHF
recipes is shown below in fig.1.
Figure 1: A rendition of the early, three stage RLHF process with SFT, a reward model,
and then optimization.
In modern language model training, RLHF is one component of post-training. Post-training
is a more complete set of techniques and best-practices to make language models more useful
for downstream tasks [6]. Post-training can be summarized as using three optimization
methods:
1. Instruction / Supervised Finetuning (IFT/SFT), where we teach formatting and for
base of instruction following

In [11]:
# ---------------------- Query and Retrieve Documents ----------------------
print("\n# ---------------------- Query and Retrieve Documents ----------------------")
query = "what is RLHF?"
# Step 4: Convert vectorstore into a retriever
retriever = vectorstore.as_retriever(
    search_type="mmr", 
    search_kwargs={"k": 3, "lambda_mult": 1}
)
results = retriever.invoke(query)

print("\nüîç Top 3 Relevant Documents:")
for doc in results:
    print(doc.page_content)
    print("----------- ----------------------")

print("\n# ----------------------  lambda = 0.1----------------------")

# Step 4: Convert vectorstore into a retriever
retriever = vectorstore.as_retriever(
    search_type="mmr", 
    search_kwargs={"k": 3, "lambda_mult": 0.1}
)
results = retriever.invoke(query)

print("\nüîç Top 3 Relevant Documents:")
for doc in results:
    print(doc.page_content)
    print("----------- ----------------------")


# ---------------------- Query and Retrieve Documents ----------------------

üîç Top 3 Relevant Documents:
details key decisions and basic implementation examples for each step in this process.
RLHF has been applied to many domains successfully, with complexity increasing as the
techniques have matured. Early breakthrough experiments with RLHF were applied to
deep reinforcement learning[1], summarization [2], following instructions[3], parsing web
information for question answering[4], and ‚Äúalignment‚Äù[5]. A summary of the early RLHF
recipes is shown below in fig.1.
Figure 1: A rendition of the early, three stage RLHF process with SFT, a reward model,
and then optimization.
In modern language model training, RLHF is one component of post-training. Post-training
is a more complete set of techniques and best-practices to make language models more useful
for downstream tasks [6]. Post-training can be summarized as using three optimization
methods:
1. Instruction / Supervised Finetun

In [12]:
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4.1",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    # api_key="...",  # if you prefer to pass api key in directly instaed of using env vars
    # base_url="...",
    # organization="...",
    # other params...
)

In [13]:
multi_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 1}),
    llm=llm
)

multi_results = multi_retriever.get_relevant_documents(query, kwargs={"k": 1})

for i, doc in enumerate(multi_results, 1):
    print(f"--- Chunk {i} ---")
    print(doc.page_content)
    print()

  multi_results = multi_retriever.get_relevant_documents(query, kwargs={"k": 1})


--- Chunk 1 ---
1 Introduction
Reinforcement learning from Human Feedback (RLHF) is a technique used to incorporate
human information into AI systems. RLHF emerged primarily as a method to solve hard to
specify problems. Its early applications were often in control problems and other traditional
domains for reinforcement learning (RL). RLHF became most known through the release
of ChatGPT and the subsequent rapid development of large language models (LLMs) and
other foundation models.
The basic pipeline for RLHF involves three steps. First, a language model that can follow
user questions must be trained (see Chapter 9). Second, human preference data must be
collected for the training of a reward model of human preferences (see Chapter 7). Finally,
the language model can be optimized with an RL optimizer of choice, by sampling genera-
tions and rating them with respect to the reward model (see Chapter 3 and 11). This book
details key decisions and basic implementation examples for each 