In [2]:
import os
from dotenv import load_dotenv
load_dotenv()

True

### Setup your environment

In [3]:
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") ## Put your OpenAI API key here
os.environ["TAVILY_API_KEY"] = os.getenv("TAVILY_API_KEY") ## Put your Tavily Search API key here
os.environ["LANGCHAIN_API_KEY"] = os.getenv("LANGCHAIN_API_KEY") ## Put your Langsmith API key here
os.environ["LANGCHAIN_HUB_API_KEY"] = os.getenv("LANGCHAIN_API_KEY") ## Put your Langsmith API key here
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY") ## Put your Google API key here. To try Gemini Pro
os.environ["LANGCHAIN_TRACING_V2"] = 'true' ## Set this as True
os.environ["LANGCHAIN_ENDPOINT"] = 'https://api.smith.langchain.com/' ## Set this as: https://api.smith.langchain.com/
os.environ["LANGCHAIN_HUB_API_URL"] = 'https://api.hub.langchain.com' ## Set this as : https://api.hub.langchain.com
os.environ["LANGCHAIN_PROJECT"] = 'llm-agents-memory'

### Naive RAG Pipeline

![Naive-RAG](images/naive-rag.webp)

#### BGE Embeddings

In [11]:
from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en-v1.5"
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

bge_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cpu'},
    encode_kwargs=encode_kwargs
)

## Load dataset

This dataset has 108 arxiv papers with content parsed using Meta's Nougat model

In [8]:
from datasets import load_dataset
from datasets import Dataset
import pandas as pd

# Load the Hugging Face dataset
dataset = load_dataset("deep-learning-analytics/arxiv_small_nougat")

# Convert to a Pandas DataFrame
df = Dataset.to_pandas(dataset['train'])

# Preview the first few rows
df.head()

Unnamed: 0,doi,id,title,summary,source,authors,categories,comment,journal_ref,primary_category,published,updated,references,content,noref_content
0,2206.02336,2206.02336,Making Large Language Models Better Reasoners ...,Few-shot learning is a challenging task that r...,http://arxiv.org/pdf/2206.02336,['Yifei Li' 'Zeqi Lin' 'Shizhuo Zhang' 'Qiang ...,['cs.CL' 'cs.AI'],,,cs.CL,20220606,20230524,"\n\n* D. Andor, L. He, K. Lee, and E. Pitler (...",# Making Large Language Models Better Reasoner...,# Making Large Language Models Better Reasoner...
1,2206.04615,2206.04615,Beyond the Imitation Game: Quantifying and ext...,Language models demonstrate both quantitative ...,http://arxiv.org/pdf/2206.04615,['Aarohi Srivastava' 'Abhinav Rastogi' 'Abhish...,['cs.CL' 'cs.AI' 'cs.CY' 'cs.LG' 'stat.ML'],"27 pages, 17 figures + references and appendic...","Transactions on Machine Learning Research, May...",cs.CL,20220609,20230612,"\n\n* Wikiquote et al. (2021) Wikiquote, russi...",# Beyond the Imitation Game: Quantifying and e...,# Beyond the Imitation Game: Quantifying and e...
2,2206.05229,2206.05229,Measuring the Carbon Intensity of AI in Cloud ...,By providing unprecedented access to computati...,http://arxiv.org/pdf/2206.05229,['Jesse Dodge' 'Taylor Prewitt' 'Remi Tachet D...,['cs.LG'],"In ACM Conference on Fairness, Accountability,...",,cs.LG,20220610,20220610,\n\n* (1)\n* Anthony et al. (2020) Lasse F. Wo...,[MISSING_PAGE_EMPTY:1]\n\nIntroduction\n\nClim...,[MISSING_PAGE_EMPTY:1]\n\nIntroduction\n\nClim...
3,2206.05802,2206.05802,Self-critiquing models for assisting human eva...,We fine-tune large language models to write na...,http://arxiv.org/pdf/2206.05802,['William Saunders' 'Catherine Yeh' 'Jeff Wu' ...,['cs.CL' 'cs.LG'],,,cs.CL,20220612,20220614,"(RLHP) has become more common [1, 2, 3, 4], d...",# Self-critiquing models for assisting human e...,# Self-critiquing models for assisting human e...
4,2206.06336,2206.06336,Language Models are General-Purpose Interfaces,Foundation models have received much attention...,http://arxiv.org/pdf/2206.06336,['Yaru Hao' 'Haoyu Song' 'Li Dong' 'Shaohan Hu...,['cs.CL'],32 pages. The first three authors contribute e...,,cs.CL,20220613,20220613,"\n\n* Agrawal et al. (2019) Harsh Agrawal, Kar...",# Language Models are General-Purpose Interfac...,# Language Models are General-Purpose Interfac...


In [13]:
paper_ids = df['doi'].unique().tolist()
print(paper_ids)

[2206.02336, 2206.04615, 2206.05229, 2206.05802, 2206.06336, 2206.07635, 2206.14858, 2207.0056, 2207.04672, 2207.05221, 2207.05608, 2207.09983, 2207.10551, 2208.02294, 2208.03299, 2208.11663, 2208.14271, 2209.03143, 2209.07686, 2209.07753, 2209.07858, 2209.14375, 2209.15003, 2210.01241, 2210.02406, 2210.02414, 2210.02875, 2210.02969, 2210.0307, 2210.03078, 2210.0335, 2210.03493, 2210.03629, 2210.03945, 2210.05359, 2210.06245, 2210.07316, 2210.07382, 2210.077, 2210.09261, 2210.11399, 2210.11416, 2210.12283, 2210.13236, 2211.00053, 2211.00295, 2211.01786, 2211.0191, 2211.02001, 2211.04325, 2211.051, 2211.08264, 2211.08411, 2211.09085, 2211.0911, 2211.0926, 2211.10435, 2211.11736, 2212.00193, 2212.06817, 2212.08073, 2212.08286, 2212.0841, 2212.09689, 2212.10403, 2212.1056, 2212.12017, 2212.14882, 2301.00303, 2301.03728, 2301.07597, 2301.08653, 2301.09211, 2301.10226, 2301.11305, 2301.12867, 2301.13196, 2301.13688, 2302.04166, 2302.04761, 2302.07459, 2302.07736, 2302.07842, 2302.07867, 230

### Select subset of data to load into a database

* Our key text column will be the `noref_content` which has the content of the paper without the references
* We will include some metadata fields as well

In [9]:
df.columns
keep_cols = ['id', 'title', 'authors', 'summary', 'source', 'published', 'noref_content']
df_subset = df[keep_cols]
df_subset= df_subset.dropna()
df_subset.shape

(101, 7)

In [10]:
from langchain.document_loaders import DataFrameLoader
loader = DataFrameLoader(df_subset, page_content_column="noref_content")
docs = loader.load()

### Document Retriever

#### Implement your text splitter

In [30]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

In [31]:
print("Num Split Docs: ", len(split_docs))

Num Split Docs:  9790


#### Implement your embedding and Vector Store generation

In [32]:
### Load to a database - Run this for the first time to create the Db file

# db = Chroma.from_documents(split_docs, bge_embeddings, persist_directory="./chroma_db")

## Load from a database
db = Chroma(persist_directory="./chroma_db", embedding_function=bge_embeddings)

### Test Retriever

In [33]:
# query it
query = "What is RLHF? When can it be used?"
matched_docs = db.similarity_search(query, k=8)

# print results
for index, value in enumerate(matched_docs):
    pos = index+1
    if index <=3:
        print(f"Matched doc {pos} is : ", matched_docs[index].page_content, "/n ===========")



### Add Answer Generation

In [34]:
from langchain.prompts import PromptTemplate
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
I prefer answers to be 8-10 sentences long:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

In [21]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
retriever = db.as_retriever()
retriever.search_kwargs['k'] = 8

model = ChatOpenAI(model_name="gpt-3.5-turbo")

chain_type_kwargs = {"prompt": PROMPT}
qa = RetrievalQA.from_chain_type(llm=model, chain_type="stuff", retriever=retriever, chain_type_kwargs=chain_type_kwargs)

  warn_deprecated(


In [35]:
query = "What is the benefit of RLHF compared to other techniques?"
result = qa.invoke({"query": query})

print("Answer: ", result['result'])

Answer:  The benefit of RLHF compared to other techniques lies in its ability to fine-tune Large Language Models (LMs) by utilizing human feedback to iteratively align the model's responses more closely with human expectations and preferences. RLHF works by generating text using a pre-trained LM, which is then evaluated by humans to learn a reward model that captures human preferences when judging model output. This approach allows LMs to be more closely aligned with complex human preferences and values that are difficult to capture with hard-coded reward functions. RLHF can be applied on top of a general-purpose LM pre-trained via self-supervised learning, and it can also be used after an initial supervised fine-tuning phase using expert demonstrations for more complex tasks. The method has shown significant improvements in LM performance across various applications, showcasing its power in enhancing model capabilities. Additionally, RLHF enables models to perform tasks that require e

In [23]:
## Based on the paper - https://arxiv.org/pdf/2305.17493.pdf

query = "What is model collapse? Why does it occur and how is it different from catastrophic forgetting?"
result = qa.invoke({"query": query})

print("Answer: ", result['result'])

Answer:  Model collapse is a degenerative process in learning where generations of generative models end up polluting the training set of the next generation with generated data, leading to a misinterpretation of reality. This phenomenon occurs when models start forgetting improbable events and converge to a distribution that deviates significantly from the original one. It is different from catastrophic forgetting, which refers to the model forgetting previously learned data when exposed to new information. In model collapse, the models do not forget previous data but misinterpret what they believe to be real, reinforcing their own beliefs. The primary cause of model collapse is the compounding of errors over generations, leading to divergence from the original model. It is essential to access genuine human-generated content to prevent model collapse.


In [36]:
## Based on the paper - https://arxiv.org/abs/2307.09288

query = "Explain in detail how was RLHF training done for Llama2-70B chat model?"
result = qa.invoke({"query": query})

print("Answer: ", result['result'])

Answer:  The RLHF training for the Llama2-70B chat model involved iteratively refining the model using methodologies like rejection sampling and Proximal Policy Optimization (PPO). During this stage, the model accumulated iterative reward modeling data in parallel with enhancements to ensure the reward models remained within distribution. The process focused on optimizing the model for better human preference alignment, helpfulness, and safety by leveraging response scores as rewards. To address the trade-off between helpfulness and safety, two separate reward models were trained, one for each aspect. The training process aimed to make the model more robust to jailbreak attempts and improve the quality of responses generated by the model in various contexts.


## Long context modeling using Gemini

Get API for Gemini from https://aistudio.google.com/app/prompts/new_chat

In [4]:
import google.generativeai as genai
GOOGLE_API_KEY= os.getenv('GOOGLE_API_KEY')

genai.configure(api_key=GOOGLE_API_KEY)

In [5]:
gemini_llm = genai.GenerativeModel(model_name="models/gemini-1.5-pro-latest")
response = gemini_llm.generate_content("Tell me about the biggest planet in our Solar System?")
print(response.text)

The biggest planet in our Solar System is **Jupiter**! 🪐  It's so large that all of the other planets could fit inside it.  It's known for its Great Red Spot, which is a giant storm that has been going on for hundreds of years. Jupiter is made mostly of gas, and it has beautiful swirling clouds of different colors.  It also has a lot of moons, with the four largest being Io, Europa, Ganymede, and Callisto. 🔭 



In [6]:
import tiktoken
def count_token(text):
    # Initialize the tokenizer
    encoding = tiktoken.get_encoding("cl100k_base")
    # Tokenize the text
    tokens = encoding.encode(text,allowed_special={'<|endoftext|>', '<|endofprompt|>'})
    # Count the number of tokens
    number_of_tokens = len(tokens)
    # Print the number of tokens
    print("Number of tokens:", number_of_tokens)


In [11]:
total_content = df.tail(25)['noref_content'].str.cat(sep='/n')
print(count_token(total_content))

Number of tokens: 388890
None


In [12]:
def generate_prompt(question, context=total_content):
    prompt = f"""Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

        {context}

        Question: {question}
        I prefer answers to be 8-10 sentences long:"""
    return prompt


In [13]:
query = "What is model collapse? Why does it occur and how is it different from catastrophic forgetting?"
prompt = generate_prompt(query)
print(count_token(prompt))

Number of tokens: 388965
None


In [14]:
%%time
response = gemini_llm.generate_content(prompt)
print(response.text)

Model collapse is a phenomenon that can occur when large language models (LLMs) are trained on data that has been generated by other LLMs. This can cause the models to progressively forget the original data distribution, leading to a decline in performance over generations. The primary cause of model collapse is statistical error, which arises due to the finite number of samples used in training. As models are trained on data generated by previous generations, these errors can compound, causing the models to drift further and further away from the true data distribution. Unlike catastrophic forgetting, where models forget previously learned information when learning new information, model collapse involves models misinterpreting what they believe to be real by reinforcing their own beliefs based on the data generated by previous models. To prevent model collapse, it is crucial to have access to genuine human-generated content and to ensure that the training data is not overly reliant o

In [15]:
query = "Explain in detail how RLHF training was done for Llama2-70B chat model. If you don't know the answer, just say so. Don't try to make it up"
prompt = generate_prompt(query)
count_token(prompt)

Number of tokens: 388986


In [16]:
%%time
response = gemini_llm.generate_content(prompt)
print(response.text)

Llama 2-Chat was fine-tuned using Reinforcement Learning with Human Feedback (RLHF).  This process begins with collecting human preference data.  Annotators provide prompts and compare two model generations, selecting the response that is most helpful while also being safe.  Using this data, a reward model is trained with a pair-wise ranking loss that learns to assign higher scores to the preferred responses.  The model is then fine-tuned with RLHF, using either Proximal Policy Optimization (PPO) or Rejection Sampling.  PPO directly optimizes the model's policy using the reward model as a reward function.  In Rejection Sampling, multiple outputs are sampled from the model for a prompt, and the response with the highest reward score is then used as the new gold standard. The model is then fine-tuned using supervised learning with this new set of ranked responses. To improve consistency in multi-turn dialogues, we use Ghost Attention (GAtt).  GAtt prefixes a system prompt, such as "You a