Acknowledgement: Material based on CS224U course by Prof Potts
Youtube lectures: https://tinyurl.com/course-cs234u
github: https://github.com/cgpotts/cs224u/

This Notebook available at: https://github.com/ljohri/NLP-RAG-Concepts


### Concepts
src: https://tinyurl.com/rag-notes


<img src="RAG Architecture Explaination.png" alt="Page1" width="900">
<img src="RAG Architecture Explaination 2.png" alt="Page1" width="900">
<img src="RAG Architecture Explaination 3.png" alt="Page1" width="900">


## Setup

In [None]:
try:
    # This library is our indicator that the required installs
    # need to be done.
    import datasets
except ModuleNotFoundError:
    !git clone https://github.com/cgpotts/cs224u/
    !pip install -r cs224u/requirements.txt
    import sys
    sys.path.append("cs224u")

## Imports

In [11]:
from datasets import load_dataset
import os
import dspy
import warnings
from openai import OpenAI
import random
from dspy.teleprompt import LabeledFewShot
from dotenv import load_dotenv
from dspy.evaluate import answer_exact_match
from dspy.evaluate.evaluate import Evaluate

root_path = 'dspy'


## Load the Environment

In [2]:
os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join(root_path, 'cache')
# keep the API keys in a `.env` file in the local root directory
load_dotenv()
openai_key = os.getenv('OPENAI_API_KEY')  # use the .env file as it is a good practice to keep keys outside of one's code

## Download Colbertv2 Pretrained Patrameters

In [3]:
if not os.path.exists(os.path.join("data", "openqa", "colbertv2.0.tar.gz")):
    !mkdir -p data/openqa
    # ColBERTv2 checkpoint trained on MS MARCO Passage Ranking (388MB compressed)
    !wget https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz -P data/openqa/
    !tar -xvzf data/openqa/colbertv2.0.tar.gz -C data/openqa/

## Download Prebuilt ColBERT index

In [4]:
index_home = os.path.join("experiments", "notebook", "indexes", "cs224u.collection.2bits")
if not os.path.exists(index_home):
    !wget https://web.stanford.edu/class/cs224u/data/cs224u.collection.2bits.tgz -P experiments/notebook/indexes
    !tar -xvzf experiments/notebook/indexes/cs224u.collection.2bits.tgz -C experiments/notebook/indexes

## Run the colbert server
conda activate nlu

git clone https://github.com/stanford-futuredata/ColBERT/ 

export INDEX_ROOT=experiments/notebook/indexes/cs224u.collection.2bits/ 

export INDEX_HOME=cs224u.collection.2bits 

export PORT=8888 

python ColBERT/server.py 


## Get the handle to the RAG Model

In [6]:
#get the handle to the RAG model
rm = dspy.ColBERTv2(url="http://127.0.0.1:8888/api/search")

#get the handle for LLM
lm = dspy.OpenAI(model='gpt-3.5-turbo', api_key=openai_key)

dspy.settings.configure(lm=lm, rm=rm)

In [7]:
client = OpenAI(api_key = openai_key)
models = client.models.list()
print(models)

SyncPage[Model](data=[Model(id='gpt-4o-audio-preview-2024-12-17', created=1734034239, object='model', owned_by='system'), Model(id='dall-e-3', created=1698785189, object='model', owned_by='system'), Model(id='dall-e-2', created=1698798177, object='model', owned_by='system'), Model(id='gpt-4o-audio-preview-2024-10-01', created=1727389042, object='model', owned_by='system'), Model(id='gpt-4o-realtime-preview-2024-10-01', created=1727131766, object='model', owned_by='system'), Model(id='gpt-4o-realtime-preview', created=1727659998, object='model', owned_by='system'), Model(id='babbage-002', created=1692634615, object='model', owned_by='system'), Model(id='tts-1-hd-1106', created=1699053533, object='model', owned_by='system'), Model(id='text-embedding-3-large', created=1705953180, object='model', owned_by='system'), Model(id='gpt-4', created=1687882411, object='model', owned_by='openai'), Model(id='text-embedding-ada-002', created=1671217299, object='model', owned_by='openai-internal'), Mo

## SQuAD is the dataset with question-answers

In [8]:
def get_squad_split(squad, split="validation"):
    """
    Use `split='train'` for the train split.

    Returns
    -------
    list of dspy.Example with attributes question, answer

    """
    data = zip(*[squad[split][field] for field in squad[split].features])
    exs = [dspy.Example(question=q, answer=a['text'][0]).with_inputs("question")
           for eid, title, context, q, a in data]
    return exs
    
squad = load_dataset("squad", trust_remote_code=True)
squad_train = get_squad_split(squad, split="train")
squad_train = get_squad_split(squad)

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [None]:
random.seed(1)

dev_exs = random.sample(squad_dev, k=200) #reducing the size of the for development purpose

## DSPy Basics


In [None]:
lm("What is the birthplace of the first author to win a Hugo Award for a translation?")

In [None]:
lm("Which U.S. states border no U.S. states?", temperature=0.9, n=4)

In [None]:
_ = lm.inspect_history(n=1)  

## Signature-based prediction
In DSPy, signatures are declarative statements about what we want the model to do. In the following "question -> answer" is the signature (the most basic QA signature one could write), and dspy.Predict is used to turn this into a complete QA system:

In [None]:
basic_predictor = dspy.Predict("question -> answer")

In [None]:
basic_predictor(question="What is the birthplace of the first author to win a Hugo Award for a translation?")

In [None]:
#what is seen by the LLM Model
_ = lm.inspect_history(n=1)

In many cases, we will want more control over the prompt. Writing a small custom dspy.Signature class is the easiest way to accomplish this. In the following, we just just tweak the initial instruction and provide some formatting guidance for the answer:

In [None]:
class BasicQASignature(dspy.Signature):
    __doc__ = """Answer questions with short factoid answers."""

    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

In [None]:
sig_predictor = dspy.Predict(BasicQASignature)

In [None]:
sig_predictor(question="Which U.S. states border no U.S. states?")

In [None]:
_ = lm.inspect_history(n=1)

## Modules
One of the hallmarks of DSPy is that it adopts design patterns from PyTorch. The main example of this is DSPy's use of the Module as the basic unit for writing simple and complex programs. Here is a very basic module for QA that makes use of BasicQASignature as we defined it just above.

In [None]:
class BasicQA(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_answer = dspy.Predict(BasicQASignature)

    def forward(self, question):
        return self.generate_answer(question=question)

As with PyTorch, the forward method is called when we want to make predictions:

In [None]:
basic_qa_model = BasicQA()
basic_qa_model(question="What is the birthplace of the first author to win a Hugo Award for a translation?")

The modular design of DSPy starts to become apparent now. If you want to change the above to use chain of thought instead of regular predictions, you need only change dspy.Predict to dspy.ChainOfThought, and similarly for dspy.ReAct, dspy.ProgramOfThought, or a module you wrote yourself.

In [None]:
fewshot_teleprompter = LabeledFewShot(k=3) ##And then we call compile on basic_qa_model as we defined it above. This returns a new module that we use like any other in DSPy:
print(squad_train[:3])
basic_fewshot_qa_model = fewshot_teleprompter.compile(basic_qa_model, trainset=squad_train)

In [None]:
basic_fewshot_qa_model(question="What is the birthplace of the first author to win a Hugo Award for a translation?")

In [None]:
_ = lm.inspect_history(n=1)

## Evaluation

In [12]:
answer_exact_match(dspy.Example(answer="STAGE 2!"), dspy.Prediction(answer="stage 2"))

True

In [None]:
tiny_evaluater = Evaluate(
    devset=dev_exs[: 15],
    num_threads=1,
    display_progress=True,
    display_table=5)

### Retrieval

In [None]:
retriever = dspy.Retrieve(k=3)

In [None]:
passages = retriever("What is the birthplace of the first author to win a Hugo Award for a translation?")
passages.passages[0]

### Finally, putting the system together

In [None]:
class ContextQASignature(dspy.Signature):
    __doc__ = """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")
    
class RAG(dspy.Module):
    def __init__(self, num_passages=1):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.Predict(ContextQASignature)

    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)    

In [None]:
rag_model = RAG(num_passages=3)
rag_model(question="What is the birthplace of the first author to win a Hugo Award for a translation?")