# Implementing RAG from scratch

### Main Steps
- Document processing
  - Preprocess text corpus (case, punc, stop words, stems), tokenize
  - Chunk and embed (choose model)
- Query pipeline
  - Receive and embed query
  - Retrieve
  - Assemble Prompt
  - Generate response

In [1]:
import nltk
from nltk.corpus import stopwords
from string import punctuation as PUNCTUATION
from pypdf import PdfReader

import numpy as np
from sentence_transformers import SentenceTransformer

nltk.download('stopwords')
nltk.download('punkt')

  from tqdm.autonotebook import tqdm, trange
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nathanmandi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/nathanmandi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Preprocessing and Knowledgebase Construction

In [2]:
# Convert document to list of tokens
reader = PdfReader("4 Ways Dynamic Baselines Can Transform Carbon Crediting (1).pdf")
text = " ".join([page.extract_text() for page in reader.pages])
print(text)

Technology  | July 7, 2022
4 Ways Dynamic Baselines Can Transform
Carbon Crediting
Today’s static approach to baselines fails to capture
the constantly evolving nature of land use and our
forests. We believe a tech-driven, dynamic approach
can transform carbon crediting in four key ways.
Last year, our planet lost 25.3 million hectares of forest, an area greater than the
size of the United Kingdom.1 The emissions associated to forest loss are sizable,
equivalent to as much as one-tenth to one-quarter of fossil fuel emissions each
year.2 3 The carbon market offers a way to pay landowners to protect their forests
and reduce these carbon emissions.
But, paying landowners who are already protecting their forests clearly does not
reduce emissions. For the carbon market to produce net climate benefits, credits
must only be awarded to landowners who wouldn’t have protected their forest
otherwise. But how do we know what a landowner would have done if they hadn’t
received funds from a carbon p

In [3]:
# Preprocess text
stopset = list(stopwords.words("english")) + list(PUNCTUATION) + ['"', "'", "-", '’']

def preprocess_text(text):
    # Potentially add stemming
    tokenized = nltk.word_tokenize(text.lower())
    clean_tokens = [w for w in tokenized if w not in stopset] # Removed case, punc, stopwords
    return clean_tokens

token_text = preprocess_text(text)
print(token_text)
print("Num tokens: ", len(token_text))

# Seems there are some unicode differences in ' still

['technology', 'july', '7', '2022', '4', 'ways', 'dynamic', 'baselines', 'transform', 'carbon', 'crediting', 'today', 'static', 'approach', 'baselines', 'fails', 'capture', 'constantly', 'evolving', 'nature', 'land', 'use', 'forests', 'believe', 'tech-driven', 'dynamic', 'approach', 'transform', 'carbon', 'crediting', 'four', 'key', 'ways', 'last', 'year', 'planet', 'lost', '25.3', 'million', 'hectares', 'forest', 'area', 'greater', 'size', 'united', 'kingdom.1', 'emissions', 'associated', 'forest', 'loss', 'sizable', 'equivalent', 'much', 'one-tenth', 'one-quarter', 'fossil', 'fuel', 'emissions', 'year.2', '3', 'carbon', 'market', 'offers', 'way', 'pay', 'landowners', 'protect', 'forests', 'reduce', 'carbon', 'emissions', 'paying', 'landowners', 'already', 'protecting', 'forests', 'clearly', 'reduce', 'emissions', 'carbon', 'market', 'produce', 'net', 'climate', 'benefits', 'credits', 'must', 'awarded', 'landowners', 'protected', 'forest', 'otherwise', 'know', 'landowner', 'would', 'd

In [4]:
# Chunk and embed
chunk_size = 128
chunks = []
i = 0
while i < len(token_text):
    next_text = token_text[i : i + chunk_size]
    i += chunk_size
    chunks.append(" ".join(next_text))
print("Num chunks: ", len(chunks))

Num chunks:  8


In [5]:
for chunk in chunks:
    print(chunk)

technology july 7 2022 4 ways dynamic baselines transform carbon crediting today static approach baselines fails capture constantly evolving nature land use forests believe tech-driven dynamic approach transform carbon crediting four key ways last year planet lost 25.3 million hectares forest area greater size united kingdom.1 emissions associated forest loss sizable equivalent much one-tenth one-quarter fossil fuel emissions year.2 3 carbon market offers way pay landowners protect forests reduce carbon emissions paying landowners already protecting forests clearly reduce emissions carbon market produce net climate benefits credits must awarded landowners protected forest otherwise know landowner would done received funds carbon project ‘ scenario baselines become important carbon markets “ baseline ” matter baseline represents business-as-usual without carbon project number credits issued landowner annually difference project baseline carbon emissions –
words difference emissions with

In [57]:
# Embedder and Generator functions
import time
from openai import OpenAI
client = OpenAI(api_key="sk-insertyourkeyhere")

def embedder(query) -> str:
    # return np.random.rand(384)
    response = client.embeddings.create(
        input=query.replace("\n", " "),
        model="text-embedding-3-small",
    )
    emb = response.data[0].embedding
    # print("Embedding length: ", len(emb))
    return np.array(emb)

def generator(prompt) -> str:
    # return prompt
    messages = [
        {"role": "user", "content": prompt}
    ]
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )
    return response.choices[0].message.content



In [22]:
# Hugging Face
# embed_model = SentenceTransformer("all-MiniLM-L6-v2")
# print("Loaded embeddings model")
# embeddings = np.array([embed_model.encode(c) for c in chunks])
# print(embeddings.shape)

# Random
# embeddings = [np.random.rand(384) for _ in chunks]

# Using OpenAI model
embeddings = [embedder(chunk) for chunk in chunks]

Embedding length:  1536
Embedding length:  1536
Embedding length:  1536
Embedding length:  1536
Embedding length:  1536
Embedding length:  1536
Embedding length:  1536
Embedding length:  1536


In [15]:
# VectorDB class
class VectorDB:
    def __init__(self, chunks, embeddings):
        self.chunks = chunks
        self.embeddings = embeddings
        self.n = len(self.embeddings)

    def get_top_k(self, query_embed, k=2):
        # Simple kNN implementations
        assert k <= self.n, "Fewer than k chunks stored"
        top_embed_inds = list(range(k))
        dists = [VectorDB.dist(query_embed, self.embeddings[i]) for i in top_embed_inds]
        for i in range(k, self.n):
            new_dist = VectorDB.dist(self.embeddings[i], query_embed)

            # Replace farthest distance if current is lower
            farthest_ind = max(range(k), key=lambda i: dists[i])
            if dists[farthest_ind] > new_dist:
                dists[farthest_ind] = new_dist
                top_embed_inds[farthest_ind] = i

        top_embeds = [self.embeddings[i] for i in top_embed_inds]
        top_chunks = [self.chunks[i] for i in top_embed_inds]

        return top_embeds, top_chunks

    @staticmethod
    def dist(a, b):
        return np.linalg.norm(a-b)

In [16]:
# def almost_equal(p1, p2):
#     return VectorDB.dist(p1, p2) <= 0.00001

def point_in_list(lst, p):
    return any(np.array_equal(p, x) for x in lst) 

def test_vector_db():
    chunks = range(4)
    p00, p01, p10, p11 = [np.array(p) for p in [(0, 0), (0, 1), (1, 0), (1, 1)]]
    embeddings = [p00, p01, p10, p11]
    vdb = VectorDB(chunks, embeddings)
    embs1, _ = vdb.get_top_k(np.array([0.75, 0.5]), k=2)
    assert all(point_in_list(embs1, x) for x in [p10, p11])

    embs2, _ = vdb.get_top_k(np.array([1.5, 1.5]), k=3)
    assert all(point_in_list(embs2, x) for x in [p10, p11, p01])
    
    print(embs1)
    print(embs2)
    print("Success!")

test_vector_db()

[array([1, 0]), array([1, 1])]
[array([1, 1]), array([0, 1]), array([1, 0])]
Success!


## Prompting and Query Pipelining

In [17]:
class RAGPipeline:

    BASE_PROMPT = """
You are a world-class, state-of-the-art expert on {0}. 
You have been asked a question. The following is some context relevant to the topic: 
{1}
With this in mind, here is the question: {2}
    """

    def __init__(self, expertise: str, vdb: VectorDB, embedder, generator, k=2) -> None:
        self.expertise = expertise
        self.vdb = vdb
        self.embedder = embedder
        self.generator = generator
        self.k = k

    def assemble_prompt(self, q, docs) -> str:
        formatted_docs = ""
        for i in range(len(docs)):
            formatted_docs += f"CONTEXT {i+1} \n{docs[i]}\n"

        return RAGPipeline.BASE_PROMPT.format(self.expertise, formatted_docs, q)

    def query(self, q: str) -> str:
        q_embed = self.embedder(q)
        _, docs = vdb.get_top_k(q_embed, self.k)
        prompt = self.assemble_prompt(q, docs)
        #print(prompt)
        response = self.generator(prompt)
        return response


In [58]:
expertise = "Carbon Capture Crediting"
vdb = VectorDB(chunks=chunks, embeddings=embeddings)
rag = RAGPipeline(expertise, vdb, embedder, generator, k=2)

query1 = "What are dynamic baselines important in measuring carbon credits?"
query2 = "What does the company Pachama do?"
query3 = "Where can I find some resources on carbon credit research and technology?"

queries = [query1, query2, query3]

In [61]:
test_rag_pipeline = RAGPipeline(expertise, vdb, embedder, generator=lambda p: p, k=2)

def test_rag():
    print(query1 + "\n")
    a = test_rag_pipeline.query(query1) # Print prompt from within
    print(a)

test_rag()

What are dynamic baselines important in measuring carbon credits?


You are a world-class, state-of-the-art expert on Carbon Capture Crediting. 
You have been asked a question. The following is some context relevant to the topic: 
CONTEXT 1 
technology july 7 2022 4 ways dynamic baselines transform carbon crediting today static approach baselines fails capture constantly evolving nature land use forests believe tech-driven dynamic approach transform carbon crediting four key ways last year planet lost 25.3 million hectares forest area greater size united kingdom.1 emissions associated forest loss sizable equivalent much one-tenth one-quarter fossil fuel emissions year.2 3 carbon market offers way pay landowners protect forests reduce carbon emissions paying landowners already protecting forests clearly reduce emissions carbon market produce net climate benefits credits must awarded landowners protected forest otherwise know landowner would done received funds carbon project ‘ scenario 

In [59]:
for q in queries:
    print(q + "\n")
    a = rag.query(q)
    print(a + "\n")
    print("-----------------")

What are dynamic baselines important in measuring carbon credits?

Dynamic baselines are important in measuring carbon credits for several key reasons:

1. **Reflect Real-Time Conditions**: Unlike static baselines, which are fixed and may not account for changes in land use, environmental conditions, or socio-economic factors over time, dynamic baselines utilize ongoing data collection to reflect the current state of emissions and carbon sequestration. This allows for a more accurate representation of the impacts of carbon projects.

2. **Adaptability to Change**: The environment is constantly changing due to factors such as climate change, land management practices, and economic activities. Dynamic baselines can adjust to these changes, ensuring that the carbon credit calculations remain relevant and accurate. This adaptability helps maintain the credibility and integrity of carbon markets.

3. **Improved Prediction of Future Emissions**: By incorporating advanced technologies, such a

## Sandbox

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load pre-trained model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Example text chunks
chunks = [
    "This is the first chunk of text.",
    "Here is another chunk, different from the first one.",
    "This chunk is also a piece of text."
]

# Tokenize and generate embeddings
inputs = tokenizer(chunks, padding=True, truncation=True, return_tensors='pt')

# HF models are still hanging on load :/

In [None]:
inputs

In [None]:
with torch.no_grad():
    outputs = model(**inputs)

# Get the embeddings (taking the mean of the token embeddings for simplicity)
embeddings = outputs.last_hidden_state.mean(dim=1)

# Print the embeddings
for chunk, embedding in zip(chunks, embeddings):
    print(f"Chunk: {chunk}")
    print(f"Embedding: {embedding.numpy()}\n")


In [48]:
r = generator("test")

ChatCompletion(id='chatcmpl-9nXqL33h7uBsAICAOesJJbTVzAXRV', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Test received! How can I assist you today?', role='assistant', function_call=None, tool_calls=None))], created=1721595193, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier=None, system_fingerprint='fp_661538dc1f', usage=CompletionUsage(completion_tokens=10, prompt_tokens=8, total_tokens=18))


In [51]:
r.choices

[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Test received! How can I assist you today?', role='assistant', function_call=None, tool_calls=None))]