# Simple Retrieval Augmented Generation (RAG)

This is an example of a simple RAG. Retrieval is manual and feed into the LLM at the very beginning. No agent is involved, hence it can be used with weaker models.

We will use [FAISS](https://python.langchain.com/docs/integrations/vectorstores/faiss/) as the vector database (vectorstore).

In [1]:
# !pip install faiss-cpu langchain_community langchain_openai pandas

In [2]:
import os
from uuid import uuid4

import faiss
import pandas as pd
from dotenv import load_dotenv
from langchain_community.docstore import InMemoryDocstore
from langchain_community.embeddings import LlamafileEmbeddings
from langchain_community.llms.llamafile import Llamafile
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.prompts.prompt import PromptTemplate
from langchain_core.runnables import chain
from langchain_openai import ChatOpenAI
from tqdm import tqdm

In [3]:
load_dotenv()

True

## Load dataset and generate documents

We use the [Complete Pokédex dataset](https://www.kaggle.com/datasets/cristobalmitchell/pokedex/data), which is a dataset about Pokemon
but unfortunately not that complete.

I've included a variant of it in the `data` folder, which I save the original file using comma
separation and UTF-8 since the original file is hard to open directly using pandas.

In [4]:
df = pd.read_csv("../data/pokemon.csv")

In [5]:
df

Unnamed: 0,national_number,gen,english_name,japanese_name,primary_type,secondary_type,classification,percent_male,percent_female,height_m,...,evochain_1,evochain_2,evochain_3,evochain_4,evochain_5,evochain_6,gigantamax,mega_evolution,mega_evolution_alt,description
0,1,I,Bulbasaur,Fushigidane,grass,poison,Seed Pokémon,88.14,11.86,0.7,...,Level,Ivysaur,Level,Venusaur,,,,,,There is a plant seed on its back right from t...
1,2,I,Ivysaur,Fushigisou,grass,poison,Seed Pokémon,88.14,11.86,1.0,...,Level,Ivysaur,Level,Venusaur,,,,,,"When the bulb on its back grows large, it appe..."
2,3,I,Venusaur,Fushigibana,grass,poison,Seed Pokémon,88.14,11.86,2.0,...,Level,Ivysaur,Level,Venusaur,,,Gigantamax Venusaur,Mega Venusaur,,Its plant blooms when it is absorbing solar en...
3,4,I,Charmander,Hitokage,fire,,Lizard Pokémon,88.14,11.86,0.6,...,Level,Charmeleon,Level,Charizard,,,,,,It has a preference for hot things. When it ra...
4,5,I,Charmeleon,Lizardo,fire,,Flame Pokémon,88.14,11.86,1.1,...,Level,Charmeleon,Level,Charizard,,,,,,"It has a barbaric nature. In battle, it whips ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
893,894,VIII,Regieleki,Regieleki,electric,,Electron Pokémon,,,1.2,...,,,,,,,,,,This Pokémon is a cluster of electrical energy...
894,895,VIII,Regidrago,Regidrago,dragon,,Dragon Orb Pokémon,,,2.1,...,,,,,,,,,,An academic theory proposes that Regidrago’s a...
895,896,VIII,Glastrier,Blizapos,ice,,Wild Horse Pokémon,,,2.2,...,,,,,,,,,,Glastrier emits intense cold from its hooves. ...
896,897,VIII,Spectrier,Rayspos,ghost,,Swift Horse Pokémon,,,2.0,...,,,,,,,,,,It probes its surroundings with all its senses...


In [8]:
def generate_document(row):
    doc = f"""Number: {row["national_number"]:03}
Generation: {row["gen"]}
Name: {row["english_name"]}
Type: {row["primary_type"]}{" / " + row["secondary_type"] if pd.notnull(row["secondary_type"]) else ""}
Species: {row["classification"]}
Abilities: {", ".join([x for x in [row["abilities_0"], row["abilities_1"], row["abilities_2"], row["abilities_hidden"]] if pd.notnull(x)])}
Evolutions: {" -> ".join([x for x in [row["evochain_0"], row["evochain_2"], row["evochain_4"], row["evochain_6"]] if pd.notnull(x)])}
Description: {row["description"]}"""

    return Document(page_content=doc, metadata={})

In [9]:
df["document"] = df.apply(generate_document, axis=1)

In [10]:
print(df.loc[0, "document"].page_content)

Number: 001
Generation: I
Name: Bulbasaur
Type: grass / poison
Species: Seed Pokémon
Abilities: Overgrow, Chlorophyll
Evolutions: Bulbasaur -> Ivysaur -> Venusaur
Description: There is a plant seed on its back right from the day this Pokémon is born. The seed slowly grows larger.


To shorten run time, we will only use pokemon from gen 1. If you have more computation power, or have access to a faster API for embeddings, you can use the full dataset.

In [11]:
df = df[df["gen"] == "I"]

## Define embedding and model

We are using the `mxbai-embed-large-v1` Llamafile for embeddings, and `Llama3.2-3B` for the model itself. Note that `Llama3.2-3B` is a weak model that does not support tooling natively, so an agent build is not possible with this model.

In [12]:
embeddings = LlamafileEmbeddings()
llm = Llamafile(base_url="http://localhost:8081", temperature=0)

# llm = ChatOpenAI(
#     base_url=os.getenv("OPENAI_API_ENDPOINT"),
#     api_key=os.getenv("OPENAI_API_KEY"),
#     model="gpt-4o",
#     temperature=0,
# )

## Define vectorstore

In [13]:
embed_dim = len(embeddings.embed_query("Hello World"))
index = faiss.IndexFlatL2(embed_dim)

vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

## Load documents into vectorstore

Transform to langchain's Document format, which is suitable for the vector store to consume. Note that our documents are very short, so no chunking is necessary.

In [14]:
documents = list(df["document"])
uuids = [str(uuid4()) for _ in range(len(documents))]
for document, uid in zip(tqdm(documents), uuids):
    vector_store.add_documents(documents=[document], ids=[uid])

100%|██████████| 151/151 [06:16<00:00,  2.49s/it]


In [15]:
# save and local vectorstore

vector_store.save_local("../pokemon_faiss_index")

# vector_store = FAISS.load_local(
#     "../pokemon_faiss_index", embeddings, allow_dangerous_deserialization=True
# )

Test the vectorstore a bit.

In [16]:
results = vector_store.similarity_search_with_score(
    "Which pokémon likes sleeping?", k=3, filter={}
)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content}")

* [SIM=0.629744] Number: 061
Generation: I
Name: Poliwhirl
Type: water
Species: Tadpole Pokémon
Abilities: Water Absorb, Damp, Swift Swim
Evolutions: Poliwag -> Poliwhirl -> Poliwrath -> Politoed
Description: Staring at the swirl on its belly causes drowsiness. This trait of Poliwhirl’s has been used in place of lullabies to get children to go to sleep.
* [SIM=0.669929] Number: 097
Generation: I
Name: Hypno
Type: psychic
Species: Hypnosis Pokémon
Abilities: Insomnia, Forewarn, Inner Focus
Evolutions: Drowzee -> Hypno
Description: Avoid eye contact if you come across one. It will try to put you to sleep by using its pendulum.
* [SIM=0.689312] Number: 143
Generation: I
Name: Snorlax
Type: normal
Species: Sleeping Pokémon
Abilities: Immunity, Thick Fat, Gluttony
Evolutions: Munchlax -> Snorlax
Description: It is not satisfied unless it eats over 880 pounds of food every day. When it is done eating, it goes promptly to sleep.


Make the vectorstore a LangChain retriever for chaining.

In [17]:
retriever = vector_store.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 3, "fetch_k": 20, "lambda_mult": 0.5},
)

In [18]:
retr_res = retriever.invoke("Which pokémon is the cutest?")
for doc in retr_res:
    print(doc.page_content)
    print("----")

Number: 039
Generation: I
Name: Jigglypuff
Type: normal / fairy
Species: Balloon Pokémon
Abilities: Cute Charm, Competitive, Friend Guard
Evolutions: Igglybuff -> Jigglypuff -> Wigglytuff
Description: Jigglypuff has top-notch lung capacity, even by comparison to other Pokémon. It won’t stop singing its lullabies until its foes fall asleep.
----
Number: 127
Generation: I
Name: Pinsir
Type: bug
Species: Stagbeetle Pokémon
Abilities: Hyper Cutter, Mold Breaker, Moxie
Evolutions: Pinsir
Description: These Pokémon judge one another based on pincers. Thicker, more impressive pincers make for more popularity with the opposite gender.
----
Number: 025
Generation: I
Name: Pikachu
Type: electric
Species: Mouse Pokémon
Abilities: Static, Lightning Rod
Evolutions: Pichu -> Pikachu -> Raichu
Description: Pikachu that can generate powerful electricity have cheek sacs that are extra soft and super stretchy.
----


## Define chain and invoke

### 1. Define chain

In [19]:
prompt = PromptTemplate.from_template("""You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:
""")

In [20]:
@chain
def get_context(question):
    """Get context from the vectorstore given the question, then put them into a format suitable as input to the prompt."""
    return {
        "question": question,
        "context": retriever.invoke(question),
    }

### 2. Invoke

In [21]:
question = "Which pokémon likes sleeping?"

res = (get_context | prompt | llm).invoke(question)

In [22]:
print(res)

Snorlax is the Pokémon that likes sleeping. It goes promptly to sleep after eating a large amount of food. Snorlax's need for food and subsequent sleep are described in its Pokédex entry.<|eot_id|>


Note that Snorlax is not the most similar item according to the vectorstore, but the model is able to pick the (more or less) correct answer.

In [23]:
question2 = "Which pokémon is the cutest?"

res2 = (get_context | prompt | llm).invoke(question2)

In [24]:
print(res2)

I don't know. The provided context doesn't mention any pokémon as being the cutest, and it only includes descriptions of Jigglypuff's singing abilities, Pinsir's pincers, and Pikachu's cheek sacs. There is no clear indication of which pokémon might be considered cute.<|eot_id|>


It seems the model is reluctant to answer anything subjective.