# RAG (with LanceDB and LlamaParse)

## LlamaParse

In [3]:
!pip install llama-index-core llama-parse llama-index-readers-file python-dotenv

Collecting llama-index-core
  Downloading llama_index_core-0.12.10.post1-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-parse
  Downloading llama_parse-0.5.19-py3-none-any.whl.metadata (7.0 kB)
Collecting llama-index-readers-file
  Downloading llama_index_readers_file-0.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index-core)
  Downloading dirtyjson-1.0.8-py3-none-any.whl.metadata (11 kB)
Collecting nltk>3.8.1 (from llama-index-core)
  Using cached nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting beautifulsoup4<5.0.0,>=4.12.3 (from llama-index-readers-file)
  Using cached beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)
Collecting pypdf<6.0.0,>=5.1.0 (from llama-index-readers-file)
  Downloading pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)
Collecting striprtf<0.0.27,>=0.0.26 (from llama-index-readers-file)
  Downloading striprtf-0.0.26-py3-none-any.whl.metadata (2.1 kB)
Collecting soupsieve>1.2 (from beautifulsoup4<5.0.0,>

In [1]:
pdf_files = ["ChatLLM_Network.pdf", "Cognitive_Architectures_for_Language_Agents.pdf"]

### Load and Parse PDF file using LlamaParse

In [2]:
import nest_asyncio
nest_asyncio.apply()

In [3]:
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader

# set up parser
parser = LlamaParse(result_type="text")

file_extractor = {".pdf": parser}

data_for_parse = SimpleDirectoryReader(input_files=pdf_files, file_extractor=file_extractor)
data_for_parse

<llama_index.core.readers.file.base.SimpleDirectoryReader at 0x1768fecd0>

In [4]:
documents =data_for_parse.load_data()
documents

Started parsing the file under job_id 058676da-8f9b-430c-bc36-b1e1da8c6139
Started parsing the file under job_id 8b14c8cb-a6cc-4217-8de1-23e2c94001d0


[Document(id_='d8e107bc-267a-4918-9831-64028f5ff03f', embedding=None, metadata={'file_path': 'ChatLLM_Network.pdf', 'file_name': 'ChatLLM_Network.pdf', 'file_type': 'application/pdf', 'file_size': 925390, 'creation_date': '2025-01-08', 'last_modified_date': '2025-01-08'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='ChatLLM Network: More brains, More intelligence\n                              Rui Hao†\n                  School of Computer Science\n\n    Beijing University of Posts and Telecommunications\n                      haorui@bupt.edu.cn\n\n           Linmei Hu ∗ †\n\n  School of Computer Science\nBeijing Institute of Technology

In [5]:
len(documents)

45

### Chunk files

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=64,
    length_function=len,
    is_separator_regex=False,
)

In [7]:
documents_list = []
page_number = 0
last_doc = None
for doc in documents:
    if last_doc is None or last_doc != doc.metadata["file_name"]:
        page_number = 1
        last_doc = doc.metadata["file_name"]
    else:
        page_number += 1

    texts = text_splitter.split_text(doc.text)
    for text in texts:
        item = {}
        item["id_"] = doc.id_
        item["text"] = text
        item["metadata_file_name"] = doc.metadata["file_name"]
        item["metadata_creation_date"] = doc.metadata["creation_date"]
        item["metadata_pagenumber"] = page_number
        documents_list.append(item)



In [8]:
len(documents_list)

311

### Chunks to Pandas DataFrame

In [9]:
import pandas as pd

df = pd.DataFrame(documents_list)
df

Unnamed: 0,id_,text,metadata_file_name,metadata_creation_date,metadata_pagenumber
0,d8e107bc-267a-4918-9831-64028f5ff03f,"ChatLLM Network: More brains, More intelligenc...",ChatLLM_Network.pdf,2025-01-08,1
1,d8e107bc-267a-4918-9831-64028f5ff03f,Yirui Zhang\n ...,ChatLLM_Network.pdf,2025-01-08,1
2,d8e107bc-267a-4918-9831-64028f5ff03f,Abstract\n ...,ChatLLM_Network.pdf,2025-01-08,1
3,d8e107bc-267a-4918-9831-64028f5ff03f,ChatLLM network that allows multiple dialogue-...,ChatLLM_Network.pdf,2025-01-08,1
4,d8e107bc-267a-4918-9831-64028f5ff03f,network attains significant improvements in pr...,ChatLLM_Network.pdf,2025-01-08,1
...,...,...,...,...,...
306,ca2527a2-6594-4b63-9206-bc7d26a111af,"S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. ...",Cognitive_Architectures_for_Language_Agents.pdf,2025-01-08,31
307,ca2527a2-6594-4b63-9206-bc7d26a111af,"Society, volume 45, 2023a.\nT. Zhang, F. Liu, ...",Cognitive_Architectures_for_Language_Agents.pdf,2025-01-08,31
308,ca2527a2-6594-4b63-9206-bc7d26a111af,31,Cognitive_Architectures_for_Language_Agents.pdf,2025-01-08,31
309,875aba78-cbf0-4519-95de-05fc7c299b5a,Published in Transactions on Machine Learning ...,Cognitive_Architectures_for_Language_Agents.pdf,2025-01-08,32


## LanceDB

In [1]:
!pip install lancedb pandas sentence_transformers

Collecting lancedb
  Using cached lancedb-0.17.1-cp39-abi3-macosx_11_0_arm64.whl.metadata (4.7 kB)
Collecting pandas
  Using cached pandas-2.2.3-cp311-cp311-macosx_11_0_arm64.whl.metadata (89 kB)
Collecting sentence_transformers
  Using cached sentence_transformers-3.3.1-py3-none-any.whl.metadata (10 kB)
Collecting deprecation (from lancedb)
  Using cached deprecation-2.1.0-py2.py3-none-any.whl.metadata (4.6 kB)
INFO: pip is looking at multiple versions of lancedb to determine which version is compatible with other requirements. This could take a while.
Collecting lancedb
  Using cached lancedb-0.17.0-cp39-abi3-macosx_11_0_arm64.whl.metadata (4.7 kB)
Collecting pylance==0.20.0 (from lancedb)
  Using cached pylance-0.20.0-cp39-abi3-macosx_11_0_arm64.whl.metadata (7.4 kB)
Collecting overrides>=0.7 (from lancedb)
  Using cached overrides-7.7.0-py3-none-any.whl.metadata (5.8 kB)
Collecting pyarrow>=14 (from pylance==0.20.0->lancedb)
  Using cached pyarrow-18.1.0-cp311-cp311-macosx_12_0_arm

### Connect to DB

In [10]:
import lancedb
db = lancedb.connect(".lancedb")

  from .autonotebook import tqdm as notebook_tqdm


### Define the embedding function

In [11]:
from lancedb.embeddings import get_registry
embedding_model = get_registry().get("sentence-transformers").create(name="BAAI/bge-small-en-v1.5", device="mps")

### Define the data model or schema

In [12]:
#You should put HF_TOKEN in the Notebook enviroment variables
from lancedb.pydantic import LanceModel, Vector

class ChunksOfData(LanceModel):
    text: str = embedding_model.SourceField()
    metadata_file_name: str
    metadata_creation_date: str
    metadata_pagenumber: int
    vector: Vector(embedding_model.ndims()) = embedding_model.VectorField()

### Create table and add data

In [13]:
def df_to_dict_batches(df: pd.DataFrame, batch_size: int = 128):
    """
    Yields data from a DataFrame in batches of dictionaries.
    Each batch is a list of dict, suitable for LanceDB ingestion.
    """
    for start_idx in range(0, len(df), batch_size):
        end_idx = start_idx + batch_size
        # Convert the batch of rows to a list of dict
        batch_dicts = df.iloc[start_idx:end_idx].to_dict(orient="records")
        yield batch_dicts

tbl = db.create_table(
    "embedded_chunks3",
    data=df_to_dict_batches(df, batch_size=10),
    schema=ChunksOfData,
)

### Querying your table

In [14]:
query = "What does CoALA stands for?"
#actual = table.search(query).limit(5).to_pydantic(Words)[0]
res= tbl.search(query).limit(5).to_pandas()
res


Unnamed: 0,text,metadata_file_name,metadata_creation_date,metadata_pagenumber,vector,_distance
0,making process to choose actions. We use CoALA...,Cognitive_Architectures_for_Language_Agents.pdf,2025-01-08,1,"[-0.053947296, 0.03000264, -0.013586269, -0.02...",0.569758
1,Published in Transactions on Machine Learning ...,Cognitive_Architectures_for_Language_Agents.pdf,2025-01-08,12,"[-0.09369078, 0.025953628, 0.015643774, -0.011...",0.637318
2,Agent design: thinking beyond simple reasoning...,Cognitive_Architectures_for_Language_Agents.pdf,2025-01-08,15,"[-0.07036838, 0.036759418, -0.0069561503, -0.0...",0.643972
3,4\n Cognitive Architectures for Languag...,Cognitive_Architectures_for_Language_Agents.pdf,2025-01-08,8,"[-0.07170645, -0.004507213, -0.022157056, 0.01...",0.646239
4,Decision Procedure ...,Cognitive_Architectures_for_Language_Agents.pdf,2025-01-08,8,"[-0.06418777, 0.029837094, -0.022896353, -0.01...",0.665226


#### Hybrid Search

In [15]:
query = "What does CoALA stands for?"
tbl.create_fts_index('text', use_tantivy=False)
tbl.search(query, query_type="hybrid").limit(5).to_pandas()

Unnamed: 0,text,metadata_file_name,metadata_creation_date,metadata_pagenumber,vector,_relevance_score
0,Agent design: thinking beyond simple reasoning...,Cognitive_Architectures_for_Language_Agents.pdf,2025-01-08,15,"[-0.07036838, 0.036759418, -0.0069561503, -0.0...",0.031498
1,4\n Cognitive Architectures for Languag...,Cognitive_Architectures_for_Language_Agents.pdf,2025-01-08,8,"[-0.07170645, -0.004507213, -0.022157056, 0.01...",0.031498
2,making process to choose actions. We use CoALA...,Cognitive_Architectures_for_Language_Agents.pdf,2025-01-08,1,"[-0.053947296, 0.03000264, -0.013586269, -0.02...",0.016393
3,"just choose their preferred framing, as long a...",Cognitive_Architectures_for_Language_Agents.pdf,2025-01-08,18,"[-0.04917023, -0.039094172, 0.012346595, 0.007...",0.016393
4,Published in Transactions on Machine Learning ...,Cognitive_Architectures_for_Language_Agents.pdf,2025-01-08,12,"[-0.09369078, 0.025953628, 0.015643774, -0.011...",0.016129


## RAG

In [16]:
import os
from langchain_openai import ChatOpenAI

AVALAI_BASE_URL = "https://api.avalai.ir/v1"
GPT_MODEL_NAME = "gpt-4o-mini"

gpt4o_chat = ChatOpenAI(model=GPT_MODEL_NAME,
                        base_url=AVALAI_BASE_URL,
                        api_key=os.environ["AVALAI_API_KEY"])

In [17]:
query = "What does CoALA stands for?"
context_list = tbl.search(query, query_type="hybrid").limit(5).to_list()
context_list

[{'text': 'Agent design: thinking beyond simple reasoning. CoALA defines agents over three distinct concepts: (i)\ninternal memory, (ii) a set of possible internal and external actions, and (iii) a decision making procedure over\nthose actions. Using CoALA to develop an application-specific agent consists of specifying implementations\nfor each of these components in turn. We assume that the agent’s environment and external action space are\ngiven, and show how CoALA can be used to determine an appropriate high-level architecture. For example,\nwe can imagine designing a personalized retail assistant (Yao et al., 2022a) that helps users find relevant items\nbased on their queries and purchasing history. In this case, the external actions would consist of dialogue or\nreturning search results to the user.',
  'metadata_file_name': 'Cognitive_Architectures_for_Language_Agents.pdf',
  'metadata_creation_date': '2025-01-08',
  'metadata_pagenumber': 15,
  'vector': [-0.07036837935447693,
 

In [18]:
context = ''.join([f"{c['text']}\n\n" for c in context_list])

print(context)

Agent design: thinking beyond simple reasoning. CoALA defines agents over three distinct concepts: (i)
internal memory, (ii) a set of possible internal and external actions, and (iii) a decision making procedure over
those actions. Using CoALA to develop an application-specific agent consists of specifying implementations
for each of these components in turn. We assume that the agent’s environment and external action space are
given, and show how CoALA can be used to determine an appropriate high-level architecture. For example,
we can imagine designing a personalized retail assistant (Yao et al., 2022a) that helps users find relevant items
based on their queries and purchasing history. In this case, the external actions would consist of dialogue or
returning search results to the user.

4
        Cognitive Architectures for Language Agents (CoALA): A Conceptual Framework
We present Cognitive Architectures for Language Agents (CoALA) as a framework to organize existing                 

In [19]:
system_prompt = "Answer user query based on the given context."
user_prompt = f"Question:\n{query}\nContext:\n{context}"
print(user_prompt)

Question:
What does CoALA stands for?
Context:
Agent design: thinking beyond simple reasoning. CoALA defines agents over three distinct concepts: (i)
internal memory, (ii) a set of possible internal and external actions, and (iii) a decision making procedure over
those actions. Using CoALA to develop an application-specific agent consists of specifying implementations
for each of these components in turn. We assume that the agent’s environment and external action space are
given, and show how CoALA can be used to determine an appropriate high-level architecture. For example,
we can imagine designing a personalized retail assistant (Yao et al., 2022a) that helps users find relevant items
based on their queries and purchasing history. In this case, the external actions would consist of dialogue or
returning search results to the user.

4
        Cognitive Architectures for Language Agents (CoALA): A Conceptual Framework
We present Cognitive Architectures for Language Agents (CoALA) as a 

In [20]:
from langchain_core.messages import HumanMessage, SystemMessage

messages = [
    SystemMessage(system_prompt),
    HumanMessage(user_prompt),
]

In [21]:
response = gpt4o_chat.invoke(messages)
response.pretty_print()


CoALA stands for **Cognitive Architectures for Language Agents**.


In [22]:
response = gpt4o_chat.invoke("What does CoALA stands for?")
response.pretty_print()


CoALA stands for "Coalition for Algorithmic Accountability." It is an initiative focused on promoting transparency and accountability in algorithmic decision-making processes, particularly in the context of technology and data usage. The coalition typically involves various stakeholders, including civil society organizations, researchers, and policy advocates, working together to address the ethical implications of algorithms and their impact on society. If you have a specific context in mind or if there are other meanings for CoALA, please let me know!
