## Background

Llama 3.2 is a state-of-the-art large language model (LLM) developed by Meta (formerly Facebook). It is designed to be a successor to the earlier Llama 2 model and is based on the Llama architecture. This model can be used for a variety of tasks, including language understanding,  Retrieval-Augmented Generation (RAG), and more.

LangChain is an open-source framework that simplifies the creation of LLM applications through the use of "chains." Chains are LangChain-specific components that can be combined for a variety of AI use cases, including RAG.

By integrating Atlas Vector Search with LangChain, you can use Atlas as a vector database and use Atlas Vector Search to implement RAG by retrieving semantically similar documents from your data.

## Prerequisites

To complete this tutorial, you must have the following:

- Llama running in GPU or Llama.cpp running locally on your GPU/CPU.

- Sign up for a `free` acount at mongodb atlas. This will provide you an Atlas cluster for `free` running MongoDB version 6.0.11, 7.0.2, or later (including RCs).

- An environment to run interactive Python notebooks such as Colab or local notebook.

In [None]:
%pip install --upgrade --quiet langchain python-dotenv langchain-community langchain-core langchain-mongodb langchain-openai pymongo pypdf ipykernel notebook datasets transformers[torch] bitsandbytes pandas pillow sentence-transformers llama-cpp-python

In [25]:
import getpass, os, pymongo, pprint, json
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from pymongo import MongoClient
from pymongo.operations import SearchIndexModel

In [26]:
from datasets import load_dataset
ds = load_dataset("mychen76/invoices-and-receipts_ocr_v1", split="test")
ds

Dataset({
    features: ['image', 'id', 'parsed_data', 'raw_data'],
    num_rows: 125
})

In [27]:
ATLAS_CONNECTION_STRING = getpass.getpass("MongoDB Atlas SRV Connection String:")

NOTE:
- Your connection string should use the following format:

`mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net`

In [None]:
# Connect to your Atlas cluster
client = MongoClient(ATLAS_CONNECTION_STRING)

# Define collection and index name
db_name = "langchain_db"
collection_name = "invoice_llama"
atlas_collection = client[db_name][collection_name]
vector_search_index = "invoice_llama_vector_index"

In [None]:
from sentence_transformers import SentenceTransformer
from langchain.embeddings import HuggingFaceEmbeddings

# Initialize the sentence transformer model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Create a LangChain compatible embedding object
embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')

# Prepare documents for vector store
documents = []
for item in ds:
    parsed_data = json.loads(item['parsed_data'])
    content = f"Invoice ID: {item['id']}\n"
    content += f"Parsed Data: {parsed_data}\n"
    content += f"Raw Data: {item['raw_data']}"
    
    doc = Document(
        page_content=content,
        metadata={
            "id": item['id']
        }
    )
    documents.append(doc)

# Create the vector store
vector_store = MongoDBAtlasVectorSearch.from_documents(
    documents=documents,
    embedding=embeddings,
    collection=atlas_collection,
    index_name=vector_search_index
)

print(f"Vector store created with {len(documents)} documents.")

_id or id key found in metadata. Please pop from each dict and input as separate list.Retrieving methods will include the same id as '_id' in metadata.


Vector store created with 125 documents.


In [None]:
# Create your index model, then create the search index
search_index_model = SearchIndexModel(
    definition={
        "fields": [
            {
                "type": "vector",
                "path": "embedding",
                "numDimensions": 384,
                "similarity": "cosine"
            },
            {
                "type": "filter",
                "path": "id",
            }
        ]
    },
    name="invoice_llama_vector_index",
    type="vectorSearch"
)

atlas_collection.create_search_index(model=search_index_model)

'invoice_llama_vector_index'

### Semantic Search with Score
searching invoice id: 38

In [None]:
query = "ARTIE'S DELICATESSEN, 2290 BROADWAY, New YorkNY 10024, 212579-5959"
results = vector_store.similarity_search(query)

pprint.pprint(results)

[Document(metadata={'_id': '66fb9ead88af6025598cf484', 'id': '38'}, page_content='Invoice ID: 38\nParsed Data: {\'xml\': \'<s_receipt><s_total>$48.40</s_total><s_tips></s_tips><s_time>7:43:29PM</s_time><s_telephone>(212)579-5959</s_telephone><s_tax>3.95</s_tax><s_subtotal>44.45</s_subtotal><s_store_name>ARTIE’SDELICATESSEN</s_store_name><s_store_addr>2290BROADWAY NewYork,NY10024</s_store_addr><s_line_items><s_item_value>16.95</s_item_value><s_item_quantity>1</s_item_quantity><s_item_name>SupremeBurger</s_item_name><s_item_key></s_item_key><sep/><s_item_value>15.25</s_item_value><s_item_quantity>1</s_item_quantity><s_item_name>Medium</s_item_name><s_item_key></s_item_key><sep/><s_item_value>2.25</s_item_value><s_item_quantity>1</s_item_quantity><s_item_name>PastramiBurger</s_item_name><s_item_key></s_item_key><sep/><s_item_value>4.50</s_item_value><s_item_quantity>1</s_item_quantity><s_item_name>Medium</s_item_name><s_item_key></s_item_key><sep/><s_item_value>5.50</s_item_value><s_item_

## Semamtic Search with Filter

In [None]:
# query = "Patel, Thompson and Montgomery 356 Kyle Vista New James, MA 46228"
query = "ARTIE'S DELICATESSEN, 2290 BROADWAY, New YorkNY 10024, 212579-5959"
results = vector_store.similarity_search_with_score(
    query = query,
    k = 2,
    pre_filter = { "id": { "$in": ["38"] } }
)
pprint.pprint(results)

[(Document(metadata={'_id': '66fb9ead88af6025598cf484', 'id': '38'}, page_content='Invoice ID: 38\nParsed Data: {\'xml\': \'<s_receipt><s_total>$48.40</s_total><s_tips></s_tips><s_time>7:43:29PM</s_time><s_telephone>(212)579-5959</s_telephone><s_tax>3.95</s_tax><s_subtotal>44.45</s_subtotal><s_store_name>ARTIE’SDELICATESSEN</s_store_name><s_store_addr>2290BROADWAY NewYork,NY10024</s_store_addr><s_line_items><s_item_value>16.95</s_item_value><s_item_quantity>1</s_item_quantity><s_item_name>SupremeBurger</s_item_name><s_item_key></s_item_key><sep/><s_item_value>15.25</s_item_value><s_item_quantity>1</s_item_quantity><s_item_name>Medium</s_item_name><s_item_key></s_item_key><sep/><s_item_value>2.25</s_item_value><s_item_quantity>1</s_item_quantity><s_item_name>PastramiBurger</s_item_name><s_item_key></s_item_key><sep/><s_item_value>4.50</s_item_value><s_item_quantity>1</s_item_quantity><s_item_name>Medium</s_item_name><s_item_key></s_item_key><sep/><s_item_value>5.50</s_item_value><s_item

## Llama.cpp
In this experiment I am using Llama.cpp model to run `Llama-3.2-3B-Instruct` 100% locally on a Macbook Pro.
You can of course use the original model (not quantized) directly via HuggingFace Transformers if you have a powerful GPU at your disposal.

In [None]:
from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF",
    filename="*q8_0.gguf",
)


llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/sina/.cache/huggingface/hub/models--hugging-quants--Llama-3.2-3B-Instruct-Q8_0-GGUF/snapshots/7ef7efff7d2c14e5d6161a0c7006e1f2fea6ec79/./llama-3.2-3b-instruct-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_mod

## Basic RAG

This example does the following:

- Instantiates Atlas Vector Search as a retriever to query for similar documents, including the optional k parameter to search for only the 10 most relevant documents.

- Defines a LangChain prompt template to instruct the LLM to use these documents as context for your query. LangChain passes these documents to the {context} input variable and your query to the {question} variable.

- Constructs a chain that specifies the following:

  - Atlas Vector Search as the retriever to search for documents that are used as context by the LLM.

  - The prompt template that you constructed.

  - `llama-3.2-3b-instruct` model as the LLM used to generate a context-aware response.

- Prompts the chain with a sample query about Atlas security recommendations.

- Returns the LLM's response and the documents used as context. The generated response might vary.


In [None]:
import os
from llama_cpp import Llama
from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

model_dir = "/Users/sina/.cache/huggingface/hub/models--hugging-quants--Llama-3.2-3B-Instruct-Q8_0-GGUF/snapshots/7ef7efff7d2c14e5d6161a0c7006e1f2fea6ec79"
model_file = "llama-3.2-3b-instruct-q8_0.gguf"  # replace with actual filename
model_path = os.path.join(model_dir, model_file)

# Instantiate the Vector Search as a Retriever
retriever = vector_store.as_retriever(
    search_type = "similarity",
    search_kwargs = { "k": 1 }
)

# Define a prompt template
template = """
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
{context}
Question: {question}
Answer:
"""
custom_rag_prompt = PromptTemplate.from_template(template)

# Set up the Llama model
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = LlamaCpp(
    model_path=model_path,
    temperature=0,
    max_tokens=2000,
    top_p=1,
    n_ctx=8192,  # Increase this value, try 4096 or 8192
    callback_manager=callback_manager,
    verbose=True,
)

def format_docs(docs):
    formatted_docs = []
    for doc in docs:
        formatted_doc = f"Invoice ID: {doc.metadata['id']}\n"
        formatted_doc += f"Content: {doc.page_content}\n"
        formatted_docs.append(formatted_doc)
    return "\n\n".join(formatted_docs[:3])

# Construct a chain to answer questions on your data
rag_chain = (
    { "context": retriever | format_docs, "question": RunnablePassthrough() }
    | custom_rag_prompt
    | llm
    | StrOutputParser()
)

# Prompt the chain with a query
question = "What is the store address of the store with name 'ARTIE'S DELICATESSEN'?"
answer = rag_chain.invoke(question)

print("Question:" + question)
print("Answer:" + answer)

# Return source documents
documents = retriever.invoke(question)
print("\nSource documents:")
pprint.pprint(documents)

llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/sina/.cache/huggingface/hub/models--hugging-quants--Llama-3.2-3B-Instruct-Q8_0-GGUF/snapshots/7ef7efff7d2c14e5d6161a0c7006e1f2fea6ec79/llama-3.2-3b-instruct-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model

The store address of the store with name 'ARTIE'S DELICATESSEN' is 2290BROADWAY NewYork,NY10024.

llama_perf_context_print:        load time =  296122.81 ms
llama_perf_context_print: prompt eval time =       0.00 ms /  6309 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /    31 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  298065.74 ms /  6340 tokens


Question:What is the store address of the store with name 'ARTIE'S DELICATESSEN'?
Answer:The store address of the store with name 'ARTIE'S DELICATESSEN' is 2290BROADWAY NewYork,NY10024.

Source documents:
[Document(metadata={'_id': '66fb9ead88af6025598cf484', 'id': '38'}, page_content='Invoice ID: 38\nParsed Data: {\'xml\': \'<s_receipt><s_total>$48.40</s_total><s_tips></s_tips><s_time>7:43:29PM</s_time><s_telephone>(212)579-5959</s_telephone><s_tax>3.95</s_tax><s_subtotal>44.45</s_subtotal><s_store_name>ARTIE’SDELICATESSEN</s_store_name><s_store_addr>2290BROADWAY NewYork,NY10024</s_store_addr><s_line_items><s_item_value>16.95</s_item_value><s_item_quantity>1</s_item_quantity><s_item_name>SupremeBurger</s_item_name><s_item_key></s_item_key><sep/><s_item_value>15.25</s_item_value><s_item_quantity>1</s_item_quantity><s_item_name>Medium</s_item_name><s_item_key></s_item_key><sep/><s_item_value>2.25</s_item_value><s_item_quantity>1</s_item_quantity><s_item_name>PastramiBurger</s_item_name>

## RAG with Filtering

This example does the following:

- Instantiates Atlas Vector Search as a retriever to query for similar documents, including the following optional parameters:

  - `k` to search for only the `5` most relevant documents.

  - `score_threshold` to use only documents with a relevance score above `0.1`.

    - Note
    This parameter refers to a relevance score that Langchain uses to normalize your results, and not the relevance score used in Atlas Search queries. To use Atlas Search scores in your RAG implementation, define a custom retriever that uses the similarity_search_with_score method and filters by the Atlas Search score.

  - `pre_filter` to filter on the  `id` field for documents that appear with id `0` only.

- Defines a LangChain prompt template to instruct the LLM to use these documents as context for your query. LangChain passes these documents to the  `{context}` input variable and your query to the `{question}` variable.

- Constructs a chain that specifies the following:

  - Atlas Vector Search as the retriever to search for documents that are used as context by the LLM.

  - The prompt template that you constructed.

  - `llama-3.2-3b-instruct` model as the LLM used to generate a context-aware response.

- Prompts the chain with a sample query about Atlas security recommendations.

- Returns the LLM's response and the documents used as context. The generated response might vary.


In [None]:
# Instantiate the Vector Search as a Retriever
retriever = vector_store.as_retriever(
   search_type = "similarity_score_threshold",
   search_kwargs = {
      "k": 5,
      "score_threshold": 0.1,
      "pre_filter": { "id": { "$in": ["38"] } }
   }
)

# Define a prompt template
template = """
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
{context}
Question: {question}
Answer:
"""
custom_rag_prompt = PromptTemplate.from_template(template)

# Set up the Llama model
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = LlamaCpp(
    model_path=model_path,
    temperature=0,
    max_tokens=2000,
    top_p=1,
    n_ctx=8192,  # Increase this value, try 4096 or 8192
    callback_manager=callback_manager,
    verbose=True,
)

def format_docs(docs):
    formatted_docs = []
    for doc in docs:
        formatted_doc = f"Invoice ID: {doc.metadata['id']}\n"
        formatted_doc += f"Content: {doc.page_content}\n"
        formatted_docs.append(formatted_doc)
    return "\n\n".join(formatted_docs[:3])

# Construct a chain to answer questions on your data
rag_chain = (
    { "context": retriever | format_docs, "question": RunnablePassthrough() }
    | custom_rag_prompt
    | llm
    | StrOutputParser()
)

# Prompt the chain with a query
question = "What is the store address of the store with name 'ARTIE'S DELICATESSEN'?"
answer = rag_chain.invoke(question)

print("Question:" + question)
print("Answer:" + answer)

# Return source documents
documents = retriever.invoke(question)
print("\nSource documents:")
pprint.pprint(documents)

llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /Users/sina/.cache/huggingface/hub/models--hugging-quants--Llama-3.2-3B-Instruct-Q8_0-GGUF/snapshots/7ef7efff7d2c14e5d6161a0c7006e1f2fea6ec79/llama-3.2-3b-instruct-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model

The store address of the store with name 'ARTIE'S DELICATESSEN' is 2290BROADWAY NewYork,NY10024.

llama_perf_context_print:        load time =  313182.44 ms
llama_perf_context_print: prompt eval time =       0.00 ms /  6309 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /    31 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  315064.59 ms /  6340 tokens


Question:What is the store address of the store with name 'ARTIE'S DELICATESSEN'?
Answer:The store address of the store with name 'ARTIE'S DELICATESSEN' is 2290BROADWAY NewYork,NY10024.

Source documents:
[Document(metadata={'_id': '66fb9ead88af6025598cf484', 'id': '38'}, page_content='Invoice ID: 38\nParsed Data: {\'xml\': \'<s_receipt><s_total>$48.40</s_total><s_tips></s_tips><s_time>7:43:29PM</s_time><s_telephone>(212)579-5959</s_telephone><s_tax>3.95</s_tax><s_subtotal>44.45</s_subtotal><s_store_name>ARTIE’SDELICATESSEN</s_store_name><s_store_addr>2290BROADWAY NewYork,NY10024</s_store_addr><s_line_items><s_item_value>16.95</s_item_value><s_item_quantity>1</s_item_quantity><s_item_name>SupremeBurger</s_item_name><s_item_key></s_item_key><sep/><s_item_value>15.25</s_item_value><s_item_quantity>1</s_item_quantity><s_item_name>Medium</s_item_name><s_item_key></s_item_key><sep/><s_item_value>2.25</s_item_value><s_item_quantity>1</s_item_quantity><s_item_name>PastramiBurger</s_item_name>