# RAGdoll example

@untrueaxioms

<img src='img/github-header-image.png' />


In [1]:
import logging
from dotenv import load_dotenv
load_dotenv(override=True)

True

In [2]:
from ragdoll.helpers import set_logger
loginfo = set_logger(logging.INFO)

In [3]:
config={
    'log_level':logging.INFO
    }

In [4]:
from ragdoll.config import Config

In [5]:
from ragdoll.helpers import is_notebook
from ragdoll.index import RagdollIndex

index= RagdollIndex(config)
check_notebook = is_notebook(print_output=True)


Running in a Jupyter Notebook or JupyterLab environment.


# Indexing

The RagdollIndex class handles all the tasks outlined in the diagram below (see more at langchain's documentation)

<img src='img/load_split_embed_store.png' height='500'/>

#### Set debug

#### Set question for retrieval

In [6]:
question = "tell me more about langchain"

## Load

In [7]:
search_queries = index.get_suggested_search_terms(question)
search_queries

[32m[index] Fetching suggested search terms for the query[0m
[32m[index] 🧠 Generating potential search queries with prompt:
 tell me more about langchain[0m
[32m[models] 🤖 retrieving gpt-3.5-turbo-16k model [0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"[0m
[32m[index] 🧠 Generated potential search queries: ["What is the purpose of Langchain?", "Langchain reviews and user experiences"][0m


['What is the purpose of Langchain?', 'Langchain reviews and user experiences']

In [8]:
results=index.get_search_results(search_queries)
#can also access this via index.search_results or get the urls only with index.url_list

[32m[index]   🌐 Searching with query What is the purpose of Langchain?...[0m
[32m[index]   🌐 Searching with query Langchain reviews and user experiences...[0m
[32m[__init__] file_cache is only supported with oauth2client<4.0.0[0m
[32m[__init__] file_cache is only supported with oauth2client<4.0.0[0m


In [9]:
urllist = f"".join(f"\n  * {d['href']}" for i, d in enumerate(results))
print(urllist)


  * https://aws.amazon.com/what-is/langchain/
  * https://www.techtarget.com/searchenterpriseai/definition/LangChain
  * https://www.reddit.com/r/LangChain/comments/12r5y1g/what_are_the_benefits_of_using_langchain_compared/
  * https://www.reddit.com/r/LangChain/comments/18eukhc/i_just_had_the_displeasure_of_implementing/
  * https://medium.com/llm-study-diary-a-beginners-path-through-ai/comprehensive-review-of-langchain-part-1-4734d61a49e1
  * https://www.reddit.com/r/LangChain/comments/1bszde0/would_you_use_langchain_more_if_it_had_better/


In [10]:
documents = index.get_scraped_content()
print("-" * 100)
print(f"extracted {len(documents)} sites")
print("-" * 100)

print(documents[0].metadata['source'],'\n\n',documents[0].page_content[:100])

[32m[index] 🌐 Fetching raw source content[0m


----------------------------------------------------------------------------------------------------
extracted 6 sites
----------------------------------------------------------------------------------------------------
https://aws.amazon.com/what-is/langchain/ 

 Click here to return to Amazon Web Services homepage
About AWS
Contact Us
Support
English
My Account


## Split

Document Splitting is required to split documents into smaller chunks. Document splitting happens after we load data into standardised document format but before it goes into the vector store.


The default RecursiveSplitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

How the text is split: by list of characters.
How the chunk size is measured: by number of characters.


In [11]:
split_docs = index.get_split_documents(documents)
print("-" * 100)
print(f"extracted {len(split_docs)} documents from {len(documents)} documents")
print("-" * 100)

[32m[index] 📰 Chunking document[0m


----------------------------------------------------------------------------------------------------
extracted 481 documents from 6 documents
----------------------------------------------------------------------------------------------------


## Pipeline 

we can also run all in one like this:

In [12]:
split_docs = index.run_index_pipeline(question)
print("-" * 100)
print(f"extracted {len(split_docs)} documents from {len(documents)} documents")
print("-" * 100)


[32m[index] Running index pipeline[0m
[32m[index] Fetching suggested search terms for the query[0m
[32m[index] 🧠 Generating potential search queries with prompt:
 tell me more about langchain[0m
[32m[models] 🤖 retrieving gpt-3.5-turbo-16k model [0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"[0m
[32m[index] 🧠 Generated potential search queries: ["What is the purpose of Langchain?", "Langchain reviews and user experiences"][0m
[32m[index]   🌐 Searching with query What is the purpose of Langchain?...[0m
[32m[index]   🌐 Searching with query Langchain reviews and user experiences...[0m
[32m[__init__] file_cache is only supported with oauth2client<4.0.0[0m
[32m[__init__] file_cache is only supported with oauth2client<4.0.0[0m
[32m[index] 🌐 Fetching raw source content[0m
[32m[index] 📰 Chunking document[0m


----------------------------------------------------------------------------------------------------
extracted 481 documents from 6 documents
----------------------------------------------------------------------------------------------------


# Retrieval

The retrieval class handles the following activities:


<img src='img/retrieve_augment_prompt.png' height='500'/>

## Embed and Store

Let’s start by initializing a simple vector store retriever and storing our docs (in chunks).


In [13]:
from ragdoll.retriever import RagdollRetriever

ragdoll = RagdollRetriever(config)

## Basic retrieval

let's create a vector store from the split_docs and then query it using similarity search.

In [14]:
# #uncomment this code if you want to test with a local doc.

# from langchain.docstore.document import Document

# split_docs = [
#     Document(page_content="LangChain is a framework designed to simplify the creation of applications using large language models. As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.", metadata={'source': 'wikipedia'})
# ]

In [15]:
db = ragdoll.get_db(split_docs)

[32m[retriever] 🗃️  creating vector database (FAISS)...[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[loader] Loading faiss with AVX2 support.[0m
[32m[loader] Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")[0m
[32m[loader] Loading faiss.[0m
[32m[loader] Successfully loaded faiss.[0m


In [16]:
from ragdoll.helpers import pretty_print_docs

simdocs = db.similarity_search(question)

print("-" * 100)
print(f"The similarity store returned {len(simdocs)} relevant documents. below is a snippet:")
print("-" * 100, "\n\n")
print(pretty_print_docs(simdocs, for_llm=False)[:500])

[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m


----------------------------------------------------------------------------------------------------
The similarity store returned 4 relevant documents. below is a snippet:
---------------------------------------------------------------------------------------------------- 


Source: https://aws.amazon.com/what-is/langchain/
Title: What is LangChain? - LangChain Explained - AWS
Content: What is LangChain?
LangChain is an open source framework for building applications based on large language models (LLMs). LLMs are large deep-learning models pre-trained on large amounts of data that can generate responses to user queries—for example, answering questions or creating images from text-based prompts. LangChain provides tools and abstractions to improve the customization,


Let's now utilise a langchain retriever based on our selected vector db. A langchain retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. 

In [17]:
retriever = ragdoll.get_retriever() 
# we can do this because the vector db has already been created
#if we havent run get_db yet, we can simply create the retriever with ragdoll.get_retriever(documents=split_docs)

[32m[retriever] 📋 getting retriever[0m


In [18]:
simdocs = retriever.invoke(question)
print("-" * 100)
print(f"The retriever returned {len(simdocs)} relevant documents. below is a snippet:")
print("-" * 100, "\n\n")
print(pretty_print_docs(simdocs, for_llm=False)[:500])

[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m


----------------------------------------------------------------------------------------------------
The retriever returned 4 relevant documents. below is a snippet:
---------------------------------------------------------------------------------------------------- 


Source: https://aws.amazon.com/what-is/langchain/
Title: What is LangChain? - LangChain Explained - AWS
Content: What is LangChain?
LangChain is an open source framework for building applications based on large language models (LLMs). LLMs are large deep-learning models pre-trained on large amounts of data that can generate responses to user queries—for example, answering questions or creating images from text-based prompts. LangChain provides tools and abstractions to improve the customization,


## Multiquery retrieval

Distance-based vector database retrieval embeds (represents) queries in high-dimensional space and finds similar embedded documents based on “distance”. But, retrieval may produce different results with subtle changes in query wording or if the embeddings do not capture the semantics of the data well. With multiple queries, we are more likely get more results back from the database. The aim of multi-query is to have an expanded results sets which might be able to answer questions better than docs from a single query. These results will be deduplicated (in case the same document comes back multiple times) and then used as context in your final prompt. The MultiQueryRetriever class takes care of this, and can be selected by setting the `base_retriever` key in the config dictionary to `MULTI_QUERY`

In [19]:
mq_retriever = ragdoll.get_mq_retriever() 
# we can do this because the vector db has already been created
#if we havent run get_db yet, we can simply create the retriever with ragdoll.get_retriever(documents=split_docs)

[32m[retriever] 📋 getting multi query retriever[0m
[32m[retriever] 💭 Remember that the multi query retriever will incur additional calls to your LLM[0m
[32m[models] 🤖 retrieving gpt-3.5-turbo-16k model for multi query retriever[0m


In [25]:
simdocs = mq_retriever.invoke("what is langchain")
print("-" * 100)
print(f"The retriever returned {len(simdocs)} relevant documents. below is a snippet:")
print("-" * 100, "\n\n")
print(pretty_print_docs(simdocs, for_llm=False)[:500])

[32m[_client] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"[0m
[32m[multi_query] Generated queries: ['1. Can you provide information about the LangChain?', "2. I'm curious to know more about LangChain. Can you explain it to me?", '3. Could you please give me an overview of LangChain and its purpose?'][0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m


----------------------------------------------------------------------------------------------------
The retriever returned 4 relevant documents. below is a snippet:
---------------------------------------------------------------------------------------------------- 


Source: https://aws.amazon.com/what-is/langchain/
Title: What is LangChain? - LangChain Explained - AWS
Content: What is LangChain?
LangChain is an open source framework for building applications based on large language models (LLMs). LLMs are large deep-learning models pre-trained on large amounts of data that can generate responses to user queries—for example, answering questions or creating images from text-based prompts. LangChain provides tools and abstractions to improve the customization,


## Contextual Compression Retriever

One challenge with retrieval is that usually you don’t know the specific queries your document storage system will face when you ingest data into the system. This means that the information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. The idea is simple: instead of immediately returning retrieved documents as-is, you can compress them using the context of the given query, so that only the relevant information is returned. “Compressing” here refers to both compressing the contents of an individual document and filtering out documents wholesale.

To use the Contextual Compression Retriever, you’ll need: - a base retriever (either the standard or multi query) - and a Document Compressor

The Contextual Compression Retriever passes queries to the base retriever, takes the initial documents and passes them through the Document Compressor. The Document Compressor takes a list of documents and shortens it by reducing the contents of documents or dropping documents altogether

We could do this with recursive calls to an LLM but this is expensive and slow. The EmbeddingsFilter provides a cheaper and faster option by embedding the documents and query and only returning those documents which have sufficiently similar embeddings to the query.


In [21]:
ccfg={
        "use_embeddings_filter":True, 
        "use_splitter":True, 
        "use_redundant_filter":True, 
        "use_relevant_filter":True,
        "similarity_threshold":0.5, #embeddings filter settings
    }

In [22]:
cc_retriever = ragdoll.get_compression_retriever(retriever, ccfg)

[32m[retriever] 🗜️ Compression object pipeline: embeddings_filter ➤ splitter ➤ redundant_filter ➤ relevant_filter[0m


In [23]:
simdocs = cc_retriever.get_relevant_documents(question)
print("-" * 100)
print(f"The retriever returned {len(simdocs)} relevant documents. below is a snippet:")
print("-" * 100, "\n\n")
print(pretty_print_docs(simdocs, for_llm=False)[:500])

[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m


----------------------------------------------------------------------------------------------------
The retriever returned 5 relevant documents. below is a snippet:
---------------------------------------------------------------------------------------------------- 


Source: https://aws.amazon.com/what-is/langchain/
Title: What is LangChain? - LangChain Explained - AWS
Content: What is LangChain?
LangChain is an open source framework for building applications based on large language models (LLMs). LLMs are large deep-learning models pre-trained on large amounts of data that can generate responses to user queries—for example, answering questions or creating images from text-based prompts. LangChain provides tools and abstractions to improve the customization,


### Question Answer

In [24]:
response = ragdoll.answer_me_this(question, cc_retriever)
print(response)

[32m[retriever] 🔗 Running RAG chain[0m
[32m[models] 🤖 retrieving gpt-3.5-turbo-16k model for RAG chain[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"[0m
[32m[_client] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"[0m


LangChain is an open source framework that was launched in 2022 by co-founders Harrison Chase and Ankush Gola. It is designed to enable software developers working with artificial intelligence (AI) and machine learning to combine large language models (LLMs) with other external components to develop LLM-powered applications. LLMs are deep-learning models that have been pre-trained on large amounts of data and can generate responses to user queries, such as answering questions or creating images from text-based prompts.

The main goal of LangChain is to link powerful LLMs, such as OpenAI's GPT-3.5 and GPT-4, to an array of external data sources in order to create and leverage the benefits of natural language processing (NLP) applications. This means that developers, software engineers, and data scientists with experience in Python, JavaScript, or TypeScript programming languages can utilize LangChain's packages offered in those languages.

LangChain provides tools and abstractions that 