# RAGdoll example

@untrueaxioms

<img src='img/github-header-image.png' />


In [1]:
import logging
from dotenv import load_dotenv
load_dotenv(override=True)

True

In [2]:
from ragdoll.helpers import set_logger
loginfo = set_logger()

In [3]:
from ragdoll.helpers import is_notebook
from ragdoll.index import RagdollIndex

index= RagdollIndex({'log_level':logging.INFO})
check_notebook = is_notebook(print_output=True)


Running in a Jupyter Notebook or JupyterLab environment.


# Indexing

The RagdollIndex class handles all the tasks outlined in the diagram below (see more at langchain's documentation)

<img src='img/load_split_embed_store.png' height='500'/>

#### Set debug

In [7]:
def reload():
    import importlib

    ragdoll_index_module = importlib.import_module("ragdoll.index")  # Assuming the module exists
    importlib.reload(ragdoll_index_module)
    index= RagdollIndex({'log_level':logging.WARN})

#### Set question for retrieval

In [8]:
question = "tell me more about langchain"

## Load

In [9]:
search_queries = index.get_suggested_search_terms(question)
search_queries

[32m[index] Fetching suggested search terms for the query[0m
[32m[index] 👨‍💻 Generating potential search queries with prompt:
 tell me more about langchain[0m


[32m[index] 👨‍💻 Generated potential search queries: ["langchain features and benefits", "langchain reviews and testimonials"][0m


['langchain features and benefits', 'langchain reviews and testimonials']

In [10]:
results=index.get_search_results(search_queries)
#can also access this via index.search_results or get the urls only with index.url_list

[32m[index]   🌐 Searching with query langchain features and benefits...[0m
[32m[index]   🌐 Searching with query langchain reviews and testimonials...[0m


In [11]:
urllist = f"".join(f"\n  * {d['href']}" for i, d in enumerate(results))
print(urllist)


  * https://www.marktechpost.com/2023/12/14/what-is-langchain-use-cases-and-benefits/
  * https://lakefs.io/blog/what-is-langchain-ml-architecture/
  * https://logankilpatrick.medium.com/what-is-langchain-and-why-should-i-care-as-a-developer-b2d952c42b28
  * https://news.ycombinator.com/item?id=36645575
  * https://www.reddit.com/r/aipromptprogramming/comments/13f8gjr/review_langchain_vs_huggingfaces_new_agent_system/
  * https://github.com/hwchase17/langchain/issues/4772


In [12]:
documents = index.get_scraped_content()
print("-" * 100)
print(f"extracted {len(documents)} sites")
print("-" * 100)

print(documents[0].metadata['source'],'\n\n',documents[0].page_content[:500])

[32m[index] Fetching content URLs[0m


----------------------------------------------------------------------------------------------------
extracted 6 sites
----------------------------------------------------------------------------------------------------
https://www.marktechpost.com/2023/12/14/what-is-langchain-use-cases-and-benefits/ 

 What is LangChain? Use Cases and Benefits
LangChain is an artificial intelligence framework designed for programmers to develop applications using large language models. It allows you to facilitate the creation of applications that consist of two key features:
1. Context-Awareness: LangChain enables applications to be context-aware by establishing connections between a language model and various context sources. These sources may include prompt instructions, few-shot examples, or other content t


## Split

Document Splitting is required to split documents into smaller chunks. Document splitting happens after we load data into standardised document format but before it goes into the vector store.


The default RecursiveSplitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

How the text is split: by list of characters.
How the chunk size is measured: by number of characters.


In [13]:
split_docs = index.get_split_documents(documents)
print("-" * 100)
print(f"extracted {len(split_docs)} documents from {len(documents)} documents")
print("-" * 100)

[32m[index] Chunking document[0m


----------------------------------------------------------------------------------------------------
extracted 97 documents from 6 documents
----------------------------------------------------------------------------------------------------


## Pipeline 

we can also run all in one like this:

In [14]:
split_docs = index.run_index_pipeline(question)
print("-" * 100)
print(f"extracted {len(split_docs)} documents from {len(documents)} documents")
print("-" * 100)


[32m[index] Running index pipeline[0m
[32m[index] Fetching suggested search terms for the query[0m
[32m[index] 👨‍💻 Generating potential search queries with prompt:
 tell me more about langchain[0m
[32m[index] 👨‍💻 Generated potential search queries: ["langchain features and benefits", "langchain reviews and testimonials"][0m
[32m[index]   🌐 Searching with query langchain features and benefits...[0m
[32m[index]   🌐 Searching with query langchain reviews and testimonials...[0m
[32m[index] Fetching content URLs[0m
[32m[index] Chunking document[0m


----------------------------------------------------------------------------------------------------
extracted 97 documents from 6 documents
----------------------------------------------------------------------------------------------------


# Retrieval

The retrieval class handles the following activities:


<img src='img/retrieve_augment_prompt.png' height='500'/>

## Embed and Store

Let’s start by initializing a simple vector store retriever and storing our docs (in chunks).


In [17]:
from ragdoll.retriever import RagdollRetriever

ragdoll = RagdollRetriever(config={'log_level':logging.WARN})

## Basic retrieval

let's create a vector store from the split_docs and then query it using similarity search.

In [18]:
# #uncomment this code if you want to test with a local doc.

# from langchain.docstore.document import Document

# split_docs = [
#     Document(page_content="LangChain is a framework designed to simplify the creation of applications using large language models. As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.", metadata={'source': 'wikipedia'})
# ]

In [19]:
db = ragdoll.get_db(split_docs)

In [20]:
from ragdoll.helpers import pretty_print_docs

simdocs = db.similarity_search('tell me about langchain')

print("-" * 100)
print(f"The similarity store returned {len(simdocs)} relevant documents. below is a snippet:")
print("-" * 100, "\n\n")
print(pretty_print_docs(simdocs, for_llm=False)[:500])

----------------------------------------------------------------------------------------------------
The similarity store returned 4 relevant documents. below is a snippet:
---------------------------------------------------------------------------------------------------- 


Source: https://www.marktechpost.com/2023/12/14/what-is-langchain-use-cases-and-benefits/
Title: What is LangChain? Use Cases and Benefits - MarkTechPost
Content: What is LangChain? Use Cases and Benefits
LangChain is an artificial intelligence framework designed for programmers to develop applications using large language models. It allows you to facilitate the creation of applications that consist of two key features:
1. Context-Awareness: LangChain enables applications to be context-aware by 


Let's now utilise a langchain retriever based on our selected vector db. A langchain retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. 

In [21]:
retriever = ragdoll.get_retriever() 
# we can do this because the vector db has already been created
#if we havent run get_db yet, we can simply create the retriever with ragdoll.get_retriever(documents=split_docs)

In [22]:
simdocs = retriever.get_relevant_documents("what is langchain")
print("-" * 100)
print(f"The retriever returned {len(simdocs)} relevant documents. below is a snippet:")
print("-" * 100, "\n\n")
print(pretty_print_docs(simdocs, for_llm=False)[:500])

----------------------------------------------------------------------------------------------------
The retriever returned 4 relevant documents. below is a snippet:
---------------------------------------------------------------------------------------------------- 


Source: https://www.marktechpost.com/2023/12/14/what-is-langchain-use-cases-and-benefits/
Title: What is LangChain? Use Cases and Benefits - MarkTechPost
Content: What is LangChain? Use Cases and Benefits
LangChain is an artificial intelligence framework designed for programmers to develop applications using large language models. It allows you to facilitate the creation of applications that consist of two key features:
1. Context-Awareness: LangChain enables applications to be context-aware by 


## Multiquery retrieval

Distance-based vector database retrieval embeds (represents) queries in high-dimensional space and finds similar embedded documents based on “distance”. But, retrieval may produce different results with subtle changes in query wording or if the embeddings do not capture the semantics of the data well. With multiple queries, we are more likely get more results back from the database. The aim of multi-query is to have an expanded results sets which might be able to answer questions better than docs from a single query. These results will be deduplicated (in case the same document comes back multiple times) and then used as context in your final prompt. The MultiQueryRetriever class takes care of this, and can be selected by setting the `base_retriever` key in the config dictionary to `MULTI_QUERY`

In [23]:
mq_retriever = ragdoll.get_mq_retriever() 
# we can do this because the vector db has already been created
#if we havent run get_db yet, we can simply create the retriever with ragdoll.get_retriever(documents=split_docs)

In [24]:
simdocs = mq_retriever.get_relevant_documents("what is langchain")
print("-" * 100)
print(f"The retriever returned {len(simdocs)} relevant documents. below is a snippet:")
print("-" * 100, "\n\n")
print(pretty_print_docs(simdocs, for_llm=False)[:500])

----------------------------------------------------------------------------------------------------
The retriever returned 6 relevant documents. below is a snippet:
---------------------------------------------------------------------------------------------------- 


Source: https://www.marktechpost.com/2023/12/14/what-is-langchain-use-cases-and-benefits/
Title: What is LangChain? Use Cases and Benefits - MarkTechPost
Content: LangChain is super handy because it allows us to use language models to work with databases without manually writing complex SQL queries. It’s like having a conversation with the database, making it easier to get the information we need. This feature opens up possibilities for creating chatbots that can answer questions based on databa


## Contextual Compression Retriever

One challenge with retrieval is that usually you don’t know the specific queries your document storage system will face when you ingest data into the system. This means that the information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. The idea is simple: instead of immediately returning retrieved documents as-is, you can compress them using the context of the given query, so that only the relevant information is returned. “Compressing” here refers to both compressing the contents of an individual document and filtering out documents wholesale.

To use the Contextual Compression Retriever, you’ll need: - a base retriever (either the standard or multi query) - and a Document Compressor

The Contextual Compression Retriever passes queries to the base retriever, takes the initial documents and passes them through the Document Compressor. The Document Compressor takes a list of documents and shortens it by reducing the contents of documents or dropping documents altogether

We could do this with recursive calls to an LLM but this is expensive and slow. The EmbeddingsFilter provides a cheaper and faster option by embedding the documents and query and only returning those documents which have sufficiently similar embeddings to the query.


In [25]:
cc_retriever = ragdoll.get_compression_retriever(retriever)

In [26]:
simdocs = mq_retriever.get_relevant_documents("what is langchain")
print("-" * 100)
print(f"The retriever returned {len(simdocs)} relevant documents. below is a snippet:")
print("-" * 100, "\n\n")
print(pretty_print_docs(simdocs, for_llm=False)[:500])

----------------------------------------------------------------------------------------------------
The retriever returned 6 relevant documents. below is a snippet:
---------------------------------------------------------------------------------------------------- 


Source: https://www.marktechpost.com/2023/12/14/what-is-langchain-use-cases-and-benefits/
Title: What is LangChain? Use Cases and Benefits - MarkTechPost
Content: LangChain is super handy because it allows us to use language models to work with databases without manually writing complex SQL queries. It’s like having a conversation with the database, making it easier to get the information we need. This feature opens up possibilities for creating chatbots that can answer questions based on databa
