# RAGdoll example

@untrueaxioms

<img src='img/github-header-image.png' />


In [2]:
from dotenv import load_dotenv
load_dotenv(override=True)

True

In [3]:
from ragdoll.helpers import is_notebook
from ragdoll.index import RagdollIndex

index= RagdollIndex({'enable_logging':True})

print(index.get_config())
check_notebook = is_notebook(print_output=True)

{'enable_logging': True, 'max_search_results_per_query': 5, 'alternative_query_term_count': 5, 'max_workers': 3, 'embeddings': 'OpenAIEmbeddings', 'vector_store': 'FAISS'}
Running in a Jupyter Notebook or JupyterLab environment.


The RagdollIndex class handles all the tasks outlined in the diagram below (see more at langchain's documentation)

<img src='img/load_split_embed_store.png' height='500'/>

#### Set debug

In [4]:
def reload():
    import importlib

    ragdoll_index_module = importlib.import_module("ragdoll.index")  # Assuming the module exists
    importlib.reload(ragdoll_index_module)
    index= RagdollIndex({'enable_logging':True})

#### Set question for retrieval

In [5]:
question = "tell me more about langchain"


## Load

In [6]:
search_queries = index.get_suggested_search_terms(question)
search_queries

['What is Langchain and how does it work?',
 'Langchain features and benefits',
 'Langchain use cases and applications',
 'Langchain competitors and alternatives',
 'Langchain reviews and user experiences']

In [7]:
results=index.get_search_results(search_queries)
#can also access this via index.search_results or get the urls only with index.url_list

In [8]:
urllist = f"".join(f"\n  * {d['href']}" for i, d in enumerate(results))
print(urllist)


  * https://www.techtarget.com/searchenterpriseai/definition/LangChain
  * https://www.ibm.com/topics/langchain
  * https://www.producthunt.com/stories/what-is-langchain-how-to-use
  * https://aws.amazon.com/what-is/langchain/
  * https://blog.enterprisedna.co/what-is-langchain-a-beginners-guide-with-examples/
  * https://www.marktechpost.com/2023/12/14/what-is-langchain-use-cases-and-benefits/
  * https://lakefs.io/blog/what-is-langchain-ml-architecture/
  * https://logankilpatrick.medium.com/what-is-langchain-and-why-should-i-care-as-a-developer-b2d952c42b28
  * https://js.langchain.com/docs/use_cases
  * https://medium.com/@ebruboyaci35/use-cases-with-langchain-e0fd5b0587f1
  * https://python.langchain.com/docs/use_cases
  * https://github.com/gkamradt/langchain-tutorials/blob/main/LangChain%20Cookbook%20Part%202%20-%20Use%20Cases.ipynb
  * https://www.datacamp.com/tutorial/introduction-to-lanchain-for-data-engineering-and-data-applications
  * https://www.reddit.com/r/LocalLLaMA/c

In [9]:
documents = index.get_scraped_content()
print("-" * 100)
print(f"extracted {len(documents)} sites")
print("-" * 100)

print(documents[0].metadata['source'],'\n\n',documents[0].page_content[:500])

error occurred: HTTPSConnectionPool(host='www.marktechpost.com', port=443): Read timed out. (read timeout=4) 
----------------------------------------------------------------------------------------------------
extracted 22 sites
----------------------------------------------------------------------------------------------------
https://www.techtarget.com/searchenterpriseai/definition/LangChain 

 The potential of AI technology has been percolating in the background for years. But when ChatGPT, the AI chatbot, began grabbing headlines in early 2023, it put generative AI in the spotlight.
This guide is your go-to manual for generative AI, covering its benefits, limits, use cases, prospects and much more.
You forgot to provide an Email Address.
This email address doesn’t appear to be valid.
This email address is already registered. Please log in.
You have exceeded the maximum character limi


## Split

Document Splitting is required to split documents into smaller chunks. Document splitting happens after we load data into standardised document format but before it goes into the vector store.


The default RecursiveSplitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

How the text is split: by list of characters.
How the chunk size is measured: by number of characters.


In [10]:
split_docs = index.get_split_documents(documents)
print("-" * 100)
print(f"extracted {len(split_docs)} documents from {len(documents)} documents")
print("-" * 100)

----------------------------------------------------------------------------------------------------
extracted 352 documents from 22 documents
----------------------------------------------------------------------------------------------------


## Embed and Store

Let’s start by initializing a simple vector store retriever and storing our docs (in chunks).


In [11]:
retriever = index.get_retriever(split_docs)

## Pipeline 

we can also run all in one like this:

In [12]:
pl_retriever = index.run_index_pipeline(question)

# Basic retrieval

In [15]:
docs = retriever.get_relevant_documents('how does langchain work')

from ragdoll.helpers import pretty_print_docs
print("-" * 100)
print(f"The retriever had found {len(docs)} relevant documents")
print("-" * 100, "\n\n")
print(pretty_print_docs(docs, for_llm=False, top_n=1))

----------------------------------------------------------------------------------------------------
The retriever had found 4 relevant documents
---------------------------------------------------------------------------------------------------- 


Source: https://aws.amazon.com/what-is/langchain/
Title: What is LangChain? - LangChain Explained - AWS
Content: How does LangChain work?
With LangChain, developers can adapt a language model flexibly to specific business contexts by designating steps required to produce the desired outcome.
Chains
Chains are the fundamental principle that holds various AI components in LangChain to provide context-aware responses. A chain is a series of automated actions from the user's query to the model's output. For example, developers can use a chain for:
Links
Chains are made of links. Each action that developers string together to form a chained sequence is called a link. With links, developers can divide complex tasks into multiple, smaller tasks. E

In [16]:
docs = pl_retriever.get_relevant_documents('how does langchain work')
print("-" * 100)
print(f"The retriever had found {len(docs)} relevant documents")
print("-" * 100, "\n\n")
print(pretty_print_docs(docs, for_llm=False, top_n=1))

----------------------------------------------------------------------------------------------------
The retriever had found 4 relevant documents
---------------------------------------------------------------------------------------------------- 


Source: https://aws.amazon.com/what-is/langchain/
Title: What is LangChain? - LangChain Explained - AWS
Content: How does LangChain work?
With LangChain, developers can adapt a language model flexibly to specific business contexts by designating steps required to produce the desired outcome.
Chains
Chains are the fundamental principle that holds various AI components in LangChain to provide context-aware responses. A chain is a series of automated actions from the user's query to the model's output. For example, developers can use a chain for:
Links
Chains are made of links. Each action that developers string together to form a chained sequence is called a link. With links, developers can divide complex tasks into multiple, smaller tasks. E