# Langchain Retrieval

- Document Loaders
- Text Splitting
- Vector stores
- Retreivers
- few more tools..
  
<img src="images/langchain_retrieval.jpg" width=75%/>

In this notebook, we will use the `langchain` library to use pre-trained models for various NLP tasks. 

<a href="https://colab.research.google.com/github/miztiik/llm-bootcamp/blob/main/chapters/intro_to_langchain/langchain_retreival.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
# Comment the above line to see the installation logs

# Install the dependencies
!pip install -qU python-dotenv
!pip install -qU langchain
!pip install -qU langchain-openai
!pip install -qU pypdf
!pip install -qU unstructured
!pip install -qU unstructured[md]"
!pip install -qU rapidocr-onnxruntime
!pip install -qU chromadb

In [None]:
!pip install -qU unstructured[md]"

In [None]:
# Load environment variables
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

In [None]:
from langchain_openai import OpenAI
from langchain_openai import ChatOpenAI

llm = OpenAI(model_name="gpt-3.5-turbo-instruct")
llm_chat = ChatOpenAI(model_name="gpt-3.5-turbo-0125", temperature=0.3)

## Document loaders

Download any reasonably sized pdf and upload it to the colab environment. We will use the `langchain` library to load the document and extract the text from it. You can find some samples in the `datasets` folder in the repo.

<img src="images/langchain_retrieval.jpg" width=40%/>
  

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(
    "./../../datasets/raw_data/pdf/2023_india_economic_survey.pdf")
pages = loader.load_and_split()

Lets check out the few pages from the document. An advantage of this approach is that documents can be retrieved with page numbers.

In [None]:
pages[3]

### Extract text from images in PDF

Using the `rapidocr-onnxruntime` package we can extract images as text as well:

In [None]:
!pip install -qU rapidocr-onnxruntime

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(
    "https://arxiv.org/pdf/2103.15348.pdf", extract_images=True)
pages = loader.load()

In [None]:
pages[4].page_content

In [None]:
loader = PyPDFLoader(
    "./../../datasets/raw_data/pdf/bain_on_strategy.pdf", extract_images=True
)
pages = loader.load()

In [None]:
pages[3].page_content

Loading all documents in a folder based on extension

In [None]:
from langchain_community.document_loaders import DirectoryLoader


loader = DirectoryLoader("./../../datasets/raw_data/", glob="**/*.md")


docs = loader.load()


len(docs)

In [None]:
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_community.document_loaders import UnstructuredMarkdownLoader

# For Markdown
loader = UnstructuredMarkdownLoader("README.md")
# For HTML
loader = UnstructuredHTMLLoader("index.html")

## Text Splitters

- **Character Splitting** - Simple static character chunks of data
  - The problem with it is that we do not take into account the structure of our document at all. We simply split by a fixed number of characters.
- **Recursive Character Text Splitting** - Recursive chunking based on a list of separators. We can specify a list of separators. This is the swiss army knife of splitters and my first choice, when you don't know which splitter to start with, this is a good first bet.
- **Document Specific Splitting** - Various chunking methods for different document types (PDF, Python, Markdown)


Additional Reading - [5 Levels Of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb)


Lets try these on some sample text,

In [None]:
sample_txt = """
The economic prosperity of the Tamils depended on foreign trade. Literary, archaeological and numismatic sources confirm the trade relationship between Tamilakam and Rome, where spices and pearls from India were in great demand. With the accession of Augustus in 27 BCE, trade between Tamilakam and Rome received a tremendous boost and culminated at the time of Nero who died in 68 CE. At that point, trade declined until the death of Caracalla (217 CE), after which it almost ceased. It was revived again under the Byzantine emperors. Under the early Roman emperors, there was a great demand for articles of luxury, especially beryl. 

Most of the articles of luxury mentioned by the Roman writers came from Tamilakam. In the declining period, cotton and industrial products were still imported by Rome. The exports from the Tamil country included pepper, pearls, ivory, textiles and gold ornaments, while the imports were luxury goods such as glass, coral, wine and topaz. The government provided the essential infrastructure such as good harbours, lighthouses, and warehouses to promote overseas trade. 

The trade route taken by ships from Rome to Tamilakam has been described in detail by the writers, such as Strabo and Pliny the Elder. Roman and Arab sailors were aware of the existence of the monsoon winds that blew across the Indian Ocean on a seasonal basis. A Roman captain named Hippalus first sailed a direct route from Rome to India, using the monsoon winds.

Source: https://en.wikipedia.org/wiki/Economy_of_ancient_Tamil_country
"""

### Character Splitting

Character splitting is the most basic form of splitting up your text. It is the process of simply dividing your text into N-character sized chunks regardless of their content or form.

This method isn't recommended for any applications - but it's a great starting point for us to understand the basics.

**Pros**: Easy & Simple

**Cons**: Very rigid and doesn't take into account the structure of your text
Concepts to know:

`Chunk Size` - The number of characters you would like in your chunks. 50, 100, 100,000, etc.

`Chunk Overlap` - The amount you would like your sequential chunks to overlap. This is to try to avoid cutting a single piece of context into multiple pieces. This will create duplicate data across chunks.

`strip_whitespace=False` - If you would like to retain whitespace from the beginning and end of your chunks. 

In [None]:
from langchain.text_splitter import CharacterTextSplitter


text_splitter = CharacterTextSplitter(





    chunk_size=128, chunk_overlap=0, separator="", strip_whitespace=False





)



text_splitter.create_documents([sample_txt])

**Observerations**: We are not taking into account the structure of our document at all. We simply split by a fixed number of characters.

### Recursive Character Text Splitting

Split a text into chunks using a Text Splitter. 

Parameters include:

`chunk_size`: Max size of the resulting chunks (in either characters or tokens, as selected)
`chunk_overlap`: Overlap between the resulting chunks (in either characters or tokens, as selected)
`length_function`: How to measure lengths of chunks, examples are included for either characters or tokens


You can see the default separators for LangChain here. Let's take a look at them one by one.

- `"\n\n"` - Double new line, or most commonly paragraph breaks
- `"\n"` - New lines
- `" "` - Spaces
- `""` - Characters


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


length_function = len


text_splitter = RecursiveCharacterTextSplitter(





    separators=["\n\n", "\n", " ", ""],





    chunk_size=128,





    chunk_overlap=1,

    length_function=length_function,





)

In [None]:
text_splitter.create_documents([sample_txt])

Let's view this visually,

<img src="images/chunk_visualization.png" width=50%>

<small>Source: https://chunkviz.up.railway.app/</small>

### Document Specific Splitting

This level is all about making your chunking strategy fit your different data formats.
 
**Markdown Document splitter**

Separators:

- `\n#{1,6}` - Split by new lines followed by a header (H1 through H6)
- ```` ```\\n ```` - Code blocks
- `\n\\*\\*\\*+\n `- Horizontal Lines
- `\n---+\n` - Horizontal Lines
- `\n___+\n` - Horizontal Lines
- `\n\n` Double new lines
- `\n` - New line
- `" "` - Spaces
- `""` - Character


In [None]:
markdown_text = """
# Trade Relationship Between Tamilakam and Rome: A Historical Overview

During ancient times, **Tamilakam's** economy thrived due to its extensive trading relationships with Rome. Archaeological, literary, and numismatic evidence affirms the exchange of valuable commodities like spices, pearls, and ivory.

## Key Commodities:
- **Exports**: Pepper, Pearls, Ivory, Textiles, Gold Ornaments
- **Imports**: Glass, Coral, Wine, Topaz
"""

In [None]:
from langchain.text_splitter import MarkdownTextSplitter

splitter = MarkdownTextSplitter(chunk_size=40, chunk_overlap=0)

md_splitter = splitter.create_documents([markdown_text])

In [None]:
md_splitter

## Why Chunk Size Matters in Language Modeling (LLM)

### Importance of Chunk Size
Consider an article, where the initial sentences introduce entities by their names, while the latter ones rely solely on pronouns to reference them. The split chunks that don’t contain the actual entity names will lose the semantic meaning and won’t be retrieved through vector search. Therefore, replacing the pronouns with actual names can improve the semantic significance of split chunks in this case. Choosing the right chunk_size is a critical decision that can influence the efficiency and accuracy of a the system in several ways. Below are reasons why:

- **Relevance and Granularity:** A small `chunk_size`, like `128`, yields more granular chunks. This granularity, however, presents a risk: vital information might not be among the top retrieved chunks, especially if the `similarity_top_k` setting is as restrictive as `2`. Conversely, a chunk size of 512 is likely to encompass all necessary information within the top chunks, ensuring that answers to queries are readily available. To navigate this, we employ the Faithfulness and Relevancy metrics. These measure the absence of ‘hallucinations’ and the ‘relevancy’ of responses based on the query and the retrieved contexts respectively.
- **Response Generation Time:** As the `chunk_size` increases, so does the volume of information directed into the LLM to generate an answer. While this can ensure a more comprehensive context, it might also slow down the system. Ensuring that the added depth doesn't compromise the system's responsiveness is crucial.
-. **Optimizing Response Generation Time**: With increasing `chunk_size`, the amount of data processed by the LLM for generating answers grows accordingly. Though providing a broader context, it may lead to longer response times. Keeping a reasonable equilibrium between depth and swiftness is indispensable.

### Determining the Ideal Chunk Size
To identify the most effective `chunk_size` for your particular application and dataset, rigorous testing across diverse sizes is mandatory. Moreover, tailor the `chunk_size` to individual stages in your pipeline. For instance, employ a larger `chunk_size` for high-level tasks like summarization, contrastingly, smaller `chunk_sizes` for low-level tasks like coding based on a function definition.

## Vector stores

`Chroma` is a AI-native open-source vector database focused on developer productivity and happiness. Chroma is licensed under Apache 2.0. It is a simple and fast vector store that can be used to store and retrieve vectors.

<small> Learn more about [ChromaDB](https://python.langchain.com/docs/integrations/vectorstores/chroma)  </small>

<img src="images/langchain_retrieval.jpg" width=40% />

In [None]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [None]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader(
    "./../../datasets/raw_data/txt/2023_msft_earnings_call_transcript.txt"
)
raw_data = loader.load()

In [None]:
print(raw_data[0].page_content[:1000])

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


length_function = len


text_splitter = RecursiveCharacterTextSplitter(





    separators=["\n\n", "\n", " ", ""],





    chunk_size=128,





    chunk_overlap=1,





    length_function=length_function,





)



documents = text_splitter.split_documents(raw_data)

Embedding Text Using OpenSource Models

_Note: You can also use OpenAI's GPT-3 for this task, but it is a paid service._



In [None]:
from langchain_community.vectorstores import Chroma
from langchain.embeddings import SentenceTransformerEmbeddings

embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# For OpenAI Embeddings
# from langchain_openai import OpenAIEmbeddings
# embeddings = OpenAIEmbeddings()

Creating Vector Store with Chroma DB

In [None]:
db = Chroma.from_documents(documents, embeddings)

## Retrievers

Retrieving Semantically Similar Documents from vector stores

<img src="images/langchain_retrieval.jpg" width=40%/>

In [None]:
query = "Revenue for last year?"
matching_docs = db.similarity_search(query)

len(matching_docs)

In [None]:
matching_docs[1]

In [None]:
retriever = db.as_retriever()
query = "What did Satya say about growth?"

matching_docs = docs = retriever.get_relevant_documents(query)


for doc in matching_docs:
    print(doc.page_content)
    print(doc.metadata)
    print("\n")

### Similarity score threshold retrieval

In [None]:
retriever = db.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.4}
)

In [None]:
docs = retriever.get_relevant_documents(query)
for doc in docs:
    print(doc.page_content)
    print(doc.metadata)
    print("\n")

### Maximum marginal relevance(mmr) retrieval

The Max Marginal Relevance Example Selector selects examples based on a combination of which examples are most similar to the inputs, while also optimizing for diversity. It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs, and then iteratively adding them while penalizing them for closeness to already selected examples.

<Small> Additional Learning: [Simple Unsupervised Keyphrase Extraction using Sentence Embeddings](https://arxiv.org/pdf/1801.04470.pdf) </small>

In [None]:
retriever = db.as_retriever(search_type="mmr")

In [None]:
docs = retriever.get_relevant_documents(query)
for doc in docs:
    print(doc.page_content)
    print(doc.metadata)
    print("\n")

### Retrieval top k results

In [None]:
retriever = db.as_retriever(search_kwargs={"k": 1})

In [None]:
docs = retriever.get_relevant_documents(
    "what did he say about ketanji brown jackson")


len(docs)

## Additional Reading
- [LangChain Retrieval](https://python.langchain.com/docs/modules/data_connection/)
- [Langchain VectorDB](https://python.langchain.com/docs/integrations/vectorstores/chroma)
- [Simple Unsupervised Keyphrase Extraction using Sentence Embedding](https://arxiv.org/pdf/1801.04470.pdf)
- [Chunk Visualization](https://chunkviz.up.railway.app/)