# Langchain Retrieval

- Document Loaders
- Text Splitting
- Vector stores
- Retreivers
- few more tools..
  
<img src="images/langchain_retrieval.jpg" width=75%/>

**References**
- [LangChain Retrieval](https://python.langchain.com/docs/modules/data_connection/)

In this notebook, we will use the `langchain` library to use pre-trained models for various NLP tasks. 

<a href="https://colab.research.google.com/github/miztiik/llm-bootcamp/blob/main/chapters/intro_to_langchain/langchain_retreivers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
# Comment the above line to see the installation logs

# Install the dependencies
!pip install -qU python-dotenv
!pip install -qU langchain
!pip install -qU langchain-openai
!pip install -qU pypdf
!pip install -qU unstructured
!pip install -qU unstructured[md]"
!pip install -qU rapidocr-onnxruntime
!pip install -qU chromadb

In [None]:
!pip install -qU unstructured[md]"

In [None]:
# Load environment variables
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

In [None]:
from langchain_openai import OpenAI
from langchain_openai import ChatOpenAI

llm = OpenAI(model_name="gpt-3.5-turbo-instruct")
llm_chat = ChatOpenAI(model_name="gpt-3.5-turbo-0125", temperature=0.3)

## Document loaders

Download any reasonably sized pdf and upload it to the colab environment. We will use the `langchain` library to load the document and extract the text from it. You can find some samples in the `datasets` folder in the repo.

<img src="images/langchain_retrieval.jpg" width=40%/>
  

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(
    "./../../datasets/raw_data/pdf/2023_india_economic_survey.pdf")
pages = loader.load_and_split()

Lets check out the few pages from the document. An advantage of this approach is that documents can be retrieved with page numbers.

In [None]:
pages[3]

### Extract text from images in PDF

Using the `rapidocr-onnxruntime` package we can extract images as text as well:

In [None]:
!pip install -qU rapidocr-onnxruntime

In [None]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(
    "https://arxiv.org/pdf/2103.15348.pdf", extract_images=True)
pages = loader.load()

In [None]:
pages[4].page_content

In [None]:
loader = PyPDFLoader(
    "./../../datasets/raw_data/pdf/bain_on_strategy.pdf", extract_images=True)
pages = loader.load()

In [None]:
pages[3].page_content

Loading all documents in a folder based on extension

In [None]:
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader("./../../datasets/raw_data/", glob="**/*.md")
docs = loader.load()
len(docs)

In [None]:
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_community.document_loaders import UnstructuredMarkdownLoader

# For Markdown
loader = UnstructuredMarkdownLoader("README.md")
# For HTML
loader = UnstructuredHTMLLoader("index.html")

## Text Splitters

- **Character Splitting** - Simple static character chunks of data
  - The problem with it is that we do not take into account the structure of our document at all. We simply split by a fixed number of characters.
- **Recursive Character Text Splitting** - Recursive chunking based on a list of separators. We can specify a list of separators. This is the swiss army knife of splitters and my first choice, when you don't know which splitter to start with, this is a good first bet.
- **Document Specific Splitting** - Various chunking methods for different document types (PDF, Python, Markdown)


Additional Reading - [5 Levels Of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb)


Lets try these on some sample text,

In [None]:
sample_txt = """
The economic prosperity of the Tamils depended on foreign trade. Literary, archaeological and numismatic sources confirm the trade relationship between Tamilakam and Rome, where spices and pearls from India were in great demand. With the accession of Augustus in 27 BCE, trade between Tamilakam and Rome received a tremendous boost and culminated at the time of Nero who died in 68 CE. At that point, trade declined until the death of Caracalla (217 CE), after which it almost ceased. It was revived again under the Byzantine emperors. Under the early Roman emperors, there was a great demand for articles of luxury, especially beryl. 

Most of the articles of luxury mentioned by the Roman writers came from Tamilakam. In the declining period, cotton and industrial products were still imported by Rome. The exports from the Tamil country included pepper, pearls, ivory, textiles and gold ornaments, while the imports were luxury goods such as glass, coral, wine and topaz. The government provided the essential infrastructure such as good harbours, lighthouses, and warehouses to promote overseas trade. 

The trade route taken by ships from Rome to Tamilakam has been described in detail by the writers, such as Strabo and Pliny the Elder. Roman and Arab sailors were aware of the existence of the monsoon winds that blew across the Indian Ocean on a seasonal basis. A Roman captain named Hippalus first sailed a direct route from Rome to India, using the monsoon winds.

Source: https://en.wikipedia.org/wiki/Economy_of_ancient_Tamil_country
"""

### Character Splitting

Character splitting is the most basic form of splitting up your text. It is the process of simply dividing your text into N-character sized chunks regardless of their content or form.

This method isn't recommended for any applications - but it's a great starting point for us to understand the basics.

**Pros**: Easy & Simple

**Cons**: Very rigid and doesn't take into account the structure of your text
Concepts to know:

`Chunk Size` - The number of characters you would like in your chunks. 50, 100, 100,000, etc.

`Chunk Overlap` - The amount you would like your sequential chunks to overlap. This is to try to avoid cutting a single piece of context into multiple pieces. This will create duplicate data across chunks.

`strip_whitespace=False` - If you would like to retain whitespace from the beginning and end of your chunks. 

In [None]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    chunk_size=128,
    chunk_overlap=0,
    separator='',
    strip_whitespace=False
)


text_splitter.create_documents([sample_txt])

**Observerations**: We are not taking into account the structure of our document at all. We simply split by a fixed number of characters.

### Recursive Character Text Splitting

Split a text into chunks using a Text Splitter. 

Parameters include:

`chunk_size`: Max size of the resulting chunks (in either characters or tokens, as selected)
`chunk_overlap`: Overlap between the resulting chunks (in either characters or tokens, as selected)
`length_function`: How to measure lengths of chunks, examples are included for either characters or tokens


You can see the default separators for LangChain here. Let's take a look at them one by one.

- `"\n\n"` - Double new line, or most commonly paragraph breaks
- `"\n"` - New lines
- `" "` - Spaces
- `""` - Characters


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
length_function = len
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=128,
    chunk_overlap=1,
    length_function=length_function,
)

In [None]:
text_splitter.create_documents([sample_txt])

Let's view this visually,

<img src="images/chunk_visualization.png" width=50%>

<small>Source: https://chunkviz.up.railway.app/</small>

### Document Specific Splitting

This level is all about making your chunking strategy fit your different data formats.
 
**Markdown Document splitter**

Separators:

- `\n#{1,6}` - Split by new lines followed by a header (H1 through H6)
- ```` ```\\n ```` - Code blocks
- `\n\\*\\*\\*+\n `- Horizontal Lines
- `\n---+\n` - Horizontal Lines
- `\n___+\n` - Horizontal Lines
- `\n\n` Double new lines
- `\n` - New line
- `" "` - Spaces
- `""` - Character


In [None]:
markdown_text = """
# Trade Relationship Between Tamilakam and Rome: A Historical Overview

During ancient times, **Tamilakam's** economy thrived due to its extensive trading relationships with Rome. Archaeological, literary, and numismatic evidence affirms the exchange of valuable commodities like spices, pearls, and ivory.

## Key Commodities:
- **Exports**: Pepper, Pearls, Ivory, Textiles, Gold Ornaments
- **Imports**: Glass, Coral, Wine, Topaz
"""

In [None]:
from langchain.text_splitter import MarkdownTextSplitter

splitter = MarkdownTextSplitter(chunk_size=40, chunk_overlap=0)

md_splitter = splitter.create_documents([markdown_text])

In [161]:
md_splitter

[Document(page_content='# Trade Relationship Between Tamilakam'),
 Document(page_content='and Rome: A Historical Overview'),
 Document(page_content="During ancient times, **Tamilakam's**"),
 Document(page_content='economy thrived due to its extensive'),
 Document(page_content='trading relationships with Rome.'),
 Document(page_content='Archaeological, literary, and'),
 Document(page_content='numismatic evidence affirms the'),
 Document(page_content='exchange of valuable commodities like'),
 Document(page_content='spices, pearls, and ivory.'),
 Document(page_content='## Key Commodities:'),
 Document(page_content='- **Exports**: Pepper, Pearls, Ivory,'),
 Document(page_content='Textiles, Gold Ornaments'),
 Document(page_content='- **Imports**: Glass, Coral, Wine,'),
 Document(page_content='Topaz')]

#### Conclusion

- **Chunk Size**: A  very typical case is that the chunks may lose information when split up. Consider a typical article, where the initial sentences introduce entities by their names, while the latter ones rely solely on pronouns to reference them. The split chunks that don’t contain the actual entity names will lose the semantic meaning and won’t be retrieved through vector search. Therefore, replacing the pronouns with actual names can improve the semantic significance of split chunks in this case.
  - Chunk size based on usecase - You do not have to stick to one chunk optimization method for all the steps in your pipeline. For example, if your pipeline involves both high-level tasks like summarization and low-level tasks like coding based on a function definition, you could try to use a bigger chunk size for summarization and then smaller chunks for coding reference.
- 

## Vector stores

<img src="images/langchain_retrieval.jpg" width=40% />

In [None]:
import os
import getpass

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

In [108]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader(
    "./../../datasets/raw_data/txt/2023_msft_earnings_call_transcript.txt")
raw_data = loader.load()

In [122]:
print(raw_data[0].page_content[:1000])

Source:
https://www.fool.com/earnings/call-transcripts/2023/10/24/microsoft-msft-q1-2024-earnings-call-transcript/
https://www.fool.com/earnings/call-transcripts/2024/01/31/microsoft-msft-q2-2024-earnings-call-transcript/
https://www.microsoft.com/en-us/Investor/earnings/FY-2023-Q3/press-release-webcast
https://www.fool.com/earnings/call-transcripts/2023/07/25/microsoft-msft-q4-2023-earnings-call-transcript/
Microsoft FY23 Third Quarter Earnings Conference Call
Brett Iversen, Satya Nadella, Amy Hood
Tuesday, April 25, 2023

BRETT IVERSEN: 

Good afternoon and thank you for joining us today. On the call with me are Satya Nadella, chairman and chief executive officer, Amy Hood, chief financial officer, Alice Jolla, chief accounting officer, and Keith Dolliver, deputy general counsel.

On the Microsoft Investor Relations website, you can find our earnings press release and financial summary slide deck, which is intended to s


In [124]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
length_function = len
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=128,
    chunk_overlap=1,
    length_function=length_function,
)

documents = text_splitter.split_documents(raw_data)

Embedding Text Using Langchain

In [125]:
from langchain_community.vectorstores import Chroma
from langchain.embeddings import SentenceTransformerEmbeddings

embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

Creating Vector Store with Chroma DB

In [145]:
db = Chroma.from_documents(documents, embeddings)

## Retrievers

Retrieving Semantically Similar Documents from vector stores

<img src="images/langchain_retrieval.jpg" width=40%/>

In [163]:
query = "Revenue for last year?"
matching_docs = db.similarity_search(query)

len(matching_docs)

4

In [152]:
matching_docs[1]

Document(page_content='revenue in the next 12 months, up 15% year over year. The remaining portion, which will be recognized beyond the next 12', metadata={'source': './../../datasets/raw_data/txt/2023_msft_earnings_call_transcript.txt'})

In [167]:
retriever = db.as_retriever()
query = "What did Satya say about growth?"

matching_docs = docs = retriever.get_relevant_documents(query)


for doc in matching_docs:
    print(doc.page_content)
    print(doc.metadata)
    print("\n")

In addition, what Satya mentioned earlier in a question, and I just want to take every chance to reiterate it, if you have a
{'source': './../../datasets/raw_data/txt/2024_fy_msft_investor_earnings_call_q1_and_q2.txt'}


In addition, what Satya mentioned earlier in a question, and I just want to take every chance to reiterate it, if you have a
{'source': './../../datasets/raw_data/txt/2024_fy_msft_investor_earnings_call_q1_and_q2.txt'}


In addition, what Satya mentioned earlier in a question, and I just want to take every chance to reiterate it, if you have a
{'source': './../../datasets/raw_data/txt/2023_msft_earnings_call_transcript.txt'}


In addition, what Satya mentioned earlier in a question, and I just want to take every chance to reiterate it, if you have a
{'source': './../../datasets/raw_data/txt/2023_msft_earnings_call_transcript.txt'}




### Similarity score threshold retrieval

In [193]:
retriever = db.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.4}
)

In [196]:
docs = retriever.get_relevant_documents(query)
for doc in docs:
    print(doc.page_content)
    print(doc.metadata)
    print("\n")

In addition, what Satya mentioned earlier in a question, and I just want to take every chance to reiterate it, if you have a
{'source': './../../datasets/raw_data/txt/2024_fy_msft_investor_earnings_call_q1_and_q2.txt'}


In addition, what Satya mentioned earlier in a question, and I just want to take every chance to reiterate it, if you have a
{'source': './../../datasets/raw_data/txt/2024_fy_msft_investor_earnings_call_q1_and_q2.txt'}


In addition, what Satya mentioned earlier in a question, and I just want to take every chance to reiterate it, if you have a
{'source': './../../datasets/raw_data/txt/2023_msft_earnings_call_transcript.txt'}


In addition, what Satya mentioned earlier in a question, and I just want to take every chance to reiterate it, if you have a
{'source': './../../datasets/raw_data/txt/2023_msft_earnings_call_transcript.txt'}




### Maximum marginal relevance retrieval

MMR tries to reduce the redundancy of results while at the same time maintaining query relevance of results for already ranked documents/phrases etc.

<Small> Additional Learning: [Simple Unsupervised Keyphrase Extraction using Sentence Embeddings](https://arxiv.org/pdf/1801.04470.pdf) </small>

In [168]:
retriever = db.as_retriever(search_type="mmr")

In [197]:
docs = retriever.get_relevant_documents(query)
for doc in docs:
    print(doc.page_content)
    print(doc.metadata)
    print("\n")

In addition, what Satya mentioned earlier in a question, and I just want to take every chance to reiterate it, if you have a
{'source': './../../datasets/raw_data/txt/2024_fy_msft_investor_earnings_call_q1_and_q2.txt'}


In addition, what Satya mentioned earlier in a question, and I just want to take every chance to reiterate it, if you have a
{'source': './../../datasets/raw_data/txt/2024_fy_msft_investor_earnings_call_q1_and_q2.txt'}


In addition, what Satya mentioned earlier in a question, and I just want to take every chance to reiterate it, if you have a
{'source': './../../datasets/raw_data/txt/2023_msft_earnings_call_transcript.txt'}


In addition, what Satya mentioned earlier in a question, and I just want to take every chance to reiterate it, if you have a
{'source': './../../datasets/raw_data/txt/2023_msft_earnings_call_transcript.txt'}




### Retrieval top k results

In [199]:
retriever = db.as_retriever(search_kwargs={"k": 1})

In [200]:
docs = retriever.get_relevant_documents(
    "what did he say about ketanji brown jackson")
len(docs)

1