## 1. Self Querying retriever
- https://github.com/samwit/langchain-tutorials/blob/main/RAG/YT_LangChain_RAG_tips_and_Tricks_01_Self_Query.ipynb

In [2]:
import os,sys,getpass
import pprint
os.environ["OPENAI_API_KEY"] = getpass.getpass(prompt='OpenAI API Token:')

In [3]:
from langchain.schema import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings() ## probably want to check which embedding endpoint it is using

  warn_deprecated(


- #### Create some sample documents to work with 

In [4]:
docs = [
    Document(
        page_content="Complex, layered, rich red with dark fruit flavors",
        metadata={"name":"Opus One", "year": 2018, "rating": 96, "grape": "Cabernet Sauvignon", "color":"red", "country":"USA"},
    ),
    Document(
        page_content="Luxurious, sweet wine with flavors of honey, apricot, and peach",
        metadata={"name":"Château d'Yquem", "year": 2015, "rating": 98, "grape": "Sémillon", "color":"white", "country":"France"},
    ),
    Document(
        page_content="Full-bodied red with notes of black fruit and spice",
        metadata={"name":"Penfolds Grange", "year": 2017, "rating": 97, "grape": "Shiraz", "color":"red", "country":"Australia"},
    ),
    Document(
        page_content="Elegant, balanced red with herbal and berry nuances",
        metadata={"name":"Sassicaia", "year": 2016, "rating": 95, "grape": "Cabernet Franc", "color":"red", "country":"Italy"},
    ),
    Document(
        page_content="Highly sought-after Pinot Noir with red fruit and earthy notes",
        metadata={"name":"Domaine de la Romanée-Conti", "year": 2018, "rating": 100, "grape": "Pinot Noir", "color":"red", "country":"France"},
    ),
    Document(
        page_content="Crisp white with tropical fruit and citrus flavors",
        metadata={"name":"Cloudy Bay", "year": 2021, "rating": 92, "grape": "Sauvignon Blanc", "color":"white", "country":"New Zealand"},
    ),
    Document(
        page_content="Rich, complex Champagne with notes of brioche and citrus",
        metadata={"name":"Krug Grande Cuvée", "year": 2010, "rating": 93, "grape": "Chardonnay blend", "color":"sparkling", "country":"New Zealand"},
    ),
    Document(
        page_content="Intense, dark fruit flavors with hints of chocolate",
        metadata={"name":"Caymus Special Selection", "year": 2018, "rating": 96, "grape": "Cabernet Sauvignon", "color":"red", "country":"USA"},
    ),
    Document(
        page_content="Exotic, aromatic white with stone fruit and floral notes",
        metadata={"name":"Jermann Vintage Tunina", "year": 2020, "rating": 91, "grape": "Sauvignon Blanc blend", "color":"white", "country":"Italy"},
    ),
]
vectorstore = Chroma.from_documents(docs, embeddings)

In [5]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [6]:
metadata_field_info = [
    AttributeInfo(
        name="grape",
        description="The grape used to make the wine",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="name",
        description="The name of the wine",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="color",
        description="The color of the wine",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="year",
        description="The year the wine was released",
        type="integer",
    ),
    AttributeInfo(
        name="country",
        description="The name of the country the wine comes from",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="The Robert Parker rating for the wine 0-100", type="integer" #float
    ),
]
document_content_description = "Brief description of the wine"

- Set up the retriever

In [7]:

llm = OpenAI(temperature=0)

retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    verbose=True
)
     

  warn_deprecated(


- it should generate a query for semetic matching and a filter for metadata filtering
- see [link](https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/#constructing-from-scratch-with-lcel) to see how to further customize

In [17]:
docs = retriever.invoke("I want a wine that has fruity nodes")
print(docs) 

[Document(page_content='Crisp white with tropical fruit and citrus flavors', metadata={'color': 'white', 'country': 'New Zealand', 'grape': 'Sauvignon Blanc', 'name': 'Cloudy Bay', 'rating': 92, 'year': 2021}), Document(page_content='Intense, dark fruit flavors with hints of chocolate', metadata={'color': 'red', 'country': 'USA', 'grape': 'Cabernet Sauvignon', 'name': 'Caymus Special Selection', 'rating': 96, 'year': 2018}), Document(page_content='Luxurious, sweet wine with flavors of honey, apricot, and peach', metadata={'color': 'white', 'country': 'France', 'grape': 'Sémillon', 'name': "Château d'Yquem", 'rating': 98, 'year': 2015}), Document(page_content='Complex, layered, rich red with dark fruit flavors', metadata={'color': 'red', 'country': 'USA', 'grape': 'Cabernet Sauvignon', 'name': 'Opus One', 'rating': 96, 'year': 2018})]


In [18]:
# This example specifies a query and a filter
retriever.invoke("I want a wine that has fruity nodes and has a rating above 97")

[Document(page_content='Luxurious, sweet wine with flavors of honey, apricot, and peach', metadata={'color': 'white', 'country': 'France', 'grape': 'Sémillon', 'name': "Château d'Yquem", 'rating': 98, 'year': 2015}),
 Document(page_content='Highly sought-after Pinot Noir with red fruit and earthy notes', metadata={'color': 'red', 'country': 'France', 'grape': 'Pinot Noir', 'name': 'Domaine de la Romanée-Conti', 'rating': 100, 'year': 2018})]

- keep top k 

In [19]:
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    enable_limit=True,  ### enable limit
    verbose=True,
)

In [20]:
retriever.get_relevant_documents("what are two that have a rating above 97")

[Document(page_content='Luxurious, sweet wine with flavors of honey, apricot, and peach', metadata={'color': 'white', 'country': 'France', 'grape': 'Sémillon', 'name': "Château d'Yquem", 'rating': 98, 'year': 2015}),
 Document(page_content='Highly sought-after Pinot Noir with red fruit and earthy notes', metadata={'color': 'red', 'country': 'France', 'grape': 'Pinot Noir', 'name': 'Domaine de la Romanée-Conti', 'rating': 100, 'year': 2018})]

## 2.  Parent Document Retriever

- You may want to have small documents, so that their embeddings can most accurately reflect their meaning. If too long, then the embeddings can lose meaning.
- You want to have long enough documents that the context of each chunk is retained.

In [23]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader,DirectoryLoader
from langchain_community.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

In [31]:
## reference : https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory
data_folder = '/root/workspace/data/Adv_RAG_temp_data'
loader = DirectoryLoader(data_folder, glob="*.txt",loader_cls=TextLoader,show_progress=True,use_multithreading=True)
docs = loader.load()

100%|██████████| 2/2 [00:00<00:00, 275.25it/s]


- Retrieving full documents : In this mode, we want to retrieve the full documents. Therefore, we only specify a child splitter.

In [32]:
# This text splitter is used to create the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)
retriever.add_documents(docs, ids=None)

In [35]:
## we have two docs, so two store ids
list(store.yield_keys())

['34c0989f-e952-41cc-aded-17edfac1f3bf',
 '277aa046-c08c-4563-a1c9-3b873b3f6ba3']

In [42]:
## - Let’s now call the vector store search functionality - we should see that it returns small chunks (since we’re storing the small chunks).
sub_docs = vectorstore.similarity_search("Chinese history")
print('retrieve chunks in docs that is related ')
pprint.pprint(sub_docs)

retrieve chunks in docs that is related 
[Document(page_content='Chinese dynasty, which was at once involved in the usual internal and\nexternal struggles. For the moment, however, the southern region was\nrelatively at peace, and was accordingly attracting settlers.', metadata={'doc_id': '277aa046-c08c-4563-a1c9-3b873b3f6ba3', 'source': '/root/workspace/data/Adv_RAG_temp_data/chinahistory.txt'}),
 Document(page_content='North China before its defeat, and resumed these from 932 on; there were\neven relations with one of the South Chinese states; in the same way,\nKao-li continuously played one state against the other (M. Rogers _et\nal_.).', metadata={'doc_id': '277aa046-c08c-4563-a1c9-3b873b3f6ba3', 'source': '/root/workspace/data/Adv_RAG_temp_data/chinahistory.txt'}),
 Document(page_content='inside China we followed, had for the first time to defend itself\nagainst views and systems entirely opposed to it; for the Turkish and\nMongol peoples who ruled northern China brought with them

In [44]:
print(sub_docs[0].page_content)
print(len(sub_docs[0].page_content))

Chinese dynasty, which was at once involved in the usual internal and
external struggles. For the moment, however, the southern region was
relatively at peace, and was accordingly attracting settlers.
200


- Now we can retrieve the parent document using the parent document retriever

In [46]:
## so for instance in a news article setting, we can using small chunks for matching and return the entire document 
retrieved_docs = retriever.get_relevant_documents("Chinese history")
len(retrieved_docs[0].page_content)

959799

- ##### A second way to use it is to return a larger chunk rather than the entire document 

In [47]:
# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="split_parents", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()

In [49]:
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

retriever.add_documents(docs)

In [50]:
## We can see that there are much more than two documents now - these are the larger chunks.
len(list(store.yield_keys()))

1004

In [51]:
## Let’s make sure the underlying vector store still retrieves the small chunks.
sub_docs = vectorstore.similarity_search("Chinese history")

In [52]:
print(sub_docs[0].page_content)

[Illustration: Map 3. China in the struggle with the Huns or Hsiung Nu
(_roughly 128-100 B.C._)]


In [56]:
## with the parent doc retriever to get larger chunks 
retrieved_docs = retriever.get_relevant_documents("Chinese history")
print(retrieved_docs[0].page_content)

[Illustration: Map 3. China in the struggle with the Huns or Hsiung Nu
(_roughly 128-100 B.C._)]

The first active step taken was to try, in 133 B.C., to capture the
head of the Hsiung-nu state, who was called a _shan-yü_ but the
_shan-yü_ saw through the plan and escaped. There followed a period of
continuous fighting until 119 B.C. The Chinese made countless attacks,
without lasting success. But the Hsiung-nu were weakened, one sign of
this being that there were dissensions after the death of the _shan-yü_
Chün-ch'en, and in 127 B.C. his son went over to the Chinese. Finally
the Chinese altered their tactics, advancing in 119 B.C. with a strong
army of cavalry, which suffered enormous losses but inflicted serious
loss on the Hsiung-nu. After that the Hsiung-nu withdrew farther to the
north, and the Chinese settled peasants in the important region of
Kansu.


## 3.  Hybrid Search BM25 & Ensemble Retriever

In [60]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain_community.vectorstores import FAISS, Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

In [61]:
doc_list = [
    "I like apples",
    "I like oranges",
    "Apples and oranges are fruits",
]

In [64]:
# initialize the bm25 retriever and faiss retriever
bm25_retriever = BM25Retriever.from_texts(doc_list)
bm25_retriever.k = 2

embedding = OpenAIEmbeddings()
faiss_vectorstore = FAISS.from_texts(doc_list, embedding)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

- initialize the ensemble retriever : and provide a weighting system 
- it is often useful when people know exactly what they want to search for

In [65]:
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)

In [66]:
docs = ensemble_retriever.get_relevant_documents("apples")
docs

[Document(page_content='I like apples'),
 Document(page_content='Apples and oranges are fruits')]

## 4. Contextual Compressors & Filters 