# exploring vectordatabases and retrievers
this notebook focuses on exploring different ways to set up vector db and retrievers

# Set up

In [1]:
%load_ext dotenv
%dotenv ../.env

In [2]:
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, PromptTemplate

from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableSequence, RunnableAssign, RunnableLambda
from langchain_core.output_parsers import StrOutputParser


import os
from os.path import  join

In [3]:

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")


In [4]:
#this needs more work 
need_to_recompile = False
if(not os.path.exists("faiss_index") or need_to_recompile):
    path_to_docs = './text'
    docs = []
    for f in os.listdir(path_to_docs):
        file_path = join(path_to_docs, f)
        if os.path.isfile(file_path):
            loader = UnstructuredHTMLLoader(file_path)
            docs.extend(loader.load())
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    splits = text_splitter.split_documents(docs)

    vectorstore = FAISS.from_documents(splits, OpenAIEmbeddings())
    vectorstore.save_local("faiss_index")
else:
    vectorstore = FAISS.load_local("faiss_index", OpenAIEmbeddings(), allow_dangerous_deserialization=True)


In [5]:
prompt = ChatPromptTemplate(
    input_variables=['context','question'],
    messages=[
        HumanMessagePromptTemplate(
            prompt=PromptTemplate(
                input_variables=['context', 'question'], 
                template="""You are an assistant for question-answering tasks. 
                Use the following pieces of retrieved context to answer the question. 
                If you don't know the answer, just say that you don't know. 
                Be detailed in your answer.\nQuestion: {question} \nContext: {context} \nAnswer:"""
                )
            )
        ]
)

In [6]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [7]:
def make_chain(retriever):
  chain = (
    #our first step is to create a dict {context: retriever.invoke($input_question), question: $input_question}, done through runnableparallel
    RunnableParallel({
      "context": retriever,
      "question": RunnablePassthrough()
    })
    #with this dict, we then use runnable assign to keep our original dictionary, but also add on a new field; answer                                                                                      
    | RunnableAssign(           
        # we add the new field with the mapper funciotn, which takes a new runnable parallel
        # this parrallel will take the dictionary passed into it, use it as input, and when it has its output, it will add the parallels dict to the orinal
        # {**input_dictionary, **our_parallel_function(input_dictionary)->dict}                                                      
        mapper=RunnableParallel(
            #we just need one thing, the answer, so that is the only field
            {"answer": 
              # prompt is expecting a dict with a 'question' and 'context' field, we currently have both of those fields, but context is a list of Document objects
              # we use the assign and format_docs function to remap context to our desired format 
              RunnableAssign(
                  mapper={"context": RunnableLambda(lambda x: format_docs(x['context']))}
                )
              | prompt
              | llm
              | StrOutputParser()
              #at the end, this parallel has produced a dict of {answer: string output}, which is then added to the original
            }
        )
    )
  )
  return chain

In [8]:
def ask_question(chain, question):
    ans = chain.invoke(question)
    print("question = ",ans["question"])
    print("answer = ", ans['answer'])
    print("Documents used:")
    for d in ans['context']:
        if len(d.page_content) > 40:
            print("\tsource: "+d.metadata['source']+"\t"+d.page_content[:40]+"..."+d.page_content[-30:])
        else:
            print("\tsource: "+d.metadata['source']+"\t"+d.page_content)

## fyi on other ways to chain text


other ways to do chain 
```python
RunnableSequence(
    RunnableParallel({
        "context": retriever,
        "question": RunnablePassthrough()
    }),
    RunnableAssign(           
      mapper=RunnableParallel(
          {"answer": 
            RunnableAssign(
                mapper={"context": RunnableLambda(lambda x: format_docs(x['context']))}
              )
            | prompt
            | llm
            | StrOutputParser()
          }
      )
    )   
)
############
RunnableParallel({
    "context": retriever,
    "question": RunnablePassthrough()
}).assign(answer=RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
        | prompt
        | llm
        | StrOutputParser()
)


```

# Default

In [9]:
question = "describe the hustory of US and UK"
retriever  = vectorstore.as_retriever()
chain = make_chain(retriever)
ask_question(chain, question)

question =  describe the hustory of US and UK
answer =  The history of the United States and the United Kingdom has been marked by periods of conflict and cooperation. After the American Revolutionary War, where the United States gained independence from Britain, relations between the two countries remained generally peaceful in the 19th century with occasional border disputes and tensions during the American Civil War. In the 20th century, spurred by multiple world conflicts, the US and UK became close allies.

The Rush–Bagot Treaty in 1817 demilitarized the Great Lakes and Lake Champlain, laying the basis for a demilitarized boundary that remains in effect today. The War of 1812, fought between the US and UK, ended with the Treaty of Paris in 1783, giving the US nearly all the territory east of the Mississippi River and south of the Great Lakes. Despite the war, the two nations quickly resumed trade and developed a growing friendship.

Overall, the relationship between the US and UK 

In [10]:
# Retrieve more documents with higher diversity
# Useful if your dataset has many similar documents
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={'k': 6, 'lambda_mult': 0.25}
)
chain = make_chain(retriever)
ask_question(chain, question)

question =  describe the hustory of US and UK
answer =  The history of the United States and the United Kingdom dates back to the colonization of North America by the British Empire in the late 15th century. By the 1760s, thirteen British colonies were established along the Atlantic Coast. The Southern Colonies built an agricultural system based on slave labor, enslaving millions from Africa. After defeating France, the British Parliament imposed taxes like the Stamp Act of 1765, leading to resistance by colonists. The Boston Tea Party in 1773 and the Intolerable Acts issued by Parliament in response led to armed conflict in Massachusetts in 1775.

Despite early tensions, relations between the United States and Britain remained generally peaceful for the rest of the 19th century, except for occasional border disputes and some tensions during and after the American Civil War. In the 20th century, spurred by multiple world conflicts, the two countries became close allies. The memory of c

In [11]:
# Fetch more documents for the MMR algorithm to consider
# But only return the top 5
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={'k': 3, 'fetch_k': 9}
)
chain = make_chain(retriever)
ask_question(chain, question)

question =  describe the hustory of US and UK
answer =  The history of the US and UK dates back to the late 15th century when European colonization began in North America. The British Empire colonized the Atlantic Coast, establishing thirteen colonies by the 1760s. The Southern Colonies built an agricultural system based on slave labor, enslaving millions from Africa. Tensions arose between the colonies and Britain due to taxes imposed by the British Parliament, leading to resistance and armed conflict, which began in Massachusetts in 1775.

After the American Revolutionary War, the United States gained independence from Britain in 1783. Despite some tensions during and after the American Civil War, relations between the US and UK remained peaceful for the rest of the 19th century. In the 20th century, spurred by multiple world conflicts, the two countries became close allies. The Rush–Bagot Treaty of 1817 between the US and Britain demilitarized the Great Lakes and Lake Champlain, lay

In [12]:
# Only retrieve documents that have a relevance score
#Above a certain threshold
retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={'score_threshold': 0.8}
)
chain = make_chain(retriever)
ask_question(chain, question)




question =  describe the hustory of US and UK
answer =  The history of the United States and the United Kingdom is complex and spans centuries of interactions, conflicts, and alliances. 

The United Kingdom, also known as Great Britain, is made up of England, Scotland, Wales, and Northern Ireland. Its history dates back to ancient times, with the Roman conquest of Britain in 43 AD being a significant event. Over the centuries, the UK saw the rise and fall of various kingdoms, wars, and alliances with other European powers.

The United States, on the other hand, was colonized by the British in the 17th century. The American Revolution in the late 18th century saw the 13 colonies break away from British rule and form the United States of America. This marked the beginning of a new chapter in the history of both countries.

Throughout history, the US and the UK have had a complex relationship, characterized by periods of cooperation and conflict. The two countries have been allies in majo

In [13]:
# Use a filter to only retrieve documents from a specific paper
retriever = vectorstore.as_retriever(
    search_kwargs={'filter': {'source':'./text\\war1812.html'}}
)
chain = make_chain(retriever)
ask_question(chain, question)


question =  describe the hustory of US and UK
answer =  The history of the relationship between the United States and the United Kingdom has had its ups and downs. Despite occasional border disputes and tensions during and after the American Civil War, relations between the two countries remained peaceful for most of the 19th century. In the 20th century, the US and UK became close allies, largely due to their cooperation during various world conflicts.

The Rush-Bagot Treaty of 1817 between the US and UK demilitarized the Great Lakes and Lake Champlain, setting the stage for a peaceful boundary. This treaty remains in effect today. The War of 1812 is often overlooked in the UK, as the British considered it a minor conflict compared to their involvement in battles against Napoleon.

The United States, on the other hand, saw the War of 1812 as a second war of independence, leading to a surge in nationalism and unity during the Era of Good Feelings. The US gained a sense of complete inde

# multiquery
https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/MultiQueryRetriever/
This uses LLM to think of mulitiple ways to ask the question. The goal is to create different question embeddings to get a wider range of relevant documents. 

In [14]:
from typing import List
from langchain_core.output_parsers import BaseOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.retrievers.multi_query import MultiQueryRetriever


In [16]:



# Output parser will split the LLM result into a list of queries
class LineListOutputParser(BaseOutputParser[List[str]]):
    """Output parser for a list of lines."""

    def parse(self, text: str) -> List[str]:
        lines = text.strip().split("\n")
        return lines


output_parser = LineListOutputParser()

QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five 
    different versions of the given user question to retrieve relevant documents from a vector 
    database. By generating multiple perspectives on the user question, your goal is to help
    the user overcome some of the limitations of the distance-based similarity search. 
    Provide these alternative questions separated by newlines.
    Original question: {question}""",
)
llm = ChatOpenAI(temperature=0)

# Chain
llm_chain = QUERY_PROMPT | llm | output_parser


In [17]:
# Run
retriever = MultiQueryRetriever(
    retriever=vectorstore.as_retriever(), llm_chain=llm_chain, parser_key="lines"
)  # "lines" is the key (attribute name) of the parsed output

# Results
unique_docs = retriever.invoke(question)

we build a second llm chain to help the retriever be more flexible in how it selects documents\
we can then use this retriever as normal

In [18]:
chain = make_chain(retriever)
ask_question(chain, question)

question =  describe the hustory of US and UK
answer =  The history of the United States and the United Kingdom has been marked by various events and conflicts, with the War of 1812 being a significant point of contention between the two nations. The long-term results of the war were generally satisfactory for both countries, leading to peaceful relations for the rest of the 19th century. In the 20th century, spurred by multiple world conflicts, the US and UK became close allies.

The Rush–Bagot Treaty of 1817 demilitarized the Great Lakes and Lake Champlain, laying the basis for a demilitarized boundary that remains in effect to this day. The historian Donald Hickey suggests that Britain's long-term policy of rapprochement with the US in the 19th century was driven by the belief that accommodating the US was the best way to defend Canada.

The War of 1812 had significant impacts on both nations. In the US, it led to a period known as the Era of Good Feelings, characterized by national

# Other

## multivector
if we build a document indexer along with our vector db, we can expicitly state which documents we think are most relevant to the answer\
https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/multi_vector/\
## self query
we use an llm to 'self-query' our question and documents to hopefully produce a better result\
https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/self_query/\
## contextual compression
compresses the documents pulled by a retriever into a more condensed text to get rid of extra information\
https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/contextual_compression/


# community vdb and support

https://api.python.langchain.com/en/latest/community_api_reference.html#module-langchain_community.vectorstores

there are many vectorstores made by the community \
langchain has a [native vectorstore](https://api.python.langchain.com/en/latest/core_api_reference.html#module-langchain_core.vectorstores)\
most seem to be similar but interface with a different storage app (sql dbs, redis, FAISS, etc)\
[This](https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.faiss.FAISS.html#langchain_community.vectorstores.faiss.FAISS) is the link to the FAISS db. It comes with most of the standard ways to add text documents, search in different ways, etc. 


https://api.python.langchain.com/en/latest/community_api_reference.html#module-langchain_community.retrievers\

Retrievers are more general than vector stores; they just retrieve documents. Most simply use a vector store as a backbone and source of information, but there exists other ways of doing this.\
Many of the community retrievers focus around retrieving from specific souces : arxiv papers, wikipedia, databases 