# Understanding Retrieval Question Answering

In [1]:
%pip install -Uqqq rich openai tiktoken wandb langchain unstructured tabulate pdf2image chromadb

Note: you may need to restart the kernel to use updated packages.


In [41]:
%pip install faiss-gpu

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2
Note: you may need to restart the kernel to use updated packages.


In [1]:
import os, random
from pathlib import Path
import tiktoken
from getpass import getpass
from rich.markdown import Markdown
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

#### OpenAI Key!!

In [2]:
os.environ["OPENAI_API_KEY"] =  "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"

In [46]:
# we need a single line of code to start tracing langchain with W&B
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"

# wandb documentation to configure wandb using env variables
# https://docs.wandb.ai/guides/track/advanced/environment-variables
# here we are configuring the wandb project name
os.environ["WANDB_PROJECT"] = "maven-article"

In [47]:
MODEL_NAME = "text-davinci-003"
# MODEL_NAME = "gpt-4"

In [13]:
!pwd

/home/jupyter/MLSysDes


In [14]:
local_dir = '/home/jupyter/MLSysDes/data'

In [15]:
from langchain.document_loaders import DirectoryLoader

def find_md_files(directory):
    "Find all markdown files in a directory and return a LangChain Document"
    dl = DirectoryLoader(directory, "**/*.txt")
    return dl.load()

documents = find_md_files(local_dir)
len(documents)

3

In [16]:
documents

[Document(page_content='Anarchy is a society without a government. It may also refer to a society or group of people that entirely rejects a set hierarchy.\n\nIn practical terms, anarchy can refer to the curtailment or abolition of traditional forms of government and institutions. It can also designate a nation or any inhabited place that has no system of government or central rule. Anarchy is primarily advocated by individual anarchists who propose replacing government with voluntary institutions. These institutions or free associations are generally modeled on nature since they can represent concepts such as community and economic self-reliance, interdependence, or individualism. Although anarchy is often negatively used as a synonym of chaos or societal collapse or anomie, this is not the meaning that anarchists attribute to anarchy, a society without hierarchies.\n\nEtymology[edit] Anarchy comes from the Latin word anarchia, which came from the Greek word anarchos ("having no ruler

In [17]:
# We will need to count tokens in the documents, and for that we need the tokenizer
tokenizer = tiktoken.encoding_for_model(MODEL_NAME)

In [18]:
# function to count the number of tokens in each document
def count_tokens(documents):
    token_counts = [len(tokenizer.encode(document.page_content)) for document in documents]
    return token_counts

count_tokens(documents)

[231, 345, 1762]

We will use `LangChain` built in `MarkdownTextSplitter` to split the documents into sections. Actually splitting `Markdown` without breaking syntax is not that easy. This splitter strips out syntax.
- We can pass the `chunk_size` param and avoid lenghty chunks.
- The `chunk_overlap` param is useful so you don't cut sentences randomly. This is less necessary with `Markdown`

The `MarkdownTextSplitter` also takes care of removing double line breaks and save us some tokens that way.

In [19]:
from langchain.text_splitter import MarkdownTextSplitter

md_text_splitter = MarkdownTextSplitter(chunk_size=1000)
document_sections = md_text_splitter.split_documents(documents)
len(document_sections), max(count_tokens(document_sections))

(18, 203)

In [20]:
Markdown(document_sections[0].page_content)

## Embeddings

Let's now use embeddings with a vector database retriever to find relevant documents for a query.

In [21]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# We will use the OpenAIEmbeddings to embed the text, and Chroma to store the vectors
embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(document_sections, embeddings)

In [22]:
retriever = db.as_retriever(search_kwargs=dict(k=3))

In [23]:
query = "What is Individuation?"
docs = retriever.get_relevant_documents(query)

In [24]:
# Let's see the results
for doc in docs:
    print(doc.metadata["source"])

/home/jupyter/MLSysDes/data/Individualism.txt
/home/jupyter/MLSysDes/data/Individualism.txt
/home/jupyter/MLSysDes/data/Individualism.txt


## Stuff Prompt
​
We'll now take the content of the retrieved documents, stuff them into prompt template along with the query, and pass into an LLM to obtain the answer. 

In [24]:
from langchain.prompts import PromptTemplate

prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

context = "\n\n".join([doc.page_content for doc in docs])
prompt = PROMPT.format(context=context, question=query)

In [25]:
from langchain.llms import OpenAI
#d24951b9a8ae9f9787c6c5f0e97d4888b41a5f4b
llm = OpenAI()
response = llm.predict(prompt)
Markdown(response)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········


[34m[1mwandb[0m: [32m[41mERROR[0m API key must be 40 characters long, yours was 51
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/jupyter/.netrc
[34m[1mwandb[0m: Streaming LangChain activity to W&B at https://wandb.ai/priya_r_h/maven-article/runs/4ehf1w1n
[34m[1mwandb[0m: `WandbTracer` is currently in beta.
[34m[1mwandb[0m: Please report any issues to https://github.com/wandb/wandb/issues with the tag `langchain`.


## Using Langchain

Langchain gives us tools to do this efficiently in few lines of code. Let's do the same using `RetrievalQA` chain.

In [26]:
from langchain.chains import RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain

qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)
                                                  
result = qa.run(query)

Markdown(result)

In [27]:
import wandb
wandb.finish()

### From miami hotel df

In [5]:
from langchain.document_loaders.csv_loader import CSVLoader
# from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader

In [28]:
from langchain.embeddings.openai import OpenAIEmbeddings
import nltk
nltk.download('punkt')
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
from langchain.chat_models import ChatOpenAI
from langchain import PromptTemplate
from langchain.chains import LLMChain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.llms import OpenAI
from langchain.vectorstores import FAISS

[nltk_data] Downloading package punkt to /home/jupyter/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [50]:
path= 'data/miami_hotels.csv'
loader = CSVLoader(file_path=path,source_column="title")

data = loader.load()

In [31]:
from langchain.document_loaders import TextLoader
loader = TextLoader('/home/jupyter/MLSysDes/data/anarchy.txt')
documents = loader.load()

In [51]:
print(data[0])

page_content="id: 7787044\ntype: HOTEL\nname: Faena Miami Beach\nimage: https://media-cdn.tripadvisor.com/media/photo-o/1d/78/a4/13/exterior-view.jpg\nawards: []\nrankingPosition: 5\npriceLevel: $$$$\npriceRange: $729 - $1,426\ncategory: hotel\nrating: 4.5\nhotelClass: 0.0\nhotelClassAttribution: \nphone: 13055348800\naddress: 3201 Collins Ave Faena District, Miami Beach, FL 33140-4023\nemail: reservations-miamibeach@faena.com\namenities: []\nnumberOfRooms: 179\nprices: []\nlatitude: 25.807375\nlongitude: -80.12364\nwebUrl: https://www.tripadvisor.com/Hotel_Review-g34439-d7787044-Reviews-Faena_Miami_Beach-Miami_Beach_Florida.html\nwebsite: https://www.faena.com/miami-beach\nrankingString: #5 of 235 hotels in Miami Beach\nrankingDenominator: 235\nnumberOfReviews: 2123\nreview: Hands down my absolute favorite hotel in South Beach—there’s no place I’d rather stay. I’ve been back four times in the past six months –each experience is better than the last and I always look forward to coming 

In [52]:
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your document')

You have 2511 document(s) in your data
There are 2444 characters in your document


In [53]:
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(data, embeddings)

In [54]:
query = 'What are tourist attractions in Miami?'

In [43]:
def get_response_from_query(db, query, k=5):
    """
    text-davinci-003 can handle up to 4097 tokens. Setting the chunksize to 1000 and k to 4 maximizes the number of tokens to analyze.
    """

    docs = db.similarity_search(query, k=k)

    docs_page_content = " ".join([d.page_content for d in docs])

    # llm = BardLLM()
    llm = ChatOpenAI(model_name="gpt-3.5-turbo",temperature=0)

    prompt = PromptTemplate(
        input_variables=["question", "docs"],
        template="""
        A bot that is open to discussions about different cultural, philosophical and political exchanges. I will use do different analysis to the articles provided to me. Stay truthful and if you weren't provided any resources give your oppinion only.
        Answer the following question: {question}
        By searching the following articles: {docs}
        
        Only use the factual information from the documents. Make sure to mention key phrases from the articles.
        
        If you feel like you don't have enough information to answer the question, say "I don't know".
      
        """,
    )

    chain = LLMChain(llm=llm, prompt=prompt)
    # chain = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, prompt=prompt,
    #                                                     chain_type="stuff", retriever=db.as_retriever(), return_source_documents=True)

    response = chain.run(question=query, docs=docs_page_content,return_source_documents=True)
    r_text = str(response)
    
    ##evaluation part
    
    prompt_eval = PromptTemplate(
        input_variables=["question", "docs"],
        template="""
        You job is to evaluate if the response to a question is similar to the source given.
        
        for the following: {question}
        By searching the following article: {docs}
        
       Give a reason why they are similar or not, start with a Yes or a No.
      
        """,
    )

    chain_part_2 = LLMChain(llm=llm, prompt=prompt_eval)
   
    evals = chain_part_2.run(question=r_text, docs=docs_page_content)

    
    return response,docs,evals

In [48]:
answer,sources,evals=get_response_from_query(db,query,5)

In [49]:
print("\n\n> Question:")
print(query)
print("\n> Answer:")
print(answer)
print("\n> Eval:")
print(evals)

# # Print the relevant sources used for the answer
print("----------------------------------SOURCE DOCUMENTS---------------------------")
for document in sources:
    print("\n> " + document.metadata["source"])
    print(document.page_content[:300])
print("----------------------------------SOURCE DOCUMENTS---------------------------")




> Question:
What are tourist attractions in Miami?

> Answer:
Based on the articles provided, some tourist attractions in Miami include the beach, Lincoln Road, and the Art Deco district. The Pestana Miami South Beach hotel is located only two blocks from Lincoln Road and a quick walk to the beach, making it a great place for families and couples looking for a peaceful beach getaway. The MB Hotel, Trademark Collection by Wyndham also has lovely beach access and a beautiful pool. The Majestic Hotel South Beach is located on Ocean Drive, which is known for its Art Deco architecture. However, there is no specific list of tourist attractions provided in these articles, so there may be other popular destinations in Miami that are not mentioned.

> Eval:
Yes, the responses provided are similar to the source given as they all mention hotels in Miami Beach and their amenities, such as proximity to the beach, pool access, and helpful staff. However, they do not provide a specific list of tour