# Understanding Retrieval Question Answering

### Setup

In [None]:
!pip install -Uqqq rich openai tiktoken wandb langchain unstructured tabulate pdf2image chromadb

In [1]:
import os, random
from pathlib import Path
import tiktoken
from getpass import getpass
from rich.markdown import Markdown

In [2]:
if os.getenv("OPENAI_API_KEY") is None:
  if any(["VSCODE" in x for x in os.environ.keys()]):
    print('Please enter password in the VS Code prompt at the top of your VS Code window.')
  os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI key from: https://platform.openai.com/account/api")

assert os.getenv("OPENAI_API_KEY", "").startswith("sk-"), "This doesn't look like a valid OpenAI API key"
print("OpenAI API key configured")

OpenAI API key configured


## Langchain
LangChain is a framework for developing applications powered by language models. We will use some of its features in the code below.

In [3]:
# we need a single line of code to start tracing Langchain with W&B
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"

# configuring the wandb project name
os.environ["WANDB_PROJECT"] = "llmapps"

## Parsing documents
We will use a sample of markdown documents in this notebook. Let's find them and make sure we can stuff them into the prompt. That means they may need to be chunked and not exceed some number of tokens.

In [4]:
MODEL_NAME = "text-davinci-003"

In [5]:
from langchain.document_loaders import DirectoryLoader

def find_md_files(directory):
  "Find all markdown files in a directory and return a LangChain Document"
  dl = DirectoryLoader(directory, "**/*.md")
  return dl.load()

documents = find_md_files("../docs_sample/")
len(documents)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ghost\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ghost\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


11

In [6]:
# We will need to count tokens in the documents, and for that we need the tokenizer
tokenizer = tiktoken.encoding_for_model(MODEL_NAME)

In [7]:
# Function to count the number of tokens in each document
def count_tokens(documents):
  token_counts = [len(tokenizer.encode(document.page_content)) for document in documents]
  return token_counts

count_tokens(documents)

[310, 2135, 2330, 2592, 665, 1154, 387, 763, 2047, 2616, 1199]

We will use LangChain built in MarkdownTextSplitter to split the documents into sections. Actually splitting Markdown without breaking syntax is not that easy. This splitter strips out syntax.

- We can pass the `chunk_size` param and avoid lenghty chunks.
- The `chunk_overlap` param is useful so you don't cut sentences randomly. This is less necessary with Markdown
The MarkdownTextSplitter also takes care of removing double line breaks and save us some tokens that way.

In [8]:
from langchain.text_splitter import MarkdownTextSplitter

md_text_splitter = MarkdownTextSplitter(chunk_size=1000)
document_sections = md_text_splitter.split_documents(documents)
len(document_sections), max(count_tokens(document_sections))

(90, 438)

In [9]:
Markdown(document_sections[0].page_content)

## Embeddings
Let's now use embeddings with a vector database retriever to find relevant documents for a query

In [10]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# We will use the OpenAIEmbeddings to embed the text, and  Chroma to store the vectors
embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(document_sections, embeddings)

We can create a retriever from the db now, we can pass the `k` param to get the most relevant sections from the similarity search

In [11]:
retriever = db.as_retriever(search_kwargs=dict(k=3))

In [12]:
query = "How can I share my W&B report with my team members in a public W&B project?"
docs = retriever.get_relevant_documents(query)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Streaming LangChain activity to W&B at https://wandb.ai/arrogantemartin/llmapps/runs/h6o8klq3
[34m[1mwandb[0m: `WandbTracer` is currently in beta.
[34m[1mwandb[0m: Please report any issues to https://github.com/wandb/wandb/issues with the tag `langchain`.


In [13]:
# Let's see the results
for doc in docs:
  print(doc.metadata["source"])

..\docs_sample\collaborate-on-reports.md
..\docs_sample\collaborate-on-reports.md
..\docs_sample\teams.md


## Stuff Prompt
We'll now take the content of the retrieved documents, stuff them into prompt templates along with the query and pass into an LLM to obtain the answer.

In [14]:
from langchain.prompts import PromptTemplate

prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

context = "\n\n".join([doc.page_content for doc in docs])
prompt = PROMPT.format(context=context, question=query)


Use langchain to call openai chat API with the question

In [15]:
from langchain.llms import OpenAI

llm = OpenAI()
response = llm.predict(prompt)
Markdown(response)

## Using LangChain
Langchain gives us tools to do this efficiently in few lines of code. Let's do the same using `RetrievalQA` chain

In [16]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)
result = qa.run(query)

Markdown(result)

In [17]:
import wandb
wandb.finish()