# Understanding Retrieval Question Answering

### Setup

In [1]:
!pip install -U -q -r requirements.txt

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


In [1]:
import os, random
from pathlib import Path
import tiktoken
from getpass import getpass
from rich.markdown import Markdown
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

import wandb

You will need an OpenAI API key to run this notebook. You can get one [here](https://platform.openai.com/account/api-keys).

In [2]:
if os.getenv("OPENAI_API_KEY") is None:
  if any(['VSCODE' in x for x in os.environ.keys()]):
    print('Please enter password in the VS Code prompt at the top of your VS Code window!')
  os.environ["OPENAI_API_KEY"] = getpass("")

assert os.getenv("OPENAI_API_KEY", "").startswith("sk-"), "This doesn't look like a valid OpenAI API key"
print("OpenAI API key configured")

Please enter password in the VS Code prompt at the top of your VS Code window!
OpenAI API key configured


## Langchain

[LangChain](https://docs.langchain.com/docs/) is a framework for developing applications powered by language models. We will use some of its features in the code below. Let's start by configuring W&B tracing. 

In [9]:
# we need a single line of code to start tracing langchain with W&B
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"

# wandb documentation to configure wandb using env variables
# https://docs.wandb.ai/guides/track/advanced/environment-variables
# here we are configuring the wandb project name
os.environ["WANDB_PROJECT"] = "maven-test"
os.environ["WANDB_NOTEBOOK_NAME"] = "[hf_wand]Retrieval"

In [10]:
wandb.init()

## Parsing documents

We will use a small sample of markdown documents in this notebook. Let's find them and make sure we can stuff them into the prompt. That means they may need to be chunked and not exceed some number of tokens. 

In [11]:
MODEL_NAME = "text-davinci-003"
# MODEL_NAME = "gpt-4"

In [12]:
from langchain.document_loaders import DirectoryLoader

def find_md_files(directory):
    "Find all markdown files in a directory and return a LangChain Document"
    dl = DirectoryLoader(directory, glob="**/*.txt")
    return dl.load()

documents = find_md_files("/workspace/maven-blps-2/data/")
len(documents)

1

In [13]:
documents

[Document(page_content='I would like to get your all  thoughts on the bond yield increase this week. I am not worried about the market downturn but the sudden increase in yields. On 2/16 the 10 year bonds yields increased by almost  9 percent and on 2/19 the yield increased by almost 5 percent.\\n\\nKey Points from the CNBC Article:\\n\\n**The “taper tantrum” in 2013 was a sudden spike in Treasury yields due to market panic after the Federal Reserve announced that it would begin tapering its quantitative easing program. **\\n\\n**Major central banks around the world have cut interest rates to historic lows and launched unprecedented quantities of asset purchases in a bid to shore up the economy throughout the pandemic. **\\n\\n**However, the recent rise in yields suggests that some investors are starting to anticipate a tightening of policy sooner than anticipated to accommodate a potential rise in inflation. **\\n\\nThe recent rise in bond yields and U.S. inflation expectations has so

In [14]:
# We will need to count tokens in the documents, and for that we need the tokenizer
tokenizer = tiktoken.encoding_for_model(MODEL_NAME)

In [15]:
# function to count the number of tokens in each document
def count_tokens(documents):
    token_counts = [len(tokenizer.encode(document.page_content)) for document in documents]
    return token_counts

count_tokens(documents)

[1482]

We will use `LangChain` built in `MarkdownTextSplitter` to split the documents into sections. Actually splitting `Markdown` without breaking syntax is not that easy. This splitter strips out syntax.
- We can pass the `chunk_size` param and avoid lenghty chunks.
- The `chunk_overlap` param is useful so you don't cut sentences randomly. This is less necessary with `Markdown`

The `MarkdownTextSplitter` also takes care of removing double line breaks and save us some tokens that way.

In [16]:
from langchain.text_splitter import MarkdownTextSplitter

md_text_splitter = MarkdownTextSplitter(chunk_size=1000)
document_sections = md_text_splitter.split_documents(documents)
len(document_sections), max(count_tokens(document_sections))

(10, 232)

let's look at the first section

In [17]:
Markdown(document_sections[0].page_content)

## Embeddings

Let's now use embeddings with a vector database retriever to find relevant documents for a query. 

In [19]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# We will use the OpenAIEmbeddings to embed the text, and Chroma to store the vectors
embed_model = OpenAIEmbeddings()
print(f"OpenAI Embedding Model: {embed_model.model}")
db = Chroma.from_documents(document_sections, embed_model)

OpenAI Embedding Model: text-embedding-ada-002


We can create a retriever from the db now, we can pass the `k` param to get the most relevant sections from the similarity search

In [20]:
retriever = db.as_retriever(search_kwargs=dict(k=3))

['What is the "taper tantrum" and how does it relate to the recent increase in bond yields?',
 'How have major central banks responded to the pandemic and what impact has this had on bond yields?',
 'What factors have contributed to the recent rise in U.S. inflation expectations?',
 'How do rising bond yields typically affect stock markets?',
 'What is the risk of another "taper tantrum" in 2021 and why?',
 'How have long-term bond yields in Japan and Europe been affected by the increase in U.S. Treasury yields?',
 'What is the difference in how Europe and the United States are approaching the prospect of interest rate hikes?',
 'Why do some analysts believe that the rise in bond yields is a "normal feature" of economic recovery?',
 'How might successful vaccine rollouts and continued growth impact reflationary moves across asset classes?',
 'What is the potential impact of overbought sectors like commodities and banks on the equity bull market?']

In [21]:
query = "How do rising bond yields typically affect stock markets?"
docs = retriever.get_relevant_documents(query)

In [23]:
# Let's see the results
for doc in docs:
    print(doc.metadata["source"])

/workspace/maven-blps-2/data/article.txt
/workspace/maven-blps-2/data/article.txt
/workspace/maven-blps-2/data/article.txt


## Stuff Prompt

We'll now take the content of the retrieved documents, stuff them into prompt template along with the query, and pass into an LLM to obtain the answer. 

In [26]:
from langchain.prompts import PromptTemplate

prompt_template = """
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:
""".strip()

PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

context = "\n\n".join([doc.page_content for doc in docs])
prompt = PROMPT.format(context=context, question=query)

Use langchain to call openai chat API with the question

In [24]:
from langchain.llms import OpenAI

llm = OpenAI()
print(f"OpenAI Completion Model: {llm.model_name}")

OpenAI Completion Model: text-davinci-003


In [27]:
response = llm.predict(prompt).strip()
Markdown(response)

## Using Langchain

Langchain gives us tools to do this efficiently in few lines of code. Let's do the same using `RetrievalQA` chain.

In [28]:
from langchain.chains import RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain

qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)
                                                  
result = qa.run(query).strip()

Markdown(result)

In [29]:

wandb.finish()



VBox(children=(Label(value='0.004 MB of 0.004 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…