### Document QA using Large Language Models (LLMs)

Using LLM document extraction methods for better querying of food review data

This is the dummy notebook to the article I have written here:

To visit my food recommender bot on Telegram, please use this link here: https://t.me/jasonthefoodie_bot

#### 1. Checkout the Dataset

In [1]:
import pandas as pd

In [None]:
# import data to take a quick look
df = pd.read_csv("dummy_data.csv")
df

#### 2. Setting up the Vector Store

In [None]:
from langchain.vectorstores import DeepLake
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders.csv_loader import CSVLoader
import os

In [2]:
# define env variables for AzureOpenAI model
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_BASE"] = "OPENAI_API_BASE"
os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"
os.environ["OPENAI_API_VERSION"] = "2023-03-15-preview"

In [3]:
# instantiate OpenAIEmbeddings
# note that chunk_size is set to 1 due AzureOpenAI limitations: https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#embeddings
embeddings = OpenAIEmbeddings(deployment="embedding", chunk_size=1)

In [7]:
# instantiate CSV loader and load food reviews with review link as source
loader = CSVLoader(file_path='dummy_data.csv', csv_args={
        "delimiter": ",",
}, encoding='utf-8', source_column='review_link')
data = loader.load()

In [None]:
# see what the document content is like
print(data[0])

In [None]:
# create deeplake db
db = DeepLake(
    dataset_path="./my_deeplake/", embedding_function=embeddings, overwrite=True
)
db.add_documents(data)

In [None]:
# load from existing DB, if database exists
db = DeepLake(
    dataset_path="./my_deeplake/", embedding_function=embeddings, read_only=True
)

In [5]:
# example query
query = "What places selling seafood bee hoon have you been to?"
docs = db.similarity_search(query)

#### 3. Generating Prompts with Context

In [None]:
from langchain.chat_models import AzureChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains.question_answering import load_qa_chain

In [None]:
# define prompt template
prompt_template = """You are a food recommender bot that has visited and given reviews for places given in the context. Help users find food recommendations.
Use only the context given to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
Filter out any results from the context that you are not so confident of.
Answer the user directly first and then list down your suggestions according to the format below, if the user is asking for suggestions. End your answer right after giving the suggestions. Placeholders are indicated using [] and comments are indicated using (). Recommend more than 1 option to the user, if possible.
Keep your answer to at most 3500 chracters.

[Short direct answer to the user's question]

Here are my recommendations:
🏠 [Name of place]
<i>[venue tags]</i>
✨ Avg Rating: [Rating of venue]
💸 Price: [Estimated price of dining at venue] (this is optional. If not found or not clear, use a dash instead.)
📍 <a href=[Location of venue] ></a>
📝 Reviews:
[list of review_link, seperated by linespace] (Use this format: 1. <a href=[review_link] >[food_desc_title text]</a>)

For example,

🏠 Doodak
<i>Steak, Date Night, Korean, Seafood</i>
✨ Avg Rating: 4
💸 Price: ~$100/pax
📍 <a href="https://www.google.com/maps/search/?api=1&query=1.3521,103.8198"></a>
📝 Reviews:
1. <a href="https://abc.xyz/review1">Good food and presentation</a>

Here is the context:
{context}

Question: {question}
"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

In [None]:
# Instatiate LLM
llm = AzureChatOpenAI(
    deployment_name='DEPLOYMENT_NAME',
    temperature=0
)

In [None]:
# instantiate QA chain
chain = load_qa_chain(llm, chain_type="stuff", prompt=PROMPT)

In [None]:
# pass example query to vector store and QA chain
query = "Any hawker food to recommend?"
docs = db.similarity_search(query)
output = chain({"input_documents": docs, "question": query}, return_only_outputs=True)

In [None]:
# print results
# note that there is some hallucination here, as links to the review are not mine.
print(output["output_text"])