# Vector Store and Question Generation

This notebook explores the setting up of a vector store with both PDF and JSON documents for RAG, a sanity check for the similarity-search task and question generation.

## Initialization

In [1]:
# %pip install -qU "langchain-chroma>=0.1.2"
# %pip install -qU langchain-openai
# %pip install -qU langchain

In [2]:
from langchain_openai import OpenAIEmbeddings
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## Vector Store

In [3]:
import json
from uuid import uuid4
from langchain_core.documents import Document
from langchain_chroma import Chroma

json_store = Chroma(
    collection_name="json_store",
    embedding_function=embeddings,
)

documents = []
for file in os.listdir("../data/card/json"):
    with open(os.path.join("../data/card/json", file), "r", encoding="utf-8") as f:
        card = json.load(f)
        documents.append(Document(
            metadata={
                "card_name" : card["card_name"],
                "card_type" : card["card_type"],
                "issuer" : card["issuer"],
                "card_association" : card["card_association"],
            },
            page_content=json.dumps(card)
        ))

uuids = [str(uuid4()) for _ in range(len(documents))]
json_store.add_documents(documents=documents, ids=uuids)
None

In [4]:
from pypdf import PdfReader
from uuid import uuid4
from langchain_core.documents import Document
from langchain_chroma import Chroma
import tqdm

pdf_store = Chroma(
    collection_name="pdf_store",
    embedding_function=embeddings,
)

documents = []
for file in tqdm.tqdm(os.listdir("../data/card/pdf")):
    reader = PdfReader(os.path.join("../data/card/pdf", file))
    for i, page in enumerate(reader.pages):
        documents.append(Document(
        metadata={
            "file_name" : file,
            "page_number" : i,
        },
        page_content=page.extract_text()
    ))

uuids = [str(uuid4()) for _ in range(len(documents))]
pdf_store.add_documents(documents=documents, ids=uuids)
None


 69%|██████▉   | 50/72 [00:18<00:22,  1.01s/it]Overwriting cache for 0 372
100%|██████████| 72/72 [00:25<00:00,  2.84it/s]


In [18]:
# Similarity search test

results = json_store.similarity_search(
    "What cards does lazada offer",
    k=5,
)

for res in results:
    print(f"*  [{res.metadata}]")
    
retriever = json_store.as_retriever(search_type="similarity", search_kwargs={"k": 5})

*  [{'card_association': 'Mastercard', 'card_name': 'Lazada-UOB Card', 'card_type': 'Cash Rebate', 'issuer': 'UOB'}]
*  [{'card_association': 'Mastercard', 'card_name': "DBS Woman's Card", 'card_type': 'Online Shopping', 'issuer': 'DBS'}]
*  [{'card_association': 'Mastercard', 'card_name': 'OCBC NXT Credit Card', 'card_type': 'Buy Now Pay Later', 'issuer': 'OCBC'}]
*  [{'card_association': 'Mastercard', 'card_name': 'CIMB World Mastercard', 'card_type': 'Cashback', 'issuer': 'CIMB'}]
*  [{'card_association': 'Mastercard', 'card_name': "UOB Lady's Card", 'card_type': 'Miles', 'issuer': 'UOB'}]


## Q&A

In [15]:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

from langchain import hub

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")
prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)



In [17]:
for chunk in rag_chain.stream("What cards does lazada offer?"):
    print(chunk, end="", flush=True)

Lazada offers the Lazada-UOB Card, which provides cash rebates on Lazada, Redmart, dining, entertainment, transport, and other spends. The card has an annual fee of S$196.20, and it requires a minimum income of S$30,000 for Singaporeans/PRs. The card is issued by UOB and falls under the card association of Mastercard.

## Question Generation

In [None]:
import random

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")
prompt = hub.pull("rlm/rag-prompt")

llm.invoke("What cards does lazada offer?")
cards = os.listdir("../data/card/json")

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

questions = [
    "What cards does KrisFlyer offer?",
    "What are the benefits regarding airport lounges?",
    "What kind of airport lounges are available?",
]

prompt = (
    # Cold-start question
    f"Come up with a short, one-line question on {cards[random.randint(0, len(cards)-1)]} that can be answered by the following context.",
) if len(questions) == 0 else (
    # Chat-history-question
    f"Come up with a short, one-line question.",
    f"Additionally, make sure the question is relevant to all of these previously asked questions (but do not repeat an existing question): {questions}."
)

for chunk in rag_chain.stream(' '.join(prompt)):
    print(chunk, end="", flush=True)



What is the eligibility criteria for accessing Priority Pass lounges with the Standard Chartered Journey Credit Card?