### [Intra knowledge QnA](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/intra_knowledge_qna.ipynb)

#### Vertex AI embeddings

In [1]:
PROJECT_ID='first-vertexai-project'
REGION_ID="us-central1"

In [5]:
from vertexai.preview.language_models import TextEmbeddingModel

In [None]:
INDEX_PATH = "../Data/IRS-P554/Dataset/"
PERSIST_PATH = "../Data/IRS-P554/PersistentDB/"

TEXT_MODEL = "gemini-1.0-pro"
EMBEDDING_MODEL = "textembedding-gecko@003"

In [7]:
model = TextEmbeddingModel.from_pretrained(EMBEDDING_MODEL)

#### Test embeddings

In [16]:
from scipy import spatial

cat_embedding = model.get_embeddings(["cat"])[0].values
mouse_embedding = model.get_embeddings(["mouse"])[0].values
car_embedding = model.get_embeddings(["car"])[0].values

In [18]:
cat_mouse_sim = spatial.distance.cosine(cat_embedding, mouse_embedding)
cat_car_sim = spatial.distance.cosine(cat_embedding, car_embedding)

print(f"similarity between cat and mouse is {cat_mouse_sim}")
print(f"similarity between cat and car is {cat_car_sim}")

similarity between cat and mouse is 0.22889688625572613
similarity between cat and car is 0.19975199464123117


In [19]:
# Utils
import os
import time
from typing import List

# Langchain
import langchain
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.document_loaders import TextLoader, UnstructuredPDFLoader
from langchain.prompts import PromptTemplate
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.chroma import Chroma
from langchain_google_vertexai import VertexAI, VertexAIEmbeddings

print(f"LangChain version: {langchain.__version__}")

# Vertex AI
from google.cloud import aiplatform

print(f"Vertex AI SDK version: {aiplatform.__version__}")

# PDF
import pdfminer

print(f"PDF miner version: {pdfminer.__version__}") # has to be 20221105


LangChain version: 0.1.14
Vertex AI SDK version: 1.46.0
PDF miner version: 20221105


#### Data preparation
We will be using the public Internal Revenue Service(IRS) document which states the details for each section of tax eligibility for seniors in the USA. It consists of 37 pages.

This document serves as the input PDF for generating and indexing embeddings, querying the model, and facilitating Question and Answer scenarios based on the data corpus.

In [20]:
# Create the folder for input files
!mkdir -p $INDEX_PATH

# Download the files
!wget -nc https://www.irs.gov/pub/irs-pdf/p554.pdf -P $INDEX_PATH

File ‘../Data/IRS-P554/Dataset/p554.pdf’ already there; not retrieving.



#### Parsing PDFs

In [159]:
PDF_FILE = "/Users/shivramamurthi/Data/IRS-P554/Dataset/p554.pdf"

##### [Using partition_pdf](https://unstructured-io.github.io/unstructured/core/partition.html#partition-pdf)

In [160]:
from unstructured.partition.pdf import partition_pdf

# Returns a List[Element] present in the pages of the parsed pdf document
elements = partition_pdf(PDF_FILE, url=None)

for element in elements:
    print(element)

Userid: CPM
Schema: tipx Leadpct: 100% Pt. size: 10
Draft
Ok to Print
AH XSL/XML Fileid: … tions/p554/2023/a/xml/cycle06/source
(Init. & Date) _______
Page 1 of 37
14:46 - 18-Jan-2024
The type and rule above prints on all proofs including departmental reproduction proofs. MUST be removed before printing.
Contents
Department of the Treasury Internal Revenue Service
What's New . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Reminders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Publication 554 Cat. No. 15102R
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 1. 2023 Filing Requirements . . . . . . . . . . 5 General Requirements . . . . . . . . . . . . . . . . . . . . 5
Tax Guide for Seniors
Chapter 2. Taxable and Nontaxable Income . . . . . 6 Compensation for Services . . . . . . . . . . . . . . . . . 6 Retirement Plan Distributions . . . . . . . . . . . . . . . . 6 Social Security and Equivalent Railroad
For use in prep

##### 

##### `UnstructuredPDFLoader` uses `partition_pdf` internally to identify elements

In [157]:
l = UnstructuredPDFLoader(PDF_FILE, mode="elements")
eds = l.load()
for ed in eds:
    print(ed)

page_content='Userid: CPM' metadata={'source': '/Users/shivramamurthi/Data/IRS-P554/Dataset/p554.pdf', 'coordinates': {'points': ((145.0, -125.25788999999997), (145.0, -115.25788999999997), (201.27, -115.25788999999997), (201.27, -125.25788999999997)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': '/Users/shivramamurthi/Data/IRS-P554/Dataset', 'filename': 'p554.pdf', 'languages': ['eng'], 'last_modified': '2024-01-26T22:10:32', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Header'}
page_content='Schema: tipx Leadpct: 100% Pt. size: 10' metadata={'source': '/Users/shivramamurthi/Data/IRS-P554/Dataset/p554.pdf', 'coordinates': {'points': ((247.0, -125.25788999999997), (247.0, -115.25788999999997), (434.28000000000003, -115.25788999999997), (434.28000000000003, -125.25788999999997)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': '/Users/shivramamurthi/Data/IRS-P554/Dataset', 'filename': 'p554.p

#### Define utility functions
Document Loading: Read in the content of the documents from the provided directory.

Document Splitting: Creates a CharacterTextSplitter object with parameters:

`chunk_size`: Aiming for text chunks of around 8192 characters.
`chunk_overlap`: A small overlap between chunks, likely to preserve context when documents are split.
Applies the text_splitter to the loaded documents, breaking them into smaller, more manageable text chunks.

Gathers all the split text chunks into a single list and return for furter processing.

In [161]:
chunk_size=1024
chunk_overlap=128

In [165]:
def get_split_documents(index_path: str) -> List[str]:
    """
    This function is used to chunk documents and convert them into a list.

    Args:
    index_path: Path of the dataset folder containing the documents.

    Returns:
    List of chunked, or split documents.
    """

    split_docs = []

    for file_name in os.listdir(index_path):
        print(f"file_name : {file_name}")
        if file_name.endswith(".pdf"):
            loader = UnstructuredPDFLoader(index_path + file_name)
        else:
            loader = TextLoader(index_path + file_name)

        text_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
        split_docs.extend(text_splitter.split_documents(loader.load()))

    return split_docs

#### Create Vector Database
Instantiate a VertexAIEmbeddings embedding object that will efficiently generate text embeddings using the specified textembedding-gecko@003 model.

In [166]:
# Custom VertexAI Embeddings object
EMBEDDING_NUM_BATCH = 5

embeddings = VertexAIEmbeddings(
    model_name=EMBEDDING_MODEL, batch_size=EMBEDDING_NUM_BATCH
)

- Load Documents: `get_split_documents` function retrieves and preprocesses documents from the specified `INDEX_PATH`.
- Generate Embeddings: The code generates vector embeddings
- Create Vector Database: A Chroma vector database (db) is initialized. This specialized database is designed for storing and efficiently searching vector embeddings.
- Persist Database: The db.persist() command saves the newly created vector database to disk at the location defined by `PERSIST_PATH` environment variable.

In [217]:
# Load documents, generate vectors and store in Vector database
split_docs = get_split_documents(INDEX_PATH)

chromadb = Chroma.from_documents(
    documents=split_docs, embedding=embeddings, persist_directory=PERSIST_PATH
)
chromadb.persist()  # Ensure DB persist

print(len(chromadb))

file_name : p554.pdf


Created a chunk of size 1047, which is longer than the specified 1024
Created a chunk of size 1171, which is longer than the specified 1024


446


In [218]:
from langchain_community.vectorstores import FAISS

faissdb = FAISS.from_documents(
    split_docs, embeddings)

print(faissdb.index.ntotal)

201


#### Create the Retriever
Load the gemini generative model with parameters.

In [219]:
max_output_tokens=8192  # Token limit determines the maximum amount of text output.
temperature=0.9         # Temperature controls the degree of randomness in token selection.
top_p=0.95              # Tokens are selected from most probable to least until the sum of their probabilities equals the top_p value.
top_k=40                # A top_k of 1 means the selected token is the most probable among all tokens.

In [220]:
#  Intialising the Vertex Language model with required parameters
llm = VertexAI(
    model=TEXT_MODEL,
    max_output_tokens=max_output_tokens,
    temperature=temperature,
    top_p=top_p,
    top_k=top_k,
    verbose=True,
)

#### Vector store retriever for similarity search.

#### Test similarity search

In [223]:
k_relevant=10
query = "Tell me about MilTax"

In [224]:
chromadb.similarity_search_with_score(query=query, k=k_relevant)

[(Document(page_content="1. You expect to owe at least $1,000 in tax for 2024, after subtracting your withholding and tax credits.\n\n2. You expect your withholding and tax credits to be less than the smaller of:\n\n90% of the tax to be shown on your 2024 tax return, or • 100% of the tax shown on your 2023 tax return. The 2023 tax return must cover all 12 months.\n\nMilTax. Members of the U.S. Armed Forces and quali- fied veterans may use MilTax, a free tax service of- fered by the Department of Defense through Military OneSource. For more information, go to MilitaryOneSource (MilitaryOneSource.mil/MilTax). Also, the IRS offers Free Fillable Forms, which can be completed online and then e-filed regardless of in- come.\n\nIf all of your income is subject to income tax withholding and enough tax is withheld, you probably don't need to make estimated tax payments.\n\nFor more information on estimated tax, see Pub. 505.\n\nUsing online tools to help prepare your return. Go to IRS.gov/Tools

In [225]:
faissdb.similarity_search_with_relevance_scores(query=query, k=k_relevant)

[(Document(page_content="1. You expect to owe at least $1,000 in tax for 2024, after subtracting your withholding and tax credits.\n\n2. You expect your withholding and tax credits to be less than the smaller of:\n\n90% of the tax to be shown on your 2024 tax return, or • 100% of the tax shown on your 2023 tax return. The 2023 tax return must cover all 12 months.\n\nMilTax. Members of the U.S. Armed Forces and quali- fied veterans may use MilTax, a free tax service of- fered by the Department of Defense through Military OneSource. For more information, go to MilitaryOneSource (MilitaryOneSource.mil/MilTax). Also, the IRS offers Free Fillable Forms, which can be completed online and then e-filed regardless of in- come.\n\nIf all of your income is subject to income tax withholding and enough tax is withheld, you probably don't need to make estimated tax payments.\n\nFor more information on estimated tax, see Pub. 505.\n\nUsing online tools to help prepare your return. Go to IRS.gov/Tools

#### Prompt template
Define a prompt template for a language model that consists of a string template. It accepts a set of parameters from the user that can be used to generate a prompt for a language model.

In [226]:
prompt_template = """
You are a helpful AI assistant. You're tasked to answer the question given below, but only based on the context provided.
context:
<context>
{context}
</context>

question:
<question>
{input}
</question>

If you cannot find an answer ask the user to rephrase the question.
answer:

"""
prompt = PromptTemplate.from_template(prompt_template)

### Create the retrieval chain
Then invoke it by passing the question as an input. Documentation:
- [create_stuff_documents_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.combine_documents.stuff.create_stuff_documents_chain.html)
- [create_retrieval_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval.create_retrieval_chain.html)


In [234]:
# Create a chain for passing a list of Documents to a model.
# The input is a dictionary that must have a “context” key that maps to a List[Document], and any other input variables expected in the prompt.
combine_docs_chain = create_stuff_documents_chain(llm, prompt)

# retriever = chromadb.as_retriever(search_type="similarity", search_kwargs={"k": k_relevant})
# retriever = faissdb.as_retriever(search_type="similarity", search_kwargs={"k": k_relevant})
retriever = faissdb.as_retriever()

# Create retrieval chain that retrieves documents and then passes them on.
retrieval_chain = create_retrieval_chain(retriever, combine_docs_chain)

In [237]:
langchain.globals.set_debug(False)

#### Test the retrieval chain

##### Example questions:
- Special rules for joint returns.
- Tell about persons not eligible for the standard deduction.
- Tell me about Figuring the EIC.
- Tell about contributions to Kay Bailey Hutchison Spousal IRAs.
- Tell about standard deduction for dependents.

In [236]:
response = retrieval_chain.invoke({"input": query})
print(response["answer"])

[32;1m[1;3m[chain/start][0m [1m[1:chain:retrieval_chain] Entering Chain run with input:
[0m{
  "input": "Tell me about MilTax"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:retrieval_chain > 2:chain:RunnableAssign<context>] Entering Chain run with input:
[0m{
  "input": "Tell me about MilTax"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:retrieval_chain > 2:chain:RunnableAssign<context> > 3:chain:RunnableParallel<context>] Entering Chain run with input:
[0m{
  "input": "Tell me about MilTax"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:retrieval_chain > 2:chain:RunnableAssign<context> > 3:chain:RunnableParallel<context> > 4:chain:retrieve_documents] Entering Chain run with input:
[0m{
  "input": "Tell me about MilTax"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:retrieval_chain > 2:chain:RunnableAssign<context> > 3:chain:RunnableParallel<context> > 4:chain:retrieve_documents > 5:chain:RunnableLambda] Entering Chain run with input:
[0m{
  "input": "Tell me about MilTax"
}
[3

#### Interactive UI Widget for Question-Answering
Enter your question in the input box and choose one of the options:
- Ask Me!: This option will generate an answer using similarity search on vector embeddings.
- More Details: Select this option to input the generated answer into a Large Language Model (LLM) for a more elaborate response.

In [231]:
# HTML Widgets
import ipywidgets as widgets
from IPython.display import clear_output

In [232]:
button = widgets.Button(description="Ask Me!")
output = widgets.Output()
button_stp = widgets.Button(description="More details")
output = widgets.Output()
text = widgets.Text(
    description="Question:", layout=widgets.Layout(width="80%", height="50px")
)
display(text, button, button_stp, output)

@output.capture()
def on_button_clicked(b):
    clear_output()
    question = text.value

    result = retrieval_chain.invoke({"input": question})
    source_documents = list({doc.metadata["source"] for doc in result["context"]})

    answer = result["answer"]
    print(f"\nAnswer: '{answer}'")
    print("\nSource-", "\n".join(source_documents))
    print("\n")


@output.capture()
def on_stp_clicked(b):
    clear_output()
    question = text.value
    query = question + "Give detailed information as much as possible. "
    result = retrieval_chain.invoke({"input": query})
    source_documents = list({doc.metadata["source"] for doc in result["context"]})

    answer = result["answer"]
    print(f"\nAnswer: '{answer}'")
    print("\nSource-", "\n".join(source_documents))
    print("\n")


button.on_click(on_button_clicked)
button_stp.on_click(on_stp_clicked)

Text(value='', description='Question:', layout=Layout(height='50px', width='80%'))

Button(description='Ask Me!', style=ButtonStyle())

Button(description='More details', style=ButtonStyle())

Output()