In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Question Answering with Large Documents - Foundations


<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/language/examples/reference-architectures/question_answering_with_large_documents.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/examples/reference-architectures/question_answering_with_large_documents.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/blob/main/language/examples/reference-architectures/question_answering_with_large_documents.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>

## Overview

You are a Data Scientist in a company and have been tasked to build a question-answering system using Vertex PaLM API. The system should be able to take the company's documents and extract or query important information. You are simply building an out-of-box question-answering bot on enterprise data that can be in any format, such as "PDF," "DOC," "TXT," "DOCX," "PPTX," "HTML," etc.  

The token limit is the current challenge and limitation with large language models, including [Vertex PaLM API](). As you have seen in the basic [question-answering notebook](), to get the best results from PaLM 2 Model, you must provide relevant context while asking `closed-domain` questions. Unfortunately, enterprise data do not follow such restrictions, and underlying documents could have thousands or millions of pages, which means you won't be able to pass them as context. 

In this `reference-architecture`, you will see two methods that can address the large context challenge, known as : 

* Chunk-based Q&A - splitting documents in smaller chunks.
* Chunk-Based Embeddeding Q&A - creating embeddings of smaller chunks and using vector similarity search to find relevant context. 

The notebook introduces you to the foundational theory of handling huge documents for building a question-answering bot using Vertex PaLM API and finding relevant context for a user query, keeping the context limitation in check. 

In addition, there can be open source or Google Cloud drop-in replacement of steps, which will be discussed later in the notebook. 

### Objective

By the end of the notebook, you will learn how to build a question-answering system using PaLM API, which can handle large documents. 
You will also learn the conceptual implementation of two methods to help you embed large contexts from many documents. 

## Getting Started

### Install Vertex AI SDK & Other dependencies 

In [3]:
#Base system dependencies
!sudo apt -y -qq install tesseract-ocr
!sudo apt -y -qq install libtesseract-dev
!sudo apt-get -y -qq install poppler-utils #required by PyPDF2 for page count and other pdf utilities
!sudo apt-get -y -qq install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig
#Python dependencies
!pip install google-cloud-aiplatform --upgrade --quiet --user
!pip install pytesseract --quiet --user
!pip install PyPDF2 --quiet --user
!pip install textract --quiet --user

tesseract-ocr is already the newest version (4.0.0-2).
0 upgraded, 0 newly installed, 0 to remove and 9 not upgraded.
libtesseract-dev is already the newest version (4.0.0-2).
0 upgraded, 0 newly installed, 0 to remove and 9 not upgraded.
E: Package 'pstotext' has no installation candidate


**Colab only**: Uncomment the following cell to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top.

In [7]:
# # Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### Authenticating your notebook environment
* If you are using **Colab** to run this notebook, uncomment the cell below and continue.
* If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env).

In [None]:
# from google.colab import auth
# auth.authenticate_user()

### Import libraries

In [48]:
import pandas as pd
from vertexai.preview.language_models import TextGenerationModel,TextEmbeddingModel
from tenacity import retry, stop_after_attempt, wait_random_exponential
from PyPDF2 import PdfReader
import glob
import textract
import pandas as pd
import numpy as np
import re
import warnings
warnings.filterwarnings('ignore')

### Import models

In [5]:
generation_model = TextGenerationModel.from_pretrained("text-bison@001")
embedding_model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

To make PaLM API calls more resilient since you will make many calls to the API in this notebook, you should allow specific API calls to be retried before they fail. We should not "hammer" or "congest" our underlying foundation models, so it is wise to wait a bit before a retry. 

You can add a simple delay before every request; however, adding a fixed delay only helps if it allows the service enough time to catch up. Alternatively, a simple approach is to increase the length of the delay for each subsequent attempt. 

The constant used for the delay may need to be adjusted depending on the service you're calling. This is commonly refered to as [exponential backoff](https://en.wikipedia.org/wiki/Exponential_backoff)

You can find more retry strategies that works best for Google Cloud APIs [here](https://cloud.google.com/storage/docs/retry-strategy) and API guide for the current method [here](https://tenacity.readthedocs.io/en/latest/api.html)

You can see an example below to make a function that implements a "retry and exponential backoff."

In [6]:
#To avoid hitting quota limit since you might hit the APIs multiple times in this notebook. 
# Each retry occurs at a random time in a geometrically expanding interval. 
# It allows for a custom multiplier and an ability to restrict the upper limit of the random interval to some maximum value.

@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(3))
def text_generation_model_with_backoff(**kwargs):
    return generation_model.predict(**kwargs).text


@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(3))
def embedding_model_with_backoff(text=[]):
    embeddings = embedding_model.get_embeddings(text)
    return [each.values for each in embeddings][0]

## Question Answering with large Documents

### Basic Method
Before you dive deeper into possible methods for large document question-answering, explore the primary process of building the system and how it fails with larger files and context. 

![Basic Method](https://storage.googleapis.com/document-examples-llm/assets/basic_method_flow.jpg)

#### Document Loader
You start by loading the documents from your source. In this case, we are dumping them from Google Cloud Storage.

In [7]:
#Copying the files from the GCS bucket to local 
!gsutil -m cp -r gs://document-examples-llm/documents .

Copying gs://document-examples-llm/documents/20230426_alphabet_10Q.pdf...
Copying gs://document-examples-llm/documents/20210203_alphabet_10K.pdf...       
Copying gs://document-examples-llm/documents/practitioners_guide_to_mlops_whitepaper.docx...
Copying gs://document-examples-llm/documents/20230203_alphabet_10K.pdf...       
Copying gs://document-examples-llm/documents/practitioners_guide_to_mlops_whitepaper.pdf...
Copying gs://document-examples-llm/documents/MLOps Whitepaper.json...           
Copying gs://document-examples-llm/documents/MLOps Whitepaper.pptx...
Copying gs://document-examples-llm/documents/practitioners_guide_to_mlops_whitepaper.txt...
Copying gs://document-examples-llm/documents/20220202_alphabet_10K.pdf...       
Copying gs://document-examples-llm/documents/mlops_whitepaper_page18.jpg...
Copying gs://document-examples-llm/documents/mlops_whitepaper_page22.png...     
/ [11/12 files][ 17.6 MiB/ 17.6 MiB]  99% Done                                  

#### Document Processor

Once you have the documents, it's time to process them. In the processor phase, your goal is to read the documents and convert them into the required format to be easily used in the downstream logic. While reading, you should maintain as much metadata as possible of the source document. 

Here, you can observe that we are loading different file types like PDF, TXT, DOCX, and JSON. Each file type has its reader, and you can use a simple open-source library called [textract](https://textract.readthedocs.io/en/stable/) and [PyPDF2](https://pypdf2.readthedocs.io/en/3.0.0/) to load them. For this data, you can save `file-name`, `file_type`, `page_number` (only for pdf), and `content` for each file. 

This metadata will be essential to quote the source of information while sending it as a context and answering queries at later stage. 


In [8]:
%%time 

final_data = []
for name in glob.glob('documents/*'):
    file_type = name.split(".")[-1]
    if file_type == "pdf":
        reader = PdfReader(name)
        for i, page in enumerate(reader.pages):
            text = page.extract_text()
            if text:
                data_packet = {}
                data_packet['file_name'] = name.split("/")[-1]
                data_packet['file_type'] = "pdf"
                data_packet['page_number'] = int(i+1)
                data_packet['content'] = text
                final_data.append(data_packet) 
    if file_type == "txt":
        text_txt = textract.process(name).decode("utf-8")
        data_packet = {}
        data_packet['file_name'] = name.split("/")[-1]
        data_packet['file_type'] = "txt"
        data_packet['page_number'] = None
        data_packet['content'] = text_txt
        final_data.append(data_packet)
    if file_type == "docx":
        text_docx = textract.process(name).decode("utf-8")
        data_packet = {}
        data_packet['file_name'] = name.split("/")[-1]
        data_packet['file_type'] = "docx"
        data_packet['page_number'] = None
        data_packet['content'] = text_docx
        final_data.append(data_packet)
    if file_type == "json":
        text_json = textract.process(name).decode("utf-8")
        data_packet = {}
        data_packet['file_name'] = name.split("/")[-1]
        data_packet['file_type'] = "json"
        data_packet['page_number'] = None
        data_packet['content'] = text_json
        final_data.append(data_packet) 
        
        
    # Add readers and processors for other file types ["doc","pptx","jpeg" etc.] 

CPU times: user 23.9 s, sys: 60.9 ms, total: 24 s
Wall time: 24 s


Creating a simple `document store` is important once you have read the documents with all the necessary metadata. Think of document store as a tabular representation of all the files and their content with essential metadata.

In [9]:
#converting the data that has been read from GCS to Pandas DataFrame for easy readibility and downstream logic 
pdf_data = pd.DataFrame.from_dict(final_data)
pdf_data = pdf_data.sort_values(by=['file_name','page_number']) #sorting the datafram by filename and page_number
pdf_data.reset_index(inplace=True,drop=True)
pdf_data

Unnamed: 0,file_name,file_type,page_number,content
0,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES\nSECURITIES AND EXCHANGE COMMISS...
1,20210203_alphabet_10K.pdf,pdf,2.0,Large accelerated filer ☒ Accelerated filer ☐...
2,20210203_alphabet_10K.pdf,pdf,3.0,Alphabet Inc.\nForm 10-K\nFor the Fiscal Year ...
3,20210203_alphabet_10K.pdf,pdf,4.0,NOTE ABOUT FORWARD-LOOKING STATEMENTS\nThis An...
4,20210203_alphabet_10K.pdf,pdf,5.0,•the sufficiency and timing of our proposed re...
...,...,...,...,...
367,practitioners_guide_to_mlops_whitepaper.pdf,pdf,33.0,33\nML metadata tracking is generally integrat...
368,practitioners_guide_to_mlops_whitepaper.pdf,pdf,34.0,34\nFigure 14 shows the tasks that are involve...
369,practitioners_guide_to_mlops_whitepaper.pdf,pdf,35.0,35\nFigure 15. End-to-end MLOps workflow
370,practitioners_guide_to_mlops_whitepaper.pdf,pdf,36.0,36\nAdditional resources\nFor more information...


In [10]:
# you can check how many different file type we have in our document store. 
print("Document store has these different file types : \n", pdf_data['file_type'].value_counts())

Document store has these different file types : 
 pdf     369
json      1
docx      1
txt       1
Name: file_type, dtype: int64


#### Context Selection

Now, the next step in the conventional method is to pass the context to PaLM API while asking the question. 

You don't know which document will be helpful, so we you can go ahead and use all the document's text present in `content` column as context. 

In [12]:
# combining all the content of the PDF as single string such that it can be passed as context.
context = '\n'.join(str(v) for v in pdf_data['content'].values)
print("The total words in the context: ",len(context))

The total words in the context:  1531642


#### Prompt Engineering

Next, you can write a simple prompt along with the question. Then, you can preempt the prompt by making it follow some basic instructions. In the prompt, you only ask to answer if it finds the answer in the given `context`. 

You are dynamically passing the context and the question so that you can change it as per requirements and experimentations.

In [14]:
question = "what is the address for google headquarter"
prompt = f"""Answer the question as precise as possible using the provided context. If the answer is
              not contained in the context, say "answer not available in context" \n\n
            Context: \n {context}?\n
            Question: \n {question} \n
            Answer:
          """

#### Vertex PaLM API - Answer Extraction & Evaluation

Now here's the moment of truth. In your prompt, you are passing so many words as context (roughly all documents). 

You already know that we have a input 
(prompt) token limit of [8192 tokens](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/models) for the `text-bison@001` model, so your PaLM API call should fail. Because, as per ~8k token limit, the PaLM model is expecting ~6k words (input token). However, we are sending  ~ `1531642` words just as a prompt. 

As a reminder, a single token may be smaller than a word. A token is approximately four characters. Therefore, 100 tokens correspond to roughly 60-80 words. 

Hence, you know why conventional methods would not work when you want to do question-answering on large documents. 

In [15]:
try:
    print("PaLM Predicted:",generation_model.predict(
        prompt
    ).text)
except Exception as e: 
    print("The code failed since it won't be able to run inference on such a huge context and throws this exception: ",e)

The code failed since it won't be able to run inference on such a huge context and throws this exception:  400 Request contains an invalid argument.


However, you can still run the code, if you restrict the context to first 1000 words or something which is lesser than the token limit for PaLM API. But there is a good chance you will miss getting the expected answer, since your context might be missing in the first 1000 words. 

In [16]:
question = "How much google invested in waymo?"
prompt = f"""Answer the question as precise as possible using the provided context. If the answer is
              not contained in the context, say "answer not available in context" \n\n
            Context: \n {context[:1000]}?\n
            Question: \n {question} \n
            Answer:
          """
print(len(prompt))
print("PaLM Predicted:",generation_model.predict(
    prompt
).text)

1290
PaLM Predicted: answer not available in context


So, now you have seen how stuffing the whole document content of so many files is not a very promising method to build question-answering systems. There are many different methods to address this limitation, but as discussed in the overview section, you will see two foundational and important methods: 

* Chunk-based Q&A
* Chunk-Based Embeddeding Q&A

### Method 1: Chunk-based Q&A

In Chunk-based Q&A, rather than passing the whole document as context, we break the document into manageable chunks and then design a prompt to ask each chunk the question. 

The chunks should be configured considering the input token limit (roughly 8k). Sometimes, it would be dividing the documents page-wise or splitting the pages into smaller paragraphs. It ultimately depends on the documents you have at your source. 

However, one thing remains constant: the chunk size should be defined by input token limit (roughly 8k), which would mean roughly not more than 6k (approximate) words.

The typical flow for this method goes like this:

* You take N documents from your source. 
* Split documents into N chunks (let's say 1000 words for each chunk) 
* Each chunk should be passed as context to the question-answer prompt
* Summarize the answers from each chunk either as a `column` or persist that into the persistence layer. 


You can refer to the below digram for more clarity:

![chunk learning](https://storage.googleapis.com/document-examples-llm/assets/chunklearning_flow.jpg)

You can start by writing a simple function `get_chunks_iter` that takes a long string `s` and the size of the chunk as `maxlength`. 

This function aims to divide input string `s` into the size of `maxlength` - which are total words in that chunk and, save all the individual chunks into a list, and return `final_chunk` list. 

In [19]:
def get_chunks_iter(s, maxlength):
    start = 0
    end = 0
    final_chunk = []
    while start + maxlength  < len(s) and end != -1:
        end = s.rfind(" ", start, start + maxlength + 1)
        final_chunk.append(s[start:end])
        start = end +1
    final_chunk.append(s[start:])
    return final_chunk

#function to apply "get_chunks_iter" function on each row of document store. 
def split_text(row):
    chunk_iter =  get_chunks_iter(row, chunk_size)
    return chunk_iter

In [22]:
global chunk_size
# you can define how many words should be there in a given chunk. 
chunk_size = 1000

pdf_data_sample = pdf_data.copy()
# Remove all non-alphabets and numbers from the data to clean it up. 
# This is harsh cleaning. You can define your custom logic for cleansing here. 
pdf_data_sample['content'] = pdf_data_sample['content'].apply(lambda x: re.sub('[^A-Za-z0-9]+', ' ', x))
# Apply the chunk splitting logic here on each row of content in document store. 
pdf_data_sample['chunks'] = pdf_data_sample['content'].apply(split_text)
# Now, each row in 'chunks' contains list of all chunks and hence we need to explode them into individual rows. 
pdf_data_sample = pdf_data_sample.explode("chunks")
# Sort and reset index 
pdf_data_sample = pdf_data_sample.sort_values(by=['file_name','page_number'])
pdf_data_sample.reset_index(inplace=True,drop=True)
pdf_data_sample.head()

Unnamed: 0,file_name,file_type,page_number,content,chunks
0,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...
1,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,Nasdaq Stock Market LLC Nasdaq Global Select M...
2,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,registrant was required to submit such files Y...
3,20210203_alphabet_10K.pdf,pdf,2.0,Large accelerated filer Accelerated filer Non ...,Large accelerated filer Accelerated filer Non ...
4,20210203_alphabet_10K.pdf,pdf,2.0,Large accelerated filer Accelerated filer Non ...,2020 was approximately 849 7 billion For purpo...


You can observe how a single page in the `20210203_alphabet_10K.pdf` file is divided into three chunks. 

You have three pages with the same "1" indicating that a page has been divided into three subsets (chunks). This is important because now you have a manageable chunk to send as context, rather than whole document as seen before.

This will increase the total number of rows in the document store as well. 

In [26]:
print("The original document store has :",pdf_data.shape[0]," rows without chunking") 
print("The chunked document store has :",pdf_data_sample.shape[0]," rows with chunking")

The original document store has : 372  rows without chunking
The chunked document store has : 1604  rows with chunking


Now you can define two prompts: 

1) For asking a direct question with a similar strategy as seen before with the conventional method. However, pass the `chunks` column this time since you should only pass a single chunk for each PaLM API call on each row. 

2) For extracting specific entities like dates, amounts, product names, etc. This prompt will be helpful if you have any particular entities that must be extracted from the documents. This prompt doesn't require any specific question to be asked. In fact, you can make the prompt one-shot or multi-shot prompt by providing examples. 

In [27]:
#function to pass in the apply function on document store to extract answer for specific question on each row. 
def get_answer(df):
    prompt = f"""Answer the question as precise as possible using the provided context. If the answer is
                not contained in the context, say "answer not available in context" \n\n
              Context: \n {df['chunks']}?\n
              Question: \n {question} \n
              Answer:
            """

    pred =  text_generation_model_with_backoff(
      prompt=prompt
    )
    return pred

#function to pass in the apply function on document store to extract dates from each row. 
def get_dates(df):
    prompt = f"""Extract dates from the given context along with significance of those dates. If the dates are not available then
              say "dates not available". Do not mention any date which is not given in the context \n\n
              context: \n
              {df['chunks']}
            """

    pred =  text_generation_model_with_backoff(
      prompt=prompt
    )
    return pred

In [49]:
#we can take a small sample of the whole document store to avoid making too many calls to the API. 
pdf_data_sample_head = pdf_data_sample.head(20)

question = "what is the address for google headquarter"
pdf_data_sample_head['predicted_answer'] = pdf_data_sample_head.apply(get_answer,axis=1)
pdf_data_sample_head['predicted_dates'] = pdf_data_sample_head.apply(get_dates,axis=1)
pdf_data_sample_head.head()

Unnamed: 0,file_name,file_type,page_number,content,chunks,predicted_answer,predicted_dates
0,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,1600 Amphitheatre Parkway Mountain View CA 94043,"The fiscal year end date is December 31, 2020."
1,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,Nasdaq Stock Market LLC Nasdaq Global Select M...,answer not available in context,Dates not available.
2,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,registrant was required to submit such files Y...,answer not available in context,dates not available
3,20210203_alphabet_10K.pdf,pdf,2.0,Large accelerated filer Accelerated filer Non ...,Large accelerated filer Accelerated filer Non ...,answer not available in context,June 30 2020: aggregate market value of shares...
4,20210203_alphabet_10K.pdf,pdf,2.0,Large accelerated filer Accelerated filer Non ...,2020 was approximately 849 7 billion For purpo...,answer not available in context,The date 2020 is mentioned in the context.


You can now look into the results to see how our PaLM performed. 

In [41]:
index = 0
print("[Context]")
print(pdf_data_sample_head['chunks'].iloc[index])
print("\n\n [Extracted Answer]")
print(pdf_data_sample_head['predicted_answer'].iloc[index])
print("\n\n [Source: page_number]")
print(pdf_data_sample_head['page_number'].iloc[index])
print("\n\n [Source: file_name]")
print(pdf_data_sample_head['file_name'].iloc[index])

[Context]
UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington D C 20549 FORM 10 K Mark One ANNUAL REPORT PURSUANT TO SECTION 13 OR 15 d OF THE SECURITIES EXCHANGE ACT OF 1934 For the fiscal year ende d December 31 2020 OR TRANSITION REPORT PURSUANT TO SECTION 13 OR 15 d OF THE SECURITIES EXCHANGE ACT OF 1934 For the transition period from to Commission file number 001 37580 Alphabet Inc Exact name of registrant as specified in its charter Delaware 61 1767919 State or other jurisdiction of incorporation or organization I R S Employer Identification No 1600 Amphitheatre Parkway Mountain View CA 94043 Address of principal executive offices including zip code 650 253 000 0 Registrant s telephone number including area code Securities registered pursuant to Section 12 b of the Act Title of each class Trading Symbol s Name of each exchange on which registered Class A Common Stock 0 001 par value GOOGL Nasdaq Stock Market LLC Nasdaq Global Select Market Class C Capital Stock 0 001 par 

In [42]:
print("[Context]")
print(pdf_data_sample_head['chunks'].iloc[index])
print("\n\n [Extracted Dates]")
print(pdf_data_sample_head['predicted_dates'].iloc[index])
print("\n\n [Source: page_number]")
print(pdf_data_sample_head['page_number'].iloc[index])
print("\n\n [Source: file_name]")
print(pdf_data_sample_head['file_name'].iloc[index])

[Context]
UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington D C 20549 FORM 10 K Mark One ANNUAL REPORT PURSUANT TO SECTION 13 OR 15 d OF THE SECURITIES EXCHANGE ACT OF 1934 For the fiscal year ende d December 31 2020 OR TRANSITION REPORT PURSUANT TO SECTION 13 OR 15 d OF THE SECURITIES EXCHANGE ACT OF 1934 For the transition period from to Commission file number 001 37580 Alphabet Inc Exact name of registrant as specified in its charter Delaware 61 1767919 State or other jurisdiction of incorporation or organization I R S Employer Identification No 1600 Amphitheatre Parkway Mountain View CA 94043 Address of principal executive offices including zip code 650 253 000 0 Registrant s telephone number including area code Securities registered pursuant to Section 12 b of the Act Title of each class Trading Symbol s Name of each exchange on which registered Class A Common Stock 0 001 par value GOOGL Nasdaq Stock Market LLC Nasdaq Global Select Market Class C Capital Stock 0 001 par 

Now, let's look into this method's various pros and cons to summarize what you have done. 

**Pros:**

* Precision: The precision of the answers increases using this method since the context is exact because of chunking.
* Entity Extraction: This is the most helpful method if your answer is available across different document levels. If there is a specific number or entity you are looking to extract available on most pages, then this is the most efficient and precise method. 

**Cons:**

*   Multiple API Call: Each subset/chunk will call the API. It might incur costs and may take time. You can also hit the API limit quotas if the number of documents is enormous. 
*   Slow: If your answers are available at early chunks/subsets, even then, it will search through all the chunks/subsets. It's very similar to binary search. You can make it efficient by devising clever strategies to make sure it stops when it finds relevant information.
* Conflicting Answers: What if the relevant answer is found in multiple chunks and subsets? How would you know the correct answer? You will likely find overlapping answers. You should devise some clever ranking algorithms to fix that.

Moving forward, let's explore the following method, which addresses some of the shortcomings of Method 1. 

### Method 2: Chunk-Based Embeddeding Q&A

One of the biggest cons you saw with the previous method for typical Q&A is that you would have to call PaLM API on all the chunks. As you have seen, it's suitable for precision and extraction tasks, but more efficient ways exist for direct question-answering. One such method is creating embeddings of the chunks. 

When you have embeddings (vector/number representations of text) of each chunk, you can use simple vector mathematics to find similar/dissimilar chunks across the document store. Why is that helpful?

Rather than going through each chunk till N chunks (image N to be very large), you can find the relevant context from all the chunks in the document store where your answer may exist. 

Like, when you ask about "Google headquarters", you should have a way where only those chunks should be selected where there is some information about "Google headquarters" and not every chunk. This saves you considerable computation and API costs. 

The typical flow for this method goes like this: 
* You take N documents from your source
* Split documents into N chunks (let's say 1000 words for each chunk)
* Create embeddings for each chunk and store them in the document store. It can now be called vector DB or vector document storage. 
* Convert the question that you want to ask also to embeddings. 
* Now you have two sets of vectors - 1) chunk vectors and 2) query/question vector
* You can perform a cosine similarity between these vectors to find all closed chunks.
* Once you have the cosine scores, you can sort them, pick the top M and use this as a context for the prompt. 
* Finally, hit the PaLM API with only the relevant context. As you can observe, you didn't have to call the API on call chunks. 

You can refer to the below digram for more clarity:


![Embedding Learning](https://storage.googleapis.com/document-examples-llm/assets/embeddinglearning.jpg)

You can start the implementation first by simply getting the embeddings for each chunk. You are still working on the top 20 head data points to avoid any significant calls to the API. 

This will add the embeddings (vector/number representation) of each chunk as a separate column, and we can now call this a vector db or vector document store. 

In [50]:
pdf_data_sample_head['embedding'] = pdf_data_sample_head['chunks'].apply(lambda x: embedding_model_with_backoff([x]))
pdf_data_sample_head["embedding"] = pdf_data_sample_head.embedding.apply(np.array)
pdf_data_sample_head.head()

Unnamed: 0,file_name,file_type,page_number,content,chunks,predicted_answer,predicted_dates,embedding
0,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,1600 Amphitheatre Parkway Mountain View CA 94043,"The fiscal year end date is December 31, 2020.","[-0.0018616351298987865, 0.007186433300375938,..."
1,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,Nasdaq Stock Market LLC Nasdaq Global Select M...,answer not available in context,Dates not available.,"[0.009593687951564789, 0.006503074895590544, -..."
2,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,registrant was required to submit such files Y...,answer not available in context,dates not available,"[0.02656714990735054, -0.007002761587500572, -..."
3,20210203_alphabet_10K.pdf,pdf,2.0,Large accelerated filer Accelerated filer Non ...,Large accelerated filer Accelerated filer Non ...,answer not available in context,June 30 2020: aggregate market value of shares...,"[0.0011118471156805754, -0.0008834037580527365..."
4,20210203_alphabet_10K.pdf,pdf,2.0,Large accelerated filer Accelerated filer Non ...,2020 was approximately 849 7 billion For purpo...,answer not available in context,The date 2020 is mentioned in the context.,"[-0.014244736172258854, -0.005289070773869753,..."


Now comes the heart of this method. First, you can define a function `get_context_from_question`, which takes the:
* `valid_question` user wants to ask, 
* 'vector_store`: vector db store, which we created in the last step and,
* `sort_index_value`: The value defines how many chunks will be picked after running the sort on the cosine similarity score. 

The function will take the `valid_question`, create the embeddings, and do the dot product (cosine similarity) with all the chunks you passed in the vector store. Once you have the score, you can sort the results in decreasing order and pick chunks per the `sort_index_value` value as a combined string. 

This will become your context for the question asked. 

In [44]:
def get_dot_product(row):
    return np.dot(row,query_vector)

def get_context_from_question(valid_question, 
                              vector_store,
                              sort_index_value=2):
    global query_vector
    query_vector = np.array(embedding_model_with_backoff([valid_question]))
    top_matched = vector_store["embedding"].apply(get_dot_product).sort_values(ascending=False)[:sort_index_value].index
    context = " ".join(vector_store[vector_store.index.isin(top_matched)]['chunks'].values)
    return context

Now that you have a general function that always gets you custom relevant context for the question, you can call it with every new question. 

In [242]:
%%time 

#your question for the documents 
valid_question = "what is the address for google headquarter"

#get the custom relevant chunks from all the chunks in vector store. 
context = get_context_from_question(valid_question, 
                                    vector_store=pdf_data_sample_head,
                                    sort_index_value=1 #Top N results to pick from embedding vector search
                                   )

#Prompt for Q&A which takes the custom context found in last step. 
prompt = f""" Answer the question as precise as possible using the provided context. If the answer is
            not contained in the context, say "answer not available in context" \n\n
            Context: \n {context}?\n
            Question: \n {valid_question} \n
            Answer:
          """

#Call the PaLM API on the prompt. 
print("PaLM Predicted:",text_generation_model_with_backoff(
    prompt=prompt
))


PaLM Predicted: 1600 Amphitheatre Parkway Mountain View CA 94043
CPU times: user 9.53 ms, sys: 2.63 ms, total: 12.2 ms
Wall time: 970 ms


As you can see, the best part of this method is that you don't have to call the API multiple times. Instead, just one time, and it figured out the answers. 

In [47]:
%%time 

valid_question = "who are the foudners of google?"

context = get_context_from_question(valid_question, 
                                    vector_store=pdf_data_sample_head,
                                    sort_index_value=1 #Top N results to pick from embedding vector search
                                   )

prompt = f""" Answer the question as precise as possible using the provided context. If the answer is
            not contained in the context, say "answer not available in context" \n\n
            Context: \n {context}?\n
            Question: \n {valid_question} \n
            Answer:
          """
print("PaLM Predicted:",text_generation_model_with_backoff(
    prompt=prompt
))


PaLM Predicted: Larry and Sergey
CPU times: user 0 ns, sys: 10.7 ms, total: 10.7 ms
Wall time: 1.02 s


Now, let's look into this method's various pros and cons to summarize what you have done. 

**Pros:**

*  Fast: this is fast since it doesn't require the API to be executed on all the chunks.

**Cons:**
*  The vector document store can run into a vast length, and cosine similarity and basic mathematics can become slow. 

In the following coming notebooks in this journey, you will see the operationalization of the foundation methods using GCP and open-source products. 