In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Question Answering with Large Documents - Foundations


<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/language/examples/reference-architectures/question_answering_with_large_documents.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/examples/reference-architectures/question_answering_with_large_documents.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/blob/main/language/examples/reference-architectures/question_answering_with_large_documents.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>

## Overview

You are a Data Scientist in a company and have been tasked to build a question-answering system using Vertex PaLM API. The system should be able to take the company's documents and extract or query important information. You are simply building an out-of-box question-answering bot on enterprise data that can be in any format, such as "PDF," "DOC," "TXT," "DOCX," "PPTX," "HTML," etc.  

The token limit is the current challenge and limitation with large language models, including [Vertex PaLM API](). As you have seen in the basic [question-answering notebook](), to get the best results from PaLM 2 Model, you must provide relevant context while asking `closed-domain` questions. Unfortunately, enterprise data do not follow such restrictions, and underlying documents could have thousands or millions of pages, which means you won't be able to pass them as context. 

In this `reference-architecture`, you will see two methods that can address the large context challenge, known as : 

* Chunk Learning - splitting documents in smaller chunks.
* Chunk Embedding Learning - creating embeddings of smaller chunks and using vector similarity search to find relevant context. 

The notebook introduces you to the foundational theory of handling huge documents for building a question-answering bot using Vertex PaLM API and finding relevant context for a user query, keeping the context limitation in check. 

In addition, there can be open source or Google Cloud drop-in replacement of steps, which will be discussed later in the notebook. 

### Objective

By the end of the notebook, you will learn how to build a question-answering system using PaLM API, which can handle large documents. 
You will also learn the conceptual implementation of two methods to help you embed large contexts from many documents. 

## Getting Started

### Install Vertex AI SDK & Other dependencies 

In [158]:
#Base system dependencies
!sudo apt -y -qq install tesseract-ocr
!sudo apt -y -qq install libtesseract-dev
!sudo apt-get -y -qq install poppler-utils #required by PyPDF2 for page count and other pdf utilities
!sudo apt-get -y -qq install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig
#Python dependencies
!pip install google-cloud-aiplatform --upgrade --quiet --user
!pip install pytesseract --quiet --user
!pip install PyPDF2 --quiet --user
!pip install textract --quiet --user

**Colab only**: Uncomment the following cell to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top.

In [7]:
# # Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### Authenticating your notebook environment
* If you are using **Colab** to run this notebook, uncomment the cell below and continue.
* If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env).

In [None]:
# from google.colab import auth
# auth.authenticate_user()

### Import libraries

In [80]:
import pandas as pd
from vertexai.preview.language_models import TextGenerationModel,TextEmbeddingModel
from tenacity import retry, stop_after_attempt, wait_random_exponential
from PyPDF2 import PdfReader
import glob
import textract
import pandas as pd
import numpy as np

### Import models

In [75]:
generation_model = TextGenerationModel.from_pretrained("text-bison@001")
embedding_model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

To make PaLM API calls more resilient since you will make many calls to the API in this notebook, you should allow specific API calls to be retried before they fail. We should not "hammer" or "congest" our underlying foundation models, so it is wise to wait a bit before a retry. 

You can add a simple delay before every request; however, adding a fixed delay only helps if it allows the service enough time to catch up. Alternatively, a simple approach is to increase the length of the delay for each subsequent attempt. 

The constant used for the delay may need to be adjusted depending on the service you're calling. This is commonly refered to as [exponential backoff](https://en.wikipedia.org/wiki/Exponential_backoff)

You can find more retry strategies that works best for Google Cloud APIs [here](https://cloud.google.com/storage/docs/retry-strategy) and API guide for the current method [here](https://tenacity.readthedocs.io/en/latest/api.html)

You can see an example below to make a function that implements a "retry and exponential backoff."

In [90]:
#To avoid hitting quota limit since you might hit the APIs multiple times in this notebook. 
# Each retry occurs at a random time in a geometrically expanding interval. 
# It allows for a custom multiplier and an ability to restrict the upper limit of the random interval to some maximum value.

@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(3))
def text_generation_model_with_backoff(**kwargs):
    return generation_model.predict(**kwargs).text


@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(3))
def embedding_model_with_backoff(text=[]):
    embeddings = embedding_model.get_embeddings(text)
    return [each.values for each in embeddings][0]

## Question Answering with large Documents

### Basic Method
Before you dive deeper into correct methods for large document question-answering, explore the primary process of building the system and how it fails with larger files and context. 

![Basic Method](https://storage.googleapis.com/document-examples-llm/assets/basic_method_flow.jpg)

#### Document Loader

In [171]:
#Copying the files from the GCS bucket to local 
!gsutil -m cp -r gs://document-examples-llm/documents .

Copying gs://document-examples-llm/documents/20230426_alphabet_10Q.pdf...
Copying gs://document-examples-llm/documents/20210203_alphabet_10K.pdf...
Copying gs://document-examples-llm/documents/20220202_alphabet_10K.pdf...       
Copying gs://document-examples-llm/documents/MLOps Whitepaper.pptx...           
Copying gs://document-examples-llm/documents/20230203_alphabet_10K.pdf...       
Copying gs://document-examples-llm/documents/practitioners_guide_to_mlops_whitepaper.pdf...
Copying gs://document-examples-llm/documents/MLOps Whitepaper.json...           
Copying gs://document-examples-llm/documents/practitioners_guide_to_mlops_whitepaper.txt...
Copying gs://document-examples-llm/documents/mlops_whitepaper_page18.jpg...     
Copying gs://document-examples-llm/documents/mlops_whitepaper_page22.png...     
Copying gs://document-examples-llm/documents/practitioners_guide_to_mlops_whitepaper.docx...
/ [11/12 files][ 17.6 MiB/ 17.6 MiB]  99% Done                                  

#### Document Processor

In [224]:
%%time 

final_data = []
for name in glob.glob('documents/*'):
    file_type = name.split(".")[-1]
    if file_type == "pdf":
        reader = PdfReader(name)
        for i, page in enumerate(reader.pages):
            text = page.extract_text()
            if text:
                data_packet = {}
                data_packet['file_name'] = name.split("/")[-1]
                data_packet['file_type'] = "pdf"
                data_packet['page_number'] = int(i+1)
                data_packet['content'] = text
                final_data.append(data_packet) 
    if file_type == "txt":
        text_txt = textract.process(name).decode("utf-8")
        data_packet = {}
        data_packet['file_name'] = name.split("/")[-1]
        data_packet['file_type'] = "txt"
        data_packet['page_number'] = None
        data_packet['content'] = text_txt
        final_data.append(data_packet)
    if file_type == "docx":
        text_docx = textract.process(name).decode("utf-8")
        data_packet = {}
        data_packet['file_name'] = name.split("/")[-1]
        data_packet['file_type'] = "docx"
        data_packet['page_number'] = None
        data_packet['content'] = text_docx
        final_data.append(data_packet)
    if file_type == "json":
        text_json = textract.process(name).decode("utf-8")
        data_packet = {}
        data_packet['file_name'] = name.split("/")[-1]
        data_packet['file_type'] = "json"
        data_packet['page_number'] = None
        data_packet['content'] = text_json
        final_data.append(data_packet) 
        
        
    # Add readers and processors for other file types ["doc","docx","pptx","jpeg" etc.] 

CPU times: user 17.9 s, sys: 5.63 s, total: 23.6 s
Wall time: 23.6 s


In [251]:
#converting the data that has been read from GCS to Pandas DataFrame for easy readibility and downstream logic 
pdf_data = pd.DataFrame.from_dict(final_data)
pdf_data = pdf_data.sort_values(by=['file_name','page_number']) #sorting the datafram by filename and page_number
pdf_data.reset_index(inplace=True,drop=True)
pdf_data

Unnamed: 0,file_name,file_type,page_number,content
0,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES\nSECURITIES AND EXCHANGE COMMISS...
1,20210203_alphabet_10K.pdf,pdf,2.0,Large accelerated filer ☒ Accelerated filer ☐...
2,20210203_alphabet_10K.pdf,pdf,3.0,Alphabet Inc.\nForm 10-K\nFor the Fiscal Year ...
3,20210203_alphabet_10K.pdf,pdf,4.0,NOTE ABOUT FORWARD-LOOKING STATEMENTS\nThis An...
4,20210203_alphabet_10K.pdf,pdf,5.0,•the sufficiency and timing of our proposed re...
...,...,...,...,...
367,practitioners_guide_to_mlops_whitepaper.pdf,pdf,33.0,33\nML metadata tracking is generally integrat...
368,practitioners_guide_to_mlops_whitepaper.pdf,pdf,34.0,34\nFigure 14 shows the tasks that are involve...
369,practitioners_guide_to_mlops_whitepaper.pdf,pdf,35.0,35\nFigure 15. End-to-end MLOps workflow
370,practitioners_guide_to_mlops_whitepaper.pdf,pdf,36.0,36\nAdditional resources\nFor more information...


In [249]:
print("Document store has these different file types : \n", pdf_data['file_type'].value_counts())

Document store has these different file types : 
 pdf     369
json      1
docx      1
txt       1
Name: file_type, dtype: int64


#### Context Selection

In [196]:
# combining all the content of the PDF as single string such that it can be passed as context.
context = '\n'.join(str(v) for v in pdf_data['content'].values)
print("The total words in the prompt: ",len(context))

The total words in the prompt:  1539605


#### Prompt Engineering

In [197]:
question = "what is the address for google headquarter"
prompt = f"""Answer the question as precise as possible using the provided context. If the answer is
              not contained in the context, say "answer not available in context" \n\n
            Context: \n {context}?\n
            Question: \n {question} \n
            Answer:
          """

#### Vertex PaLM API - Answer Extraction & Evaluation

In [198]:
try:
    print("PaLM Predicted:",generation_model.predict(
        prompt
    ).text)
except Exception as e: 
    print("The code failed since it won't be able to run inference on such a huge context and throws this exception: ",e)

The code failed since it won't be able to run inference on such a huge context and throws this exception:  400 Request contains an invalid argument.


However, you can still run the code, if you restrict the context to first 1000 words or something which is lesser than the token limit for PaLM API. But there is a good chance you will miss getting the expected answer. 

In [199]:
question = "what is the address for google headquarter"
prompt = f"""Answer the question as precise as possible using the provided context. If the answer is
              not contained in the context, say "answer not available in context" \n\n
            Context: \n {context[:1000]}?\n
            Question: \n {question} \n
            Answer:
          """
print(len(prompt))
print("PaLM Predicted:",generation_model.predict(
    prompt
).text)

1298
PaLM Predicted: 1600 Amphitheatre Parkway  
Mountain View , CA 94043


### Method 1: Chunk Learning

In Chunk Learning, rather than passing the whole document as context, we break the document into manageble chunks (Some times splitting by page or maybe more smaller like paragraphs) and then desing a prompt that can ask the question to each chunk. In our case, lets send each page content as a context and see how it performs. We have to make sure our prompts are designed in a way where it should output "not available" if the answer is not available.

![chunk learning](https://storage.googleapis.com/document-examples-llm/assets/chunklearning_flow.jpg)

If you have a lot of text in a single page, then maybe you need to divide the data much furhter. For this we can create chunks of selected size for each page. In the below code, we are divide each page into multiple chunks of 1000. You can increase or decrease it based on your data. We will maintain the page number column since it will serve as a good index.

In [228]:
def get_chunks_iter(s, maxlength):
    start = 0
    end = 0
    final_chunk = []
    while start + maxlength  < len(s) and end != -1:
        end = s.rfind(" ", start, start + maxlength + 1)
        final_chunk.append(s[start:end])
        start = end +1
    final_chunk.append(s[start:])
    # print(len(final_chunk))
    return final_chunk

def split_text(row):
    # print(chunk_size)
    # print(row)
    chunk_iter =  get_chunks_iter(row, chunk_size)
    return chunk_iter

In [257]:
global chunk_size
chunk_size = 1000

pdf_data_sample = pdf_data.copy()
pdf_data_sample['content'] = pdf_data_sample['content'].apply(lambda x: re.sub('[^A-Za-z0-9]+', ' ', x))
pdf_data_sample['chunks'] = pdf_data_sample['content'].apply(split_text)
pdf_data_sample = pdf_data_sample.explode("chunks")
pdf_data_sample = pdf_data_sample.sort_values(by=['file_name','page_number'])
pdf_data_sample.reset_index(inplace=True,drop=True)
pdf_data_sample.head()

Unnamed: 0,file_name,file_type,page_number,content,chunks
0,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...
1,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,Nasdaq Stock Market LLC Nasdaq Global Select M...
2,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,registrant was required to submit such files Y...
3,20210203_alphabet_10K.pdf,pdf,2.0,Large accelerated filer Accelerated filer Non ...,Large accelerated filer Accelerated filer Non ...
4,20210203_alphabet_10K.pdf,pdf,2.0,Large accelerated filer Accelerated filer Non ...,2020 was approximately 849 7 billion For purpo...


In [None]:
def get_answer(df):
    prompt = f"""Answer the question as precise as possible using the provided context. If the answer is
                not contained in the context, say "answer not available in context" \n\n
              Context: \n {df['chunks']}?\n
              Question: \n {question} \n
              Answer:
            """

    pred =  text_generation_model_with_backoff(
      prompt=prompt
    )
    return pred

def get_dates(df):
    prompt = f"""extract dates from the given context along with significance of those dates. If the dates are not available then
              say "dates not available". Do not mention any date which is not given in the context \n\n
              context: \n
              {df['content']}
            """

    pred =  text_generation_model_with_backoff(
      prompt=prompt
    )
    return pred

In [258]:
pdf_data_sample_head = pdf_data_sample.head(20)

question = "what is the address for google headquarter"
pdf_data_sample_head['predicted_answer'] = pdf_data_sample_head.apply(get_answer,axis=1)
pdf_data_sample_head.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,file_name,file_type,page_number,content,chunks,predicted_answer
0,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,The address of the principal executive offices...
1,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,Nasdaq Stock Market LLC Nasdaq Global Select M...,The address of the principal executive offices...
2,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,registrant was required to submit such files Y...,The address of the principal executive offices...
3,20210203_alphabet_10K.pdf,pdf,2.0,Large accelerated filer Accelerated filer Non ...,Large accelerated filer Accelerated filer Non ...,answer not available in context
4,20210203_alphabet_10K.pdf,pdf,2.0,Large accelerated filer Accelerated filer Non ...,2020 was approximately 849 7 billion For purpo...,answer not available in context


In [259]:
pdf_data_sample_head = pdf_data_sample.head(20)
pdf_data_sample_head['predicted_dates'] = pdf_data_sample_head.apply(get_dates,axis=1)
pdf_data_sample_head.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,file_name,file_type,page_number,content,chunks,predicted_dates
0,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,"The fiscal year end date is December 31, 2020."
1,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,Nasdaq Stock Market LLC Nasdaq Global Select M...,"The fiscal year end date is December 31, 2020."
2,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,registrant was required to submit such files Y...,"The fiscal year end date is December 31, 2020."
3,20210203_alphabet_10K.pdf,pdf,2.0,Large accelerated filer Accelerated filer Non ...,Large accelerated filer Accelerated filer Non ...,The following dates are mentioned in the conte...
4,20210203_alphabet_10K.pdf,pdf,2.0,Large accelerated filer Accelerated filer Non ...,2020 was approximately 849 7 billion For purpo...,The following dates are mentioned in the conte...


** Pros: **

* Precision: The precision of the answers increases using this method.
* Multiple context Q&A [Entity Extraction]: This is the most useful method if your answer maybe available across different levels of your document. Like, there is a specific number or entity you are looking to extract which are available on most page, then this is the most efficient and precise method. Shown as `get_dates` function and `dates` column in the final dataframe.

** Cons: **


*   Multiple API Call: Each subset/chunks will call the API. Might incure cost.
*   Slow: If your answers are available at early chunks/subsets, even then it will search through all the chunks/subsets. Its very similar to binary search. You can make it efficient by devising cleaver strategies to make sure it stops when it finds the relevant information.

* Conflicting Answers: What if the relevant answer is found at multiple chunks and subsets? How would you find the correct answer? There is good chance that you will find overlapping answers. You might have to devise some cleaver ranking algorithms to fix that.



### Method 2: Map Reduce - Chunk Learning with Embedding

One of the biggest con we saw with the previous method for typical Q&A is that you would have to call PaLM API on all the chunks. Its good, but not the most efficient way, as we have seen.

How about exploring a way, where we can figure out only top N (1 - 5) chunks which are close to intended user query and then only call the API once by passing that as context? Sounds amazing, right?


![Embedding Learning](https://storage.googleapis.com/document-examples-llm/assets/embeddinglearning.jpg)

In [260]:
pdf_data_sample_head = pdf_data_sample.head(50)
pdf_data_sample_head['embedding'] = pdf_data_sample_head['chunks'].apply(lambda x: embedding_model_with_backoff([x]))
pdf_data_sample_head["embedding"] = pdf_data_sample_head.embedding.apply(np.array)
pdf_data_sample_head.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,file_name,file_type,page_number,content,chunks,embedding
0,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,"[-0.0018616351298987865, 0.007186433300375938,..."
1,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,Nasdaq Stock Market LLC Nasdaq Global Select M...,"[0.009593687951564789, 0.006503074895590544, -..."
2,20210203_alphabet_10K.pdf,pdf,1.0,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,registrant was required to submit such files Y...,"[0.02656714990735054, -0.007002761587500572, -..."
3,20210203_alphabet_10K.pdf,pdf,2.0,Large accelerated filer Accelerated filer Non ...,Large accelerated filer Accelerated filer Non ...,"[0.0011118471156805754, -0.0008834037580527365..."
4,20210203_alphabet_10K.pdf,pdf,2.0,Large accelerated filer Accelerated filer Non ...,2020 was approximately 849 7 billion For purpo...,"[-0.014244736172258854, -0.005289070773869753,..."


In [240]:
def get_dot_product(row):
    return np.dot(row,query_vector)

def get_context_from_question(valid_question, 
                              vector_store,
                              sort_index_value=2):
    global query_vector
    query_vector = np.array(embedding_model_with_backoff([valid_question]))
    top_matched = vector_store["embedding"].apply(get_dot_product).sort_values(ascending=False)[:sort_index_value].index
    context = " ".join(vector_store[vector_store.index.isin(top_matched)]['chunks'].values)
    return context

In [242]:
%%time 

valid_question = "what is the address for google headquarter"

context = get_context_from_question(valid_question, 
                                    vector_store=pdf_data_sample_head,
                                    sort_index_value=1 #Top N results to pick from embedding vector search
                                   )

prompt = f""" Answer the question as precise as possible using the provided context. If the answer is
            not contained in the context, say "answer not available in context" \n\n
            Context: \n {context}?\n
            Question: \n {valid_question} \n
            Answer:
          """
print("PaLM Predicted:",text_generation_model_with_backoff(
    prompt=prompt
))


PaLM Predicted: 1600 Amphitheatre Parkway Mountain View CA 94043
CPU times: user 9.53 ms, sys: 2.63 ms, total: 12.2 ms
Wall time: 970 ms
