# Entity Extraction from Documents using Google Cloud Vertex AI and Document AI

This notebook demonstrates how to extract structured information from scanned documents using Google Cloud's Document AI (for OCR) and Vertex AI's PaLM Text model (for entity extraction).

**Author:** Jasmeet Bhatia



## Objective

In this notebook, we will use Google Cloud Vertex AI's PaLM Text model to extract entities from a scanned PDF containing deed information. The process will involve:

1. Using Document AI to convert the scanned PDF to text (OCR)
2. Applying Vertex AI's language models to extract structured information
3. Converting the extracted information to a tabular format
4. Storing the results in BigQuery

### Set up and import dependencies

In [None]:
# Install required dependencies
!pip install google-cloud-aiplatform --upgrade
!pip install google-cloud-documentai

In [None]:
# Install all dependencies from requirements.txt
!pip install --upgrade -r requirements.txt

In [None]:
# Import required libraries
from google.cloud import documentai
import vertexai
from vertexai.preview.language_models import TextGenerationModel
import pandas as pd
from IPython.display import IFrame

### Define path to the pdf file

In [None]:
# Path to the sample deed document
file_path = './sample_data/34_Deed.pdf'

## Display and review the PDF File

In [None]:
file_path='./sample_data/34_Deed.pdf'

# Display the PDF document for review
from IPython.display import IFrame
IFrame(file_path, width=800, height=700)

In [None]:
## Document AI OCR Processing

Use Google Cloud Document AI to perform OCR on the PDF document and extract text.

from IPython.display import IFrame
IFrame(file_path, width=800, height=700)

In [None]:
### Use GCP Document AI to OCR the PDF

def process_document_sample(
    project_id: str,
    location: str,
    processor_id: str,
    file_path: str,
    mime_type: str,
    field_mask: str = None,
):
    """Process a document using Document AI.
    
    Args:
        project_id: Google Cloud Project ID
        location: Location of the Document AI processor
        processor_id: ID of the Document AI processor
        file_path: Path to the document to process
        mime_type: MIME type of the document (e.g., 'application/pdf')
        field_mask: Optional field mask
    
    Returns:
        The extracted text from the document
    """
    # Create Document AI client
    client = documentai.DocumentProcessorServiceClient()
    
    # Get the full resource name of the processor
    name = client.processor_path(project_id, location, processor_id)
    
    # Import the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()
    
    # Load the image content
    raw_document = documentai.RawDocument(content=image_content, mime_type=mime_type)
    
    # Process the document
    request = documentai.ProcessRequest(
        name=name, raw_document=raw_document
    )
    
    result = client.process_document(request=request)
    document = result.document
    
    # Return the extracted text
    return document.text

In [None]:
### Process PDF Document

# Define function to OCR the PDF using Document AI
def process_document_sample(
    project_id: str,
    location: str,
    processor_id: str,
    file_path: str,
    mime_type: str,
    field_mask: str = None,
):

    client = documentai.DocumentProcessorServiceClient()

    name = client.processor_path(project_id, location, processor_id)

    # Import the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

    # Load the image content
    raw_document = documentai.RawDocument(content=image_content, mime_type=mime_type)


    request = documentai.ProcessRequest(
        name=name, raw_document=raw_document
    )

    result = client.process_document(request=request)


    document = result.document

    # Read the text recognition output from the processor
    return(document.text)



### Use this for PDF Files

### Alternative: Process TIFF Document

If working with TIFF files instead of PDFs, use the code below.

In [None]:
#For PDF Docs
ocr_output = process_document_sample(
  project_id="398507275014",
  location="us",
  processor_id="2fb6b1be15c7f2d",
    mime_type = 'application/pdf',
    field_mask = None,
  file_path= file_path
)

# Uncomment and modify to process TIFF documents
# ocr_output = process_document_sample(
#     project_id="398507275014",
#     location="us",
#     processor_id="2fb6b1be15c7f2d",
#     mime_type='image/tiff',
#     field_mask=None,
#     file_path="./sample_data/sample_deed.tiff"
# )

### Use this for TIFF files

# Preview the OCR output (first 32,000 characters)
print(ocr_output[:32000])

In [None]:
## Entity Extraction with Vertex AI

Now we'll use Vertex AI's PaLM model to extract structured information from the OCR text.

##For TIFF docs uncomment the below section and run
#ocr_output = process_document_sample(
#  project_id="398507275014",
#  location="us",
#  processor_id="2fb6b1be15c7f2d",
#    mime_type = 'image/tiff',
#    field_mask = None,
#  file_path="./genai_demo_data/demo_data.tiff"
#)

In [None]:
#Print the first 1000 characters of the OCR output
print(ocr_output[:32000])

def predict_large_language_model_sample(
    project_id: str,
    model_name: str,
    temperature: float,
    max_decode_steps: int,
    top_p: float,
    top_k: int,
    content: str,
    location: str = "us-central1",
    tuned_model_name: str = "",
) :
    """Generate predictions from a Large Language Model.
    
    Args:
        project_id: Google Cloud Project ID
        model_name: Name of the language model to use
        temperature: Sampling temperature (higher = more creative, lower = more deterministic)
        max_decode_steps: Maximum number of tokens to generate
        top_p: Nucleus sampling parameter
        top_k: Top-k sampling parameter
        content: The input content/prompt to send to the model
        location: Google Cloud region
        tuned_model_name: Optional tuned model name if using a fine-tuned model
        
    Returns:
        The model's text response
    """
    # Initialize Vertex AI
    vertexai.init(project=project_id, location=location)
    
    # Load the model
    model = TextGenerationModel.from_pretrained(model_name)
    if tuned_model_name:
        model = model.get_tuned_model(tuned_model_name)
    
    # Generate prediction
    response = model.predict(
        content,
        temperature=temperature,
        max_output_tokens=max_decode_steps,
        top_k=top_k,
        top_p=top_p,
    )
    
    print(f"Response from Model: {response.text}")
    return response.text

### Create Extraction Prompt

Define the prompt that will guide the LLM to extract specific entities from the document.

In [None]:
# Define the function to process OCR output through Vertex AI GenAI Model


def predict_large_language_model_sample(
    project_id: str,
    model_name: str,
    temperature: float,
    max_decode_steps: int,
    top_p: float,
    top_k: int,
    content: str,
    location: str = "us-central1",
    tuned_model_name: str = "",
    ) :
    """Predict using a Large Language Model."""
    vertexai.init(project=project_id, location=location)
    model = TextGenerationModel.from_pretrained(model_name)
    if tuned_model_name:
      model = model.get_tuned_model(tuned_model_name)
    response = model.predict(
        content,
        temperature=temperature,
        max_output_tokens=max_decode_steps,
        top_k=top_k,
        top_p=top_p,)
    print(f"Response from Model: {response.text}")
    return(response.text)

# Define the prompt for entity extraction
prompt_suffix = '''Give me following information extracted from text above in a Table format:
- Name of Seller 1
- Seller 1 Type
- If seller is an LLC, then name of the officer of the LLC
- Name of Seller 2
- Name of buyer 1
- Name of buyer 2
- Name of buyer 3
- Type of ownership
- Name of Title Company
- Address of the property
- Tract Number
- Water Right Details
- Title Order number
- Document transfer tax'''

# Additional fields that could be extracted if needed:
# - List of parcels
# - Property details


### Combine OCR Text with Prompt

Merge the extracted text with our prompt to create the input for the language model.

In [None]:
prompt_suffix = '''Give me following information extracted form text above in a Table format:
- Name of Seller 1
- Seller 1 Type
-  If seller is an LLC, then name of the officer of the LLC
- Name of Seller 2
- Name of buyer 1
- Name of buyer 2
- Name of buyer 3
- Type of ownership
- Name of Title Company
- Address of the property
- Tract Number
- Water Right Details
- Title Order number
- Document transfer tax'''

# Combine OCR output with our prompt
ocr_text = ocr_output + prompt_suffix

# Preview a portion of the input text
# Note: Limiting to 15K characters in the notebook. Model can handle 8K tokens (~32K characters)
print(ocr_text[5000:20000])

#- List of parcels
#- Property details

### Execute First LLM Query

Send the combined OCR text and prompt to the language model for initial entity extraction.

In [None]:
ocr_text = ocr_output+prompt_suffix
print(ocr_text[5000:20000]) #Limiting to 20K characters in teh notebook. Model can handle 8K Tokens = ~32K Characters

# Process the OCR text through the Vertex AI model
llm_output1 = predict_large_language_model_sample(
    project_id="jsb-alto",         # GCP Project
    model_name="text-bison@001",   # LLM Model 
    temperature=0.2,                # Temperature (lower = more deterministic)
    max_decode_steps=256,           # Max output tokens
    top_p=0.8,                      # Top-p sampling parameter
    top_k=40,                       # Top-k sampling parameter
    content=ocr_text,               # Input content
    location="us-central1"         # GCP region
)

### Feed the input prompt to the LLM

In [None]:
# Process the full Input Text through the GenAI Model
llm_output1 = predict_large_language_model_sample("jsb-alto", #GCP Project
                                                 "text-bison@001", #LLM Model 
                                                 0.2, #Temperature
                                                 256, #Max output tokens
                                                 0.8, #Top K
                                                 40,  #Top P
                                                 ocr_text, 
                                                 "us-central1")

# Define the second prompt for table formatting
prompt_suffix = '''Convert the above information into table format with 
Columns - (Seller_1, Seller_1_Type, Seller_1_Officer, Seller_2, Buyer_1, Buyer_2, Buyer_3, Type_of_Ownership, Title_Company, Title_order_number, Document_transfer_tax)
For blank fields put N/A'''

# Combine the first response with the formatting prompt
prompt2 = llm_output1 + prompt_suffix

# Preview a portion of the combined text
print(prompt2[:20000])

### Format Results as a Table

Now we'll ask the model to format the extracted information into a structured table.

In [None]:
prompt_suffix = ''' Convert the above information into table format with 
Columns - (Seller_1, Seller_1_Type, Seller_1_Officer, Seller_2, Buyer_1, Buyer_2, Buyer_3, Type_of_Ownership,Title_Company,Title_order_number,Document_transfer_tax)
For Blank fields put N/A'''

prompt2 = llm_output1+prompt_suffix
print(prompt2[:20000]) #Limiting to 20K characters in teh notebook. Model can handle 8K Tokens = ~32K Characters

### Execute Second LLM Query

Send the second prompt to the language model to format the extracted data as a table.

In [None]:
llm_output2 = predict_large_language_model_sample("jsb-alto", #GCP Project
                                                 "text-bison@001", #LLM Model 
                                                 0.2, #Temperature
                                                 256, #Max output tokens
                                                 0.8, #Top K
                                                 40,  #Top P
                                                 prompt2, 
                                                 "us-central1")

### View Tabular Results

Display the formatted table response from the language model.



# Print the formatted table received from the LLM
print(llm_output2)

In [None]:
# Print the answer received from LLM. 
# In this Patent document use case, answer should the name of the inventors
print(llm_output2)

## Data Processing

### Convert to Pandas DataFrame

Transform the table-formatted text into a structured DataFrame for analysis.

In [None]:
### Convert to PD Dataframe

# Parse the pipe-delimited output into a pandas DataFrame
import io

# Convert the string output to a DataFrame
output = pd.read_csv(io.StringIO(llm_output2), sep='|')

# Clean up the DataFrame
output = output.dropna(axis=1, how='all')   # Remove empty columns
output.columns = output.columns.str.replace(' ', '')  # Remove spaces from column names

# Display the DataFrame
output

In [None]:
import io
output = pd.read_csv(io.StringIO(llm_output2), sep='|')
output = output.dropna(axis=1, how='all')
#output = output.dropna(axis=0, how='all')
# remove special character
output.columns = output.columns.str.replace(' ', '')
output

## Data Storage

### Export Results to BigQuery

Store the extracted and structured data in Google BigQuery for further analysis and integration.

## Push Table to BQ

In [None]:
import datetime

from google.cloud import bigquery
import pandas
import pytz

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the table to create.
table_id = "jsb-alto.entity_extract.deed_extract6"


dataframe = output


job_config = bigquery.LoadJobConfig(
    # Specify a (partial) schema. All columns are always written to the
    # table. The schema is used to assist in data type definitions.
    schema=[
        # Specify the type of columns whose type cannot be auto-detected. For
        # example the "title" column uses pandas dtype "object", so its
        # data type is ambiguous.
        bigquery.SchemaField("Seller_1", bigquery.enums.SqlTypeNames.STRING),
        # Indexes are written if included in the schema by name.
        bigquery.SchemaField("Seller_1_Type", bigquery.enums.SqlTypeNames.STRING),
        # Indexes are written if included in the schema by name.
        bigquery.SchemaField("Seller_1_Officer", bigquery.enums.SqlTypeNames.STRING),
        
        bigquery.SchemaField("Seller_2", bigquery.enums.SqlTypeNames.STRING),
        
        bigquery.SchemaField("Buyer_1", bigquery.enums.SqlTypeNames.STRING),
        
        bigquery.SchemaField("Buyer_2", bigquery.enums.SqlTypeNames.STRING),
    ],
    # Optionally, set the write disposition. BigQuery appends loaded rows
    # to an existing table by default, but with WRITE_TRUNCATE write
    # disposition it replaces the table with the loaded data.
    write_disposition="WRITE_TRUNCATE",
)

job = client.load_table_from_dataframe(
    dataframe, table_id, job_config=job_config
)  # Make an API request.
job.result()  # Wait for the job to complete.

table = client.get_table(table_id)  # Make an API request.
print(
    "Loaded {} rows and {} columns to {}".format(
        table.num_rows, len(table.schema), table_id
    )
)

# Run a query to retrieve all stored records
query = f"SELECT * FROM `{table_id}`"
query_job = client.query(query)  # Make an API request.
results = query_job.result()  # Wait for the job to complete.

# Convert the results to a DataFrame
df = results.to_dataframe()

# Display the DataFrame
print(df)

### Query the Saved Data

Verify that our data is accessible in BigQuery by running a query.

In [None]:
%%bigquery
SELECT * FROM jsb-alto.entity_extract.deed_extract6