In [1]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Document Processing with Gemini

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fdocument-processing%2Fdocument_processing.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Run in Colab Enterprise
    </a>
  </td>       
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/document-processing/document_processing.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://goo.gle/4jhBze9">
      <img width="32px" src="https://cdn.qwiklabs.com/assets/gcp_cloud-e3a77215f0b8bfa9b3f611c0d2208c7e8708ed31.svg" alt="Google Cloud logo"><br> Open in  Cloud Skills Boost
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            


| Authors |
| --- |
| [Holt Skinner](https://github.com/holtskinner) |
| [Renato Leite](https://github.com/leiterenato) |

## Overview

In today's information-driven world, the volume of digital documents generated daily is staggering. From emails and reports to legal contracts and scientific papers, businesses and individuals alike are inundated with vast amounts of textual data. Extracting meaningful insights from these documents efficiently and accurately has become a paramount challenge.

Document processing involves a range of tasks, including text extraction, classification, summarization, and translation, among others. Traditional methods often rely on rule-based algorithms or statistical models, which may struggle with the nuances and complexities of natural language.

Generative AI offers a promising alternative to understand, generate, and manipulate text using natural language prompting. Gemini on Vertex AI allows these models to be used in a scalable manner through:

- [Vertex AI Studio](https://cloud.google.com/generative-ai-studio) in the Cloud Console
- [Vertex AI REST API](https://cloud.google.com/vertex-ai/docs/reference/rest)
- [Google Gen AI SDK for Python](https://cloud.google.com/vertex-ai/generative-ai/docs/sdks/overview)

For more information, see the [Generative AI on Vertex AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview) documentation.


### Objectives

In this tutorial, you will learn how to use the Gemini API in Vertex AI with the Google Gen AI SDK for Python to process PDF documents.

You will complete the following tasks:

- Install the SDK
- Use the Gemini 2.5 Flash model to:
  - Extract structured entities from an unstructured document
  - Classify document types
  - Combine classification and entity extraction into a single workflow
  - Answer questions from documents
  - Summarize documents
  - Extract Table Data as HTML
  - Translate documents
  - Compare and contrast similar documents
  - Identify and extract relevant pages from a PDF

### Costs

This tutorial uses billable components of Google Cloud:

- Vertex AI

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.


## Getting Started


### Install Google Gen AI SDK for Python


In [2]:
%pip install --upgrade --quiet google-genai pypdf

  You can safely remove it manually.[0m[33m
  You can safely remove it manually.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


### Restart current runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [3]:
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### Set Google Cloud project information and create client

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [1]:
import os
from google import genai

PROJECT_ID = "qwiklabs-gcp-02-89db1e616e00"  # @param {type:"string"}
LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "global")

client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)


### Import libraries


In [2]:
from datetime import date
from enum import Enum
import json

from IPython.display import Markdown, display
from google.genai.types import GenerateContentConfig, Part
from pydantic import BaseModel, Field
import pypdf

PDF_MIME_TYPE = "application/pdf"
JSON_MIME_TYPE = "application/json"
ENUM_MIME_TYPE = "text/x.enum"

### Load the Gemini 2.5 Flash model

Gemini 2.5 Flash (`gemini-2.5-flash`) is a multimodal model that supports multimodal prompts. You can include text, image(s), and video in your prompt requests and get text or code responses.

Learn more about all [Gemini models on Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models).

In [3]:
MODEL_ID = "gemini-2.5-flash"  # @param {type: "string"}

## Entity Extraction

[Named Entity Extraction](https://en.wikipedia.org/wiki/Named-entity_recognition) is a technique of Natural Language Processing to identify specific fields and values from unstructured text. For example, you can find key-value pairs from a filled out form, or get all of the important data from an invoice categorized by the type.

### Extract entities from an invoice

In this example, you will use a sample invoice and get all of the information in a structured format.

This is the prompt to be sent to Gemini along with the PDF document. Feel free to edit this for your specific use case.

In [4]:
entity_extraction_system_instruction = """You are a document entity extraction specialist. Given a document, your task is to extract the text value of the entities provided in the schema.
- The values must only include text found in the document
- Do not normalize any entity values.
"""

We will use [Controlled generation](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output) to tell the model which fields need to be extracted.

The response schema is specified in the `response_schema` parameter in `config`, and the model output will strictly follow that schema.

You can provide the schemas as [Pydantic](https://docs.pydantic.dev/) models or a [JSON](https://www.json.org/json-en.html) string and the model will respond as JSON or an [Enum](https://docs.python.org/3/library/enum.html) depending on the value set in `response_mime_type`.

In [5]:
class Address(BaseModel):
    street: str | None = Field(None, json_schema_extra={"example": "123 Main St"})
    city: str | None = Field(None, json_schema_extra={"example": "Springfield"})
    state: str | None = Field(None, json_schema_extra={"example": "IL"})
    postal_code: str | None = Field(None, json_schema_extra={"example": "62704"})
    country: str | None = Field(None, json_schema_extra={"example": "USA"})


class LineItem(BaseModel):
    amount: float = Field(..., json_schema_extra={"example": 100.00})
    description: str | None = Field(None, json_schema_extra={"example": "Laptop"})
    product_code: str | None = Field(None, json_schema_extra={"example": "LPT-001"})
    quantity: int = Field(..., json_schema_extra={"example": 2})
    unit: str | None = Field(None, json_schema_extra={"example": "pcs"})
    unit_price: float = Field(..., json_schema_extra={"example": 50.00})


class VAT(BaseModel):
    amount: float = Field(..., json_schema_extra={"example": 20.00})
    category_code: str | None = Field(None, json_schema_extra={"example": "A"})
    tax_amount: float | None = Field(None, json_schema_extra={"example": 5.00})
    tax_rate: float | None = Field(
        None, json_schema_extra={"example": 10.0}
    )  # Percentage as a float (e.g., 10 for 10%)
    total_amount: float = Field(..., json_schema_extra={"example": 200.00})


class Party(BaseModel):
    name: str = Field(..., json_schema_extra={"example": "Google"})
    street: str | None = Field(None, json_schema_extra={"example": "456 Business Rd"})
    city: str | None = Field(None, json_schema_extra={"example": "Metropolis"})
    state: str | None = Field(None, json_schema_extra={"example": "NY"})
    postal_code: str | None = Field(None, json_schema_extra={"example": "10001"})
    country: str | None = Field(None, json_schema_extra={"example": "USA"})
    email: str | None = Field(None, json_schema_extra={"example": "contact@google.com"})
    phone: str | None = Field(None, json_schema_extra={"example": "+1-555-1234"})
    website: str | None = Field(None, json_schema_extra={"example": "https://google.com"})
    tax_id: str | None = Field(None, json_schema_extra={"example": "123456789"})
    registration: str | None = Field(None, json_schema_extra={"example": "Reg-98765"})
    iban: str | None = Field(None, json_schema_extra={"example": "US1234567890123456789"})
    payment_ref: str | None = Field(None, json_schema_extra={"example": "INV-2024-001"})


class Invoice(BaseModel):
    invoice_id: str = Field(..., json_schema_extra={"example": "INV-2024-001"})
    invoice_date: str = Field(..., json_schema_extra={"example": "2024-02-03"})
    supplier: Party
    receiver: Party
    line_items: list[LineItem]
    vat: list[VAT]

For this example, we will download a PDF document to local storage and send the file bytes to the API for processing.

You can view the document [here](https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/invoice.pdf).

In [6]:
# Download a PDF from Google Cloud Storage
! gsutil cp "gs://cloud-samples-data/generative-ai/pdf/invoice.pdf" ./invoice.pdf

Copying gs://cloud-samples-data/generative-ai/pdf/invoice.pdf...
/ [1 files][340.0 KiB/340.0 KiB]                                                
Operation completed over 1 objects/340.0 KiB.                                    


In [7]:
# Load file bytes
with open("invoice.pdf", "rb") as f:
    file_bytes = f.read()

# Send to Gemini API
response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        "The following document is an invoice.",
        Part.from_bytes(data=file_bytes, mime_type=PDF_MIME_TYPE),
    ],
    config=GenerateContentConfig(
        system_instruction=entity_extraction_system_instruction,
        temperature=0,
        response_schema=Invoice,
        response_mime_type=JSON_MIME_TYPE,
    ),
)

We can load the extracted data as an object using the `response.parsed` field.

In [8]:
invoice_data = response.parsed
print("\n-------Extracted Entities--------")
print(invoice_data)


-------Extracted Entities--------
invoice_id='3222' invoice_date='02/23/2021' supplier=Party(name='AMNOSH SUPPLIERS', street='9291 Proin Road', city='Lake Charles', state='ME', postal_code='11292', country=None, email='sales@amnoshsuppliers.com', phone='123-456-7890', website='www.amnoshsuppliers.com', tax_id=None, registration=None, iban=None, payment_ref=None) receiver=Party(name='Martin Colby', street='45 Lightning Road,', city='Arizona', state='AZ', postal_code='88776', country=None, email='proprietor@abcxyz.com', phone='321-321-1234', website=None, tax_id=None, registration=None, iban=None, payment_ref=None) line_items=[LineItem(amount=490.12, description='Drag Series Transmission Build - A WD DSM', product_code=None, quantity=1, unit=None, unit_price=490.12), LineItem(amount=220.15, description='Drive Shaft Automatic Right', product_code=None, quantity=7, unit=None, unit_price=31.45), LineItem(amount=549.1, description='Multigrade Synthetic Technology Bench', product_code=None, 

Or the response can then be parsed as JSON into a Python dictionary for use in other applications.

In [9]:
json_object = json.loads(response.text)
print(json_object)

{'invoice_id': '3222', 'invoice_date': '02/23/2021', 'supplier': {'name': 'AMNOSH SUPPLIERS', 'street': '9291 Proin Road', 'city': 'Lake Charles', 'state': 'ME', 'postal_code': '11292', 'country': None, 'email': 'sales@amnoshsuppliers.com', 'phone': '123-456-7890', 'website': 'www.amnoshsuppliers.com', 'tax_id': None, 'registration': None, 'iban': None, 'payment_ref': None}, 'receiver': {'name': 'Martin Colby', 'street': '45 Lightning Road,', 'city': 'Arizona', 'state': 'AZ', 'postal_code': '88776', 'country': None, 'email': 'proprietor@abcxyz.com', 'phone': '321-321-1234', 'website': None, 'tax_id': None, 'registration': None, 'iban': None, 'payment_ref': None}, 'line_items': [{'amount': 490.12, 'description': 'Drag Series Transmission Build - A WD DSM', 'product_code': None, 'quantity': 1, 'unit': None, 'unit_price': 490.12}, {'amount': 220.15, 'description': 'Drive Shaft Automatic Right', 'product_code': None, 'quantity': 7, 'unit': None, 'unit_price': 31.45}, {'amount': 549.1, 'des

You can see that Gemini extracted all of the relevant fields from the document.

### Extract entities from a payslip

Let's try with another type of document, a payslip or paystub.

In this example, we will use a document hosted on Google Cloud Storage and process it by passing the URI.

You can view the document [here](https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/earnings_statement.pdf).

In [10]:
class Payslip(BaseModel):
    employee_id: str = Field(..., json_schema_extra={"description": "Unique identifier for the employee"})
    employee_name: str = Field(..., json_schema_extra={"description": "Full name of the employee"})
    pay_period_start: date = Field(..., json_schema_extra={"description": "Start date of the pay period"})
    pay_period_end: date = Field(..., json_schema_extra={"description": "End date of the pay period"})
    gross_income: float = Field(..., json_schema_extra={"description": "Total income before deductions"})
    federal_tax: float = Field(..., json_schema_extra={"description": "Federal tax deduction amount"})
    state_tax: float | None = Field(
        0.0, json_schema_extra={"description": "State tax deduction amount, if applicable"}
    )
    social_security: float = Field(..., json_schema_extra={"description": "Social Security deduction amount"})
    medicare: float = Field(..., json_schema_extra={"description": "Medicare deduction amount"})
    other_deductions: float | None = Field(
        0.0, json_schema_extra={"description": "Other deductions (e.g., health insurance, retirement)"}
    )
    net_income: float = Field(..., json_schema_extra={"description": "Income after all deductions"})
    payment_date: date = Field(..., json_schema_extra={"description": "Date the payment was issued"})
    hours_worked: float | None = Field(
        None, json_schema_extra={"description": "Total hours worked in the pay period"}
    )
    hourly_rate: float | None = Field(
        None, json_schema_extra={"description": "Employee's hourly rate, if applicable"}
    )

In [11]:
response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        "The following document is a Payslip.",
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/pdf/earnings_statement.pdf",
            mime_type=PDF_MIME_TYPE,
        ),
    ],
    config=GenerateContentConfig(
        system_instruction=entity_extraction_system_instruction,
        temperature=0,
        response_schema=Payslip,
        response_mime_type=JSON_MIME_TYPE,
    ),
)

In [12]:
print("\n-------Extracted Entities--------")
print(response.parsed)


-------Extracted Entities--------
employee_id='123456' employee_name='Janet Doe' pay_period_start=datetime.date(1110, 12, 17) pay_period_end=datetime.date(1212, 12, 17) gross_income=1600.0 federal_tax=179.2 state_tax=80.0 social_security=99.2 medicare=20.8 other_deductions=160.0 net_income=1060.8 payment_date=datetime.date(1215, 12, 17) hours_worked=80.0 hourly_rate=20.0


## Document Classification

Document classification is the process for identifying the type of document. For example, invoice, W-2, receipt, etc.

In this example, you will use a [sample tax form (W-9)](https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/w9.pdf) and get the specific type of document from a specified `Enum`.

In [13]:
classification_prompt = """You are a document classification specialist. Given a document, your task is to find which category the document belongs to from the document categories provided in the schema."""


class DocumentCategory(Enum):
    TAX_1040_2019 = "1040_2019"
    TAX_1040_2020 = "1040_2020"
    TAX_1099_R = "1099-r"
    BANK_STATEMENT = "bank_statement"
    CREDIT_CARD_STATEMENT = "credit_card_statement"
    EXPENSE = "expense"
    TAX_1120S_2019 = "form_1120S_2019"
    TAX_1120S_2020 = "form_1120S_2020"
    INVESTMENT_RETIREMENT_STATEMENT = "investment_retirement_statement"
    INVOICE = "invoice"
    PAYSTUB = "paystub"
    PROPERTY_INSURANCE = "property_insurance"
    PURCHASE_ORDER = "purchase_order"
    UTILITY_STATEMENT = "utility_statement"
    W2 = "w2"
    W9 = "w9"
    DRIVER_LICENSE = "driver_license"

In [14]:
response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        "Classify the following document.",
        Part.from_uri(
            file_uri="https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/w9.pdf",
            mime_type=PDF_MIME_TYPE,
        ),
    ],
    config=GenerateContentConfig(
        system_instruction=classification_prompt,
        temperature=0,
        response_schema=DocumentCategory,
        response_mime_type=ENUM_MIME_TYPE,
    ),
)

In [15]:
print("\n-------Document Classification--------")
print(response.text)
print(response.parsed)


-------Document Classification--------
w9
DocumentCategory.W9


You can see that Gemini successfully categorized the document.

### Chaining Classification and Extraction

These techniques can also be chained together to extract any number of document types.

For example, if you have multiple types of documents to process, you can send each document to Gemini with a classification prompt, then based on that output, you can write logic to decide which extraction prompt to use.

These are the sample documents:

- [US Driver License](https://storage.googleapis.com/cloud-samples-data/documentai/SampleDocuments/US_DRIVER_LICENSE_PROCESSOR/dl3.pdf)
- [Invoice](https://storage.googleapis.com/cloud-samples-data/documentai/SampleDocuments/INVOICE_PROCESSOR/google_invoice.pdf)
- [Form W-2](https://storage.googleapis.com/cloud-samples-data/documentai/SampleDocuments/FORM_W2_PROCESSOR/2020FormW-2.pdf)

In [16]:
class W2Form(BaseModel):
    control_number: str | None = Field(None)
    ein: str = Field(...)

    employee_first_name: str = Field(...)
    employee_last_name: str = Field(...)
    employee_address_street: str = Field(...)
    employee_address_city: str = Field(...)
    employee_address_state: str = Field(...)
    employee_address_zip: str = Field(...)

    employer_name: str = Field(...)
    employer_address_street: str = Field(...)
    employer_address_city: str = Field(...)
    employer_address_state: str = Field(...)
    employer_address_zip: str = Field(...)
    employer_state_id_number: str | None = Field(None)

    wages_tips_other_compensation: float = Field(...)
    federal_income_tax_withheld: float = Field(...)
    social_security_wages: float = Field(...)
    social_security_tax_withheld: float = Field(...)
    medicare_wages_and_tips: float = Field(...)
    medicare_tax_withheld: float = Field(...)

    state: str | None = Field(None)
    state_wages_tips_etc: float | None = Field(None)
    state_income_tax: float | None = Field(None)

    box_12_code: str | None = Field(None)
    box_12_value: str | None = Field(None)

    form_year: int = Field(...)


class DriversLicense(BaseModel):
    address: str = Field(
        ..., json_schema_extra={"title": "Address", "description": "The address of the individual."}
    )
    date_of_birth: date = Field(
        ..., json_schema_extra={"title": "Date of Birth", "description": "The birthdate of the individual."}
    )
    document_id: str = Field(
        ...,
        json_schema_extra={"title": "Document ID", "description": "The unique document ID for the driver's license."},
    )
    expiration_date: date = Field(
        ...,
        json_schema_extra={"title": "Expiration Date", "description": "The expiration date of the driver's license."},
    )
    family_name: str = Field(
        ...,
        json_schema_extra={"title": "Family Name", "description": "The family name (last name) of the individual."},
    )
    given_names: str = Field(
        ...,
        json_schema_extra={"title": "Given Names", "description": "The given names (first and middle names) of the individual."},
    )
    issue_date: date = Field(
        ..., json_schema_extra={"title": "Issue Date", "description": "The issue date of the driver's license."}
    )

# Map classification types to schemas
classification_to_schema = {
    DocumentCategory.INVOICE: Invoice,
    DocumentCategory.W2: W2Form,
    DocumentCategory.DRIVER_LICENSE: DriversLicense,
}

In [17]:
gcs_uris = [
    "gs://cloud-samples-data/documentai/SampleDocuments/US_DRIVER_LICENSE_PROCESSOR/dl3.pdf",
    "gs://cloud-samples-data/documentai/SampleDocuments/INVOICE_PROCESSOR/google_invoice.pdf",
    "gs://cloud-samples-data/documentai/SampleDocuments/FORM_W2_PROCESSOR/2020FormW-2.pdf",
]

for gcs_uri in gcs_uris:
    print(f"\nFile: {gcs_uri}\n")

    # Send to Gemini with Classification Prompt
    classification_response = client.models.generate_content(
        model=MODEL_ID,
        contents=[
            "Classify the following document.",
            Part.from_uri(file_uri=gcs_uri, mime_type=PDF_MIME_TYPE),
        ],
        config=GenerateContentConfig(
            system_instruction=classification_prompt,
            temperature=0,
            response_schema=DocumentCategory,
            response_mime_type=ENUM_MIME_TYPE,
        ),
    )

    print(f"Document Classification: {classification_response.text}")

    # Get Extraction schema based on Classification
    extraction_schema = classification_to_schema.get(classification_response.parsed)

    if not extraction_schema:
        print(f"Document does not belong to a specified class. Skipping extraction.")
        continue

    # Send to Gemini with Extraction Prompt
    extraction_response = client.models.generate_content(
        model=MODEL_ID,
        contents=[
            f"Extract the entities from the following {classification_response.text} document.",
            Part.from_uri(file_uri=gcs_uri, mime_type=PDF_MIME_TYPE),
        ],
        config=GenerateContentConfig(
            system_instruction=classification_prompt,
            temperature=0,
            response_schema=extraction_schema,
            response_mime_type=JSON_MIME_TYPE,
        ),
    )

    print("\n-------Extracted Entities--------")
    print(extraction_response.parsed)


File: gs://cloud-samples-data/documentai/SampleDocuments/US_DRIVER_LICENSE_PROCESSOR/dl3.pdf

Document Classification: driver_license

-------Extracted Entities--------
address='123 MAIN STREET HELENA, MT 59601' date_of_birth=datetime.date(1968, 8, 4) document_id='0812319684104' expiration_date=datetime.date(2023, 8, 4) family_name='SAMPLE' given_names='BRENDA LYNN' issue_date=datetime.date(2015, 2, 15)

File: gs://cloud-samples-data/documentai/SampleDocuments/INVOICE_PROCESSOR/google_invoice.pdf

Document Classification: invoice

-------Extracted Entities--------
invoice_id='23413561D' invoice_date='Sep 24, 2019' supplier=Party(name='Google', street=None, city=None, state=None, postal_code=None, country=None, email=None, phone=None, website=None, tax_id=None, registration=None, iban=None, payment_ref=None) receiver=Party(name='Jane Smith', street='1600 Amphitheatre Pkway', city='Mountain View', state='CA', postal_code='94043', country=None, email=None, phone=None, website=None, tax_i

## Document Question Answering

Gemini can be used to answer questions about a document.

This example answers a question about the Transformer model paper ["Attention is all you need"](https://arxiv.org/pdf/1706.03762), we will be loading the PDF file directly from the source on [arXiv](https://arxiv.org)

In [18]:
qa_system_instruction = "You are a question answering specialist. Given a question and a context, your task is to provide the answer to the question based on the context provided. Give the answer first, followed by an explanation."

In [19]:
# Send Q&A Prompt to Gemini
response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        "What is the attention mechanism?",
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/pdf/1706.03762v7.pdf",
            mime_type=PDF_MIME_TYPE,
        ),
    ],
    config=GenerateContentConfig(
        system_instruction=qa_system_instruction,
        temperature=0,
        response_mime_type="text/plain",
    ),
)

print(f"Answer: {response.text}")

Answer: The attention mechanism is a function that maps a query and a set of key-value pairs to an output. In this process, the query, keys, values, and the resulting output are all vectors. The output is calculated as a weighted sum of the values, where the weight assigned to each value is determined by a compatibility function between the query and its corresponding key.

**Explanation:**
Attention mechanisms are a crucial component in sequence modeling and transduction models, enabling them to model dependencies regardless of the distance between elements in the input or output sequences. While often used with recurrent networks, the Transformer model, introduced in this paper, relies solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

The paper describes two main types of attention:
1.  **Scaled Dot-Product Attention:** This specific attention mechanism computes the dot products of a query with all keys, divides each by the square root of the key 

## Document Summarization

Gemini can also be used to summarize or paraphrase a document's contents. Your prompt can specify how detailed the summary should be or specific formatting, such as bullet points or paragraphs.

In [20]:
summarization_system_instruction = """You are a professional document summarization specialist. Given a document, your task is to provide a detailed summary of the content of the document.

If it includes images, provide descriptions of the images.
If it includes tables, extract all elements of the tables.
If it includes graphs, explain the findings in the graphs.
Do not include any numbers that are not mentioned in the document.
"""

In [21]:
# Send Summarization Prompt to Gemini
response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        "Summarize the following document.",
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/pdf/fdic_board_meeting.pdf",
            mime_type=PDF_MIME_TYPE,
        ),
    ],
    config=GenerateContentConfig(
        system_instruction=summarization_system_instruction,
        temperature=0,
        response_mime_type="text/plain",
    ),
)

display(Markdown(f"### Document Summary"))
display(Markdown(response.text))

### Document Summary

This document is a statement by FDIC Chairman Jelena McWilliams on the Notice of Proposed Rulemaking (NPRM) regarding revisions to the Community Reinvestment Act (CRA) Regulations, delivered at the FDIC Board Meeting on December 12, 2019.

**Summary of the Statement:**

Chairman McWilliams emphasizes the need for banking regulations to evolve with the industry to ensure an effective system that serves businesses and consumers nationwide. The proposed modernization of CRA regulations aims to promote greater investments in communities that need them most, particularly low- and moderate-income (LMI) areas.

She notes that while discussions about statutes and regulations can often seem technical, the CRA directly impacts the lives of ordinary Americans. The beneficial impact of CRA regulations can diminish when they become out of sync with technological and business changes, especially concerning how banks offer services. The core objective of CRA, which encourages banks to meet the credit needs of their chartered communities, including LMI areas, remains critical. However, due to transformative changes in the banking industry, such as digital banking, the regulations, last significantly revised in 1995, must be updated.

The proposed rulemaking seeks to modernize these regulations by preserving effective aspects, updating outdated components, and providing clarity to financial institutions. The ultimate goal is to increase LMI lending and benefit LMI communities across the nation.

The proposal aims to achieve this goal in several ways:
*   **Encouraging long-term commitments:** Banks would be encouraged to provide greater credit for retail loans retained on-balance sheet in LMI communities.
*   **Increasing loan size for small businesses and farms:** The size of qualifying loans to small businesses and small farms would increase to $2 million to encourage economic development, job creation, and support for family farms.
*   **Providing CRA credit for Indian Country:** Retail and community development activities in Indian Country would receive CRA credit.
*   **Expanding qualifying activities:** Activities that qualify for CRA credit would include capital investments and loan participations undertaken by banks in cooperation with Community Development Financial Institutions (CDFIs), regardless of the CDFI's location.

Additionally, the proposal would clarify qualifying activities by:
1.  Requiring the FDIC and OCC to periodically publish a list of illustrative examples of qualifying activities and establish a process for stakeholders to seek agency determination of a qualifying activity.
2.  Establishing new performance standards to assess:
    *   The distribution of qualifying retail loan originations to LMI individuals, and to small farms and small businesses in an assessment area.
    *   The quantified value of the bank's qualifying activities relative to its assessment area and bank-level retail deposits. These components would be compared to specific benchmarks and thresholds set before a bank's evaluation period, incentivizing CRA activity and allowing banks to plan without uncertainty.

The proposal also recognizes the evolution of the banking system, including digital banks, by requiring banks to add assessment areas where they have significant concentrations of retail domestic deposits. This ensures that banks meet credit needs where they collect deposits, even if outside their traditional assessment areas. Existing provisions for banks to delineate assessment areas based on their main office, branches, and deposit-taking facilities, and to include surrounding geographies where they originate or purchase a substantial portion of their loans, would remain intact.

To avoid overburdening small banks (those with $500 million or less in total assets), the proposal would allow them to choose between being evaluated under the current rules or opting into the new performance standards.

Chairman McWilliams concludes by emphasizing the importance of robust public comment and stakeholder feedback to improve the proposal. She acknowledges the hard work, collaboration, and coordination among the FDIC, OCC, and Federal Reserve Board. While the Federal Reserve Board has not joined the proposal at this time, she appreciates their engagement and recommendations, many of which are reflected in the proposal, and looks forward to continued collaboration. She expresses her support for the proposed rule.

## Table parsing from documents

Gemini can parse contents of a table and return it in a structured format, such as HTML or markdown.

In [22]:
table_extraction_prompt = """What is the HTML code of the table in this document?"""

In [23]:
# Send Table Extraction Prompt to Gemini
response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        table_extraction_prompt,
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/pdf/salary_table.pdf",
            mime_type=PDF_MIME_TYPE,
        ),
    ],
    config=GenerateContentConfig(temperature=0),
)

display(Markdown(response.text))

ClientError: 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429 for more details.', 'status': 'RESOURCE_EXHAUSTED'}}

## Document Translation

Gemini can translate documents between languages. This example translates meeting notes from English into French and Spanish.

In [24]:
translation_prompt = """Translate the first paragraph into French and Spanish. Label each paragraph with the target language."""

In [25]:
# Send Translation Prompt to Gemini
response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        translation_prompt,
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/pdf/fdic_board_meeting.pdf",
            mime_type=PDF_MIME_TYPE,
        ),
    ],
    config=GenerateContentConfig(
        temperature=0,
    ),
)

display(Markdown(f"### Translations"))
display(Markdown(response.text))

### Translations

**French**
Le secteur bancaire a considérablement évolué ces dernières années, et la réglementation doit évoluer avec l'industrie afin de favoriser un système efficace qui réponde aux besoins des entreprises et des consommateurs à travers le pays. En modernisant nos réglementations d'application de la loi sur le réinvestissement communautaire (CRA), nous espérons promouvoir des investissements accrus dans les communautés qui en ont le plus besoin.

**Spanish**
El negocio bancario ha cambiado drásticamente en los últimos años, y las regulaciones deben evolucionar con la industria para fomentar un sistema eficaz que satisfaga las necesidades de las empresas y los consumidores en todo el país. Al modernizar nuestras regulaciones que implementan la Ley de Reinversión Comunitaria (CRA), esperamos promover mayores inversiones en las comunidades que más las necesitan.

## Document Comparison

Gemini can compare and contrast the contents of multiple documents. This example finds the changes in the IRS Form 1040 between 2013 and 2023.

Note: when working with multiple documents, the order can matter and should be specified in your prompt.

In [26]:
comparison_prompt = """The first document is from 2013, the second one from 2023. How did the standard deduction evolve?"""

In [27]:
# Send Comparison Prompt to Gemini
response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        comparison_prompt,
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/pdf/form_1040_2013.pdf",
            mime_type=PDF_MIME_TYPE,
        ),
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/pdf/form_1040_2023.pdf",
            mime_type=PDF_MIME_TYPE,
        ),
    ],
    config=GenerateContentConfig(temperature=0),
)

display(Markdown(f"### Comparison"))
display(Markdown(response.text))

### Comparison

The standard deduction has significantly increased between 2013 and 2023, and the structure of how deductions are calculated has also changed.

Here's a breakdown of the evolution:

**2013 Standard Deduction (from Form 1040, page 2):**

*   **Single or Married filing separately:** $6,100
*   **Married filing jointly or Qualifying widow(er):** $12,200
*   **Head of household:** $8,950
*   **Note:** In 2013, taxpayers also claimed **personal exemptions**. For example, line 42 on the 2013 Form 1040 shows an exemption amount of $3,900 per person (for incomes up to $150,000). This was *in addition* to the standard deduction or itemized deductions.

**2023 Standard Deduction (from Form 1040, page 1):**

*   **Single or Married filing separately:** $13,850
*   **Married filing jointly or Qualifying surviving spouse:** $27,700
*   **Head of household:** $20,800
*   **Note:** The concept of **personal exemptions was eliminated** by the Tax Cuts and Jobs Act of 2017 (TCJA), which went into effect for the 2018 tax year. The increased standard deduction amounts were intended to largely offset the loss of personal exemptions for many taxpayers.

**Summary of Evolution:**

1.  **Significant Increase in Amounts:** The standard deduction amounts have roughly doubled (or more) across all filing statuses from 2013 to 2023.
    *   Single: $6,100 -> $13,850 (increase of $7,750)
    *   MFJ: $12,200 -> $27,700 (increase of $15,500)
    *   HOH: $8,950 -> $20,800 (increase of $11,850)
2.  **Elimination of Personal Exemptions:** The 2013 tax form included a deduction for personal exemptions (e.g., $3,900 per person). The 2023 form does not have this deduction. The higher standard deduction amounts in 2023 are partly a result of this change, aiming to simplify tax filing and provide a larger deduction for non-itemizers.
3.  **Simplified Deduction Calculation:** For many taxpayers, the significantly higher standard deduction means they no longer need to itemize deductions, simplifying their tax preparation.

## Document page extraction

This example uses Gemini to identify relevant pages and creates a new, focused PDF.

In [None]:
PROMPT_PAGES = """
Return the numbers of all pages in the document above that contain information related to the question below.
<Instructions>
 - Use the document above as your only source of information to determine which pages are related to the question below.
 - Return the page numbers of the document above that are related to the question. When in doubt, return the page anyway.
 - The page numbers should be in the format of a list of integers, e.g. [1, 2, 3].
</Instructions>
<Suggestions>
 - The document above is a financial report with various tables, charts, infographics, lists, and additional text information.
 - Pay close attention to the chart legends and chart colors to determine the pages. Colors may indicate which information is important for determining the pages.
 - The color of the chart legends represents the color of the bars in the chart.
 - Use ONLY this document as context to determine the pages.
 - In most cases, the page number can be found in the footer.
</Suggestions>
<Question>
{question}
</Question>
"""


def pdf_slice(input_file: str, output_file: str, pages: list[int]) -> None:
    """Using an input pdf file name and a list of page numbers,
    writes a new pdf file containing only those pages.
    """
    pdf_reader = pypdf.PdfReader(input_file)
    pdf_writer = pypdf.PdfWriter()
    for page_num in pages:
        if 1 <= page_num <= len(pdf_reader.pages):
            pdf_writer.add_page(pdf_reader.pages[page_num - 1])
    pdf_writer.write(output_file)

Include your question and the path to your PDF from a URL.

In [None]:
question = "From the Consolidated Balance Sheet, what was the difference between the total assets from 2022 to 2023?"  # @param {type: "string"}
pdf_path = "https://storage.googleapis.com/github-repo/generative-ai/gemini/use-cases/document-processing/CymbalBankFinancialStatements.pdf"  # @param {type: "string"}
local_pdf = os.path.basename(pdf_path)

Extract the relevant pages using Gemini and print them.

In [None]:
response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        "<Document>",
        Part.from_uri(file_uri=pdf_path, mime_type=PDF_MIME_TYPE),
        "</Document>",
        PROMPT_PAGES.format(question=question),
    ],
    config=GenerateContentConfig(
        temperature=0,
        response_mime_type=JSON_MIME_TYPE,
        response_schema=list[int],
    ),
)
pages = response.parsed
print(pages)

Download the PDF file to local storage.

In [None]:
!wget {pdf_path} -O {local_pdf}

To ensure we find the answer to the question, we will also retrieve the page immediately after the selected page.

In [None]:
expanded_pages = set(pages).union(page + 1 for page in pages)
pdf_slice(input_file=local_pdf, output_file="sample.pdf", pages=sorted(expanded_pages))
