In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Document Processing with Gemini

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fdocument-processing%2Fdocument_processing.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Run in Colab Enterprise
    </a>
  </td>       
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/document-processing/document_processing.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://goo.gle/4jhBze9">
      <img width="32px" src="https://cdn.qwiklabs.com/assets/gcp_cloud-e3a77215f0b8bfa9b3f611c0d2208c7e8708ed31.svg" alt="Google Cloud logo"><br> Open in  Cloud Skills Boost
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            


| | |
|-|-|
|Author(s) | [Holt Skinner](https://github.com/holtskinner), [Drew Gillson](https://github.com/drewgillson) |

## Overview

In today's information-driven world, the volume of digital documents generated daily is staggering. From emails and reports to legal contracts and scientific papers, businesses and individuals alike are inundated with vast amounts of textual data. Extracting meaningful insights from these documents efficiently and accurately has become a paramount challenge.

Document processing involves a range of tasks, including text extraction, classification, summarization, and translation, among others. Traditional methods often rely on rule-based algorithms or statistical models, which may struggle with the nuances and complexities of natural language.

Generative AI offers a promising alternative to understand, generate, and manipulate text using natural language prompting. Gemini on Vertex AI allows these models to be used in a scalable manner through:

- [Vertex AI Studio](https://cloud.google.com/generative-ai-studio) in the Cloud Console
- [Vertex AI REST API](https://cloud.google.com/vertex-ai/docs/reference/rest)
- [Google Gen AI SDK for Python](https://cloud.google.com/vertex-ai/generative-ai/docs/sdks/overview)

For more information, see the [Generative AI on Vertex AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview) documentation.


### Objectives

In this tutorial, you will learn how to use the Gemini API in Vertex AI with the Google Gen AI SDK for Python to process PDF documents.

You will complete the following tasks:

- Install the SDK
- Use the Gemini 2.0 Flash model to:
  - Extract structured entities from an unstructured document
  - Classify document types
  - Combine classification and entity extraction into a single workflow
  - Answer questions from documents
  - Summarize documents
  - Extract Table Data as HTML
  - Translate documents
  - Compare and contrast similar documents


### Costs

This tutorial uses billable components of Google Cloud:

- Vertex AI

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.


## Getting Started


### Install Google Gen AI SDK for Python


In [None]:
%pip install --upgrade --user --quiet google-genai

### Dependency Management for Projects

While using `%pip install` is convenient for interactive notebooks, standalone Python projects benefit from more structured dependency management. This ensures that your project can be reliably set up in different environments and by other collaborators.

**1. `requirements.txt`:**
   - A common way to manage dependencies is by listing them in a `requirements.txt` file.
   - You can generate a `requirements.txt` file that captures all packages in your current environment using:
     ```bash
     pip freeze > requirements.txt
     ```
   - To install dependencies from this file in a new environment, you would run:
     ```bash
     pip install -r requirements.txt
     ```
   - **Recommendation**: For a project, it's better to manually create and maintain your `requirements.txt` file, listing only the direct dependencies your project needs (e.g., `google-genai`, `pydantic`). `pip freeze` captures all packages, including indirect dependencies and those unrelated to your current project, which can make the environment less predictable.

**2. Virtual Environments:**
   - It is highly recommended to use virtual environments in conjunction with `requirements.txt`.
   - Virtual environments (created using tools like `venv` or `conda`) isolate your project's dependencies from your global Python installation and other projects. This prevents version conflicts and ensures that your project has exactly the dependencies it needs.
   - **Example with `venv`**:
     ```bash
     # Create a virtual environment
     python -m venv my-project-env
     # Activate it (on macOS/Linux)
     source my-project-env/bin/activate
     # Activate it (on Windows)
     # .\my-project-env\Scripts\activate
     
     # Install dependencies into the virtual environment
     pip install -r requirements.txt
     ```

**3. Advanced Dependency Management Tools:**
   - For more complex projects, consider using tools like [Poetry](https://python-poetry.org/) or [Pipenv](https://pipenv.pypa.io/en/latest/).
   - These tools offer more advanced features, including:
     - Dependency resolution (solving compatible versions of all direct and indirect dependencies).
     - Project packaging and building.
     - Integrated virtual environment management.
     - Often use a `pyproject.toml` file instead of just `requirements.txt`.

Adopting these practices leads to more reproducible, maintainable, and shareable Python projects.

### Restart current runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [None]:
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench).


In [None]:
import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information and create client

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [None]:
import os

from google import genai

PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

### Import libraries


In [3]:
from datetime import date
from enum import Enum
import json
from html.parser import HTMLParser # For parsing HTML table output

import google.api_core.exceptions
from IPython.display import Markdown, display
from google.genai.types import GenerateContentConfig, Part
from pydantic import BaseModel, Field, create_model_from_json_schema # Pydantic for data validation and schema definition

PDF_MIME_TYPE = "application/pdf"
JSON_MIME_TYPE = "application/json"
ENUM_MIME_TYPE = "text/x.enum"

In [None]:
def make_gemini_request(client, model_id, contents, generation_config):
    """Makes a request to the Gemini API and handles common errors.
    
    Args:
        client: The Gemini API client.
        model_id: The ID of the Gemini model to use.
        contents: The contents of the prompt to send to the model.
        generation_config: The generation configuration for the request.
        
    Returns:
        The API response object if successful, None otherwise.
    """
    try:
        response = client.models.generate_content(
            model=model_id,
            contents=contents,
            config=generation_config
        )
        return response
    except google.api_core.exceptions.InvalidArgument as e:
        print(f"An API error occurred (InvalidArgument): {type(e).__name__} - {e}")
        return None
    except google.api_core.exceptions.NotFound as e:
        print(f"An API error occurred (NotFound): {type(e).__name__} - {e}")
        return None
    except google.api_core.exceptions.ServiceUnavailable as e:
        print(f"An API error occurred (ServiceUnavailable): {type(e).__name__} - {e}")
        return None
    except Exception as e:
        print(f"An unexpected error occurred: {type(e).__name__} - {e}")
        return None # Fallback for any other unexpected errors

### Load the Gemini 2.0 Flash model

Gemini 2.0 Flash (`gemini-2.0-flash`) is a multimodal model that supports multimodal prompts. You can include text, image(s), and video in your prompt requests and get text or code responses.

Learn more about all [Gemini models on Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models).

In [4]:
MODEL_ID = "gemini-2.0-flash-001"  # @param {type: "string"}
# Using gemini-2.0-flash-001, a fast and versatile multimodal model.

## Entity Extraction

[Named Entity Extraction](https://en.wikipedia.org/wiki/Named-entity_recognition) is a technique of Natural Language Processing to identify specific fields and values from unstructured text. For example, you can find key-value pairs from a filled out form, or get all of the important data from an invoice categorized by the type. The Gemini API can perform entity extraction when provided with a schema describing the desired entities.

### Extract entities from an invoice

In this example, you will use a sample invoice and get all of the information in a structured format.

The `entity_extraction_system_instruction` guides the model to act as an entity extraction specialist. It emphasizes extracting values directly from the document without normalization, ensuring the extracted data accurately reflects the source.

In [46]:
entity_extraction_system_instruction = """You are a document entity extraction specialist. Given a document, your task is to extract the text value of the entities provided in the schema.
- The values must only include text found in the document
- Do not normalize any entity values.
"""

We will use [Controlled generation](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output) to tell the model which fields need to be extracted. The response schema is specified using Pydantic models in the `response_schema` parameter of `GenerateContentConfig`. The model output will then strictly follow this schema, returning a JSON object that can be parsed into the Pydantic model. Setting `response_mime_type` to `application/json` ensures the output is JSON.

In [78]:
# Pydantic models define the schema for entity extraction.
# Each class represents an entity, and its fields represent the attributes to extract.
class Address(BaseModel):
    street: str | None = Field(None, example="123 Main St", description="Street name and number.")
    city: str | None = Field(None, example="Springfield", description="City name.")
    state: str | None = Field(None, example="IL", description="State or region.")
    postal_code: str | None = Field(None, example="62704", description="Postal or ZIP code.")
    country: str | None = Field(None, example="USA", description="Country name or code (e.g., USA)." )


class LineItem(BaseModel):
    amount: float = Field(..., example=100.00, description="Total amount for this line item (quantity * unit_price).")
    description: str | None = Field(None, example="Laptop", description="Description of the product or service.")
    product_code: str | None = Field(None, example="LPT-001", description="Product or service code, if available.")
    quantity: int = Field(..., example=2, description="Number of units.")
    unit: str | None = Field(None, example="pcs", description="Unit of measure (e.g., pcs, kg, hrs).")
    unit_price: float = Field(..., example=50.00, description="Price per unit.")


class VAT(BaseModel):
    amount: float = Field(..., example=20.00, description="VAT amount for this category/item.")
    category_code: str | None = Field(None, example="A", description="VAT category code, if applicable.")
    tax_amount: float | None = Field(None, example=5.00, description="Tax amount component of the VAT (if specified separately)." )
    tax_rate: float | None = Field(
        None, example=10.0, description="VAT rate as a percentage (e.g., 10.0 for 10%)."
    ) 
    total_amount: float = Field(..., example=200.00, description="Total amount for items under this VAT category, including VAT.")


class Party(BaseModel):
    name: str = Field(..., example="Google", description="Name of the party (e.g., supplier or receiver)." )
    street: str | None = Field(None, example="456 Business Rd", description="Street address of the party.")
    city: str | None = Field(None, example="Metropolis", description="City of the party.")
    state: str | None = Field(None, example="NY", description="State or region of the party.")
    postal_code: str | None = Field(None, example="10001", description="Postal code of the party.")
    country: str | None = Field(None, example="USA", description="Country of the party.")
    email: str | None = Field(None, example="contact@google.com", description="Contact email address of the party.")
    phone: str | None = Field(None, example="+1-555-1234", description="Contact phone number of the party.")
    website: str | None = Field(None, example="https://google.com", description="Website URL of the party.")
    tax_id: str | None = Field(None, example="123456789", description="Tax identification number (e.g., VAT ID, EIN)." )
    registration: str | None = Field(None, example="Reg-98765", description="Business registration number, if applicable.")
    iban: str | None = Field(None, example="US1234567890123456789", description="International Bank Account Number, if applicable.")
    payment_ref: str | None = Field(None, example="INV-2024-001", description="Payment reference or identifier.")


class Invoice(BaseModel):
    invoice_id: str = Field(..., example="INV-2024-001", description="Unique identifier for the invoice.")
    invoice_date: str = Field(..., example="2024-02-03", description="Date the invoice was issued (YYYY-MM-DD)." )
    supplier: Party = Field(..., description="Details of the supplier or vendor.")
    receiver: Party = Field(..., description="Details of the receiver or customer.")
    line_items: list[LineItem] = Field(..., description="List of individual line items on the invoice.")
    vat: list[VAT] = Field(description="List of VAT (Value Added Tax) details, if applicable.")

For this example, we will download a PDF document to local storage and send the file bytes to the API for processing.

You can view the document [here](https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/invoice.pdf).

In [None]:
# Download a PDF from Google Cloud Storage
! gsutil cp "gs://cloud-samples-data/generative-ai/pdf/invoice.pdf" ./invoice.pdf

In [None]:
# Load file bytes
with open("invoice.pdf", "rb") as f:
    file_bytes = f.read()

# Define generation config for Invoice: temperature=0 for deterministic, structured output based on the Pydantic schema.
invoice_config = GenerateContentConfig(
    system_instruction=entity_extraction_system_instruction,
    temperature=0, # Use 0 for deterministic and structured output based on schema
    response_schema=Invoice,
    response_mime_type=JSON_MIME_TYPE,
)
# Send to Gemini API using the helper function
response = make_gemini_request(
    client,
    MODEL_ID,
    contents=[
        "The following document is an invoice.", # Contextual prompt for the model
        Part.from_bytes(data=file_bytes, mime_type=PDF_MIME_TYPE),
    ],
    generation_config=invoice_config,
)

We can load the extracted data as an object using the `response.parsed` field.

In [None]:
invoice_data = response.parsed
print("\n-------Extracted Entities--------")
print(invoice_data)

Or the response can then be parsed as JSON into a Python dictionary for use in other applications.

In [None]:
json_object = json.loads(response.text)
print(json_object)

You can see that Gemini extracted all of the relevant fields from the document.

### Extract entities from a payslip

Let's try with another type of document, a payslip or paystub. This uses the same `entity_extraction_system_instruction` but a different Pydantic schema (`Payslip`).

In this example, we will use a document hosted on Google Cloud Storage and process it by passing the URI.

You can view the document [here](https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/earnings_statement.pdf).

In [70]:
class Payslip(BaseModel):
    employee_id: str = Field(..., description="Unique identifier for the employee.")
    employee_name: str = Field(..., description="Full name of the employee.")
    pay_period_start: date = Field(..., description="Start date of the pay period (YYYY-MM-DD).")
    pay_period_end: date = Field(..., description="End date of the pay period (YYYY-MM-DD).")
    gross_income: float = Field(..., description="Total income before any deductions.")
    federal_tax: float = Field(..., description="Amount deducted for federal income tax.")
    state_tax: float | None = Field(
        0.0, description="Amount deducted for state income tax, if applicable."
    )
    social_security: float = Field(..., description="Amount deducted for Social Security contributions.")
    medicare: float = Field(..., description="Amount deducted for Medicare contributions.")
    other_deductions: float | None = Field(
        0.0, description="Sum of any other deductions (e.g., health insurance, retirement plan)."
    )
    net_income: float = Field(..., description="Total income after all deductions (take-home pay).")
    payment_date: date = Field(..., description="Date the payment was issued (YYYY-MM-DD).")
    hours_worked: float | None = Field(
        None, description="Total hours worked in the pay period, if applicable."
    )
    hourly_rate: float | None = Field(
        None, description="Employee's hourly rate, if applicable."
    )

In [71]:
# Define generation config for Payslip: temperature=0 for deterministic, structured output based on the Pydantic schema.
payslip_config = GenerateContentConfig(
    system_instruction=entity_extraction_system_instruction,
    temperature=0, # Use 0 for deterministic and structured output based on schema
    response_schema=Payslip,
    response_mime_type=JSON_MIME_TYPE,
)
response = make_gemini_request(
    client,
    MODEL_ID,
    contents=[
        "The following document is a Payslip.", # Contextual prompt for the model
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/pdf/earnings_statement.pdf",
            mime_type=PDF_MIME_TYPE,
        ),
    ],
    generation_config=payslip_config,
)

In [None]:
print("\n-------Extracted Entities--------")
print(response.parsed)

## Document Classification

Document classification is the process for identifying the type of document. For example, invoice, W-2, receipt, etc.

In this example, you will use a [sample tax form (W-9)](https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/w9.pdf) and get the specific type of document from a specified `Enum`.
The `classification_prompt` guides the model to categorize the document based on the `DocumentCategory` Enum provided in the schema.

In [73]:
classification_prompt = """You are a document classification specialist. Given a document, your task is to find which category the document belongs to from the document categories provided in the schema."""


class DocumentCategory(Enum):
    TAX_1040_2019 = "1040_2019"
    TAX_1040_2020 = "1040_2020"
    TAX_1099_R = "1099-r"
    BANK_STATEMENT = "bank_statement"
    CREDIT_CARD_STATEMENT = "credit_card_statement"
    EXPENSE = "expense"
    TAX_1120S_2019 = "form_1120S_2019"
    TAX_1120S_2020 = "form_1120S_2020"
    INVESTMENT_RETIREMENT_STATEMENT = "investment_retirement_statement"
    INVOICE = "invoice"
    PAYSTUB = "paystub"
    PROPERTY_INSURANCE = "property_insurance"
    PURCHASE_ORDER = "purchase_order"
    UTILITY_STATEMENT = "utility_statement"
    W2 = "w2"
    W9 = "w9"
    DRIVER_LICENSE = "driver_license"

In [None]:
# Define generation config for Document Classification: temperature=0 for deterministic output.
# The response_schema is the Enum DocumentCategory, and response_mime_type is ENUM_MIME_TYPE for the model to return one of the enum values.
doc_classification_config = GenerateContentConfig(
    system_instruction=classification_prompt,
    temperature=0, # Use 0 for deterministic category selection
    response_schema=DocumentCategory,
    response_mime_type=ENUM_MIME_TYPE,
)
response = make_gemini_request(
    client,
    MODEL_ID,
    contents=[
        "Classify the following document.",
        Part.from_uri(
            file_uri="https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/w9.pdf",
            mime_type=PDF_MIME_TYPE,
        ),
    ],
    generation_config=doc_classification_config,
)

In [None]:
print("\n-------Document Classification--------")
print(response.text)
print(response.parsed)

You can see that Gemini successfully categorized the document.

### Chaining Classification and Extraction

These techniques can also be chained together to extract any number of document types.

For example, if you have multiple types of documents to process, you can send each document to Gemini with a classification prompt, then based on that output, you can write logic to decide which extraction prompt to use.

These are the sample documents:

- [US Driver License](https://storage.googleapis.com/cloud-samples-data/documentai/SampleDocuments/US_DRIVER_LICENSE_PROCESSOR/dl3.pdf)
- [Invoice](https://storage.googleapis.com/cloud-samples-data/documentai/SampleDocuments/INVOICE_PROCESSOR/google_invoice.pdf)
- [Form W-2](https://storage.googleapis.com/cloud-samples-data/documentai/SampleDocuments/FORM_W2_PROCESSOR/2020FormW-2.pdf)

The following cells define Pydantic models for W2 forms and Driver's Licenses. The `W2Form` schema is then exported to JSON and will be dynamically loaded in the subsequent cell. This demonstrates how schemas can be managed and versioned outside the main codebase if needed.

In [81]:
class W2Form(BaseModel):
    control_number: str | None = Field(None, description="Control number from the W-2 form, if present.")
    ein: str = Field(..., description="Employer Identification Number (EIN).")

    employee_first_name: str = Field(..., description="Employee's first name.")
    employee_last_name: str = Field(..., description="Employee's last name.")
    employee_address_street: str = Field(..., description="Employee's street address.")
    employee_address_city: str = Field(..., description="Employee's city.")
    employee_address_state: str = Field(..., description="Employee's state (abbreviation, e.g., CA)." )
    employee_address_zip: str = Field(..., description="Employee's ZIP code.")

    employer_name: str = Field(..., description="Employer's name.")
    employer_address_street: str = Field(..., description="Employer's street address.")
    employer_address_city: str = Field(..., description="Employer's city.")
    employer_address_state: str = Field(..., description="Employer's state (abbreviation, e.g., CA)." )
    employer_address_zip: str = Field(..., description="Employer's ZIP code.")
    employer_state_id_number: str | None = Field(None, description="Employer's state ID number, if applicable.")

    wages_tips_other_compensation: float = Field(..., description="Box 1: Wages, tips, other compensation.")
    federal_income_tax_withheld: float = Field(..., description="Box 2: Federal income tax withheld.")
    social_security_wages: float = Field(..., description="Box 3: Social security wages.")
    social_security_tax_withheld: float = Field(..., description="Box 4: Social security tax withheld.")
    medicare_wages_and_tips: float = Field(..., description="Box 5: Medicare wages and tips.")
    medicare_tax_withheld: float = Field(..., description="Box 6: Medicare tax withheld.")

    state: str | None = Field(None, description="Box 15: State.")
    state_wages_tips_etc: float | None = Field(None, description="Box 16: State wages, tips, etc.")
    state_income_tax: float | None = Field(None, description="Box 17: State income tax.")

    box_12_code: str | None = Field(None, description="Box 12 code, if present (e.g., DD, W)." )
    box_12_value: str | None = Field(None, description="Box 12 value associated with the code.")

    form_year: int = Field(..., description="The tax year of the W-2 form (e.g., 2020)." )


class DriversLicense(BaseModel):
    address: str = Field(
        ..., title="Address", description="Full address of the individual on the license."
    )
    date_of_birth: date = Field(
        ..., title="Date of Birth", description="Birthdate of the individual (YYYY-MM-DD)."
    )
    document_id: str = Field(
        ...,
        title="Document ID",
        description="The unique document ID or license number for the driver's license.",
    )
    expiration_date: date = Field(
        ...,
        title="Expiration Date",
        description="Expiration date of the driver's license (YYYY-MM-DD).",
    )
    family_name: str = Field(
        ...,
        title="Family Name",
        description="The family name (last name or surname) of the individual.",
    )
    given_names: str = Field(
        ...,
        title="Given Names",
        description="The given names (first and middle names) of the individual.",
    )
    issue_date: date = Field(
        ..., title="Issue Date", description="Issue date of the driver's license (YYYY-MM-DD)."
    )

# Export the W2Form schema to a JSON string for later dynamic loading.
w2form_schema_json = W2Form.schema_json(indent=2)
print("W2Form JSON Schema:")
print(w2form_schema_json)

# Map classification types to Pydantic schemas for entity extraction.
# The W2Form mapping will be updated in the next cell using the dynamically loaded schema.
classification_to_schema = {
    DocumentCategory.INVOICE: Invoice,
    DocumentCategory.W2: W2Form, 
    DocumentCategory.DRIVER_LICENSE: DriversLicense,
}

In [None]:
# Dynamically load the W2Form schema from the JSON string defined in the previous cell.
W2FormLoaded = create_model_from_json_schema(w2form_schema_json, "W2FormLoaded")
print("Successfully loaded W2Form from JSON schema")

# Update the classification_to_schema dictionary to use the loaded schema for W2 forms.
classification_to_schema[DocumentCategory.W2] = W2FormLoaded

gcs_uris = [
    "gs://cloud-samples-data/documentai/SampleDocuments/US_DRIVER_LICENSE_PROCESSOR/dl3.pdf",
    "gs://cloud-samples-data/documentai/SampleDocuments/INVOICE_PROCESSOR/google_invoice.pdf",
    "gs://cloud-samples-data/documentai/SampleDocuments/FORM_W2_PROCESSOR/2020FormW-2.pdf",
]

for gcs_uri in gcs_uris:
    print(f"\nFile: {gcs_uri}\n")

    # Config for the classification step.
    classification_config = GenerateContentConfig(
        system_instruction=classification_prompt,
        temperature=0,
        response_schema=DocumentCategory,
        response_mime_type=ENUM_MIME_TYPE,
    )
    classification_response = make_gemini_request(
        client,
        MODEL_ID,
        contents=[
            "Classify the following document.",
            Part.from_uri(file_uri=gcs_uri, mime_type=PDF_MIME_TYPE),
        ],
        generation_config=classification_config,
    )

    if not classification_response or not classification_response.text:
        print("Skipping extraction due to classification error or empty response.")
        continue

    print(f"Document Classification: {classification_response.text}")

    # Get Extraction schema based on Classification
    extraction_schema = classification_to_schema.get(classification_response.parsed)

    if not extraction_schema:
        print(f"Document does not belong to a specified class. Skipping extraction.")
        continue

    # Config for the entity extraction step, using entity_extraction_system_instruction.
    extraction_config = GenerateContentConfig(
        system_instruction=entity_extraction_system_instruction, 
        temperature=0,
        response_schema=extraction_schema,
        response_mime_type=JSON_MIME_TYPE,
    )
    extraction_response = make_gemini_request(
        client,
        MODEL_ID,
        contents=[
            f"Extract the entities from the following {classification_response.text} document.",
            Part.from_uri(file_uri=gcs_uri, mime_type=PDF_MIME_TYPE),
        ],
        generation_config=extraction_config,
    )

    if not extraction_response:
        print("Skipping entity printing due to extraction error.")
        continue

    print("\n-------Extracted Entities--------")
    print(extraction_response.parsed)

## Document Question Answering

Gemini can be used to answer questions about a document. The `qa_system_instruction` guides the model to act as a question-answering specialist, using the provided document as context.

This example answers a question about the Transformer model paper ["Attention is all you need"](https://arxiv.org/pdf/1706.03762), we will be loading the PDF file directly from the source on [arXiv](https://arxiv.org)

In [83]:
qa_system_instruction = "You are a question answering specialist. Given a question and a context, your task is to provide the answer to the question based on the context provided. Give the answer first, followed by an explanation."

In [None]:
# Send Q&A Prompt to Gemini
# Define generation config for Q&A: temperature=0 for more factual, less creative answers.
qa_config = GenerateContentConfig(
    system_instruction=qa_system_instruction,
    temperature=0,
    response_mime_type="text/plain", # Expecting a plain text answer
)
response = make_gemini_request(
    client,
    MODEL_ID,
    contents=[
        "What is the attention mechanism?", # The question
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/pdf/1706.03762v7.pdf", # The document context
            mime_type=PDF_MIME_TYPE,
        ),
    ],
    generation_config=qa_config,
)

if response:
    print(f"Answer: {response.text}")
else:
    print("No answer due to an error.")

## Document Summarization

Gemini can also be used to summarize or paraphrase a document's contents. Your prompt can specify how detailed the summary should be or specific formatting, such as bullet points or paragraphs.
The `summarization_system_instruction` directs the model to act as a professional summarizer, focusing on key details including descriptions of images, tables, and graphs, while avoiding external information.

In [88]:
summarization_system_instruction = """You are a professional document summarization specialist. Given a document, your task is to provide a detailed summary of the content of the document.

If it includes images, provide descriptions of the images.
If it includes tables, extract all elements of the tables.
If it includes graphs, explain the findings in the graphs.
Do not include any numbers that are not mentioned in the document.
"""

In [None]:
# Send Summarization Prompt to Gemini
# Define generation config for Summarization: temperature=0 for a more focused and factual summary.
summarization_config = GenerateContentConfig(
    system_instruction=summarization_system_instruction,
    temperature=0, # Lower temperature for more factual summary
    response_mime_type="text/plain", # Expecting a plain text summary
)
response = make_gemini_request(
    client,
    MODEL_ID,
    contents=[
        "Summarize the following document.", # The summarization request
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/pdf/fdic_board_meeting.pdf", # The document to summarize
            mime_type=PDF_MIME_TYPE,
        ),
    ],
    generation_config=summarization_config,
)

if response:
    display(Markdown(f"### Document Summary"))
    display(Markdown(response.text))
else:
    print("No summary due to an error.")

## Table parsing from documents

Gemini can parse contents of a table and return it in a structured format, such as HTML or markdown. The following example asks for an HTML representation of a table within a document.

In [91]:
table_extraction_prompt = """What is the HTML code of the table in this document?"""

In [None]:
# Send Table Extraction Prompt to Gemini
# Define generation config for Table Extraction: temperature=0 for direct HTML output.
table_config = GenerateContentConfig(temperature=0) # Temperature 0 for precise extraction
response = make_gemini_request(
    client,
    MODEL_ID,
    contents=[
        table_extraction_prompt,
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/pdf/salary_table.pdf",
            mime_type=PDF_MIME_TYPE,
        ),
    ],
    generation_config=table_config,
)

if response and response.text:
    display(Markdown(response.text)) # Display the raw HTML table as Markdown
    
    html_table_str = response.text
    # Remove markdown code fences (```html and ```) if present around the HTML string
    if html_table_str.startswith("```html"):
        html_table_str = html_table_str[7:]
    if html_table_str.endswith("```"):
        html_table_str = html_table_str[:-3]
    html_table_str = html_table_str.strip()

    # Define a simple HTML parser to extract table data into a list of lists
    class SimpleTableParser(HTMLParser):
        def __init__(self):
            super().__init__()
            self.in_td = False # Flag to indicate if currently inside a <td> tag
            self.in_th = False # Flag to indicate if currently inside a <th> tag
            self.current_row = [] # Holds data for the current row being parsed
            self.table_data = [] # List of all rows (each row is a list of cell data)
            self.current_cell_data = "" # Accumulates data within a cell

        def handle_starttag(self, tag, attrs):
            if tag == "tr": # Start of a table row
                self.current_row = []
            elif tag == "td": # Start of a table data cell
                self.in_td = True
                self.current_cell_data = ""
            elif tag == "th": # Start of a table header cell
                self.in_th = True
                self.current_cell_data = ""

        def handle_endtag(self, tag):
            if tag == "tr": # End of a table row
                if self.current_row: 
                    self.table_data.append(self.current_row)
            elif tag == "td": # End of a table data cell
                self.in_td = False
                self.current_row.append(self.current_cell_data.strip())
                self.current_cell_data = ""
            elif tag == "th": # End of a table header cell
                self.in_th = False
                self.current_row.append(self.current_cell_data.strip())
                self.current_cell_data = ""

        def handle_data(self, data):
            if self.in_td or self.in_th: # Accumulate data if inside a <td> or <th>
                self.current_cell_data += data

    parser = SimpleTableParser()
    parser.feed(html_table_str) # Feed the HTML table string to the parser
    parsed_table_list_of_lists = parser.table_data

    print("\n-------Parsed Table (List of Lists)--------")
    for row in parsed_table_list_of_lists:
        print(row)

    # Convert to list of dictionaries if headers are present
    if parsed_table_list_of_lists and len(parsed_table_list_of_lists) > 0:
        headers = parsed_table_list_of_lists[0] # Assume the first row contains headers
        data_rows = parsed_table_list_of_lists[1:]
        if headers and data_rows: # Check if both headers and data rows exist
            list_of_dicts = [dict(zip(headers, row)) for row in data_rows]
            print("\n-------Parsed Table (List of Dictionaries)--------")
            for item in list_of_dicts:
                print(item)
        elif headers: # Only headers, no data rows
             print("\n-------Parsed Table (List of Dictionaries)--------")
             print("Only headers found, no data rows to convert to dictionaries.")
        else: # No clear headers or no data rows
             print("\n-------Parsed Table (List of Dictionaries)--------")
             print("Could not determine headers or no data rows available for dictionary conversion.")
    else:
        print("\n-------Parsed Table (List of Dictionaries)--------")
        print("Parsed table is empty or malformed, cannot convert to list of dictionaries.")
else:
    print("No table extracted due to an error or empty response.")

## Document Translation

Gemini can translate documents between languages. This example translates meeting notes from English into French and Spanish. The `translation_prompt` instructs the model to perform the translation and label each language.

In [94]:
translation_prompt = """Translate the first paragraph into French and Spanish. Label each paragraph with the target language."""

In [None]:
# Send Translation Prompt to Gemini
# Define generation config for Translation: temperature=0 for more literal and accurate translation.
translation_config = GenerateContentConfig(temperature=0)
response = make_gemini_request(
    client,
    MODEL_ID,
    contents=[
        translation_prompt,
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/pdf/fdic_board_meeting.pdf",
            mime_type=PDF_MIME_TYPE,
        ),
    ],
    generation_config=translation_config,
)

if response:
    display(Markdown(f"### Translations"))
    display(Markdown(response.text))
else:
    print("No translation due to an error.")

## Document Comparison

Gemini can compare and contrast the contents of multiple documents. This example finds the changes in the IRS Form 1040 between 2013 and 2023.

Note: when working with multiple documents, the order can matter and should be specified in your prompt. Here, the `comparison_prompt` clarifies the order of the documents.

In [96]:
comparison_prompt = """The first document is from 2013, the second one from 2023. How did the standard deduction evolve?"""

In [None]:
# Send Comparison Prompt to Gemini
# Define generation config for Comparison: temperature=0 for a factual comparison.
comparison_config = GenerateContentConfig(temperature=0)
response = make_gemini_request(
    client,
    MODEL_ID,
    contents=[
        comparison_prompt,
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/pdf/form_1040_2013.pdf",
            mime_type=PDF_MIME_TYPE,
        ),
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/pdf/form_1040_2023.pdf",
            mime_type=PDF_MIME_TYPE,
        ),
    ],
    generation_config=comparison_config,
)

if response:
    display(Markdown(f"### Comparison"))
    display(Markdown(response.text))
else:
    print("No comparison due to an error.")

### Asynchronous Processing for Multiple Documents

When processing multiple documents, making sequential API calls can be time-consuming because the program waits for each call to complete before starting the next. Asynchronous processing can significantly speed up I/O-bound operations like these by allowing the program to initiate multiple requests and then process them as they complete, rather than one by one.

Python's `asyncio` library is the standard way to handle such tasks. For optimal performance with the Gemini API, you would typically use asynchronous methods provided by the `google-genai` SDK. If the SDK offers methods like `async_generate_content()`, those should be preferred.

Below is a *hypothetical* code snippet to illustrate the concept. Note that the `async_generate_content` method is fictional for this example, and you'd need to consult the SDK documentation for the actual asynchronous methods available.

```python
import asyncio
# from google.genai import ... # Actual imports would depend on SDK's async support
# from google.genai.types import ...

# Hypothetical asynchronous function to process a single document
async def process_document_async(client, model_id, document_content, generation_config):
    # This is a placeholder for the SDK's actual async call
    # print(f"Starting processing for: {document_content[:30]}...")
    # response = await client.models.async_generate_content( # Fictional method
    #     model=model_id,
    #     contents=[document_content, Part.from_uri(...)], # Example content
    #     config=generation_config
    # )
    # await asyncio.sleep(1) # Simulate I/O-bound operation (API call)
    # return f"Processed: {document_content[:20]}... -> {response.text[:30]}..."
    
    # Using the synchronous make_gemini_request in a thread for asyncio compatibility (conceptual)
    # In a real scenario, you'd use the SDK's native async methods if available.
    loop = asyncio.get_event_loop()
    response = await loop.run_in_executor(
        None, # Uses the default ThreadPoolExecutor
        make_gemini_request, # Your existing synchronous function
        client, 
        model_id, 
        [document_content], # Assuming contents is a list for make_gemini_request 
        generation_config
    )
    await asyncio.sleep(1) # Simulate additional async work or ensure yield
    if response and response.text:
        return f"Processed (simulated async): {document_content[:20]}... -> {response.text[:30]}..."
    return f"Failed or empty (simulated async): {document_content[:20]}..."

async def main_async_processing(client, model_id, document_parts, generation_config):
    tasks = []
    for doc_part in document_parts:
        # Each Part object is assumed to be a document or part of a document for processing
        tasks.append(process_document_async(client, model_id, doc_part, generation_config))
    
    results = await asyncio.gather(*tasks)
    
    for result in results:
        print(result)

# Example of how you might run this (conceptual):
# Assuming 'client', 'MODEL_ID', 'sample_doc_parts', 'some_config' are defined:
# sample_doc_parts = [Part.from_uri(...), Part.from_uri(...)] # List of document parts
# some_config = GenerateContentConfig(...)
#
# To run in a Jupyter Notebook, you might need to use nest_asyncio or ensure the event loop is managed:
# import nest_asyncio
# nest_asyncio.apply()
#
# asyncio.run(main_async_processing(client, MODEL_ID, sample_doc_parts, some_config))
```

**Caveats for Jupyter Notebooks**:
Running `asyncio` code, especially `asyncio.run()`, directly in a Jupyter Notebook cell can lead to a `RuntimeError` if an event loop is already running (which is often the case in Jupyter environments). To manage this, you might need to:
- Use the `nest_asyncio` library: `import nest_asyncio; nest_asyncio.apply()` at the beginning of your notebook.
- Alternatively, use `await main_async_processing(...)` directly in a cell if you are in an environment that supports top-level await (like IPython 7.0+).

Always refer to the latest `google-genai` SDK documentation for the recommended way to perform asynchronous operations.