# Document Processing with Gemini

| | |
|-|-|
|Author(s) | [Justin Marciszewski](https://github.com/justinjm) |

## Overview

In today's information-driven world, the volume of digital documents generated daily is staggering. From emails and reports to legal contracts and scientific papers, businesses and individuals alike are inundated with vast amounts of textual data. Extracting meaningful insights from these documents efficiently and accurately has become a paramount challenge.

Document processing involves a range of tasks, including text extraction, classification, summarization, and translation, among others. Traditional methods often rely on rule-based algorithms or statistical models, which may struggle with the nuances and complexities of natural language.

Generative AI offers a promising alternative to understand, generate, and manipulate text using natural language prompting. Gemini on Vertex AI allows these models to be used in a scalable manner through:

- [Vertex AI Studio](https://cloud.google.com/generative-ai-studio) in the Cloud Console
- [Vertex AI REST API](https://cloud.google.com/vertex-ai/docs/reference/rest)
- [Vertex AI SDK for Python](https://cloud.google.com/vertex-ai/docs/python-sdk/use-vertex-ai-python-sdk-ref)
- [Other client libraries](https://cloud.google.com/vertex-ai/docs/start/client-libraries)

This notebook focuses on using the **Vertex AI SDK for Python** to call the Vertex AI Gemini API with the Gemini 1.5 Flash model.

For more information, see the [Generative AI on Vertex AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview) documentation.


### Objectives

In this tutorial, you will learn how to use the Vertex AI Gemini API with the Vertex AI SDK for Python to process PDF documents.

You will complete the following tasks:

- Install the Vertex AI SDK for Python
- Use the Vertex AI Gemini API to interact with Gemini 1.5 Flash (`gemini-1.5-flash`) model:
  - Extract structured entities from an unstructured document
  - Classify document types
  - Combine classification and entity extraction into a single workflow
  - Summarize documents


### Costs

This tutorial uses billable components of Google Cloud:

- Vertex AI

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.


## Getting Started


### Install Vertex AI SDK for Python


In [None]:
%pip install --upgrade --user --quiet google-cloud-aiplatform

### Restart current runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [None]:
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench).


In [None]:
import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information and initialize Vertex AI SDK

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]

In [2]:
# Define project information
# PROJECT_ID = "YOUR_PROJECT_ID"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}a

# Initialize Vertex AI
import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

### Import libraries


In [3]:
import json

from IPython.display import Markdown, display_pdf
from vertexai.generative_models import (
    GenerationConfig,
    GenerativeModel,
    HarmBlockThreshold,
    HarmCategory,
    Part,
)

### Load the Gemini 1.5 Flash model

Gemini 1.5 Flash (`gemini-1.5-flash`) is a multimodal model that supports multimodal prompts. You can include text, image(s), and video in your prompt requests and get text or code responses.

In [4]:
model = GenerativeModel(
    "gemini-1.5-flash",
    safety_settings={
        HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_ONLY_HIGH
    },
)
# This Generation Config sets the model to respond in JSON format.
generation_config = GenerationConfig(
    temperature=0.0, response_mime_type="application/json"
)

### Define helper function

Define helper function to print the multimodal prompt

In [5]:
PDF_MIME_TYPE = "application/pdf"


def print_multimodal_prompt(contents: list) -> None:
    """
    Given contents that would be sent to Gemini,
    output the full multimodal prompt for ease of readability.
    """
    for content in contents:
        if not isinstance(content, Part):
            print(content)
        elif content.inline_data:
            display_pdf(content.inline_data.data)
        elif content.file_data:
            gcs_url = (
                "https://storage.googleapis.com/"
                + content.file_data.file_uri.replace("gs://", "").replace(" ", "%20")
            )
            print(f"PDF URL: {gcs_url}")


# Send Google Cloud Storage Document to Vertex AI
def process_document(
    prompt: str,
    file_uri: str,
    mime_type: str = PDF_MIME_TYPE,
    generation_config: GenerationConfig | None = None,
    print_prompt: bool = False,
    print_raw_response: bool = False,
) -> str:
    # Load file directly from Google Cloud Storage
    file_part = Part.from_uri(
        uri=file_uri,
        mime_type=mime_type,
    )

    # Load contents
    contents = [file_part, prompt]

    # Send to Gemini
    response = model.generate_content(contents, generation_config=generation_config)

    if print_prompt:
        print("-------Prompt--------")
        print_multimodal_prompt(contents)

    if print_raw_response:
        print("\n-------Raw Response--------")
        print(response)

    return response.text

## Entity Extraction

[Named Entity Extraction](https://en.wikipedia.org/wiki/Named-entity_recognition) is a technique of Natural Language Processing to identify specific fields and values from unstructured text. For example, you can find key-value pairs from a filled out form, or get all of the important data from an invoice categorized by the type.

### Extract entities from an invoice

In this example, you will use a sample invoice and get all of the information in JSON format.

This is the prompt to be sent to Gemini along with the PDF document. Feel free to edit this for your specific use case.

In [6]:
extraction_prompt = """You are a document entity extraction specialist. Given a document, your task is to extract the text value of the following entities:
{
	"chief_complaint": "",
	"medications": [
		{
			"dose": "",
			"description": "",
			"medication": "",
			"quantity": ""
		}
	],

	"review_of_symptoms": [
		{
			"symptom": "",
			"description": ""
		}
	]
}

- The JSON schema must be followed during the extraction.
- The values must only include text found in the document
- Do not normalize any entity value.
- If an entity is not found in the document, set the entity value to null.
"""

In [7]:
# Download a PDF from Google Cloud Storage
# ! gsutil cp "gs://cloud-samples-data/generative-ai/pdf/invoice.pdf" ./invoice.pdf

In [8]:
# download pdf from public site 
import requests 

url = "https://www.med.unc.edu/medclerk/wp-content/uploads/sites/877/2018/10/hp1.pdf"
filename = "./hp1.pdf"

response = requests.get(url)

with open(filename, 'wb') as f:
    f.write(response.content)

In [9]:
# Load file bytes
with open("hp1.pdf", "rb") as f:
    file_part = Part.from_data(data=f.read(), mime_type="application/pdf")

# Load contents
contents = [file_part, extraction_prompt]

# Send to Gemini with GenerationConfig
response = model.generate_content(contents, generation_config=generation_config)

In [10]:
# print("-------Prompt--------")
# print_multimodal_prompt(contents)

# print("\n-------Raw Response--------")
# print(response.text)

This response can then be parsed as JSON into a Python dictionary for use in other applications.

In [11]:
print("\n-------Parsed Entities--------")
json_object = json.loads(response.text)
print(json.dumps(json_object, indent=4))


-------Parsed Entities--------
{
    "chief_complaint": "swelling of tongue and difficulty breathing and swallowing",
    "medications": [
        {
            "dose": "600 mg",
            "description": "bronchodilator by increasing cAMP used for treating asthma",
            "medication": "Theophyline (Uniphyl)",
            "quantity": "qhs"
        },
        {
            "dose": "300 mg",
            "description": "Ca channel blocker used to control hypertension",
            "medication": "Diltiazem",
            "quantity": "qhs"
        },
        {
            "dose": "20 mg",
            "description": "HMGCo Reductase inhibitor for hypercholesterolemia",
            "medication": "Simvistatin (Zocor)",
            "quantity": "qhs"
        },
        {
            "dose": "10 mg",
            "description": "ACEI for hypertension and diabetes for renal protective effect",
            "medication": "Ramipril (Altace)",
            "quantity": "BID"
        },
        {
 

You can see that Gemini extracted all of the relevant fields from the document.

### create sample form to populate with extracted details

In [54]:
import markdown
import pdfkit

filename = "example-patient-form-text.md"
outfilename = "example-patient-form.pdf"

with open(filename, "r") as file:
    markdown_text = file.read()

# Convert Markdown to HTML using the `markdown` library
html_text = markdown.markdown(markdown_text, extensions=['tables'])  # Enable the 'tables' extension

# Add CSS for table styling
css = """
<style>
table {
    border-collapse: collapse;
    width: 100%;
}

th, td {
    border: 1px solid #ddd;
    padding: 8px;
    text-align: left;
}

th {
    background-color: #f2f2f2;   
}
</style>
"""

# Combine HTML and CSS
full_html = f"<html><head>{css}</head><body>{html_text}</body></html>"

# # Convert HTML to PDF
pdfkit.from_string(full_html, outfilename)

True

In [57]:
from pypdf import PdfReader

reader = PdfReader(outfilename)

# You can also get all fields:
fields = reader.get_fields()
print(fields)

None


### Fill out form with extracted details 

refs 

* https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/document-understanding
* https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output 
* https://pypdf.readthedocs.io/en/latest/user/forms.html#filling-out-forms

## validate object 



In [26]:
## TODO - validate JSON object correct 
# json_object

### complete form

In [58]:
# from pypdf import PdfReader, PdfWriter

# reader = PdfReader("form.pdf")
# writer = PdfWriter()

# page = reader.pages[0]
# fields = reader.get_fields()

# writer.append(reader)

# writer.update_page_form_field_values(
#     writer.pages[0],
#     {"fieldname": "some filled in text"}, <- gemini output here
#     auto_regenerate=False,
# )

# with open("filled-out.pdf", "wb") as output_stream:
#     writer.write(output_stream)