## 📚 Prerequisites

Before executing this notebook, make sure you have properly set up your Azure Services, created your Conda environment, and configured your environment variables as per the instructions provided in the [README.md](README.md) file.

The PDFs used during this case study are publicly available:

- [Fisher EWD/EWS/EWT Valves through NPS 12x8 Instruction Manual](https://www.emerson.com/documents/automation/instruction-manual-fisher-ewd-ews-ewt-valves-through-nps-12x8-en-124788.pdf)
- [Fieldvue DVC6200 HW2 Digital Valve Controller Instruction Manual](https://www.emerson.com/documents/automation/instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf)

These documents were chosen due to their complexity, particularly in terms of tables and interpretation of graphs. To follow along with this notebook, please download these files and upload them to your blob storage container.

## 📋 Table of Contents

This notebook guides you through the following sections:

1. [**Optical Character Recognition (OCR) with Azure AI Document Intelligence**](#optical-character-recognition-ocr-with-azure-ai-document-intelligence): Overview of Azure's Document Analysis Client and its pre-trained models for document analysis.

2. [**Understanding Data Extracted from the Layout Model**](#understanding-data-extracted-from-the-layout-model): Insights into the data extracted from the layout model.
    - [**Custom Logic for Processing Extracted Information**](#custom-logic-for-processing-extracted-information): Discusses the need for custom logic to process the extracted information based on specific use cases and requirements.
    - [**Leveraging LangChain Integration**](#leveraging-langchain-integration): Explanation of how Retrieval-Augmented Generation (RAG) works with a pretrained Large Language Model (LLM) and an external data retrieval system for dynamic interaction with documents and content generation.

In [1]:
import os

# Define the target directory
target_directory = r"C:\Users\pablosal\Desktop\gbbai-azure-ai-document-intelligence"  # change your directory here

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\gbbai-azure-ai-document-intelligence


In [2]:
from src.ocr.document_intelligence import AzureDocumentIntelligenceManager

document_intelligence_client = AzureDocumentIntelligenceManager()

## Optical Character Recognition (OCR) with Azure AI Document Intelligence

Azure's Document Analysis Client provides a variety of pre-trained models that can be used to analyze documents. The `analyze_document` function takes in a document (either a URL or a file path) and the type of pre-trained model to use for analysis.

Here's a brief overview of the available pre-trained models:

- `'prebuilt-layout'`: This is the default model. It extracts text, tables, selection marks, and structure elements from the document.

- `'prebuilt-document'`: This model is used for generic document understanding.

- `'prebuilt-read'`: This model extracts both print and handwritten text.

- `'prebuilt-tax'`: This model is designed to process US tax documents.

- `'prebuilt-invoice'`: This model automates the processing of invoices.

- `'prebuilt-receipt'`: This model scans sales receipts for key data.

- `'prebuilt-id'`: This model processes identity documents.

- `'prebuilt-businesscard'`: This model extracts information from business cards.

- `'prebuilt-contract'`: This model analyzes contractual agreements.

- `'prebuilt-healthinsurancecard'`: This model processes health insurance cards.

In addition to these pre-trained models, Azure also offers custom and composed models. For more details, refer to the [Azure Document Intelligence Model Overview](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-model-overview?view=doc-intel-4.0.0).

The `analyze_document` function returns the analysis result, which can then be used for further processing or analysis.

In [3]:
# We will begin with the Fisher EWD/EWS/EWT Valves through NPS 12x8 Instruction Manual,
# which can be found at the following URL:
# https://www.emerson.com/documents/automation/instruction-manual-fisher-ewd-ews-ewt-valves-through-nps-12x8-en-124788.pdf
# We will use the 'prebuilt-layout' model for this task. This is the default model provided by Azure's Document Analysis Client,
# and it is capable of extracting text, tables, selection marks, and structure elements from the document.
# one  the latest feature is the abulity to extract content in a specific format, such as markdown.

document_url = "https://www.emerson.com/documents/automation/instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf"
document_blob = "https://testeastusdev001.blob.core.windows.net/customskillspdf/instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf"
document_local = os.path.join(os.getcwd(), "utils\\data\\instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf")
model_type = "prebuilt-layout"

result_ocr = document_intelligence_client.analyze_document(
    document_input=document_blob,
    model_type=model_type,
    output_format='markdown',
    features=["OCR_HIGH_RESOLUTION"]
)

2024-01-29 14:13:59,752 - micro - MainProcess - INFO     Blob URL detected. Extracting content. (document_intelligence.py:analyze_document:145)


2024-01-29 14:14:00,831 - micro - MainProcess - INFO     Successfully downloaded blob file instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf (blob_data_extractor.py:extract_content:88)


In [7]:
section_headings = [paragraph.content for paragraph in result_ocr.paragraphs if paragraph.role == "sectionHeading"]

In [29]:
# retriving pages from the docuemnt
result_ocr.pages

[{'pageNumber': 1, 'angle': 0, 'width': 8.5, 'height': 11, 'unit': 'inch', 'words': [{'content': 'DVC6200', 'polygon': [5.614, 0.7356, 6.1833, 0.7318, 6.1827, 0.8762, 5.614, 0.8667], 'confidence': 0.994, 'span': {'offset': 17, 'length': 7}}, {'content': 'Digital', 'polygon': [6.2218, 0.7316, 6.6483, 0.7301, 6.6472, 0.8765, 6.2212, 0.8766], 'confidence': 0.994, 'span': {'offset': 25, 'length': 7}}, {'content': 'Valve', 'polygon': [6.6755, 0.73, 7.018, 0.7295, 7.0166, 0.8723, 6.6744, 0.8764], 'confidence': 0.998, 'span': {'offset': 33, 'length': 5}}, {'content': 'Controller', 'polygon': [7.0475, 0.7294, 7.7007, 0.7302, 7.6986, 0.8537, 7.046, 0.8719], 'confidence': 0.994, 'span': {'offset': 39, 'length': 10}}, {'content': 'December', 'polygon': [6.9239, 0.9178, 7.4304, 0.9169, 7.4283, 1.0289, 6.9218, 1.0326], 'confidence': 0.995, 'span': {'offset': 50, 'length': 8}}, {'content': '2022', 'polygon': [7.4527, 0.9166, 7.7041, 0.9124, 7.702, 1.0317, 7.4506, 1.029], 'confidence': 0.988, 'span':

In [30]:
# retriving tables from the document
result_ocr.tables

[{'rowCount': 6, 'columnCount': 2, 'cells': [{'rowIndex': 0, 'columnIndex': 0, 'content': 'Instrument Level', 'boundingRegions': [{'pageNumber': 1, 'polygon': [0.84, 2.5867, 2.0967, 2.5867, 2.0967, 2.7633, 0.8367, 2.7633]}], 'spans': [{'offset': 220, 'length': 16}]}, {'rowIndex': 0, 'columnIndex': 1, 'content': 'HC, AD, PD, ODV', 'boundingRegions': [{'pageNumber': 1, 'polygon': [2.0967, 2.5867, 3.2833, 2.5867, 3.2867, 2.7633, 2.0967, 2.7633]}], 'spans': [{'offset': 239, 'length': 15}]}, {'rowIndex': 1, 'columnIndex': 0, 'content': 'Device Type', 'boundingRegions': [{'pageNumber': 1, 'polygon': [0.8367, 2.7633, 2.0967, 2.7633, 2.0967, 2.9267, 0.8367, 2.9267]}], 'spans': [{'offset': 259, 'length': 11}]}, {'rowIndex': 1, 'columnIndex': 1, 'content': '1309', 'boundingRegions': [{'pageNumber': 1, 'polygon': [2.0967, 2.7633, 3.2867, 2.7633, 3.2867, 2.9267, 2.0967, 2.9267]}], 'spans': [{'offset': 273, 'length': 4}]}, {'rowIndex': 2, 'columnIndex': 0, 'content': 'Hardware Revision', 'boundingR

In [31]:
print(result_ocr.content)

<!-- PageHeader="DVC6200 Digital Valve Controller December 2022" -->

<!-- PageHeader="Instruction Manual D103605X012" -->

Fisher™ FIELDVUE™ DVC6200 Digital Valve Controller
===

This manual applies to

|||
| - | - |
| Instrument Level | HC, AD, PD, ODV |
| Device Type | 1309 |
| Hardware Revision | 2 |
| Firmware Revision | 7 |
| Device Revision | 1 3 |
| DD Revision | 7 1 |


# Contents

Section 1 Introduction

3

Installation, Pneumatic and Electrical Connections,

and Initial Configuration

3

Scope of Manual

3

Conventions Used in this Manual

3

Description

3

Specifications

5

Related Documents

5

HART Filter

9

Voltage Available

9

Compliance Voltage

11

Auxiliary Terminal Wiring Length Guidelines

12

Maximum Cable Capacitance

12

Installation in Conjunction with a Rosemount ™

333 HART Tri\-Loop ™ HART\-to\-Analog

Signal Converter

13

Section 3 Configuration

15

Guided Setup

15

Manual Setup

15

Mode and Protection

16

Instrument Mode

16

Write Protection

16


It does a great job extracting the elements of the text including paragraph text, tables, but we'll need extra logic to extract meaningful text to feed later to our LLM. You can use the OCR feature through the Azure OpenAI service. The GPT-4 Turbo with Vision model lets you chat with an AI assistant that can analyze the images you share, and the Vision Enhancement option uses Image Analysis to give the AI assistance more details (readable text and object locations) about the image. For more information, see the GPT-4 Turbo with Vision quickstart.

## Understanding Data Extracted from the Layout Model

### Custom Logic for Processing Extracted Information

To make sense of the extracted information and prepare it for indexing into Azure AI Search, we need to implement custom logic based on specific use cases and requirements. The following is a simple example of how to transform the data into a CSV format. However, this logic should be customized to accommodate the specific use case.

In [6]:
# Filtering out the relevant table and cells on page 33
tables_on_page_33 = [
    table
    for table in result_ocr.tables
    if any(region.page_number == 33 for region in table.bounding_regions)
]

for table_idx, table in enumerate(tables_on_page_33):
    regions_on_page_33 = [
        region for region in table.bounding_regions if region.page_number == 33
    ]
    for region in regions_on_page_33:
        print(
            f"Table # {table_idx} on Page: {region.page_number}, Location: {region.polygon}"
        )

        # Processing cells that are on page 33
        cells_on_page_33 = [
            cell
            for cell in table.cells
            if any(region.page_number == 33 for region in cell.bounding_regions)
        ]
        for cell in cells_on_page_33:
            cell_text = f"Cell[{cell.row_index}][{cell.column_index}]: '{cell.content}'"
            cell_polygons = [
                region.polygon
                for region in cell.bounding_regions
                if region.page_number == 33
            ]
            print(f"...{cell_text} located within bounding polygons: {cell_polygons}")

print("----------------------------------------")

Table # 0 on Page: 33, Location: [0.9037, 1.6301, 3.9986, 1.6291, 3.9993, 2.5992, 0.9037, 2.6006]
...Cell[0][0]: 'Relay Type' located within bounding polygons: [[0.9067, 1.6333, 2.1533, 1.6333, 2.15, 1.8633, 0.9067, 1.8633]]
...Cell[0][1]: 'Pressure Signal' located within bounding polygons: [[2.1533, 1.6333, 3.7267, 1.6333, 3.7267, 1.86, 2.15, 1.8633]]
...Cell[1][0]: 'A or C' located within bounding polygons: [[0.9067, 1.8633, 2.15, 1.8633, 2.15, 2.0333, 0.9067, 2.0333]]
...Cell[1][1]: 'Port A \- Port B' located within bounding polygons: [[2.15, 1.8633, 3.7267, 1.86, 3.7267, 2.0333, 2.15, 2.0333]]
...Cell[2][0]: 'B' located within bounding polygons: [[0.9067, 2.0333, 2.15, 2.0333, 2.15, 2.22, 0.9067, 2.22]]
...Cell[2][1]: 'Port B \- Port A' located within bounding polygons: [[2.15, 2.0333, 3.7267, 2.0333, 3.7267, 2.22, 2.15, 2.22]]
...Cell[3][0]: 'B Special App\.' located within bounding polygons: [[0.9067, 2.22, 2.15, 2.22, 2.15, 2.4167, 0.9067, 2.4167]]
...Cell[3][1]: 'Port B' locate

More sophisticated logic needs to be developed to effectively extract information from complex and nested tables. This will involve handling various table structures and layouts, as well as dealing with merged cells and other complexities. The goal is to ensure accurate and reliable extraction of data, regardless of the complexity of the table.

### Leveraging LangChain Integration 

Retrieval-Augmented Generation (RAG) combines a pretrained Large Language Model (LLM) with an external data retrieval system. This allows for dynamic interaction with documents and content generation using Azure OpenAI models.

Key Features:

- **Semantic Chunking**: Fragments text into coherent segments for optimized RAG responses. Supports fixed-sized and semantic chunking.

- **Layout Model Integration**: The Layout model from Document Intelligence assists in semantic chunking, providing a solution for content extraction and document structure analysis. It's scalable, multilingual, and compatible with LLMs.

- **Document Chat with Semantic Chunking**: Enabled by Azure OpenAI and LangChain integration, this feature allows for custom chunking strategies and effective document parsing.

For more details, visit the [Microsoft Learn page on Retrieval-Augmented Generation with Azure AI Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-retrieval-augumented-generation?view=doc-intel-4.0.0).

In [9]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [10]:
docs = document_intelligence_client.load(result_ocr)

In [13]:
print(result_ocr.content)

<!-- PageHeader="DVC6200 Digital Valve Controller December 2022" -->

<!-- PageHeader="Instruction Manual D103605X012" -->

Fisher™ FIELDVUE™ DVC6200 Digital Valve Controller
===

This manual applies to

|||
| - | - |
| Instrument Level | HC, AD, PD, ODV |
| Device Type | 1309 |
| Hardware Revision | 2 |
| Firmware Revision | 7 |
| Device Revision | 1 3 |
| DD Revision | 7 1 |


# Contents

Section 1 Introduction

3

Installation, Pneumatic and Electrical Connections,

and Initial Configuration

3

Scope of Manual

3

Conventions Used in this Manual

3

Description

3

Specifications

5

Related Documents

5

HART Filter

9

Voltage Available

9

Compliance Voltage

11

Auxiliary Terminal Wiring Length Guidelines

12

Maximum Cable Capacitance

12

Installation in Conjunction with a Rosemount ™

333 HART Tri\-Loop ™ HART\-to\-Analog

Signal Converter

13

Section 3 Configuration

15

Guided Setup

15

Manual Setup

15

Mode and Protection

16

Instrument Mode

16

Write Protection

16


In [11]:
# Split the document into chunks base on markdown headers.
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

docs_string = docs[0].page_content
splits = text_splitter.split_text(docs_string)
splits

[Document(page_content='<!-- PageHeader="DVC6200 Digital Valve Controller December 2022" -->  \n<!-- PageHeader="Instruction Manual D103605X012" -->  \nFisher™ FIELDVUE™ DVC6200 Digital Valve Controller\n===  \nThis manual applies to  \n|||\n| - | - |\n| Instrument Level | HC, AD, PD, ODV |\n| Device Type | 1309 |\n| Hardware Revision | 2 |\n| Firmware Revision | 7 |\n| Device Revision | 1 3 |\n| DD Revision | 7 1 |'),
 Document(page_content='Section 1 Introduction  \n3  \nInstallation, Pneumatic and Electrical Connections,  \nand Initial Configuration  \n3  \nScope of Manual  \n3  \nConventions Used in this Manual  \n3  \nDescription  \n3  \nSpecifications  \n5  \nRelated Documents  \n5  \nHART Filter  \n9  \nVoltage Available  \n9  \nCompliance Voltage  \n11  \nAuxiliary Terminal Wiring Length Guidelines  \n12  \nMaximum Cable Capacitance  \n12  \nInstallation in Conjunction with a Rosemount ™  \n333 HART Tri\\-Loop ™ HART\\-to\\-Analog  \nSignal Converter  \n13  \nSection 3 Configur

In [15]:
splits

[Document(page_content='<!-- PageHeader="Instruction Manual D103605X012" -->  \nConfiguration December 2022'),
 Document(page_content='TRAVEL  \nTRAVEL  \nNORMAL  \nNORMAL  \nOUTGOING RAMP RATE  \nINCOMING RAMP RATE  \nREDUCED PST TIME  \nRETURN LEAD  \n1 2  \n1 2  \nRETURN LEAD  \n3  \nTIME  \nBREAKOUT TIMEOUT  \nEARLY  \nPAUSE TIME  \nTIME TURNAROUND  \nSHORT DURATION PST DISABLED  \nSHORT DURATION PST ENABLED\n:selected:\n1 MINIMUM TRAVEL MOVEMENT :selected: 2 TRAVEL TARGET MOVEMENT  \n3 MAX\\. ALLOWABLE TRAVEL  \nOutgoing Ramp Rate is the rate at which the valve will move during the Outgoing stroke of the Partial Stroke test\\. The default value is 0\\.25%/second\\.  \nIncoming Ramp Rate is the rate at which the valve will move during the Incoming stroke of the Partial Stroke test\\. The default value is 0\\.25%/second\\.  \nReturn Lead defines the percent \\(%\\) change in setpoint to overcome the hysteresis in the valve assembly\\. The error between setpoint and actual error is a

In [16]:
print(docs_string)

<!-- PageHeader="Instruction Manual D103605X012" -->

Configuration December 2022


# Figure 3\-5\. Time Series Representation of Short Duration PST

TRAVEL

TRAVEL

NORMAL

NORMAL

OUTGOING RAMP RATE

INCOMING RAMP RATE

REDUCED PST TIME

RETURN LEAD

1 2

1 2

RETURN LEAD

3

TIME

BREAKOUT TIMEOUT

EARLY

PAUSE TIME

TIME TURNAROUND

SHORT DURATION PST DISABLED

SHORT DURATION PST ENABLED
 :selected:
1 MINIMUM TRAVEL MOVEMENT :selected: 2 TRAVEL TARGET MOVEMENT

3 MAX\. ALLOWABLE TRAVEL

Outgoing Ramp Rate is the rate at which the valve will move during the Outgoing stroke of the Partial Stroke test\. The default value is 0\.25%/second\.

Incoming Ramp Rate is the rate at which the valve will move during the Incoming stroke of the Partial Stroke test\. The default value is 0\.25%/second\.

Return Lead defines the percent \(%\) change in setpoint to overcome the hysteresis in the valve assembly\. The error between setpoint and actual error is added to this percent change\. For exampl