## 📚 Prerequisites

Ensure that your Azure Services are properly set up, your Conda environment is created, and your environment variables are configured as per the instructions in the [SETTINGS.md](SETTINGS.md) file.

The PDFs used during this case study are publicly available:

- [Fisher EWD/EWS/EWT Valves through NPS 12x8 Instruction Manual](https://www.emerson.com/documents/automation/instruction-manual-fisher-ewd-ews-ewt-valves-through-nps-12x8-en-124788.pdf)
- [Fieldvue DVC6200 HW2 Digital Valve Controller Instruction Manual](https://www.emerson.com/documents/automation/instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf)

These documents were chosen due to their complexity, particularly in terms of tables and interpretation of graphs. To follow along with this notebook, please download these files and upload them to your blob storage container.

## 📋 Table of Contents

This notebook guides you through the following sections:

1. [**Optical Character Recognition (OCR) with Azure AI Document Intelligence**](#optical-character-recognition-ocr-with-azure-ai-document-intelligence): Overview of Azure's Document Analysis Client and its pre-trained models for document analysis.

2. [**Understanding Data Extracted from the Layout Model**](#understanding-data-extracted-from-the-layout-model): Insights into the data extracted from the layout model.
    - [**Custom Logic for Processing Extracted Information**](#custom-logic-for-processing-extracted-information): Discusses the need for custom logic to process the extracted information based on specific use cases and requirements.
    - [**Leveraging LangChain Integration**](#leveraging-langchain-integration): Explanation of how Retrieval-Augmented Generation (RAG) works with a pretrained Large Language Model (LLM) and an external data retrieval system for dynamic interaction with documents and content generation.

In [1]:
import os

# Define the target directory
target_directory = r"C:\Users\pablosal\Desktop\gbbai-azure-ai-document-intelligence"  # change your directory here

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\gbbai-azure-ai-document-intelligence


In [2]:
from src.ocr.document_intelligence import AzureDocumentIntelligenceManager

document_intelligence_client = AzureDocumentIntelligenceManager()

## Optical Character Recognition (OCR) with Azure AI Document Intelligence

Azure's Document Analysis Client provides a variety of pre-trained models that can be used to analyze documents. The `analyze_document` function takes in a document (either a URL or a file path) and the type of pre-trained model to use for analysis.

Here's a brief overview of the available pre-trained models:

- `'prebuilt-layout'`: This is the default model. It extracts text, tables, selection marks, and structure elements from the document.

- `'prebuilt-document'`: This model is used for generic document understanding.

- `'prebuilt-read'`: This model extracts both print and handwritten text.

- `'prebuilt-tax'`: This model is designed to process US tax documents.

- `'prebuilt-invoice'`: This model automates the processing of invoices.

- `'prebuilt-receipt'`: This model scans sales receipts for key data.

- `'prebuilt-id'`: This model processes identity documents.

- `'prebuilt-businesscard'`: This model extracts information from business cards.

- `'prebuilt-contract'`: This model analyzes contractual agreements.

- `'prebuilt-healthinsurancecard'`: This model processes health insurance cards.

In addition to these pre-trained models, Azure also offers custom and composed models. For more details, refer to the [Azure Document Intelligence Model Overview](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-model-overview?view=doc-intel-4.0.0).

The `analyze_document` function returns the analysis result, which can then be used for further processing or analysis.

In [3]:
# We will begin with the Fisher EWD/EWS/EWT Valves through NPS 12x8 Instruction Manual,
# which can be found at the following URL:
# https://www.emerson.com/documents/automation/instruction-manual-fisher-ewd-ews-ewt-valves-through-nps-12x8-en-124788.pdf
# We will use the 'prebuilt-layout' model for this task. This is the default model provided by Azure's Document Analysis Client,
# and it is capable of extracting text, tables, selection marks, and structure elements from the document.
# one  the latest feature is the abulity to extract content in a specific format, such as markdown.

document_url = "https://www.emerson.com/documents/automation/instruction-manual-fisher-ewd-ews-ewt-valves-through-nps-12x8-en-124788.pdf"
document_blob = "https://testeastusdev001.blob.core.windows.net/testretrieval/instruction-manual-fisher-et-eat-easy-e-valves-cl125-through-cl600-en-124782.pdf"
document_local = os.path.join(
    os.getcwd(),
    "utils\\data\\instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf",
)
model_type = "prebuilt-layout"

result_ocr = document_intelligence_client.analyze_document(
    document_input=document_blob,
    model_type=model_type,
    output_format="markdown",
    features=["OCR_HIGH_RESOLUTION"],
    pages="1-3",
)

2024-02-22 11:54:58,012 - micro - MainProcess - INFO     Blob URL detected. Extracting content. (document_intelligence.py:analyze_document:148)
2024-02-22 11:54:58,908 - micro - MainProcess - INFO     Successfully downloaded blob file instruction-manual-fisher-et-eat-easy-e-valves-cl125-through-cl600-en-124782.pdf (blob_data_extractor.py:extract_content:88)


ClientAuthenticationError: (401) Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource.
Code: 401
Message: Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource.

In [4]:
section_headings = [
    paragraph.content
    for paragraph in result_ocr.paragraphs
    if paragraph.role == "sectionHeading"
]

In [None]:
# retriving pages from the docuemnt
result_ocr.pages

In [None]:
# retriving tables from the document
result_ocr.tables

In [7]:
print(result_ocr.content)

<!-- PageHeader="Instruction Manual D100398X012" -->

<!-- PageHeader="ET Valve August 2021" -->

Fisher™ ET and EAT easy-e™ Valves CL125 through CL600
===


## Contents

|||
| - | - |
| Introduction | 1 |
| Scope of Manual | 1 |
| Description | 3 |
| Specifications | 3 |
| Educational Services | 3 |
| Installation | 4 |
| Maintenance | 5 |
| Packing Lubrication | 6 |
| Packing Maintenance | 6 |
| Replacing Packing | 7 |
| Trim Maintenance | 13 |
| Disassembly | 13 |
| Lapping Metal Seats | 15 |
| Valve Plug Maintenance | 16 |
| Assembly | 18 |
| ENVIRO-SEAL Bellows Seal Bonnet | 20 |
| Replacing a Plain or Extension Bonnet with an ENVIRO-SEAL Bellows Seal Bonnet (Stem/Bellows Assembly) | 20 |
| Replacement of an Installed ENVIRO-SEAL Bellows Seal Bonnet | |
| (Stem/Bellows Assembly) | 23 |
| Purging the ENVIRO-SEAL Bellows Seal Bonnet . | 25 |
| Parts Ordering | 25 |
| Parts Kits | 26 |
| Parts List | 31 |

<figure>

![](figures/0)

<!-- FigureContent="Figure 1. Fisher ET Control Valv