## 📚 Prerequisites

Before executing this notebook, make sure you have properly set up your Azure Services, created your Conda environment, and configured your environment variables as per the instructions provided in the [README.md](README.md) file.

The PDFs used during this case study are publicly available:

- [Fisher EWD/EWS/EWT Valves through NPS 12x8 Instruction Manual](https://www.emerson.com/documents/automation/instruction-manual-fisher-ewd-ews-ewt-valves-through-nps-12x8-en-124788.pdf)
- [Fieldvue DVC6200 HW2 Digital Valve Controller Instruction Manual](https://www.emerson.com/documents/automation/instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf)

These documents were chosen due to their complexity, particularly in terms of tables and interpretation of graphs. To follow along with this notebook, please download these files and upload them to your blob storage container.

## 📋 Table of Contents

This notebook guides you through the following sections:

1. [**Optical Character Recognition (OCR) with Azure AI Document Intelligence**](#optical-character-recognition-ocr-with-azure-ai-document-intelligence): Overview of Azure's Document Analysis Client and its pre-trained models for document analysis.

2. [**Understanding Data Extracted from the Layout Model**](#understanding-data-extracted-from-the-layout-model): Insights into the data extracted from the layout model.
    - [**Custom Logic for Processing Extracted Information**](#custom-logic-for-processing-extracted-information): Discusses the need for custom logic to process the extracted information based on specific use cases and requirements.
    - [**Leveraging LangChain Integration**](#leveraging-langchain-integration): Explanation of how Retrieval-Augmented Generation (RAG) works with a pretrained Large Language Model (LLM) and an external data retrieval system for dynamic interaction with documents and content generation.

## Getting Started

#### Configure Environment Variables 

Before running this notebook, you must configure certain environment variables. We will now use environment variables to store our configuration. This is a more secure practice as it prevents sensitive data from being accidentally committed and pushed to version control systems.

Create a `.env` file in your project root (use the provided `.env.sample` as a template) and add the following variables:

```env
# Azure Document Intelligence API Configuration
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT='[Your Document Intelligence Endpoint]'
AZURE_DOCUMENT_INTELLIGENCE_KEY='[Your Document Intelligence Key]'
```

Replace the placeholders (e.g., [Your Azure Search Service Endpoint]) with your actual values.

- `AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT` and `AZURE_DOCUMENT_INTELLIGENCE_KEY` are used to configure the Azure Document Intelligence API.

> 📌 **Note**
> Remember not to commit the .env file to your version control system. Add it to your .gitignore file to prevent it from being tracked.

#### Setting Up Conda Environment and Configuring VSCode for Jupyter Notebooks (Optional)

Follow these steps to create a Conda environment and set up your VSCode for running Jupyter Notebooks:

##### Create Conda Environment from the Repository

> Instructions for Windows users: 

1. **Create the Conda Environment**:
   - In your terminal or command line, navigate to the repository directory.
   - Execute the following command to create the Conda environment using the `environment.yaml` file:
     ```bash
     conda env create -f environment.yaml
     ```
   - This command creates a Conda environment as defined in `environment.yaml`.

2. **Activating the Environment**:
   - After creation, activate the new Conda environment by using:
     ```bash
     conda activate document-intelligence
     ```

> Instructions for Linux users (or Windows users with WSL or other linux setup): 

1. **Use `make` to Create the Conda Environment**:
   - In your terminal or command line, navigate to the repository directory and look at the Makefile.
   - Execute the `make` command specified below to create the Conda environment using the `environment.yaml` file:
     ```bash
     make create_conda_env
     ```

2. **Activating the Environment**:
   - After creation, activate the new Conda environment by using:
     ```bash
     conda activate document-intelligence
     ```

##### Configure VSCode for Jupyter Notebooks

1. **Install Required Extensions**:
   - Download and install the `Python` and `Jupyter` extensions for VSCode. These extensions provide support for running and editing Jupyter Notebooks within VSCode.

2. **Open the Notebook**:
   - Open the Jupyter Notebook file (`01-ocr-gpt4v.ipynb`) in VSCode.

3. **Attach Kernel to VSCode**:
   - After creating the Conda environment, it should be available in the kernel selection dropdown. This dropdown is located in the top-right corner of the VSCode interface.
   - Select your newly created environment (`document-intelligence`) from the dropdown. This sets it as the kernel for running your Jupyter Notebooks.

4. **Run the Notebook**:
   - Once the kernel is attached, you can run the notebook by clicking on the "Run All" button in the top menu, or by running each cell individually.


By following these steps, you'll establish a dedicated Conda environment for your project and configure VSCode to run Jupyter Notebooks efficiently. This environment will include all the necessary dependencies specified in your `environment.yaml` file. If you wish to add more packages or change versions, please use `pip install` in a notebook cell or in the terminal after activating the environment, and then restart the kernel. The changes should be automatically applied after the session restarts.

In [30]:
import os

# Define the target directory
target_directory = r"C:\Users\pablosal\Desktop\gbbai-azure-ai-document-intelligence"  # change your directory here

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\gbbai-azure-ai-document-intelligence


In [31]:
from src.ocr.document_intelligence import AzureDocumentIntelligenceManager

document_intelligence_client = AzureDocumentIntelligenceManager()

## Optical Character Recognition (OCR) with Azure AI Document Intelligence

Azure's Document Analysis Client provides a variety of pre-trained models that can be used to analyze documents. The `analyze_document` function takes in a document (either a URL or a file path) and the type of pre-trained model to use for analysis.

Here's a brief overview of the available pre-trained models:

- `'prebuilt-layout'`: This is the default model. It extracts text, tables, selection marks, and structure elements from the document.

- `'prebuilt-document'`: This model is used for generic document understanding.

- `'prebuilt-read'`: This model extracts both print and handwritten text.

- `'prebuilt-tax'`: This model is designed to process US tax documents.

- `'prebuilt-invoice'`: This model automates the processing of invoices.

- `'prebuilt-receipt'`: This model scans sales receipts for key data.

- `'prebuilt-id'`: This model processes identity documents.

- `'prebuilt-businesscard'`: This model extracts information from business cards.

- `'prebuilt-contract'`: This model analyzes contractual agreements.

- `'prebuilt-healthinsurancecard'`: This model processes health insurance cards.

In addition to these pre-trained models, Azure also offers custom and composed models. For more details, refer to the [Azure Document Intelligence Model Overview](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-model-overview?view=doc-intel-4.0.0).

The `analyze_document` function returns the analysis result, which can then be used for further processing or analysis.

In [32]:
# We will begin with the Fisher EWD/EWS/EWT Valves through NPS 12x8 Instruction Manual,
# which can be found at the following URL:
# https://www.emerson.com/documents/automation/instruction-manual-fisher-ewd-ews-ewt-valves-through-nps-12x8-en-124788.pdf
# We will use the 'prebuilt-layout' model for this task. This is the default model provided by Azure's Document Analysis Client,
# and it is capable of extracting text, tables, selection marks, and structure elements from the document.

document_url = "https://www.emerson.com/documents/automation/instruction-manual-fisher-ewd-ews-ewt-valves-through-nps-12x8-en-124788.pdf"
model_type = "prebuilt-layout"

result_ocr = document_intelligence_client.analyze_document(
    document_input=document_url, model_type=model_type
)

In [33]:
# retriving the content of the document
result_ocr.content



In [34]:
# retriving pages from the docuemnt
result_ocr.pages

 DocumentPage(page_number=2, angle=None, width=8.5, height=11.0, unit=inch, lines=[DocumentLine(content=EW Valve, polygon=[Point(x=0.7784, y=0.7114), Point(x=1.3801, y=0.7114), Point(x=1.3801, y=0.8451), Point(x=0.7784, y=0.8451)], spans=[DocumentSpan(offset=1628, length=8)]), DocumentLine(content=February 2020, polygon=[Point(x=0.7831, y=0.8928), Point(x=1.4756, y=0.8928), Point(x=1.4756, y=1.0217), Point(x=0.7831, y=1.0217)], spans=[DocumentSpan(offset=1637, length=13)]), DocumentLine(content=Table 1. Specifications, polygon=[Point(x=0.7784, y=1.3511), Point(x=2.125, y=1.3511), Point(x=2.1202, y=1.5182), Point(x=0.7784, y=1.5135)], spans=[DocumentSpan(offset=1651, length=23)]), DocumentLine(content=End Connection Styles, polygon=[Point(x=0.7402, y=1.628), Point(x=2.1346, y=1.6328), Point(x=2.1298, y=1.7904), Point(x=0.7402, y=1.7856)], spans=[DocumentSpan(offset=1675, length=21)]), DocumentLine(content=Flanged Ends: CL300, CL600, or CL900 Raised-face or, polygon=[Point(x=0.7927, y=1.

In [35]:
# retriving tables from the document
result_ocr.tables

[DocumentTable(row_count=29, column_count=2, cells=[DocumentTableCell(kind=content, row_index=0, column_index=0, row_span=1, column_span=2, content=Contents, bounding_regions=[BoundingRegion(page_number=1, polygon=[Point(x=0.6745, y=2.2641), Point(x=4.0673, y=2.2737), Point(x=4.0673, y=2.5814), Point(x=0.6745, y=2.5814)])], spans=[DocumentSpan(offset=104, length=8)]), DocumentTableCell(kind=content, row_index=1, column_index=0, row_span=1, column_span=1, content=Introduction, bounding_regions=[BoundingRegion(page_number=1, polygon=[Point(x=0.6745, y=2.5814), Point(x=3.7117, y=2.5814), Point(x=3.7117, y=2.7496), Point(x=0.6745, y=2.7544)])], spans=[DocumentSpan(offset=113, length=12)]), DocumentTableCell(kind=content, row_index=1, column_index=1, row_span=1, column_span=1, content=1, bounding_regions=[BoundingRegion(page_number=1, polygon=[Point(x=3.7117, y=2.5814), Point(x=4.0673, y=2.5814), Point(x=4.0673, y=2.7496), Point(x=3.7117, y=2.7496)])], spans=[DocumentSpan(offset=126, length

In [36]:
result_ocr.paragraphs

[DocumentParagraph(role=pageHeader, content=Instruction Manual D100399X012, bounding_regions=[BoundingRegion(page_number=1, polygon=[Point(x=0.7774, y=0.7161), Point(x=2.2348, y=0.7114), Point(x=2.2359, y=1.024), Point(x=0.7784, y=1.0288)])], spans=[DocumentSpan(offset=0, length=30)]),
 DocumentParagraph(role=pageHeader, content=EW Valve February 2020, bounding_regions=[BoundingRegion(page_number=1, polygon=[Point(x=7.0006, y=0.7257), Point(x=7.7025, y=0.7257), Point(x=7.7025, y=1.0408), Point(x=7.0006, y=1.0408)])], spans=[DocumentSpan(offset=31, length=22)]),
 DocumentParagraph(role=title, content=Fisher™ EWD, EWS, and EWT Valves through NPS 12x8, bounding_regions=[BoundingRegion(page_number=1, polygon=[Point(x=0.7449, y=1.4036), Point(x=5.7256, y=1.4036), Point(x=5.7256, y=2.1437), Point(x=0.7449, y=2.1437)])], spans=[DocumentSpan(offset=54, length=49)]),
 DocumentParagraph(role=None, content=Contents, bounding_regions=[BoundingRegion(page_number=1, polygon=[Point(x=0.6745, y=2.2641

It does a great job extracting the elements of the text including paragraph text, tables, but we'll need extra logic to extract meaningful text to feed later to our LLM.

## Understanding Data Extracted from the Layout Model

### Custom Logic for Processing Extracted Information

To make sense of the extracted information and prepare it for indexing into Azure AI Search, we need to implement custom logic based on specific use cases and requirements. The following is a simple example of how to transform the data into a CSV format. However, this logic should be customized to accommodate the specific use case.

In [37]:
document_table = result_ocr.tables[1]

In [38]:
import csv


def write_table_to_csv(document_table, output_file_path: str, first_row_as_header=True):
    """
    Writes the data from a DocumentTable object to a CSV file.

    :param document_table: The DocumentTable object to extract data from.
    :param output_file_path: The path of the folder where the CSV file will be created.
    :param first_row_as_header: Whether to treat the first row of the table as column headers. Defaults to True.
    """
    table_data = []

    for row in range(document_table.row_count):
        row_data = []
        for column in range(document_table.column_count):
            cell = next(
                (
                    c
                    for c in document_table.cells
                    if c.row_index == row and c.column_index == column
                ),
                None,
            )
            cell_content = cell.content if cell else ""
            row_data.append(cell_content)
        table_data.append(row_data)

    # Optionally, treat the first row as header
    if first_row_as_header and table_data:
        headers = table_data.pop(0)
    else:
        headers = [f"Column {i+1}" for i in range(document_table.column_count)]

    # Write the table data to a CSV file
    with open(output_file_path, "w", newline="") as file:
        writer = csv.writer(file)
        writer.writerow(headers)
        writer.writerows(table_data)

In [39]:
write_table_to_csv(document_table, output_file_path="utils//csv_tables//table_data.csv")

In [40]:
import pandas as pd

pd.read_csv("utils//csv_tables//table_data.csv")

Unnamed: 0,Valve,Seating,Shutoff Class
0,EWD,Metal,II (standard)
1,,,III (optional for NPS 6x4 through 12x6 valves ...
2,,,IV (optional for NPS 6x4 through 12x8 valves w...
3,EWS,Metal,IV (standard)
4,,,"V (optional, consult your Emerson sales office)"
5,EWS,PTFE,VI
6,EWT with all except Cavitrol III cages,PTFE,Standard Air Test (maximum leakage is 0.05 mL/...
7,,,V (optional)
8,,Metal,IV (standard)
9,,Metal,V (optional)(1)


![Table Test Image](utils/images/table_test.png)

More sophisticated logic needs to be developed to effectively extract information from complex and nested tables. This will involve handling various table structures and layouts, as well as dealing with merged cells and other complexities. The goal is to ensure accurate and reliable extraction of data, regardless of the complexity of the table.

### Leveraging LangChain Integration 

Retrieval-Augmented Generation (RAG) combines a pretrained Large Language Model (LLM) with an external data retrieval system. This allows for dynamic interaction with documents and content generation using Azure OpenAI models.

Key Features:

- **Semantic Chunking**: Fragments text into coherent segments for optimized RAG responses. Supports fixed-sized and semantic chunking.

- **Layout Model Integration**: The Layout model from Document Intelligence assists in semantic chunking, providing a solution for content extraction and document structure analysis. It's scalable, multilingual, and compatible with LLMs.

- **Document Chat with Semantic Chunking**: Enabled by Azure OpenAI and LangChain integration, this feature allows for custom chunking strategies and effective document parsing.

For more details, visit the [Microsoft Learn page on Retrieval-Augmented Generation with Azure AI Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-retrieval-augumented-generation?view=doc-intel-4.0.0).

In [41]:
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter
import os

endpoint = os.environ["AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT"]
key = os.environ["AZURE_DOCUMENT_INTELLIGENCE_KEY"]

In [42]:
# Initiate Azure AI Document Intelligence to load the document. You can either specify file_path or url_path to load the document.
loader = AzureAIDocumentIntelligenceLoader(
    url_path=document_url,
    api_key=key,
    api_endpoint=endpoint,
    api_model="prebuilt-layout",
)
docs = loader.load()

In [43]:
# Split the document into chunks base on markdown headers.
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

docs_string = docs[0].page_content
splits = text_splitter.split_text(docs_string)
splits

[Document(page_content='<!-- PageHeader="Instruction Manual D100399X012" -->  \n<!-- PageHeader="EW Valve February 2020" -->  \nFisher™ EWD, EWS, and EWT Valves through NPS 12x8\n===  \n|||\n| - | - |\n| Contents ||\n| Introduction | 1 |\n| Scope of Manual | 1 |\n| Description | 3 |\n| Specifications | 3 |\n| Educational Services | 4 |\n| Installation | 4 |\n| Inverted Globe Valve Applications | |\n| \\(Actuator below valve\\) | 6 |\n| Maintenance | 7 |\n| Packing Lubrication | 8 |\n| Packing Maintenance | 10 |\n| Replacing Packing | 11 |\n| Trim Maintenance | 14 |\n| Trim Removal | 15 |\n| Lapping Metal Seats | 16 |\n| Valve Plug Maintenance | 17 |\n| Trim Replacement | 19 |\n| Retrofit: Installing C\\-seal Trim | 20 |\n| Replacement of Installed C\\-seal Trim | 24 |\n| Trim Removal \\(C\\-seal Constructions\\) | 24 |\n| Lapping Metal Seats \\(C\\-seal Constructions\\) \\. | 25 \\. |\n| Remachining Metal Seats | |\n| \\(C\\-seal Constructions\\) | 25 |\n| Trim Replacement \\(C\\-seal 