## 📚 Prerequisites

Before running this notebook, ensure you have configured Azure AI services, set the appropriate configuration parameters, and set up a Conda environment to ensure reproducibility. You can find the setup instructions and how to create a Conda environment in the [REQUIREMENTS.md](REQUIREMENTS.md) file.

## 📋 Table of Contents

This notebook guides you through the following sections:

> **💡 Note:** Please refer to the notebook `01-creation-indexes-azure-ai-search.ipynb` for detailed information and steps on how to create Azure AI Search Indexes.

1. [**Indexing Vectorized Content from multiple formats and sources**](#index-documents)
    - Chunk, vectorize, and index local PDF files and website addresses.
    
2. [**Indexing Vectorized Content from complex layout documents laveraging OCR Capabilities**](#index-images)
    - Leverage complex OCR, image recognition using Azure Document Intelligence. Chunk, vectorize, and index extracted metadata from Dcouments

Before you start, please take a look at the README. It contains detailed instructions, diagrams, and information about the class structure and automation used in this project, specifically in the `AzureAIIndexer` backend. [README.md](README.md) file

In [6]:
import os

# Define the target directory
target_directory = r"C:\Users\pablosal\Desktop\gbb-ai-chatbot-arena"

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\gbb-ai-chatbot-arena


# Create Azure AI Search Indexes 

Please refer to the notebook [01-creation-indexes-azure-ai-search.ipynb](01-creation-indexes.ipynb) for detailed information and steps on how to create Azure AI Search Indexes. 

# 📚 Indexing Vectorized Content from Multiple Sources and Various Formats

In this section, we will explore how to index vectorized content from various sources and in different formats. This includes local PDF files, website addresses, `.docx` files from a SharePoint site, and more. We will chunk, vectorize, and index these different types of content, leveraging the power of Azure AI Search Indexes. This process allows us to create a comprehensive, searchable index that can handle a wide range of queries.

In [None]:
!pip install --upgrade langchain==0.2.11 --force-reinstall
!pip install --upgrade langchain-core==0.2.11 --force-reinstall

In [3]:
# Import the AzureAIndexer class from the ai_search_indexing module
from src.indexing.indexers.ai_search_indexing import AzureAIndexer

DEPLOYMENT_NAME = os.getenv("AZURE_AOAI_EMBEDDING_DEPLOYMENT_ID")
INDEX_NAME = os.getenv("AZURE_SEARCH_INDEX_NAME_HR_RECURSIVE")

# Create an instance of the AzureAIndexer class
azure_search_indexer_client = AzureAIndexer(
    index_name=INDEX_NAME, embedding_azure_deployment_name=DEPLOYMENT_NAME
)

2024-07-28 20:45:42,917 - micro - MainProcess - INFO     Loading OpenAIEmbeddings object with model, deployment foundational-canadaeast-ada, and chunk size 1000 (ai_search_indexing.py:load_embedding_model:162)
  warn_deprecated(
2024-07-28 20:45:48,139 - micro - MainProcess - INFO     AzureOpenAIEmbeddings object has been created successfully. You can now access the embeddings
                using the '.embeddings' attribute. (ai_search_indexing.py:load_embedding_model:175)
name is not a known attribute of class <class 'azure.search.documents.indexes.models._index.SearchField'> and will be ignored
type is not a known attribute of class <class 'azure.search.documents.indexes.models._index.SearchField'> and will be ignored
key is not a known attribute of class <class 'azure.search.documents.indexes.models._index.SearchField'> and will be ignored
searchable is not a known attribute of class <class 'azure.search.documents.indexes.models._index.SearchField'> and will be ignored
filterable 

### Indexing PDFs, DOCX, and Images from Blob Storage 

The `load_files_and_split_into_chunks` function is a powerful tool for indexing and processing documents. It is designed to streamline the initial steps of loading files, splitting them into manageable chunks, and preparing your documents for further processing and conversion. 

Here are its key features:

- **Multi-Format Support**: The function can process documents in different formats (PDFs, Word documents, images, etc.) from various sources (blob storage, URLs, local paths). You can pass a list of file paths, each possibly in a different format.

- **Automated File Loading**: The function efficiently loads files into memory, eliminating the need for manual file handling. It manages the reading and processing of each file.

- **Advanced Text Splitting**: After loading, the function splits the text into manageable chunks, crucial for processing large documents. You can customize the chunk size and overlap according to your needs.

- **Versatile Splitting Options**: You can choose from various splitters - 'by_title', 'by_character_recursive', 'by_character_brute_force' - to fit your specific text processing requirements.

- **Encoding Capabilities**: The function can optionally use an encoder during splitting. This feature is particularly useful for certain text analysis tasks. You can specify the model used for encoding (default is "gpt-4").

- **OCR Capabilities**: If the 'ocr' parameter is set to True, the function will use Optical Character Recognition (OCR) via Azure Document Intelligence to extract text from images or scanned documents.

- **Verbose Logging**: You can enable detailed logging for in-depth progress tracking and easier debugging.

- **High Customizability**: The function's behavior can be tailored to your needs with additional keyword arguments. This includes options like retaining separators in chunks, using separators as regex patterns, and more.

In [4]:
# Define file paths and URLs
local_pdf_path = "utils\data\Human-Resources-Policy-Manual-RHA-Updated-February2022.pdf"

# Define parameters for the load_files_and_split_into_chunks function
splitter_params = {
    "splitter_type": "by_character_recursive",
    "use_encoder": False,
    "chunk_size": 512,
    "chunk_overlap": 128,
    "verbose": False,
    "keep_separator": True,
    "is_separator_regex": False,
    "model_name": "gpt-4",
}

# Load files and split them into chunks
document_chunks_to_index = azure_search_indexer_client.load_files_and_split_into_chunks(
    file_paths=[local_pdf_path], **splitter_params
)

2024-07-28 20:45:49,920 - micro - MainProcess - INFO     Reading .pdf file from local path c:\Users\pablosal\Desktop\gbb-ai-chatbot-arena\utils\data\Human-Resources-Policy-Manual-RHA-Updated-February2022.pdf. (from_blob.py:load_document:67)
2024-07-28 20:45:49,921 - micro - MainProcess - INFO     Loading file with Loader PyPDFLoader (from_blob.py:load_document:79)
2024-07-28 20:46:10,771 - micro - MainProcess - INFO     Creating a splitter of type: by_character_recursive (by_character.py:get_splitter:63)
2024-07-28 20:46:10,772 - micro - MainProcess - INFO     Obtained splitter of type: RecursiveCharacterTextSplitter (by_character.py:split_documents_in_chunks_from_documents:175)
2024-07-28 20:46:10,847 - micro - MainProcess - INFO     Number of chunks obtained: 1332 (by_character.py:split_documents_in_chunks_from_documents:178)


In [5]:
document_chunks_to_index

[Document(metadata={'source': 'c:\\Users\\pablosal\\Desktop\\gbb-ai-chatbot-arena\\utils\\data\\Human-Resources-Policy-Manual-RHA-Updated-February2022.pdf', 'page': 0}, page_content='HR POLICY MANUAL'),
 Document(metadata={'source': 'c:\\Users\\pablosal\\Desktop\\gbb-ai-chatbot-arena\\utils\\data\\Human-Resources-Policy-Manual-RHA-Updated-February2022.pdf', 'page': 2}, page_content='iHuman Resources Policy Manual\nTable of Contents\nLatest Revision: October 1, 2017\nTABLE OF CONTENTS\nPOLICY NUMBER\nSECTION 1  INTRODUCTION\nHow to Use This Manual   . .\n . . . . . . . . . . . . . . . . . . . . . . . . .  .100\nSECTION 2  COMPLIANCE\nEqual Employment Opportunit\ny  . .\n . . . . . . . . . . . . . . . . . . . .  .200\nEmployment at Will   .\n . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  .205\nLegal and Gov\nernment Agency Inquiries   . .\n . . . . . . . . . . . . . .  .210'),
 Document(metadata={'source': 'c:\\Users\\pablosal\\Desktop\\gbb-ai-chatbot-arena\\utils\\data\\Hum

In [6]:
# Index the document chunks using the Azure Search Indexer client
azure_search_indexer_client.index_text_embeddings(document_chunks_to_index)

2024-07-28 20:46:11,324 - micro - MainProcess - INFO     Embedding and indexing initiated for 1332 text chunks. (ai_search_indexing.py:index_text_embeddings:498)
2024-07-28 20:48:28,587 - micro - MainProcess - INFO     Embedding and indexing completed for 1332 text chunks. (ai_search_indexing.py:index_text_embeddings:502)


True

## 📚 Indexing Vectorized Content from Complex Layout Documents Leveraging OCR Capabilities

In this section, we will be using Azure's Document Intelligence in the backend to extract elements from complex layout documents. This process involves extracting the title and other metadata from the documents, which allows us to chunk the document by sections. 

We use a simple algorithm that chunks the document based on sections. This approach ensures that each chunk is semantically coherent and can be indexed separately. However, there's a risk that some sections might be longer than the context window of our model. To mitigate this, we can add an additional layer to the algorithm that further chunks sections based on a cutoff count number. 

In [9]:
document_blob = "https://testeastusdev001.blob.core.windows.net/customskillspdf/instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf"

In [10]:
# Define parameters for the load_files_and_split_into_chunks function
splitter_params = {
    "splitter_type": "by_title",
    "ocr": True,
    "ocr_output_format": "markdown",
    "pages": "3-7",
}

document_chunks_to_index = azure_search_indexer_client.load_files_and_split_into_chunks(
    file_paths=document_blob,
    **splitter_params,
)

2024-03-05 16:12:10,726 - micro - MainProcess - INFO     Blob URL detected. Extracting content. (ocr_document_intelligence.py:analyze_document:147)
2024-03-05 16:12:12,043 - micro - MainProcess - INFO     Successfully downloaded blob file instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf (blob_data_extractors.py:extract_content:93)
2024-03-05 16:12:45,280 - micro - MainProcess - INFO     Successfully extracted content from https://testeastusdev001.blob.core.windows.net/customskillspdf/instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf (ocr_data_extractors.py:extract_content:82)
2024-03-05 16:12:45,285 - micro - MainProcess - INFO     Section headings: ['## Scope of Manual', '## Conventions Used in this Manual', '## Description', '## Specifications', '## Related Documents', '### Table 1-2. Specifications', '### Communication Protocol', '#### Input Signal', '### Output Signal', '#### Table 1-2. Specifications (continued)', '##### C

In [11]:
# Index the document chunks using the Azure Search Indexer client
azure_search_indexer_client.index_text_embeddings(document_chunks_to_index)

2024-03-05 16:12:45,385 - micro - MainProcess - INFO     Embedding and indexing initiated for 7 text chunks. (ai_search_indexing.py:index_text_embeddings:498)
2024-03-05 16:12:46,434 - micro - MainProcess - INFO     Embedding and indexing completed for 7 text chunks. (ai_search_indexing.py:index_text_embeddings:502)


True

> Here, we are extracting the text from complex documents using OCR and passing it to a batched recursive iterator, sectioned by title.

In [12]:
# Define parameters for the load_files_and_split_into_chunks function
splitter_params = {
    "splitter_type": "by_character_recursive",
    "ocr": True,
    "ocr_output_format": "text",
    "pages": "3-7",
    "use_encoder": False,
    "chunk_size": 512,
    "chunk_overlap": 128,
    "verbose": False,
    "keep_separator": True,
    "is_separator_regex": False,
    "verbose": True,
}

document_chunks_to_index = azure_search_indexer_client.load_files_and_split_into_chunks(
    file_paths=document_blob,
    **splitter_params,
)

2024-03-05 16:12:46,459 - micro - MainProcess - INFO     Blob URL detected. Extracting content. (ocr_document_intelligence.py:analyze_document:147)
2024-03-05 16:12:46,666 - micro - MainProcess - INFO     Successfully downloaded blob file instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf (blob_data_extractors.py:extract_content:93)
2024-03-05 16:13:18,284 - micro - MainProcess - INFO     Successfully extracted content from https://testeastusdev001.blob.core.windows.net/customskillspdf/instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf (ocr_data_extractors.py:extract_content:82)
2024-03-05 16:13:18,286 - micro - MainProcess - INFO     Creating a splitter of type: by_character_recursive (by_character.py:get_splitter:63)
2024-03-05 16:13:18,287 - micro - MainProcess - INFO     Obtained splitter of type: RecursiveCharacterTextSplitter (by_character.py:split_documents_in_chunks_from_documents:175)
2024-03-05 16:13:18,290 - micro - Mai

Chunk Number: 1, Character Count: 504, Token Count: 107
Chunk Number: 2, Character Count: 339, Token Count: 63
Chunk Number: 3, Character Count: 362, Token Count: 65
Chunk Number: 4, Character Count: 511, Token Count: 96
Chunk Number: 5, Character Count: 472, Token Count: 95
Chunk Number: 6, Character Count: 510, Token Count: 99
Chunk Number: 7, Character Count: 183, Token Count: 36
Chunk Number: 8, Character Count: 285, Token Count: 83
Chunk Number: 9, Character Count: 409, Token Count: 91
Chunk Number: 10, Character Count: 304, Token Count: 55
Chunk Number: 11, Character Count: 380, Token Count: 70
Chunk Number: 12, Character Count: 494, Token Count: 159
Chunk Number: 13, Character Count: 487, Token Count: 142
Chunk Number: 14, Character Count: 242, Token Count: 54
Chunk Number: 15, Character Count: 499, Token Count: 140
Chunk Number: 16, Character Count: 188, Token Count: 56
Chunk Number: 17, Character Count: 454, Token Count: 88
Chunk Number: 18, Character Count: 456, Token Count: 

In [13]:
# Index the document chunks using the Azure Search Indexer client
azure_search_indexer_client.index_text_embeddings(document_chunks_to_index)

2024-03-05 16:13:18,334 - micro - MainProcess - INFO     Embedding and indexing initiated for 38 text chunks. (ai_search_indexing.py:index_text_embeddings:498)
2024-03-05 16:13:22,068 - micro - MainProcess - INFO     Embedding and indexing completed for 38 text chunks. (ai_search_indexing.py:index_text_embeddings:502)


True