## 📚 Prerequisites

Before running this notebook, ensure you have configured Azure AI services, set the appropriate configuration parameters, and set up a Conda environment to ensure reproducibility. You can find the setup instructions and how to create a Conda environment in the [REQUIREMENTS.md](REQUIREMENTS.md) file.

## 📋 Table of Contents

This notebook guides you through the following sections:

> **💡 Note:** Please refer to the notebook `01-creation-indexes.ipynb` for detailed information and steps on how to create Azure AI Search Indexes.

1. [**Indexing Vectorized Content from multiple formats and sources**](#index-documents)
    - Chunk, vectorize, and index local PDF files and website addresses.
    - Download, chunk, vectorize, and index all `.docx` files from a SharePoint site.
    
2. [**Indexing Vectorized Content from complex layout documents laveraging OCR Capabilities**](#index-images)
    - Leverage complex OCR, image recognition using Azure Document Intelligence. Chunk, vectorize, and index extracted metadata from Dcouments

3. [**Indexing Vectorized Content from Audio**](#index-audio) (TODO)
    - Process WAV audio data using Azure AI Speech transalations capabilities, chunk, vectorize, and index audio files stored in Blob Storage and indexed in Azure AI Search.

Before you start, please take a look at the README. It contains detailed instructions, diagrams, and information about the class structure and automation used in this project, specifically in the `AzureAIIndexer` backend. [README.md](README.md) file

In [1]:
import os

# Define the target directory
target_directory = r"C:\Users\pablosal\Desktop\gbbai-azure-ai-search-indexing"  # change your directory here

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\gbbai-azure-ai-search-indexing


# Create Azure AI Search Indexes 

Please refer to the notebook [01-creation-indexes.ipynb](01-creation-indexes.ipynb) for detailed information and steps on how to create Azure AI Search Indexes. 

# 📚 Indexing Vectorized Content from Multiple Sources and Various Formats

In this section, we will explore how to index vectorized content from various sources and in different formats. This includes local PDF files, website addresses, `.docx` files from a SharePoint site, and more. We will chunk, vectorize, and index these different types of content, leveraging the power of Azure AI Search Indexes. This process allows us to create a comprehensive, searchable index that can handle a wide range of queries.

In [2]:
# Import the AzureAIndexer class from the ai_search_indexing module
from src.indexers.ai_search_indexing import AzureAIndexer

DEPLOYMENT_NAME = "foundational-ada"
INDEX_NAME = "test-index-002"

# Create an instance of the AzureAIndexer class
azure_search_indexer_client = AzureAIndexer(
    index_name=INDEX_NAME, embedding_azure_deployment_name=DEPLOYMENT_NAME
)

2024-02-11 14:27:11,943 - micro - MainProcess - INFO     Loading OpenAIEmbeddings object with model, deployment foundational-ada, and chunk size 1000 (ai_search_indexing.py:load_embedding_model:161)
  warn_deprecated(
  warn_deprecated(
2024-02-11 14:27:13,478 - micro - MainProcess - INFO     AzureOpenAIEmbeddings object has been created successfully. You can now access the embeddings
                using the '.embeddings' attribute. (ai_search_indexing.py:load_embedding_model:174)
vector_search_configuration is not a known attribute of class <class 'azure.search.documents.indexes.models._index.SearchField'> and will be ignored
2024-02-11 14:27:15,353 - micro - MainProcess - INFO     The Azure AI search index 'test-index-002' has been loaded correctly. (ai_search_indexing.py:load_azureai_index:225)
2024-02-11 14:27:15,364 - micro - MainProcess - INFO     Successfully loaded environment variables: TENANT_ID, CLIENT_ID, CLIENT_SECRET (sharepoint_data_extractor.py:load_environment_variab

### Indexing PDFs, DOCX, and Images from Blob Storage 

The `load_files_and_split_into_chunks` function is a powerful tool for indexing and processing documents. It is designed to streamline the initial steps of loading files, splitting them into manageable chunks, and preparing your documents for further processing and conversion. 

Here are its key features:

- **Multi-Format Support**: The function can process documents in different formats (PDFs, Word documents, images, etc.) from various sources (blob storage, URLs, local paths). You can pass a list of file paths, each possibly in a different format.

- **Automated File Loading**: The function efficiently loads files into memory, eliminating the need for manual file handling. It manages the reading and processing of each file.

- **Advanced Text Splitting**: After loading, the function splits the text into manageable chunks, crucial for processing large documents. You can customize the chunk size and overlap according to your needs.

- **Versatile Splitting Options**: You can choose from various splitters - 'by_title', 'by_character_recursive', 'by_character_brute_force' - to fit your specific text processing requirements.

- **Encoding Capabilities**: The function can optionally use an encoder during splitting. This feature is particularly useful for certain text analysis tasks. You can specify the model used for encoding (default is "gpt-4").

- **OCR Capabilities**: If the 'ocr' parameter is set to True, the function will use Optical Character Recognition (OCR) via Azure Document Intelligence to extract text from images or scanned documents.

- **Verbose Logging**: You can enable detailed logging for in-depth progress tracking and easier debugging.

- **High Customizability**: The function's behavior can be tailored to your needs with additional keyword arguments. This includes options like retaining separators in chunks, using separators as regex patterns, and more.

In [3]:
# Define file paths and URLs
local_pdf_path = "utils/data/autogen.pdf"
remote_pdf_url = "https://arxiv.org/pdf/2308.08155.pdf"
blob_pdf_url = (
    "https://testeastusdev001.blob.core.windows.net/testretrieval/autogen.pdf"
)
local_word_path = "utils/data/test.docx"
remote_word_url = (
    "https://testeastusdev001.blob.core.windows.net/testretrieval/test.docx"
)

# Combine all paths and URLs into a list. This is optional if you want to process multiple files at once.
# It will also work by passing a string for simple file processing.
file_sources = [
    local_pdf_path,
    remote_pdf_url,
    blob_pdf_url,
    local_word_path,
    remote_word_url,
]

# Define parameters for the load_files_and_split_into_chunks function
splitter_params = {
    "splitter_type": "by_character_recursive",
    "use_encoder": False,
    "chunk_size": 512,
    "chunk_overlap": 128,
    "verbose": False,
    "keep_separator": True,
    "is_separator_regex": False,
    "model_name": "gpt-4",
}

# Load files and split them into chunks
document_chunks_to_index = azure_search_indexer_client.load_files_and_split_into_chunks(
    file_paths=file_sources, **splitter_params
)

2024-02-11 14:27:25,945 - micro - MainProcess - INFO     Reading .pdf file from local path C:\Users\pablosal\Desktop\gbbai-azure-ai-search-indexing\utils\data\autogen.pdf. (from_blob.py:load_document:67)
2024-02-11 14:27:25,946 - micro - MainProcess - INFO     Loading file with Loader PyPDFLoader (from_blob.py:load_document:79)
2024-02-11 14:27:28,396 - micro - MainProcess - INFO     Reading .pdf file from https://arxiv.org/pdf/2308.08155.pdf. (from_blob.py:load_document:75)
2024-02-11 14:27:28,397 - micro - MainProcess - INFO     Loading file with Loader PyPDFLoader (from_blob.py:load_document:79)
2024-02-11 14:27:32,116 - micro - MainProcess - INFO     Successfully downloaded blob file autogen.pdf (blob_data_extractors.py:extract_content:91)
2024-02-11 14:27:32,160 - micro - MainProcess - INFO     Reading .pdf file from temporary location C:\Users\pablosal\AppData\Local\Temp\tmp5kk5eaws originally sourced from https://testeastusdev001.blob.core.windows.net/testretrieval/autogen.pdf. 

In [4]:
# Index the document chunks using the Azure Search Indexer client
azure_search_indexer_client.index_text_embeddings(document_chunks_to_index)

2024-02-11 14:27:38,725 - micro - MainProcess - INFO     Embedding and indexing initiated for 1408 text chunks. (ai_search_indexing.py:index_text_embeddings:490)
2024-02-11 14:30:05,022 - micro - MainProcess - INFO     Embedding and indexing completed for 1408 text chunks. (ai_search_indexing.py:index_text_embeddings:494)


True

### Indexing Pdfs and Docs from Sharepoint


In [19]:
file_names = ["testdocx.docx", "autogen.pdf"]

In [20]:
# Define parameters for the load_files_and_split_into_chunks function
splitter_params = {
    "splitter_type": "by_character_recursive",
    "use_encoder": False,
    "chunk_size": 512,
    "chunk_overlap": 128,
    "verbose": False,
    "keep_separator": True,
    "is_separator_regex": False,
    "model_name": "gpt-4",
}

document_chunks_to_index = (
    azure_search_indexer_client.load_files_and_split_into_chunks_from_sharepoint(
        site_domain=os.environ["SITE_DOMAIN"],
        site_name=os.environ["SITE_NAME"],
        file_names=file_names,
        **splitter_params,
    )
)

2024-02-06 11:04:23,101 - micro - MainProcess - INFO     Getting the Site ID... (sharepoint_data_extractor.py:get_site_id:191)
2024-02-06 11:04:23,728 - micro - MainProcess - INFO     Site ID retrieved: mngenvmcap747548.sharepoint.com,877fe60f-a62d-4ed1-8eda-af543c437d2c,ac47d8a7-cd54-4344-bd9d-26ada5a075c0 (sharepoint_data_extractor.py:get_site_id:195)
2024-02-06 11:04:24,556 - micro - MainProcess - INFO     Successfully retrieved drive ID: b!D-Z_hy2m0U6O2q9UPEN9LKfYR6xUzURDvZ0mraWgdcAot0GWx37EQLiVD3sO7-vm (sharepoint_data_extractor.py:get_drive_id:212)
2024-02-06 11:04:24,557 - micro - MainProcess - INFO     Making request to Microsoft Graph API (sharepoint_data_extractor.py:get_files_in_site:251)
2024-02-06 11:04:25,148 - micro - MainProcess - INFO     Received response from Microsoft Graph API (sharepoint_data_extractor.py:get_files_in_site:254)
2024-02-06 11:04:26,324 - micro - MainProcess - INFO     Reading .docx file from temporary location C:\Users\pablosal\AppData\Local\Temp\t

In [21]:
# Index the document chunks using the Azure Search Indexer client
azure_search_indexer_client.index_text_embeddings(document_chunks_to_index)

2024-02-06 11:04:31,712 - micro - MainProcess - INFO     Embedding and indexing initiated for 495 text chunks. (ai_search_indexing.py:index_text_embeddings:485)
2024-02-06 11:05:24,688 - micro - MainProcess - INFO     Embedding and indexing completed for 495 text chunks. (ai_search_indexing.py:index_text_embeddings:489)


True

## 📚 Indexing Vectorized Content from Complex Layout Documents Leveraging OCR Capabilities

In this section, we will be using Azure's Document Intelligence in the backend to extract elements from complex layout documents. This process involves extracting the title and other metadata from the documents, which allows us to chunk the document by sections. 

We use a simple algorithm that chunks the document based on sections. This approach ensures that each chunk is semantically coherent and can be indexed separately. However, there's a risk that some sections might be longer than the context window of our model. To mitigate this, we can add an additional layer to the algorithm that further chunks sections based on a cutoff count number. 

In [22]:
document_blob = "https://testeastusdev001.blob.core.windows.net/customskillspdf/instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf"

In [23]:
# Define parameters for the load_files_and_split_into_chunks function
splitter_params = {
    "splitter_type": "by_title",
    "ocr": True,
    "ocr_output_format": "markdown",
    "pages": "3-7",
}

document_chunks_to_index = azure_search_indexer_client.load_files_and_split_into_chunks(
    file_paths=document_blob,
    **splitter_params,
)

2024-02-06 11:05:24,725 - micro - MainProcess - INFO     Blob URL detected. Extracting content. (ocr_document_intelligence.py:analyze_document:147)
2024-02-06 11:05:25,574 - micro - MainProcess - INFO     Successfully downloaded blob file instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf (blob_data_extractors.py:extract_content:89)
2024-02-06 11:05:57,701 - micro - MainProcess - INFO     Successfully extracted content from https://testeastusdev001.blob.core.windows.net/customskillspdf/instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf (ocr_data_extractors.py:extract_content:82)
2024-02-06 11:05:57,705 - micro - MainProcess - INFO     Number of chunks: 9 (by_title.py:split_text_by_headings:35)
2024-02-06 11:05:57,707 - micro - MainProcess - INFO     Processed chunk 1 of 9 (by_title.py:combine_chunks:58)
2024-02-06 11:05:57,708 - micro - MainProcess - INFO     Processed chunk 2 of 9 (by_title.py:combine_chunks:58)
2024-02-06 11:05:

In [24]:
# Index the document chunks using the Azure Search Indexer client
azure_search_indexer_client.index_text_embeddings(document_chunks_to_index)

2024-02-06 11:05:57,751 - micro - MainProcess - INFO     Embedding and indexing initiated for 5 text chunks. (ai_search_indexing.py:index_text_embeddings:485)
2024-02-06 11:05:58,714 - micro - MainProcess - INFO     Embedding and indexing completed for 5 text chunks. (ai_search_indexing.py:index_text_embeddings:489)


True

> Here, we are extracting the text from complex documents using OCR and passing it to a batched recursive iterator, sectioned by title.

In [25]:
# Define parameters for the load_files_and_split_into_chunks function
splitter_params = {
    "splitter_type": "by_character_recursive",
    "ocr": True,
    "ocr_output_format": "text",
    "pages": "3-7",
    "use_encoder": False,
    "chunk_size": 512,
    "chunk_overlap": 128,
    "verbose": False,
    "keep_separator": True,
    "is_separator_regex": False,
    "verbose": True,
}

document_chunks_to_index = azure_search_indexer_client.load_files_and_split_into_chunks(
    file_paths=document_blob,
    **splitter_params,
)

2024-02-06 11:05:58,743 - micro - MainProcess - INFO     Blob URL detected. Extracting content. (ocr_document_intelligence.py:analyze_document:147)
2024-02-06 11:05:58,916 - micro - MainProcess - INFO     Successfully downloaded blob file instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf (blob_data_extractors.py:extract_content:89)
2024-02-06 11:06:30,396 - micro - MainProcess - INFO     Successfully extracted content from https://testeastusdev001.blob.core.windows.net/customskillspdf/instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf (ocr_data_extractors.py:extract_content:82)
2024-02-06 11:06:30,398 - micro - MainProcess - INFO     Creating a splitter of type: by_character_recursive (by_character.py:get_splitter:62)
2024-02-06 11:06:30,399 - micro - MainProcess - INFO     Obtained splitter of type: RecursiveCharacterTextSplitter (by_character.py:split_documents_in_chunks_from_documents:165)
2024-02-06 11:06:30,402 - micro - Mai

Chunk Number: 1, Character Count: 504, Token Count: 107
Chunk Number: 2, Character Count: 339, Token Count: 63
Chunk Number: 3, Character Count: 362, Token Count: 65
Chunk Number: 4, Character Count: 511, Token Count: 96
Chunk Number: 5, Character Count: 472, Token Count: 95
Chunk Number: 6, Character Count: 510, Token Count: 99
Chunk Number: 7, Character Count: 183, Token Count: 36
Chunk Number: 8, Character Count: 253, Token Count: 74
Chunk Number: 9, Character Count: 408, Token Count: 90
Chunk Number: 10, Character Count: 304, Token Count: 55
Chunk Number: 11, Character Count: 380, Token Count: 70
Chunk Number: 12, Character Count: 494, Token Count: 159
Chunk Number: 13, Character Count: 487, Token Count: 142
Chunk Number: 14, Character Count: 244, Token Count: 55
Chunk Number: 15, Character Count: 502, Token Count: 138
Chunk Number: 16, Character Count: 210, Token Count: 62
Chunk Number: 17, Character Count: 454, Token Count: 88
Chunk Number: 18, Character Count: 456, Token Count: 

In [26]:
# Index the document chunks using the Azure Search Indexer client
azure_search_indexer_client.index_text_embeddings(document_chunks_to_index)

2024-02-06 11:06:30,422 - micro - MainProcess - INFO     Embedding and indexing initiated for 38 text chunks. (ai_search_indexing.py:index_text_embeddings:485)
2024-02-06 11:06:37,675 - micro - MainProcess - INFO     Embedding and indexing completed for 38 text chunks. (ai_search_indexing.py:index_text_embeddings:489)


True