## üìö Prerequisites

Before executing this notebook, make sure you have properly set up your Azure Services, created your Conda environment, and configured your environment variables as per the instructions provided in the [README.md](README.md) file.

## üìã Table of Contents

This notebook guides you through the following sections:

> **üí° Note:** Please refer to the notebook `01-creation-indexes.ipynb` for detailed information and steps on how to create Azure AI Search Indexes.

1. [**Indexing Vectorized Content from Documents**](#index-documents)
    - Chunk, vectorize, and index local PDF files and website addresses.
    - Download, chunk, vectorize, and index all `.docx` files from a SharePoint site.
    - Download PDF files stored in Blob Storage, apply complex OCR processing through GPT-4 Vision, chunk and vectorize the content, and finally index the processed data in Azure AI Search.
    
2. [**Indexing Vectorized Content from Images**](#index-images)
    - Leverage complex OCR, image recognition, and summarization capabilities using GPT-4 Vision. Chunk, vectorize, and index extracted metadata from images stored in Blob Storage.

3. [**Indexing Vectorized Content from Audio**](#index-audio)
    - Process WAV audio data using Azure AI Speech Translator capabilities, chunk, vectorize, and index audio files stored in Blob Storage and indexed in Azure AI Search.

## Getting Started

#### Configure Environment Variables 

Before running this notebook, you must configure certain environment variables. We will now use environment variables to store our configuration. This is a more secure practice as it prevents sensitive data from being accidentally committed and pushed to version control systems.

Create a `.env` file in your project root (use the provided `.env.sample` as a template) and add the following variables:

```env
# Azure AI Search Service Configuration
AZURE_AI_SEARCH_SERVICE_ENDPOINT="<Your Azure Search Service Endpoint>"
AZURE_SEARCH_ADMIN_KEY="<Your Azure Search Admin Key>"
AZURE_SEARCH_INDEX_NAME_DOCUMENTS="<Your Azure Search Index Name for Documents>"
AZURE_SEARCH_INDEX_NAME_IMAGES_AND_AUDIO="<Your Azure Search Index Name for Images and Audio>"

# Azure Speech Service Configuration
SPEECH_KEY="<Your Azure Speech Service Subscription Key>"
SPEECH_REGION="<Your Azure Speech Service Region>"

# Azure OpenAI Configuration
AZURE_OPENAI_API_KEY="<Your OpenAI API Key>"
AZURE_OPENAI_ENDPOINT="<Your OpenAI Endpoint>"
AZURE_OPENAI_API_VERSION="<Your Azure OpenAI API Version>"

# Azure Storage Configuration
AZURE_STORAGE_CONNECTION_STRING="<Your Azure Storage Connection String>"
```

Replace the placeholders (e.g., [Your Azure Search Service Endpoint]) with your actual values.

- `AZURE_AI_SEARCH_SERVICE_ENDPOINT` and `AZURE_SEARCH_ADMIN_KEY` are used to configure the Azure AI Search service.
- `SPEECH_KEY` and `SPEECH_REGION` are used to configure the Azure Speech service.
- `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT`, and `AZURE_OPENAI_API_VERSION` are used to configure the Azure OpenAI service.
- `AZURE_STORAGE_CONNECTION_STRING` is used to configure the Azure Storage service.

> üìå **Note**
> Remember not to commit the .env file to your version control system. Add it to your .gitignore file to prevent it from being tracked.

#### Setting Up Conda Environment and Configuring VSCode for Jupyter Notebooks (Optional)

Follow these steps to create a Conda environment and set up your VSCode for running Jupyter Notebooks:

##### Create Conda Environment from the Repository

> Instructions for Windows users: 

1. **Create the Conda Environment**:
   - In your terminal or command line, navigate to the repository directory.
   - Execute the following command to create the Conda environment using the `environment.yaml` file:
     ```bash
     conda env create -f environment.yaml
     ```
   - This command creates a Conda environment as defined in `environment.yaml`.

2. **Activating the Environment**:
   - After creation, activate the new Conda environment by using:
     ```bash
     conda activate vector-indexing-azureaisearch
     ```

> Instructions for Linux users (or Windows users with WSL or other linux setup): 

1. **Use `make` to Create the Conda Environment**:
   - In your terminal or command line, navigate to the repository directory and look at the Makefile.
   - Execute the `make` command specified below to create the Conda environment using the `environment.yaml` file:
     ```bash
     make create_conda_env
     ```

2. **Activating the Environment**:
   - After creation, activate the new Conda environment by using:
     ```bash
     conda activate vector-indexing-azureaisearch
     ```

##### Configure VSCode for Jupyter Notebooks

1. **Install Required Extensions**:
   - Download and install the `Python` and `Jupyter` extensions for VSCode. These extensions provide support for running and editing Jupyter Notebooks within VSCode.

2. **Open the Notebook**:
   - Open the Jupyter Notebook file (`01-indexing-content.ipynb`) in VSCode.

3. **Attach Kernel to VSCode**:
   - After creating the Conda environment, it should be available in the kernel selection dropdown. This dropdown is located in the top-right corner of the VSCode interface.
   - Select your newly created environment (`vector-indexing-azureaisearch`) from the dropdown. This sets it as the kernel for running your Jupyter Notebooks.

4. **Run the Notebook**:
   - Once the kernel is attached, you can run the notebook by clicking on the "Run All" button in the top menu, or by running each cell individually.


By following these steps, you'll establish a dedicated Conda environment for your project and configure VSCode to run Jupyter Notebooks efficiently. This environment will include all the necessary dependencies specified in your `environment.yaml` file. If you wish to add more packages or change versions, please use `pip install` in a notebook cell or in the terminal after activating the environment, and then restart the kernel. The changes should be automatically applied after the session restarts.

In [1]:
import os

# Define the target directory
target_directory = r"C:\Users\pablosal\Desktop\gbbai-azure-ai-search-indexing"  # change your directory here

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\gbbai-azure-ai-search-indexing


## Create Azure AI Search Indexes 

Please refer to the notebook [01-creation-indexes.ipynb](01-creation-indexes.ipynb) for detailed information and steps on how to create Azure AI Search Indexes. 

# Indexing Vectorized Content from Documents

In [2]:
# Import the TextChunkingIndexing class from the langchain_integration module
from src.azure_ai_search.langchain_integration import AzureAIChunkIndexer

DEPLOYMENT_NAME = "foundational-ada"

# Create an instance of the TextChunkingIndexing class
azure_search_indexer_client = AzureAIChunkIndexer(
    index_name="test-diferences", embedding_azure_deployment_name=DEPLOYMENT_NAME
)

# # load the environment variables from the .env file
# gbb_ai_client.load_environment_variables_from_env_file()

# # Specify the name of the deployment in Azure AI Services
# DEPLOYMENT_NAME = "foundational-ada"

# # Load the embedding model associated with the specified deployment
# embedding_model = gbb_ai_client.load_embedding_model(azure_deployment=DEPLOYMENT_NAME)

# gbb_ai_client.load_azureai_index()

2024-01-07 21:06:55,138 - micro - MainProcess - INFO     PDFHelper initialized. (pdf_data_extractor.py:__init__:20)
2024-01-07 21:06:55,146 - micro - MainProcess - INFO     Loading OpenAIEmbeddings object with model, deployment foundational-ada, and chunk size 1000 (langchain_integration.py:load_embedding_model:150)
  warn_deprecated(
  warn_deprecated(
2024-01-07 21:06:56,812 - micro - MainProcess - INFO     AzureOpenAIEmbeddings object has been created successfully. You can now access the embeddings using the '.embeddings' attribute. (langchain_integration.py:load_embedding_model:161)
vector_search_configuration is not a known attribute of class <class 'azure.search.documents.indexes.models._index.SearchField'> and will be ignored
2024-01-07 21:06:58,255 - micro - MainProcess - INFO     The Azure AI search index 'test-diferences' has been loaded correctly. (langchain_integration.py:load_azureai_index:203)


## Indexing PDFs

In [3]:
pdf_path = "utils\\data\\autogen.pdf"
url_pdf = "https://arxiv.org/pdf/2308.08155.pdf"
blob_path = "https://testeastusdev001.blob.core.windows.net/testretrieval/autogen.pdf"

In [4]:
azure_search_indexer_client.read_and_load_pdf(pdf_url=blob_path)

2024-01-07 21:07:01,575 - micro - MainProcess - INFO     Downloading and reading PDF file from https://testeastusdev001.blob.core.windows.net/testretrieval/autogen.pdf. (langchain_integration.py:read_and_load_pdf:352)
2024-01-07 21:07:01,580 - micro - MainProcess - INFO     Initialized AzureBlobManager with container testretrieval (blob_data_extractor.py:__init__:51)


C:\Users\pablosal\AppData\Local\Temp\tmpme9wefz4.pdf


PermissionError: [Errno 13] Permission denied: 'C:\\Users\\pablosal\\AppData\\Local\\Temp\\tmpme9wefz4.pdf'

In [9]:
azure_search_indexer_client.read_and_load_pdf(pdf_url=url_pdf)

2024-01-07 20:38:13,644 - micro - MainProcess - INFO     Reading PDF file from https://arxiv.org/pdf/2308.08155.pdf. (langchain_integration.py:read_and_load_pdf:366)


[Document(page_content='AutoGen : Enabling Next-Gen LLM\nApplications via Multi-Agent Conversation\nQingyun Wu‚Ä†, Gagan Bansal‚àó, Jieyu Zhang¬±, Yiran Wu‚Ä†, Beibin Li‚àó\nErkang Zhu‚àó, Li Jiang‚àó, Xiaoyun Zhang‚àó, Shaokun Zhang‚Ä†, Jiale Liu‚àì\nAhmed Awadallah‚àó, Ryen W. White‚àó, Doug Burger‚àó, Chi Wang‚àó1\n‚àóMicrosoft Research,‚Ä†Pennsylvania State University\n¬±University of Washington,‚àìXidian University\nAgent CustomizationConversable agent\nFlexible Conversation Patterns\n‚Ä¶\n‚Ä¶\n‚Ä¶\n‚Ä¶\n‚Ä¶\n‚Ä¶\n‚Ä¶\nHierarchical chatJoint chatMulti-Agent Conversations‚Ä¶Execute the following code‚Ä¶\nGot it! Here is the revised code ‚Ä¶No, please plot % change!Plot a chart of META and TESLA stock price change YTD.\nOutput:$Month\nOutput:%MonthError package yfinanceis not installed\nSorry! Please first pip install yfinanceand then execute the code\nInstalling‚Ä¶\nExample Agent Chat\nFigure 1: AutoGen enables diverse LLM-based applications using multi-agent conversations. (Left)\

In [7]:
azure_search_indexer_client.read_and_load_pdf(pdf_path=pdf_path)

2024-01-07 20:37:47,373 - micro - MainProcess - INFO     Reading PDF file from C:\Users\pablosal\Desktop\gbbai-azure-ai-search-indexing\utils\data\autogen.pdf. (langchain_integration.py:read_and_load_pdf:342)


[Document(page_content='AutoGen : Enabling Next-Gen LLM\nApplications via Multi-Agent Conversation\nQingyun Wu‚Ä†, Gagan Bansal‚àó, Jieyu Zhang¬±, Yiran Wu‚Ä†, Beibin Li‚àó\nErkang Zhu‚àó, Li Jiang‚àó, Xiaoyun Zhang‚àó, Shaokun Zhang‚Ä†, Jiale Liu‚àì\nAhmed Awadallah‚àó, Ryen W. White‚àó, Doug Burger‚àó, Chi Wang‚àó1\n‚àóMicrosoft Research,‚Ä†Pennsylvania State University\n¬±University of Washington,‚àìXidian University\nAgent CustomizationConversable agent\nFlexible Conversation Patterns\n‚Ä¶\n‚Ä¶\n‚Ä¶\n‚Ä¶\n‚Ä¶\n‚Ä¶\n‚Ä¶\nHierarchical chatJoint chatMulti-Agent Conversations‚Ä¶Execute the following code‚Ä¶\nGot it! Here is the revised code ‚Ä¶No, please plot % change!Plot a chart of META and TESLA stock price change YTD.\nOutput:$Month\nOutput:%MonthError package yfinanceis not installed\nSorry! Please first pip install yfinanceand then execute the code\nInstalling‚Ä¶\nExample Agent Chat\nFigure 1: AutoGen enables diverse LLM-based applications using multi-agent conversations. (Left)\

In [3]:
# Scrap web and chuck files into sentences
# Define the URLs of the web pages to scrape
file_1 = "utils\\data\\ultraflex_user_manual.pdf"

# Set the chunk size and overlap size for splitting the text
CHUNK_SIZE = 512
OVERLAP_SIZE = 128
SEPARATOR = "(\n\w|\w\n)"

# Scrape the web pages, split the text into chunks, and store the chunks
# The text is split into chunks of size CHUNK_SIZE, with an overlap of OVERLAP_SIZE between consecutive chunks
text_chuncked = gbb_ai_client.load_and_split_text_by_character_from_pdf(
    source=file_1, chunk_size=CHUNK_SIZE, chunk_overlap=OVERLAP_SIZE
)

# Embed the chunks and index them in Azure Search
# This function converts the text chunks into vector embeddings and stores them in the Azure Search index
gbb_ai_client.embed_and_index(text_chuncked)

2024-01-07 17:11:25,576 - micro - MainProcess - INFO     Reading PDF files from C:\Users\pablosal\Desktop\gbbai-azure-ai-search-indexing\utils\data\ultraflex_user_manual.pdf. (indexing_azureai_search.py:read_and_load_pdfs:336)
2024-01-07 17:11:37,051 - micro - MainProcess - INFO     Starting to embed and index 39 chuncks. (indexing_azureai_search.py:embed_and_index:402)
2024-01-07 17:11:41,483 - micro - MainProcess - INFO     Successfully embedded and indexed 39 chuncks. (indexing_azureai_search.py:embed_and_index:404)


In [18]:
pdf_path = "utils\\data\\autogen.pdf"
url_pdf = "https://arxiv.org/pdf/2308.08155.pdf"

In [9]:
from langchain.document_loaders import PyPDFLoader, WebBaseLoader

In [24]:
loader = PyPDFLoader(pdf_path)
document_path = loader.load()

In [21]:
loader = PyPDFLoader(url_pdf)
document_url = loader.load()

In [25]:
document_path

[Document(page_content='AutoGen : Enabling Next-Gen LLM\nApplications via Multi-Agent Conversation\nQingyun Wu‚Ä†, Gagan Bansal‚àó, Jieyu Zhang¬±, Yiran Wu‚Ä†, Beibin Li‚àó\nErkang Zhu‚àó, Li Jiang‚àó, Xiaoyun Zhang‚àó, Shaokun Zhang‚Ä†, Jiale Liu‚àì\nAhmed Awadallah‚àó, Ryen W. White‚àó, Doug Burger‚àó, Chi Wang‚àó1\n‚àóMicrosoft Research,‚Ä†Pennsylvania State University\n¬±University of Washington,‚àìXidian University\nAgent CustomizationConversable agent\nFlexible Conversation Patterns\n‚Ä¶\n‚Ä¶\n‚Ä¶\n‚Ä¶\n‚Ä¶\n‚Ä¶\n‚Ä¶\nHierarchical chatJoint chatMulti-Agent Conversations‚Ä¶Execute the following code‚Ä¶\nGot it! Here is the revised code ‚Ä¶No, please plot % change!Plot a chart of META and TESLA stock price change YTD.\nOutput:$Month\nOutput:%MonthError package yfinanceis not installed\nSorry! Please first pip install yfinanceand then execute the code\nInstalling‚Ä¶\nExample Agent Chat\nFigure 1: AutoGen enables diverse LLM-based applications using multi-agent conversations. (Left)\

In [26]:
document_url

[Document(page_content='AutoGen : Enabling Next-Gen LLM\nApplications via Multi-Agent Conversation\nQingyun Wu‚Ä†, Gagan Bansal‚àó, Jieyu Zhang¬±, Yiran Wu‚Ä†, Beibin Li‚àó\nErkang Zhu‚àó, Li Jiang‚àó, Xiaoyun Zhang‚àó, Shaokun Zhang‚Ä†, Jiale Liu‚àì\nAhmed Awadallah‚àó, Ryen W. White‚àó, Doug Burger‚àó, Chi Wang‚àó1\n‚àóMicrosoft Research,‚Ä†Pennsylvania State University\n¬±University of Washington,‚àìXidian University\nAgent CustomizationConversable agent\nFlexible Conversation Patterns\n‚Ä¶\n‚Ä¶\n‚Ä¶\n‚Ä¶\n‚Ä¶\n‚Ä¶\n‚Ä¶\nHierarchical chatJoint chatMulti-Agent Conversations‚Ä¶Execute the following code‚Ä¶\nGot it! Here is the revised code ‚Ä¶No, please plot % change!Plot a chart of META and TESLA stock price change YTD.\nOutput:$Month\nOutput:%MonthError package yfinanceis not installed\nSorry! Please first pip install yfinanceand then execute the code\nInstalling‚Ä¶\nExample Agent Chat\nFigure 1: AutoGen enables diverse LLM-based applications using multi-agent conversations. (Left)\

In [23]:
document_url == document_path

False