# **Incorporating Azure Document Intelligence with Azure AI Search**
##### **Goals:**
- Use Azure Document Intelligence to structure PDF files.
- Store structured data in Azure Data Lake Gen2 using Python.
- Generate 3 JSON files for Azure AI Search's Data Source, Indexes, and Indexers.
- Query the Azure AI Search Indexes.

##### **Dependencies**
Install the necessary libraries:
- `!pip install azure-ai-formrecognizer==3.3.0`
- `!pip install azure-storage-file-datalake`
- All pdf files are under file folder

In [None]:
!pip install azure-ai-formrecognizer==3.3.0

# Access the Azure Document Intelligent
- Make sure your model is ready to be utilized
- Composed Document is highly suggested
- Note, make sure `pip install azure-ai-formrecognizer==3.3.0` is installed

#### Setting Variables for Azure AI Document Intelligence Access

- Change to your local environment

In [None]:
# Load environment variables or key values
df = spark.read.format("csv").option("header","true").load("Files/AI300X/ai3002/docs/configfile/az-doc-intl-config.csv").toPandas()

# Assign variables for Azure Document Intelligence access
DOC_INTELLIGENCE_ENDPOINT = df["DOC_INTELLIGENCE_ENDPOINT"][0]
DOC_INTELLIGENCE_KEY = df["DOC_INTELLIGENCE_KEY"][0]

# Print endpoint
print(DOC_INTELLIGENCE_ENDPOINT)

# Set model ID (modifiable)
COMPOSE_MODEL_ID="TaxFormsModel"

##### Import Libraries for Azure Form Recognizer, File Operations, and Data Manipulation

In [None]:
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient
import os
import pandas as pd

##### Configure Azure Form Recognizer Service
- Specify the endpoint and key.
- Provide the Compose Model ID. No need to set the Model ID as the Compose Model determines it.

In [None]:
endpoint = DOC_INTELLIGENCE_ENDPOINT
key = DOC_INTELLIGENCE_KEY
model_id = COMPOSE_MODEL_ID #MODEL_ID

##### Create an instance of the DocumentAnalysisClient
- `endpoint`: The endpoint URL of your Azure Form Recognizer resource
- `credential`: The API key for your Azure Form Recognizer resource

In [None]:
document_analysis_client = DocumentAnalysisClient(
    endpoint=endpoint, credential=AzureKeyCredential(key)
)

##### Read multiple 1040 PDF form files using the Compose Model in Azure AI Document Intelligence
- Even if the file directory changes or the structure is different, the Compose Model will still adjust accordingly.
- Change to your local files path

In [None]:
# Directory path 1099examples
dir_path = "/lakehouse/default/Files/AI300X/ai3002/docs/trainingdata/1099examples/"

json_data = []
counter = 1  # Initialize the counter

# Loop through all the files in the directory
for filename in os.listdir(dir_path):
    if filename.endswith(".pdf"):
        # Load file
        file_path = os.path.join(dir_path, filename)
        print(file_path)
        # Read the file as bytes
        with open(file_path, "rb") as f:
            file_bytes = f.read()

        document_analysis_client = DocumentAnalysisClient(
            endpoint=endpoint, credential=AzureKeyCredential(key)
        )

        # Make sure your document's type is included in the list of document types the custom model can analyze
        response = document_analysis_client.begin_analyze_document(model_id, file_bytes)
        result = response.result()

        for idx, document in enumerate(result.documents):
            doc_data = {}  # Reset the dictionary for each document
            for name, field in document.fields.items():
                field_value = field.value if field.value else field.content
                doc_data[name] = field_value

            # Add the filename to the dictionary
            doc_data['Filename'] = filename

            # Add a unique identifier to the dictionary
            doc_data['ID'] = counter
            counter += 1  # Increment the counter

            # Append the dictionary to the list
            json_data.append(doc_data)        

# print json data
print(json_data)

# print structured data
df = pd.DataFrame(json_data)

# Print the DataFrame
display(df)

#### Storing Azure Document Intelligence Data in Azure Data Lake Gen2


- ##### Defining Two Python Functions
  - `initialize_storage_account(storage_account_name, storage_account_key)`: Establishes the connection.
  - `save_json_to_adl(json_data, file_path)`: Saves the JSON file to Azure Data Lake Storage.
  - `Note`, make sure `pip install azure-storage-file-datalake` is installed

#### Creating a Python function to establish a connection to Azure Data Lake Storage.

In [None]:
from azure.storage.filedatalake import DataLakeServiceClient

def initialize_storage_account(storage_account_name, storage_account_key):
    try:  
        global service_client

        service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
            "https", storage_account_name), credential=storage_account_key)

    except Exception as e:
        print(e)      

#### Creating a Python Function for Storing JSON Data in Azure Data Lake Storage

In [None]:
from azure.storage.filedatalake import DataLakeFileClient
import json

def save_json_to_adl(json_data, file_path):
    try:
        # Split the file path into container name and file name
        container_name, file_name = file_path.split('/', 1)

        # Get a DataLakeFileSystemClient for the container
        filesystem_client = service_client.get_file_system_client(container_name)

        # Get a DataLakeFileClient for the file
        file_client = filesystem_client.get_file_client(file_name)

        # Convert the JSON data to a string
        json_str = json.dumps(json_data)

        # Upload the JSON string to the file
        file_client.upload_data(json_str, overwrite=True)
        
        print(f'Successfully uploaded JSON data to {file_path}.')

    except Exception as e:
        print(e)

#### Savin `json_data` content to Azure Datalake

#### Setting Variables for Azure Datalake Storage Access

- Change to your local environment

In [None]:
# Load environment variables or key values
df = spark.read.format("csv").option("header","true").load("Files/AI300X/ai3002/docs/configfile/az_dls_config.csv").toPandas()

# Assign variables for Azure Document Intelligence access
storage_account_name = df["storage_account_name"][0]
storage_account_key = df["storage_account_key"][0]

print(storage_account_name)


In [None]:
# Initialize your storage account

initialize_storage_account(storage_account_name, storage_account_key)

# Call the function
index_file_name = "f1099msc_payer"
file_path = f"custom/az-ai-search-indexes/{index_file_name}.json"

save_json_to_adl(json_data, file_path)

#### Azure Document Intelligence & Azure AI Search Integration

###### Goal
- Need an Azure AI Search
- Create Indexes
- Create Data Source
- Create Indexers
- Note: You will need to define your indexes properties however if you are using my pdf files I already done is for your
- All Azure Ai Search can be found from file folder

#### Creating a Python function to connect and create objects on Azure AI Search

In [None]:
import requests
import json

def create_az_ai_search_objects(endpoint, api_key, objecttype, jsonfile, version):
    # Define the URL for creating/updating the data source
    url = f"https://{endpoint}.search.windows.net/{objecttype}?api-version={version}"
    print(url)
    # Define the request headers with the query key
    headers = {
        "api-key": api_key,
        "Content-Type": "application/json"
    }

    # Load your JSON data from the file
    with open(jsonfile, 'r') as f:
        data = json.load(f)

    # Make the POST request
    response = requests.post(url, headers=headers, json=data)
    # Check the status of the request
    print(response.content)
    if response.status_code == 200 or response.status_code == 201:
        print("Request was successful.")
    else:
        print(f"Request failed. Status code: {response.status_code}")

#### Setting Variables for Azure AI Search Access

- Change to your local environment

In [None]:
# Load environment variables or key values
df = spark.read.format("csv").option("header","true").load("Files/AI300X/ai3002/docs/configfile/az_ai_search_config.csv").toPandas()

# Assign variables for Azure Document Intelligence access
az_ai_search_endpoint = df["az_ai_search_endpoint"][0]
az_ai_search_api_key = df["az_ai_search_api_key"][0]
az_ai_search_version = df["az_ai_search_version"][0]

print(az_ai_search_version)


In [None]:
# Create the data source
objecttype = "datasources"
jsonfile   = "/lakehouse/default/Files/AI300X/ai-search/data_source.json"

create_az_ai_search_objects(az_ai_search_endpoint, az_ai_search_api_key, objecttype, jsonfile,az_ai_search_version)

In [None]:
# create the Indexes
objecttype = "Indexes"
jsonfile   = "/lakehouse/default/Files/AI300X/ai-search/Indexes.json"
version    = "2024-03-01-preview"

create_az_ai_search_objects(az_ai_search_endpoint, az_ai_search_api_key, objecttype, jsonfile,az_ai_search_version)

In [None]:
# create the Indexers
objecttype = "indexers"
jsonfile   = "/lakehouse/default/Files/AI300X/ai-search/indexers.json"
version    = "2024-03-01-preview"

create_az_ai_search_objects(az_ai_search_endpoint, az_ai_search_api_key, objecttype, jsonfile,az_ai_search_version)

#### Creating a Python function to query Azure AI Search indexes

In [None]:
import requests
import json
import pprint

def query_az_ai_search (api_key,endpoint,index_name,search_query,objecttype="Indexes"):
    # Define the headers for the API request
    headers = {
        'api-key': api_key,  # replace <your-api-key> with your actual API key
        'Content-Type': 'application/json'
    }

    # Make the API request
    az_ai_search_endpoint = f"https://{endpoint}.search.windows.net/Indexes/{index_name}/docs"

    response = requests.get(az_ai_search_endpoint, headers=headers, params=search_query)

    # Parse the JSON response
    data = response.json()

    # Create a pretty printer
    pp = pprint.PrettyPrinter(indent=4)

    # Pretty print the data
    pp.pprint(data)

In [None]:
# Define the parameters for the API request
index_name="f1099-msc-payer-index"
search_query = {
    'api-version': '2023-11-01',
    'search': '*'
}

az_ai_search_data=[]

az_ai_search_data = query_az_ai_search (az_ai_search_api_key,az_ai_search_endpoint,index_name,search_query)

print(az_ai_search_data)