Azure Function for Document Processing

This Azure Function is designed specifically to handle the processing of document uploaded to Azure Blob Storage. It utilizes AI Document Intelligence to extract and analyze the document content, and OpenAI's text classification capabilities to categorize the extracted text. To enhance the processing, an embedding model is used to convert the classified text into vector data. The vector data is then seamlessly integrated into Azure Search Service for efficient indexing and searching.

Expected input and output

Input: User uploads a document to Azure Blob Storage.
- The document can be in various formats such as PDF, Word, or text.
- The user needs to have appropriate permissions to upload the document.
- The document is stored securely in the cloud.
Processing: The document processing trigger by Azure function.
Output: The document is then uploaded to Azure AI Search with key skills.
- The document becomes searchable within the Azure AI Search service.
- Users can search for the document using keywords and filters.
- The search results highlight the key skills and information extracted from the document.

Prerequisites

Azure Subscription
Python 3.8 or later
Azure CLI
Azure Functions Core Tools
Azure Storage Account
Azure Cognitive Services (Document Intelligence)
OpenAI API Key
Azure Search Service
Create manage identity between Blob storage and Document intelligence

Architecture

Trigger: The function is triggered when a document is uploaded to Azure Blob Storage.
Document Intelligence: The content of the document is read using Azure Document Intelligence.
Text Classification: The text is classified using OpenAI's API.
Embedding Model: The classified text is converted into vector data using an embedding model.
Azure Search: The vector data is loaded into Azure Search Service for indexing and searching.

Setup

1. Clone the Repository

git https://github.com/mishravivek-ms/azfunction_openai_documentupload_aisearch_python.git

2. Setup Environment variable

Create a file name ".env".
Copy content from example.env to .env file.
Create a Cognitive Service for Form Recognizer.
Create Azure OPEN AI services.
Create a storage account for upload document.
Create Azure search services

3. Code details

1: Verify the availability of the Azure service index. If the index is not available, create it.

    try:
        # Try to get the index
        search_index_client.get_index(ai_search_index)
        # If no exception is raised, the index exists and we return
        print("Index already exists")
        return
    except:
        # If an exception is raised, the index does not exist and we continue with the logic to create it
        pass
    ........
    ........
    ........
        #Create index
    index = SearchIndex(name=ai_search_index, fields=fields,
                    vector_search=vector_search)
    result = search_index_client.create_or_update_index(index)

2. Retrieve the PDF from the file stored in BLOB storage, and utilize Document Intelligence to extract text and file content according to a predefined layout.

    blob_url = f"https://{storage_account_name}.blob.core.windows.net/{input_file}"
    analyze_request = {
        "urlSource": blob_url
    }
    poller = document_intelligence_client.begin_analyze_document("prebuilt-layout", analyze_request=analyze_request)
    result: AnalyzeResult = poller.result()
    #print(result.content)
    
    

    #read result object into a full text variable
    full_text = result.content

3. Use OpenAI for extract the required item based in prompt. This code snippet a part of a function that interacts with a language model to process some text and extract information in JSON format.

    messages = [{"role": "system", "content": resume_indexing_prompt}]
    messages.append({"role": "user", "content": full_text})

    response = primary_llm_json.invoke(messages)
    extraction_json = json.loads(response.content)

4. Application use the JSON data for invoke embeddings model (text-embedding-ada-002) and create a indexed record.

5. Upload the JSON data and indexed record into Azure search services.

            document = {
                "id": document_id,
                "date": current_date,
                "jobTitle": jobTitle,
                "experienceLevel": experienceLevel,
                "content": full_text,
                "sourceFileName": fileName,
                "searchVector": searchVector
            }
            
            search_client.upload_documents(documents=[document])

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.vscode		.vscode
image		image
.funcignore		.funcignore
.gitignore		.gitignore
README.md		README.md
example.env		example.env
function_app.py		function_app.py
host.json		host.json
indexing.py		indexing.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Azure Function for Document Processing

Expected input and output

Table of Contents

Prerequisites

Architecture

Setup

1. Clone the Repository

2. Setup Environment variable

3. Code details

1: Verify the availability of the Azure service index. If the index is not available, create it.

2. Retrieve the PDF from the file stored in BLOB storage, and utilize Document Intelligence to extract text and file content according to a predefined layout.

3. Use OpenAI for extract the required item based in prompt. This code snippet a part of a function that interacts with a language model to process some text and extract information in JSON format.

4. Application use the JSON data for invoke embeddings model (text-embedding-ada-002) and create a indexed record.

5. Upload the JSON data and indexed record into Azure search services.

6. Move the uploaded blob document into processed or archive folder.

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mishravivek-ms/azfunction_openai_documentupload_aisearch_python

Folders and files

Latest commit

History

Repository files navigation

Azure Function for Document Processing

Expected input and output

Table of Contents

Prerequisites

Architecture

Setup

1. Clone the Repository

2. Setup Environment variable

3. Code details

1: Verify the availability of the Azure service index. If the index is not available, create it.

2. Retrieve the PDF from the file stored in BLOB storage, and utilize Document Intelligence to extract text and file content according to a predefined layout.

3. Use OpenAI for extract the required item based in prompt. This code snippet a part of a function that interacts with a language model to process some text and extract information in JSON format.

4. Application use the JSON data for invoke embeddings model (text-embedding-ada-002) and create a indexed record.

5. Upload the JSON data and indexed record into Azure search services.

6. Move the uploaded blob document into processed or archive folder.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages