# Deploy and inference Jina Reader LM with Azure app

This notebook demonstrates how to deploy Jina Reader LM model-powered Azure managed application ([Reader-LM v2](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.reader-lm-v2?tab=Overview) \ [Reader-LM 0.5b](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.reader-lm-500m?tab=Overview) \ [Reader-LM 1.5b](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.reader-lm-1500m?tab=Overview)) and perform inference with this application.


## Deploy the managed application

To deploy your Azure managed application, start by consulting the [official deployment guide](https://learn.microsoft.com/en-us/azure/azure-resource-manager/managed-applications/deploy-marketplace-app-quickstart). This document provides comprehensive steps for the deployment process.

It's worth mentioning that in the Basics tab of the deployment setup, you will need to provide several details about your deployment. 

You can customize the VM used, and for certain types, you might need to adjust the allowed quota to ensure access. It is recommended to use the [Standard_NC4as_T4_v3](https://learn.microsoft.com/en-us/azure/virtual-machines/nct4-v3-series) VM. This VM features up to 1 NVIDIA T4 GPU with 16 GB of memory.

<img src="images/deploy_reader_app.png" width="50%" height="50%">

Once the deployment of the managed application is complete, proceed to the resource group created for your deployment (for instance, `mrg-vmgpu-20240920142516` as referenced in the provided screenshot) to verify the resources that have been established. 

Within this resource group, look for the `jina-inference-vm`. Here, you'll find the DNS Name through which you can access your application. In this example, the application is accessible via `testreader.eastus.cloudapp.azure.com`.

Please note that the application will be unavailable immediately after deployment due to necessary post-deployment tasks such as driver installation, dependency setup, and system reboot. **We recommend waiting at least 15 minutes before using the application.**

# Perform inference with the managed application

The Python example below demonstrates how to perform real-time inference using the public IP address of the deployed virtual machine.

First, construct the prompt using `create_prompt`, where you can specify the desired return format. This will improve the results according to specific scenarios.

In [None]:
def create_prompt(text: str, return_type: str, instruction: str = None, schema: str = None):
    """
    Creates a prompt based on the specified return type (either 'json' or 'markdown').
    
    Parameters:
    - text (str): The HTML content that needs to be converted into the desired format.
    - return_type (str): The desired return format. It must be either "json" or "markdown".
    - instruction (str): The instruction to be included in the prompt. If not provided, a default instruction is used.
    - schema (str): The JSON schema for structuring the output (used only for 'json' return_type). If empty, no schema is included.

    """
    
    if return_type not in ["json", "markdown"]:
        raise ValueError("Invalid return_type. Must be 'json' or 'markdown'.")
    
    if return_type == "json":
        if not instruction:
            instruction = "Extract the main content from the given HTML and convert it in a structured JSON format."

        if schema:
            prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```" 
        else:
            prompt = f"{instruction}\n```html\n{text}\n```"
    
    elif return_type == "markdown":
        if not instruction:
            instruction = "Extract the main content from the given HTML and convert it to Markdown format."
        prompt = f"{instruction}\n```html\n{text}\n```"
    
    return prompt

html = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Phone Book</title>
</head>
<body>
    <h1>Phone Book</h1>
    <div class="contact">
        <h2>John Doe</h2>
        <p>Email: <a href="mailto:john.doe@example.com">john.doe@example.com</a></p>
    </div>
    <div class="contact">
        <h2>Jane Smith</h2>
        <p>Email: <a href="mailto:jane.smith@example.com">jane.smith@example.com</a></p>
    </div>
</body>
</html>
"""

prompt = create_prompt(return_type="markdown", text=html)
print(prompt)

Then create the invoke function to execute inference, shown as below:

In [None]:
import json

import requests


def invoke_endpoint():
    url = "http://<Insert here your public IP address>/invocations"  # With above example, it's "http://20.84.48.180/invocations"
    headers = {"Content-Type": "application/json"}
    json_data = {
        "model": "ReaderLM-v2",  # If its Reader-LM 0.5b, please use "reader-lm-0.5b" as the model name, if it's Reader-LM 1.5b, please use "reader-lm-1.5b".
        "prompt": prompt,
        "stream": False  # Whether to stream back partial progress.
    }

    response = requests.post(url, headers=headers, data=json.dumps(json_data))
    print(response.json())


invoke_endpoint()