# Deploy and inference Jina ColBERT V2 with Azure Virtual machine

This notebook demonstrates how to deploy a Jina ColBERT V2 model-powered Azure Virtual machine ([128 dimensions](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-colbert-v2-vm?tab=Overview) \ [64 dimensions](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-colbert-v2-64-vm?tab=Overview)) and perform inference with the application within the virtual machine.

## Deploy the managed application

To deploy your Azure managed application, start by consulting the [official deployment guide](https://learn.microsoft.com/en-us/azure/azure-resource-manager/managed-applications/deploy-marketplace-app-quickstart). This document provides comprehensive steps for the deployment process.

It's worth mentioning that in the Basics tab of the deployment setup, you will need to provide several details about your deployment. 

You can customize the VM used, and for certain types, you might need to adjust the allowed quota to ensure access. It is recommended to use the [Standard_NC4as_T4_v3](https://learn.microsoft.com/en-us/azure/virtual-machines/nct4-v3-series) VM. This VM features up to 1 NVIDIA T4 GPU with 16 GB of memory.

<img src="images/deploy_colbert_v2_vm.png" width="50%" height="50%">

Once the deployment of the VM is complete, proceed to the resource group created for your deployment to verify the resources that have been established. 

Within the resource group, you'll find the public IP through which you can access your application within the VM.

Please note that the application within the VM will be unavailable immediately after deployment due to necessary model loading process **We recommend waiting at least 2 minutes before using the application.**

# Perform inference with the application

The Python example below demonstrates how to perform real-time inference (encode) using the public IP address of the deployed virtual machine.

In [None]:
import json

import requests


def invoke_endpoint():
    url = "http://<Insert here your public IP address>/encode"  # For example: http://20.84.48.180/encode"
    headers = {"Content-Type": "application/json"}

    # The 'input_type' parameter must be either 'query' or 'document' (default).
    # It specifies whether the input text is treated as a query or a document for encoding purposes.
    json_data = {"data": [{"text": "good morning"}, {"text": "hello world"}], "parameters": {"input_type": "document"}}

    response = requests.post(url, headers=headers, data=json.dumps(json_data))
    print(response.json())


invoke_endpoint()

The Python example below demonstrates how to perform real-time inference (rerank) using the public IP address of the deployed virtual machine.

In [None]:
import json

import requests


def invoke_endpoint():
    url = "http://<Insert here your public IP address>/rank"  # For example: http://20.84.48.180/rank
    headers = {"Content-Type": "application/json"}

    json_data = {
        "data": {
            "documents": [
                {"text": "the dog is in my house"},
                {"text": "he likes dog"},
                {"text": "hello world"},
            ],
            "query": "where is the dog",
            "top_n": 2,
        }
    }

    response = requests.post(url, headers=headers, data=json.dumps(json_data))
    print(response.json())


invoke_endpoint()