# Deploy and inference Code Embeddings 0.5b/1.5b with Azure Virtual machine

This notebook demonstrates how to deploy a [Jina Embeddings V4](https://jina.ai/news/jina-embeddings-v4-universal-embeddings-for-multimodal-multilingual-retrieval) model-powered [Azure Virtual machine](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-embeddings-v4?tab=Overview) and perform inference with the application within the virtual machine.

## Deploy the virtual machine

To deploy the virtual machine, start by consulting the [official deployment guide](https://learn.microsoft.com/en-us/azure/virtual-machines/windows/quick-create-portal). This document provides comprehensive steps for the deployment process.

It's worth mentioning that in the Basics tab of the VM setup, you will need to provide several details about the VM. 

You can customize the VM size used, and for certain sizes, you might need to adjust the allowed quota to ensure access. It is recommended to use the [Standard_NC4as_T4_v3](https://learn.microsoft.com/en-us/azure/virtual-machines/nct4-v3-series) VM. This VM features up to 1 NVIDIA T4 GPU with 16 GB of memory.

For the other tabs, you can leave most settings as default or adjust them to fit your needs.

<img src="images/deploy_embedding_v4.png" width="50%" height="50%">

Once the deployment of the VM is complete, proceed to the resource group created for your deployment to verify the resources that have been established. 

Within the resource group, you'll find the public IP through which you can access your application within the VM.

Please note that the application within the VM will be unavailable immediately after deployment due to necessary model loading process **We recommend waiting at least 2 minutes before using the application.**

# Perform inference with the application

The Python example below demonstrates how to perform real-time inference using the public IP address of the deployed virtual machine.

In [None]:
import json

import requests


def invoke_endpoint():
    url = "http://<Insert here your public IP address>/encode"  # For example: http://20.84.48.180/encode
    headers = {"Content-Type": "application/json"}
    json_data = {
        "data": [
            {
                "text": "A beautiful sunset over the beach"
            },
            {
                "text": "Un beau coucher de soleil sur la plage"
            },
            {
                "text": "海滩上美丽的日落"
            },
            {
                "text": "浜辺に沈む美しい夕日"
            },
            {
                "image": "https://i.ibb.co/nQNGqL0/beach1.jpg"
            },
            {
                "image": "https://i.ibb.co/r5w8hG8/beach2.jpg"
            },
            {
                "image": "iVBORw0KGgoAAAANSUhEUgAAABwAAAA4CAIAAABhUg/jAAAAMklEQVR4nO3MQREAMAgAoLkoFreTiSzhy4MARGe9bX99lEqlUqlUKpVKpVKpVCqVHksHaBwCA2cPf0cAAAAASUVORK5CYII="
            }
    ]   ,
        "parameters": {  # You can refer to https://jina.ai/news/jina-embeddings-v4-universal-embeddings-for-multimodal-multilingual-retrieval for details regarding the parameters.
            "task": "text-matching",  # Select the downstream task for which the embeddings will be used. The model will return the optimized embeddings for that task. Should be one of `text-matching`, `retrieval.query`, `retrieval.passage`, `code.query`, or `code.passage`.
            "late_chunking": False,  # Apply the late chunking technique to leverage the model's long-context capabilities for generating contextual chunk embeddings. Default to False.
            "dimensions": 512,  # Output embedding size (128, 256, 512, 1025, 2048).
            "truncate": False,  # Whether to truncate input text to the model's maximum length. Default to False.
            "return_multivector": False,  # Whether to return multi-vector embeddings. Default to False.
        } 
    }

    response = requests.post(url, headers=headers, data=json.dumps(json_data))
    print(response.json())


invoke_endpoint()