# Use LangChain with Meta Llama 3 in Azure AI and Azure ML

You can use Meta-Llama-3 models deployed in Azure AI and Azure ML with `langchain` to create more sophisticated intelligent applications. Use `langchain_community` package with the Azure Machine Learning integration.

## Prerequisites

Before we start, there are certain steps we need to take to deploy the models:

* Register for a valid Azure account with subscription 
* Make sure you have access to [Azure AI Studio](https://learn.microsoft.com/en-us/azure/ai-studio/what-is-ai-studio?tabs=home)
* Create a project and resource group
* Select Meta-Llama-3 models from Model catalog. This example assumes you are deploying `Meta-Llama-3-70B-Instruct`.

    > Notice that some models may not be available in all the regions in Azure AI and Azure Machine Learning. On those cases, you can create a workspace or project in the region where the models are available and then consume it with a connection from a different one. To learn more about using connections see [Consume models with connections](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deployments-connections)

* Deploy with "Pay-as-you-go"

Once deployed successfully, you should be assigned for an API endpoint and a security key for inference.

For more information, you should consult Azure's official documentation [here](https://aka.ms/meta-llama-3-azure-ai-studio-docs) for model deployment and inference.

To complete this tutorial, you will need to:

* Install `langchain` and `langchain_community`:

    ```bash
    pip install langchain langchain_community
    ```

In [31]:
# !pip install langchain langchain_community

## Example

The following example demonstrate how to create a chain that uses a Meta-Llama-3 chat model deployed in Azure AI and Azure ML. The chain has been configured with a `ConversationBufferMemory`. This example has been adapted from [LangChain official documentation](https://python.langchain.com/docs/modules/memory/adding_memory).

In [1]:
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain.prompts import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    MessagesPlaceholder,
)
from langchain.schema import SystemMessage
# from langchain_community.chat_models.azureml_endpoint import (
#     AzureMLChatOnlineEndpoint,
#     # AzureMLEndpointApiType,
#     # LlamaChatContentFormatter,
#     ContentFormatterBase
# )
from langchain_community.llms.azureml_endpoint import (
    AzureMLOnlineEndpoint,
    ContentFormatterBase
)
import json
from typing import Dict
import time


Let's create an instance of our `AzureMLChatOnlineEndpoint` model. This class allow us to get access to any model deployed in Azure AI or Azure ML. For completion models use class `langchain_community.llms.azureml_endpoint.AzureMLOnlineEndpoint` with `LlamaContentFormatter` as the `content_formatter`.

In [2]:
class LlamaCustomContentFormatter(ContentFormatterBase):
    """Custom Content formatter for LLaMa 2"""

    def format_request_payload(self, prompt: str, model_kwargs: Dict) -> bytes:
        """Formats the request according to the chosen api"""
        prompt = ContentFormatterBase.escape_special_characters(prompt)
        # print("prompt:: ", prompt)
        # print("history:: ", model_kwargs.get("history"))
        request_payload = json.dumps(
            {
                "messages": [
                    # comment system message and history if using langchain
                    {"role": "system", "content": model_kwargs.get("system_message")},         # langchain does the system message magic
                    *model_kwargs.get("history"),                          # langchain does the memory magic
                    {"role": "user", "content": prompt}                    # langchain adds the history into the prompt with System, AI, Human labels
                ],
                "temperature": model_kwargs.get("temperature"),             # can add default value here
                "max_tokens": model_kwargs.get("max_tokens"),
            }
        )
        # print("request_payload:: ", request_payload)
        return str.encode(request_payload)

    def format_response_payload(self, output: bytes) -> str:
        """Formats response"""
        # print("output:: ", output)
        return json.loads(output)["choices"][0]["message"]["content"]
    
# NOTES:
# conversation memory can be done through adding multiple messages (make sure JSON is correct, so comma at end)  -> memory preserved as we fetch it from DB
# OR through adding the conversation into the prompt (as langchain does it) -> memory lost at every app restart

content_formatter = LlamaCustomContentFormatter()

In [107]:
from LLM_calls import azurecredentials
llm = AzureMLOnlineEndpoint(
    endpoint_url=azurecredentials.api_url_8b_3,
    endpoint_api_type="serverless",
    endpoint_api_key=azurecredentials.key_8b_3,
    content_formatter=content_formatter,
    model_kwargs={"max_tokens": 50, "history": [], "system_message": ""}, #"temperature": 0.8, 
)

> Tip: You can configure environment variables `AZUREML_ENDPOINT_URL`, `AZUREML_ENDPOINT_API_KEY`, and `AZUREML_ENDPOINT_API_TYPE` instead of passing them as arguments.

In the below prompt, we have two input keys: one for the actual input (`human_input`), and another for the input from the `Memory` class (`chat_history`).

In [108]:
# prompt = ChatPromptTemplate.from_messages(
#     [
#         SystemMessage(
#             content=system_message
#         ),
#         MessagesPlaceholder(variable_name="chat_history"),
#         HumanMessagePromptTemplate.from_template("{human_input}"),
#     ]
# )

# memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

We create the chain as follows:

In [109]:
# chat_llm_chain = LLMChain(
#     llm=chat_model,
#     prompt=prompt,
#     memory=memory,
#     verbose=True,
# )
# chat_llm_chain.predict(human_input=user_message)

We can see how it works:

In [110]:
def callAzureLLM(user_message, system_message, max_tokens=100, past_messages=[]):
    start_time = time.time() 
    # llm.model_kwargs["history"] = past_messages
    llm.model_kwargs["system_message"] = system_message
    llm.model_kwargs["max_tokens"] = max_tokens
    response = llm.invoke(input=user_message, stop=["<|eot_id|>"]) # [HumanMessage(content=user_message)], config=metadata
    print("response:: ", response)
    
    end_time = time.time()
    overall_latency = end_time - start_time
    print(f"Azure LLM Latency: {overall_latency} seconds")
    return response, overall_latency

def filter_crop_llm_response(response):
    # only accept the sentence until terminator
    terminator = ["#", "\n", "<|eot_id|>"]
    response = response.split("\n")[0]
    if any([term in response for term in terminator]):
        for term in terminator:
            if term in response:
                response = response.split(term)[0]
                break
    if (response[-1] == "'"):
        response = response[:-1]
    return response

In [111]:
height, width = 10, 10
position_description = "top right quadrant"
position = (10, 10)
positioning_area = "The location is in the top right quadrant of a 10x10 grid."
solutionCellStates = """
(     1 2 3 4 5 6 7 8 9 10)
[ (1) 0 0 1 0 1 0 1 0 0 0]
[ (2) 0 0 1 0 1 0 1 0 0 0]
[ (3) 0 0 0 0 0 0 0 0 0 0]
[ (4) 0 1 1 1 1 1 1 1 1 1]
[ (5) 0 1 1 0 1 1 1 1 0 1]
[ (6) 0 1 1 0 1 1 1 1 1 0]
[ (7) 0 1 1 1 1 1 1 1 1 0]
[ (8) 0 1 1 1 1 1 1 1 0 0]
[ (9) 1 0 1 1 1 1 1 0 0 1]
[(10) 0 1 1 1 1 1 1 1 1 0]"""

system_message = f"""Your task is to rephrase the description of the location of a cell in a {height}x{width} grid. 
Focus on providing a similar description using different words or phrases and formulate a sentence. Use concise and clear language. 

Description : '{position_description}'"""

sys_observe_around = f"""You are observing a 2D grid of size {height}x{width}. The grid is represented by binary data where 0 represents an empty cell and 1 represents a filled cell. The grid is 1-indexed, meaning the first row and column are indexed as 1. First row is at the bottom and first column is at the left of the grid.

Observe the cells in the surrounding area. Are majority of cells filled or empty? Are there any patterns in the area? 

Do not return the grid. Be concise and clear in your description.

Surronding Area: '{positioning_area}'
Grid:
{solutionCellStates}"""

In [112]:
# user_message = "Tell me where the location is in the grid."
# positioning_response_prev, positioning_latency = callAzureLLM(user_message, system_message=system_message, max_tokens=20, past_messages=[])
# positioning_response = filter_crop_llm_response(positioning_response_prev)
# print("positioning LLM:: ", positioning_response)

In [113]:
user_message = "Tell me about the cells in the vicinity."
positioning_response_prev, positioning_latency = callAzureLLM(user_message, system_message=sys_observe_around, max_tokens=70, past_messages=[])
positioning_response = filter_crop_llm_response(positioning_response_prev)
print("positioning LLM:: ", positioning_response)

HTTPError: HTTP Error 500: Internal Server Error

## Aditional resources

Here are some additional reference:  

* [Plan and manage costs (marketplace)](https://learn.microsoft.com/azure/ai-studio/how-to/costs-plan-manage#monitor-costs-for-models-offered-through-the-azure-marketplace)
