## After training the model, we can use this notebook to test the model. 

### 1. Direct download the model and test. Run this notebook in A100 GPU machine (NC24adsA100 compute instance)

!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

In [None]:
# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
import time
from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component
credential = InteractiveBrowserCredential()
subscription_id = "" # your subscription id
resource_group = ""#your resource group
workspace = "" #your workspace name
workspace_ml_client = MLClient(credential, subscription_id, resource_group, workspace)

#### For regular model 

In [None]:
model_name = "llama2_13b_fine_tuned"
model_path="./"
workspace_ml_client.models.download(model_name, version="2",download_path=model_path)
#after this step, remove the redundant parent folder name "llama2_13b_fine_tuned" so that the downloaded folder only has one 

In [None]:
import mlflow

example = {"context":"You are querying the sales database, what is the SQL query for the following question?","input":"What is the total revenue for each territory?"}
PROMPT_DICT ="\n{context}\n\n### Question:\n{input}\n\n### Response:{output}"
PROMPT_DICT_CHAT ="<s>[INST]\n{context}\n\n### Question:\n{input}\n[/INST]"
model = mlflow.pyfunc.load_model(model_name)


### For regular model

In [None]:
import mlflow
import pandas as pd
prompt = PROMPT_DICT.format(input=example["input"], context=example["context"])
prompt = {"role": "user","content": prompt} 
model.predict([prompt])


#### For Chat Model

In [None]:
import mlflow
import pandas as pd
prompt = PROMPT_DICT_CHAT.format(input=example["input"], context=example["context"])
prompt = {"role": "user","content": prompt} 
model.predict([prompt])


### 2.Deploy to managed online endpoint and test

1. Create online endpoint: ```az ml online-endpoint create -f deployment/endpoint.yml```
2. Create the deployment: ```az ml online-deployment update -f deployment/deployment.yml```

In [11]:
import urllib.request
import json
import os
import ssl

def allowSelfSignedHttps(allowed):
    # bypass the server certificate verification on client side
    if allowed and not os.environ.get('PYTHONHTTPSVERIFY', '') and getattr(ssl, '_create_unverified_context', None):
        ssl._create_default_https_context = ssl._create_unverified_context

allowSelfSignedHttps(True) # this line is needed if you use self-signed certificate in your scoring service.

# Request data goes here
# The example below assumes JSON formatting which may be updated
# depending on the format your endpoint expects.
# More information can be found here:
# https://docs.microsoft.com/azure/machine-learning/how-to-deploy-advanced-entry-script
question = "What is the average unit price of products by each supplier?"

# content = "Hi there"

data= {"data":{"text":[question], "max_gen_len":100, "temperature":0.9}}

body = str.encode(json.dumps(data))

url = 'https://llma2-fine-tuning.westus2.inference.ml.azure.com/score'
# Replace this with the primary/secondary key or AMLToken for the endpoint
api_key = ''
if not api_key:
    raise Exception("A key should be provided to invoke the endpoint")

# The azureml-model-deployment header will force the request to go to a specific deployment.
# Remove this header to have the request observe the endpoint traffic rules
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key), 'azureml-model-deployment': 'blue' }

req = urllib.request.Request(url, body, headers)

try:
    response = urllib.request.urlopen(req)

    result = response.read()
    print(result)
except urllib.error.HTTPError as error:
    print("The request failed with status code: " + str(error.code))

    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
    print(error.info())
    print(error.read().decode("utf8", 'ignore'))

b'{"output": "#\\n### Response:SELECT Suppliers.SupplierID, AVG(Products.UnitPrice) AS AverageUnitPrice FROM Products INNER JOIN Suppliers ON Products.SupplierID = Suppliers.SupplierID GROUP BY Suppliers.SupplierID\\n###\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n"}'


### 3. Deploy to AKS and test

#create AKS cluster
az aks create -g ml -n aksgpu2 --enable-managed-identity --node-count 1 --enable-addons monitoring --generate-ssh-keys --node-vm-size standard_nc24ads_a100_v4

#Install k8s-extension
az k8s-extension create --name ml --extension-type Microsoft.AzureML.Kubernetes --config enableTraining=True enableInference=True inferenceRouterServiceType=LoadBalancer allowInsecureConnections=True InferenceRouterHA=False --cluster-type managedClusters --cluster-name aksgpu2 --resource-group ml --scope cluster

#Install Nvidia extension
az aks get-credentials --resource-group ml --name aksgpu2

kubectl apply -f nvidia_device.yaml

#create namespace
kubectl create namespace gpu-resources

#create instance type
kubectl apply -f instance_type.yaml

###az aks nodepool add --resource-group ml --cluster-name aks001 --name gpunp --node-count 1 --node-vm-size standard_nc24ads_a100_v4 --node-taints sku=gpu:NoSchedule --aks-custom-headers UseGPUDedicatedVHD=true --enable-cluster-autoscaler --min-count 1 --max-count 3

#attach to azure ml workspace

az ml compute attach --resource-group ml --workspace-name ws01ent --type Kubernetes --name aksgpu2 --resource-id "/subscriptions/840b5c5c-3f4a-459a-94fc-6bad2a969f9d/resourcegroups/ml/providers/Microsoft.ContainerService/managedClusters/aksgpu2" --identity-type SystemAssigned --no-wait --namespace gpu-resources

#create the online endpoint
az ml online-endpoint create -f k8s_endpoint.yml
#create the deployment
az ml online-deployment create -f k8s_deployment.yml



#Delete deployments in case needed
az ml online-deployment delete --name blue --endpoint-name llm-k8s-gpu --yes --resource-group ml --workspace-name ws01ent
az ml online-deployment delete --name blue --endpoint-name llm-k8s-ep --yes --resource-group ml --workspace-name ws01ent
az ml online-deployment delete --name green --endpoint-name llm-k8s-ep --yes --resource-group ml --workspace-name ws01ent

az ml online-endpoint delete --name ws01ent-bajsw --resource-group ml --workspace-name ws01ent --yes

In [None]:
import urllib.request
import json
import os
import ssl

def allowSelfSignedHttps(allowed):
    # bypass the server certificate verification on client side
    if allowed and not os.environ.get('PYTHONHTTPSVERIFY', '') and getattr(ssl, '_create_unverified_context', None):
        ssl._create_default_https_context = ssl._create_unverified_context

allowSelfSignedHttps(True) # this line is needed if you use self-signed certificate in your scoring service.

# Request data goes here
# The example below assumes JSON formatting which may be updated
# depending on the format your endpoint expects.
# More information can be found here:
# https://docs.microsoft.com/azure/machine-learning/how-to-deploy-advanced-entry-script
prompt = """
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Summarize the following input to less than 30 words .
### Input:
In general, perplexity is a measurement of how well a probability model predicts a sample. In the context of Natural Language Processing, perplexity is one way to evaluate language models.
A language model is a probability distribution over sentences: it’s both able to generate plausible human-written sentences (if it’s a good language model) and to evaluate the goodness of already written sentences. Presented with a well-written document, a good language model should be able to give it a higher probability than a badly written document, i.e. it should not be “perplexed” when presented with a well-written document.
Thus, the perplexity metric in NLP is a way to capture the degree of ‘uncertainty’ a model has in predicting (i.e. assigning probabilities to) text."""

instruction ="You are querying the sales database, what is the SQL query for the following input question?"
input = "What is the average unit price of products by each supplier?"
content = f"<s>[INST]\n{instruction}\n\n### Input:\n{input}\n[/INST]"

# content = "Hi there"

data= {"data":{"text":content, "max_length":100}}

body = str.encode(json.dumps(data))

url = 'http://20.72.223.233/api/v1/endpoint/llm-k8s-gpu/score'
api_key= ''
if not api_key:
    raise Exception("A key should be provided to invoke the endpoint")

# The azureml-model-deployment header will force the request to go to a specific deployment.
# Remove this header to have the request observe the endpoint traffic rules
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key), 'azureml-model-deployment': 'blue' }

req = urllib.request.Request(url, body, headers)

try:
    response = urllib.request.urlopen(req)

    result = response.read()
    print(result)
except urllib.error.HTTPError as error:
    print("The request failed with status code: " + str(error.code))

    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
    print(error.info())
    print(error.read().decode("utf8", 'ignore'))