<a href="https://colab.research.google.com/github/jmoyou2011/LLM-Nvidia-Comparision/blob/main/Nvidia_Multi_Model_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import requests as req
import os, sys
from google.colab import drive
import time
import yaml
import json

### Create a mount point for data

In [2]:
drive.mount('/content/drive')
nb_path = '/content/notebooks'
os.symlink('/content/drive/MyDrive/ColabNotebooks', nb_path)
sys.path.insert(0, nb_path)

Mounted at /content/drive


### Initialization
These will be the global variables used by the proceeding LLM functions.

Note: I renamed the file "Colab Notebooks" to "ColabNotebooks" to facilitate
use of !cat feature to produce results.


In [3]:
with open('/content/drive/MyDrive/ColabNotebooks/model_names.yml') as f:
  config = yaml.safe_load(f)

my_api = config['KEYS'][0]
model_dict = config['model_names']
model_keys = list(model_dict.keys())

### Nvidia LLM

All the LLM are extracted from the Nvidia LLM foundational AI models catalog page. Using the APIs provided on the doc page, we will comparing them between
each other an displaying the results through gradio module.

In [4]:
def llm_invoke(model_name:str, prompt:str):
  """
    This function will call from any model within the dictionary from Nvidia
    AI foundational models and run the model. This is strictly focused on
    the text to text models found on the link below:

    https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models

    Models that required context are not included in the dictionary.

    Inputs
    model_name -> name of the model
    prompt -> prompt to be passed to the model

    Outputs
    msg -> text generated from the model given the prompt
    resp_time -> time taken to generate the response
    out_Tokens -> Number of tokens returned from the LLM.

  """
  model_name = model_name.lower().replace(" ", "")

  invoke_url = "https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/"
  fetch_url_format = "https://api.nvcf.nvidia.com/v2/nvcf/pexec/status/"

  headers = {
    "Authorization": "Bearer " + str(my_api),
    "Accept": "application/json",
  }

  payload = {
  "messages": [
    {
      "content": str(prompt),
      "role": "user"
    }
  ],
  "temperature": 0.2,
  "top_p": 0.7,
  "max_tokens": 1024,
  "seed": 42,
  "stream": False
  }

  if model_name not in model_dict.keys():
    print("Model name not found in dictionary, using default model")
    print("Default model is NV-Llama2-70B-RLHF")
    model_name = "nv-llama2-70b-rlhf"

  #Create session.
  session = req.Session()

  response = session.post(invoke_url + model_dict[model_name], headers=headers, json=payload)

  while response.status_code == 202:
    request_id = response.headers.get("NVCF-REQID")
    fetch_url = fetch_url_format + request_id
    response = session.get(fetch_url, headers=headers)

  response.raise_for_status()
  response_body = response.json()
  msg = response_body.get('choices')[0].get('message').get('content')
  resp_time = round(response.elapsed.total_seconds(), 3)
  out_tokens = response_body.get('usage').get('completion_tokens')
  return msg, resp_time, out_tokens



def llm_invoke_all(prompt:str):
  """
    This function will call from any model within the dictionary from Nvidia
    AI foundational models and run the model. This is strictly focused on
    the text to text models found on the link below:

    https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models

    Models that required context are not included in the dictionary.

    Inputs
    prompt -> Prompt to be passed to the model

    Outputs
    msg -> Text generated from the model given the prompt
    Resp_time -> Time taken to generate the response
    out_tokens -> Number of tokens produced by the LLM
    model_name -> Name of the model that was called by the curl request.

  """
  lst_resp = []

  invoke_url = "https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/"
  fetch_url_format = "https://api.nvcf.nvidia.com/v2/nvcf/pexec/status/"

  headers = {
    "Authorization": "Bearer " + str(my_api),
    "Accept": "application/json",
  }

  payload = {
  "messages": [
    {
      "content": str(prompt),
      "role": "user"
    }
  ],
  "temperature": 0.2,
  "top_p": 0.7,
  "max_tokens": 1024,
  "seed": 42,
  "stream": False
  }

  #Create session.
  session = req.Session()

  for key, _ in model_dict.items():
    # print(key)
    tmp_dict = {}
    response = session.post(invoke_url + model_dict[key], headers=headers, json=payload)

    while response.status_code == 202:
      request_id = response.headers.get("NVCF-REQID")
      fetch_url = fetch_url_format + request_id
      response = session.get(fetch_url, headers=headers)

    response.raise_for_status()
    response_body = response.json()
    msg = response_body.get('choices')[0].get('message').get('content')
    resp_time = round(response.elapsed.total_seconds(), 3)
    out_tokens = response_body.get('usage').get('completion_tokens')
    tmp_dict = {"msg": msg, "resp_time": resp_time, "model_name": key, "output_tokens": out_tokens}
    lst_resp.append(tmp_dict)
    time.sleep(2)

  lst_resp = sorted(lst_resp, key=lambda x: x['resp_time'], reverse = False)
  return lst_resp


In [5]:
# Calling the first function.
msg, resp_time, out_tokens = llm_invoke("nv-llama2-70b-rlhf", "Describe the stonehedge in the british isle? Your persona is a druid.")

print(msg)
print("Response time:", resp_time, " seconds")
print("Tokens Produced:", out_tokens)

As a Druid, I can tell you that Stonehenge is a sacred site that holds great significance for our people. It is a prehistoric monument located in the British Isle of Wight, and is believed to have been built by our ancestors over 5,000 years ago. The stones that make up Stonehenge are massive, standing up to 24 feet tall and weighing up to 50 tons each. It is a testament to the ingenuity and skill of our ancestors that they were able to transport and erect these stones without the use of modern tools and technology.

Stonehenge is a place of great mystical power, and is believed to have been used for a variety of purposes, including rituals, healing, and astronomical observations. The alignment of the stones with the sun and the moon is particularly noteworthy, as it suggests that our ancestors had a deep understanding of the movements of the celestial bodies.

For Druids, Stonehenge is a place of worship and a connection to the ancient past. We believe that the stones have a powerful 

In [None]:
# Calling the second function.
llm_responses = llm_invoke_all("Describe the catacombs in Paris?")

with open("/content/drive/MyDrive/ColabNotebooks/response_llm.json", "w", encoding="utf-8") as f:
  json.dump(llm_responses, f, indent=6, ensure_ascii = False)

# Gradio Functionality

In [None]:
!python -m pip install gradio
import gradio as gr

Collecting gradio
  Downloading gradio-4.17.0-py3-none-any.whl (16.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.7/16.7 MB[0m [31m40.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.109.2-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.1.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==0.9.0 (from gradio)
  Downloading gradio_client-0.9.0-py3-none-any.whl (306 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m306.8/306.8 kB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx (from gradio)
  Downloading httpx-0.26.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

### Gradio Interfaces.
There will be three gradio interfaces that a user can interact with in this
notebook

 * Choose a model and enter a prompt to return a response from the llm
 * Evaluation of the most responsive model over the eleven models being evaluated
 * File upload to view the full results of the function llm_invoke_all.

In [None]:
# Output Formatting
def llm_response_gradio(model_name: str, prompt: str):
    msg, resp_time = llm_invoke(model_name, prompt)
    output = f"{msg}\n\nResponse time: {resp_time} seconds"
    return output

def llm_response_gradio_all(prompt: str):
    content_lst = list()
    response_lst = llm_invoke_all(prompt)
    for doc in response_lst:
        content = f"Model:{doc.get('model_name')}\n\nResponse Time:{doc.get('resp_time')}\n\nTokens Produced:{doc.get('output_tokens')}\n\n"
        content_lst.append(content)
    return " ".join(content_lst)

def read_file(file):
  with open(file, "r", encoding="utf-8") as f:
    return f.read()


In [None]:
iface_singular = gr.Interface(fn = llm_response_gradio,
                     inputs =[gr.Radio(list(model_dict.keys()), label="Model Name"),
                              gr.Textbox(label="Enter your Prompt")],
                     outputs = gr.Textbox(label="LLM Output"),
                     title = "Nvidia LLM Invoker",
                     description = "Choose a model and enter a prompt to invoke a LLM model from Nvidia AI Foundational Models.")
iface_singular.launch()

In [None]:
iface_singular.close()

In [None]:
iface_llm = gr.Interface(fn = llm_response_gradio_all,
                         inputs = gr.Textbox(label="Enter your Prompt"),
                         outputs = gr.Textbox(label="Best Performing LLMs with respect to time."),
                         title = "Nvidia Multi-Model Invoker",
                         description = "Enter a prompt to invoke multiple LLM model from Nvidia AI Foundational Models and return the fastest 3."
)

iface_llm.launch()

In [None]:
iface_llm.close()

In [None]:
iface_read = gr.Interface(fn=read_file, inputs="file", outputs="text")
iface_read.launch()

In [None]:
iface_read.close()