# Nvidia Foundational Large Language Models

The Nvidia **foundational Large Language Models** (LLMs) are hosted on the catalog.ngc.nvidia.com webpage.
Nvidia provides these models through either streaming or non-streaming process APIs for anyone
to interact with many models. Nvidia provides an API key to developers to interact with these 
LLMs without a cost such as the OPENAI API and the rate-limiting of OPENAI. This allows new developers to
experiment with LLMs with limited knowledge of the underlying architecture to obtain responses from 
prompts supplied.

In this notebook, we will look at 11 models that are labeled text-to-text models on their catalog. Since
all the APIs for the models are the same except the last path item which is the uuid for the model. These APIs
will be called using the Python requests module. Other request modules such as async with aiohttp, 
and httpx.

#### NOTE
There was experimentation with aiohttp with asyncio with the API calls. There are a few issues discovered:

    1. Some models will return a 200 status code instead of the expected 202. This required a rewrite to account
       for this problem.
       
    2. This problem is variable in nature as it is not consistent to a particular model. If you want to async
       call the API for each model for a single prompt, one to a few of the models will return with an empty
       response. The amount of models that return an empty response will vary with each call.


In [8]:
import requests as req
import os, sys
import time
import json
import pandas as pd 
import gradio as gr

### Initialization
In the cell below we will initialize our global variables that will be used by the LLM functions.

#### NOTE:
At the time of making this notebook, Mamba-chat had been released to the Nvidia catalog page on 02/12/2024.
It was not included in the set of testing due to the payload differing in structure from the other which will
require a rework to include.


In [9]:
model_dict = {
    "mixtral8x7binstruct": "8f4118ba-60a8-4e6b-8574-e38a4067a4a3",
    "mistral7binstruct": "35ec3354-2681-4d0e-a8dd-80325dcf7c63",
    "nv-llama2-70b-rlhf": "7b3e3361-4266-41c8-b312-f5e33c81fc92",
    "nv-llama2-70b-steerlm": "d6fe6881-973a-4279-a0f8-e1d486c9618d",
    "codellama13b": "f6a96af4-8bf9-4294-96d6-d71aa787612e",
    "codellama70b": "2ae529dc-f728-4a46-9b8d-2697213666d8",
    "codellama34b": "df2bee43-fb69-42b9-9ee5-f4eabbeaf3a8",
    "llama213b": "e0bb7fb9-5333-4a27-8534-c6288f921d3f",
    "llama270b": "0e349b44-440a-44e1-93e9-abe8dcb27158",
    "yi-34b": "347fa3f3-d675-432c-b844-669ef8ee53df",
    "nemotron-3-8b-chat-steerlm": "1423ff2f-d1c7-4061-82a7-9e8c67afd43a",
}

invoke_url = "https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/"
fetch_url_format = "https://api.nvcf.nvidia.com/v2/nvcf/pexec/status/"

model_urls = {}
for k, v in model_dict.items():
    model_urls[k] = invoke_url + v

# This method is not the safest way to initialize an API key. This is meant for demonstration.
# Insert your ngc catalog API key which can be found on the model's catalog page.
os.environ["NGCKEY"] = "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"
my_api = os.environ.get('NGCKEY')

### LLM Functions
Two main LLM functions interact with the Nvidia LLMs. All the models
are interfacing with Nvidia Triton. Some of the models indicate what GPU/GPUS are
used for the interfacing while some indicate others. Comparisons may not be "apple-to-apple"
due to not all models interfacing with the same hardware.

One function named "llm_invoke" works by invoking a model name and prompt to return the model
response. This works with a helper function that formats the response to be displayed in the gradio
interface.

The second function is similar to the first model saves the output in the list of dictionaries with a
similar gradio formatting function supplementing it.

In [10]:
def llm_invoke(model_name:str, prompt:str):
  """
    This function will call from any model within the dictionary from Nvidia
    AI foundational models and run the model. This is strictly focused on
    the text-to-text models found on the link below:

    https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models

    Models that require context are not included in the dictionary.

    Inputs
    model_name (str): name of the model
    prompt (str): prompt to be passed to the model

    Outputs
    msg (str): text generated from the model given the prompt
    resp_time (str): time taken to generate the response
    out_Tokens (str): Number of tokens returned from the LLM.

  """
  model_name = model_name.lower().replace(" ", "")

  headers = {
    "Authorization": "Bearer " + str(my_api),
    "Accept": "application/json",
  }

  payload = {
  "messages": [
    {
      "content": str(prompt),
      "role": "user"
    }
  ],
  "temperature": 0.2,
  "top_p": 0.7,
  "max_tokens": 1024,
  "seed": 42,
  "stream": False
  }

  if model_name not in model_dict.keys():
    print("Model name not found in dictionary, using default model")
    print("Default model is NV-Llama2-70B-RLHF")
    model_name = "nv-llama2-70b-rlhf"

  #Create session.
  session = req.Session()

  response = session.post(model_urls[model_name], headers=headers, json=payload)

  while response.status_code == 202:
    request_id = response.headers.get("NVCF-REQID")
    fetch_url = fetch_url_format + request_id
    response = session.get(fetch_url, headers=headers)

  response.raise_for_status()
  response_body = response.json()
  msg = response_body.get('choices')[0].get('message').get('content')
  resp_time = round(response.elapsed.total_seconds(), 3)
  out_tokens = response_body.get('usage').get('completion_tokens')
  return msg, resp_time, out_tokens

def llm_response_gradio(model_name: str, prompt: str):
    msg, resp_time, out_tokens = llm_invoke(model_name, prompt)
    output = f"{msg}\n\nResponse time: {resp_time} seconds\n\nOutput_tokens:{out_tokens}"
    return output
    

#### Testing the llm invocation model

In [11]:
test_prompt_singular = "I am visiting paris, what should I see? Limit to five best spots."
result_singular = llm_response_gradio("codellama13b", test_prompt_singular)
print("LLM Result:\n{}".format(result_singular))

LLM Result:
Paris is a city with a rich history, art, architecture, and culture. Here are five of the best spots to visit in Paris:
1. The Eiffel Tower: This iconic tower is one of the most recognizable landmarks in the world. Visitors can take the elevator to the top of the tower for panoramic views of the city.
2. The Louvre Museum: This world-renowned museum is home to some of the most famous and iconic works of art in the world, including the Mona Lisa. Visitors can take a guided tour of the museum to learn more about the history and significance of the works of art on display.
3. The Arc de Triomphe: This iconic monument is one of the most recognizable landmarks in the world. Visitors can take a guided tour of the monument to learn more about its history and significance.
4. The Notre Dame Cathedral: This world-renowned cathedral is one of the most recognizable landmarks in the world. Visitors can take a guided tour of the cathedral to learn more about its history and significance

In [12]:
def llm_invoke_all(prompt:str):
  """
    This function will call from any model within the dictionary from Nvidia
    AI foundational models and run the model. This is strictly focused on
    the text to text models found on the link below:

    https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models

    Models that required context are not included in the dictionary.

    Inputs
    prompt -> Prompt to be passed to the model

    Outputs
    msg -> Text generated from the model given the prompt
    Resp_time -> Time taken to generate the response
    out_tokens -> Number of tokens produced by the LLM
    model_name -> Name of the model that was called by the curl request.

  """
  lst_resp = []
  headers = {
    "Authorization": "Bearer " + str(my_api),
    "Accept": "application/json",
  }

  payload = {
  "messages": [
    {
      "content": str(prompt),
      "role": "user"
    }
  ],
  "temperature": 0.2,
  "top_p": 0.7,
  "max_tokens": 1024,
  "seed": 42,
  "stream": False
  }

  #Create session.
  session = req.Session()

  for key, url in model_urls.items():
    tmp_dict = {}
    response = session.post(url, headers = headers, json = payload)

    while response.status_code == 202:
      request_id = response.headers.get("NVCF-REQID")
      fetch_url = fetch_url_format + request_id
      response = session.get(fetch_url, headers=headers)

    response.raise_for_status()
    response_body = response.json()
    msg = response_body.get('choices')[0].get('message').get('content')
    resp_time = round(response.elapsed.total_seconds(), 3)
    out_tokens = response_body.get('usage').get('completion_tokens')
    tmp_dict = {"model_name": key,"resp_time": resp_time, "output_tokens": out_tokens,
                "prompt": prompt, "resp_msg": msg}
    lst_resp.append(tmp_dict)
    time.sleep(0.2)

  return lst_resp

def llm_response_gradio_all(prompt: str):
    content_lst = list()
    response_lst = llm_invoke_all(prompt)
    response_lst = sorted(response_lst, key=lambda x: x['resp_time'], reverse = False)
    for doc in response_lst:
        content = f"Model:{doc.get('model_name')}\n\nResponse Time:{doc.get('resp_time')}\n\nTokens Produced:{doc.get('output_tokens')}\n\n"
        content_lst.append(content)
    return " ".join(content_lst)


### Creating benchmark datasets

The section below creates the LLMs results dataframes and LLM benchmark resutls.

In [14]:
def create_llm_dataset(file_path:str) -> pd.DataFrame: 
    """
    Creates a data frame that is the product of running multiple
    prompts through all the models and returns the response
    message, model response time, and output tokens.

    Inputs:
    file_path (str): csv file path containing a singular column of prompts

    Outputs:
    dataset (pd.DataFrame): Output data frame of all prompts apply
                            to the function of calling all LLM models.

    """
    df_prompts = pd.read_csv(file_path, sep=",")
    lst_prompts = df_prompts['questions'].values

    lst_responses = []
    counter = 0
    print("Starting LLM API calls")
    for prompts in lst_prompts:
        full_llm_res = llm_invoke_all(prompts)
        lst_responses.append(full_llm_res)
        counter += 1
        print("Completed {} rounds".format(counter))

    lst_pd = []
    for sublist in lst_responses:
        for item in sublist:
            lst_pd.append(item)

    df_results = pd.DataFrame(lst_pd)
    return df_results
    

In [15]:
def create_stats_llm(llm_df: pd.DataFrame) -> pd.DataFrame:
    """
    Creates a data frame that is a merge of two groupby 
    data frames of response time and output token.

    Inputs: 
    llm_df (pd.DataFrame): Data frame of LLM responses

    Outputs:
    stats_df (pd.DataFrame): Data frame of LLM aggregation stats.
    
    """
    time_df = llm_df.groupby("model_name")['resp_time'].aggregate(resp_time_min='min',
                                                                  resp_time_max='max',resp_time_mean='mean')
    time_df = time_df.reset_index()
    time_df = time_df.sort_values('resp_time_mean', ascending=True).reset_index(0, drop=True)

    tkn_df = llm_df.groupby("model_name")['output_tokens'].aggregate(out_tkn_min="min", out_tkn_max="max", out_tkn_mean="mean")
    tkn_df = tkn_df.reset_index()
    tkn_df = tkn_df.sort_values("out_tkn_mean", ascending=True).reset_index(0, drop=True)

    stats_df = time_df.merge(tkn_df, how="inner", on="model_name")

    return stats_df
    

In [16]:
def df_save(llm_df: pd.DataFrame, stats_llm: pd.DataFrame) -> None:
    """
    Saves two data frames to separate json files.   
    """
    folder_path = "./output_llm/"
    os.makedirs(folder_path, exist_ok = True)

    llm_df.to_json(folder_path + "results_llm.json", orient = "records", compression = "infer")

    stats_llm.to_json(folder_path + "stats_llm.json", orient = "records", compression = "infer")


In [17]:
csv_path = "./data/questions.csv"
start_tm = time.time()
df_llm = create_llm_dataset(csv_path)
llm_stats = create_stats_llm(df_llm)
end_tm = time.time()
elapsed_time = (end_tm - start_tm) / 60
print("Elapsed time to run benchmark: {} seconds".format(round(elapsed_time,3)))

Starting LLM API calls
Completed 1 rounds
Completed 2 rounds
Completed 3 rounds
Completed 4 rounds
Completed 5 rounds
Completed 6 rounds
Completed 7 rounds
Completed 8 rounds
Completed 9 rounds
Completed 10 rounds


NameError: name 'elasped_time' is not defined

In [20]:
df_save(df_llm, llm_stats)

In [18]:
print("Benchmark statistics for the LLM Models.")
llm_stats

Benchmark statistics for the LLM Models.


Unnamed: 0,model_name,resp_time_min,resp_time_max,resp_time_mean,out_tkn_min,out_tkn_max,out_tkn_mean
0,codellama13b,0.433,4.704,1.4712,151,1023,855.0
1,nv-llama2-70b-rlhf,0.143,4.709,2.0137,257,355,319.3
2,codellama70b,0.288,4.702,2.231,569,1024,775.5
3,mistral7binstruct,0.107,5.123,2.4344,136,599,369.2
4,nemotron-3-8b-chat-steerlm,0.862,4.301,2.4845,279,608,414.4
5,nv-llama2-70b-steerlm,0.116,4.395,2.5416,166,584,437.6
6,codellama34b,0.296,4.818,2.6437,407,769,522.8
7,llama213b,0.199,4.971,2.7156,159,541,382.8
8,yi-34b,0.214,4.864,2.7735,293,627,483.8
9,llama270b,0.307,5.119,2.9019,456,822,673.7


In [19]:
print("Snapshot of the LLM Results from multiple prompts.")
df_llm.head(15)

Snapshot of the LLM Results from multiple prompts.


Unnamed: 0,model_name,resp_time,output_tokens,prompt,resp_msg
0,mixtral8x7binstruct,3.455,341,What are the top tourist attractions in Paris?,"I'm here to provide you with accurate, helpful..."
1,mistral7binstruct,0.107,449,What are the top tourist attractions in Paris?,I'm glad you're asking about tourist attractio...
2,nv-llama2-70b-rlhf,0.143,325,What are the top tourist attractions in Paris?,Some of the top tourist attractions in Paris i...
3,nv-llama2-70b-steerlm,1.727,584,What are the top tourist attractions in Paris?,Paris is a city that is renowned for its rich ...
4,codellama13b,0.816,1023,What are the top tourist attractions in Paris?,Paris is one of the most popular tourist desti...
5,codellama70b,0.61,828,What are the top tourist attractions in Paris?,"🇫🇷 Paris is a city of endless attractions, but..."
6,codellama34b,0.296,512,What are the top tourist attractions in Paris?,"Paris, the capital of France, is known as the ..."
7,llama213b,4.3,459,What are the top tourist attractions in Paris?,"Hello! As a helpful and respectful assistant, ..."
8,llama270b,5.119,722,What are the top tourist attractions in Paris?,"Paris, the capital of France, is known for its..."
9,yi-34b,0.814,336,What are the top tourist attractions in Paris?,"Paris is renowned for its rich history, stunni..."


### Gradio Interface

This section will present the functions of the LLMs as an interface through gradio.
Users will select which model to use which allows them to enter a prompt to return 
the result.

The function that invokes all the LLM will present the fastest models from top to bottom
as a quick visual as to which model is best for the particular prompt.

The third gradio interface is a simple file upload to look at the results that were saved
as a json format

In [10]:
iface_singular = gr.Interface(fn = llm_response_gradio,
                     inputs =[gr.Radio(list(model_dict.keys()), label="Model Name"),
                              gr.Textbox(label="Enter your Prompt")],
                     outputs = gr.Textbox(label="LLM Output"),
                     title = "Nvidia LLM Invoker",
                     description = "Choose a model and enter a prompt to invoke a LLM model from Nvidia AI Foundational Models.")

iface_singular.launch()

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.




In [14]:
iface_singular.close()

Closing server running on port: 7860


In [15]:
iface_llm = gr.Interface(fn = llm_response_gradio_all,
                         inputs = gr.Textbox(label="Enter your Prompt"),
                         outputs = gr.Textbox(label="Best Performing LLMs with respect to time."),
                         title = "Nvidia Multi-Model Invoker",
                         description = "Enter a prompt to invoke multiple LLMs from Nvidia AI Foundational Models and return the response times."
)

iface_llm.launch()

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.




In [16]:
iface_llm.close()

Closing server running on port: 7860


In [None]:
def read_file(file):
  with open(file, "r", encoding="utf-8") as f:
    return f.read()

interface = gr.Interface(fn=read_file, inputs="file", outputs="text")
interface.launch()

In [None]:
interface.close()