# Nvidia Foundational Large Language Models Async
The Nvidia foundational Large Language Models (LLMs) are hosted on the catalog.ngc.nvidia.com webpage. Nvidia provides these models through either streaming or non-streaming process APIs for anyone to interact with many models. Nvidia provides an API key to developers to interact with these LLMs without a cost such as the OPENAI API and the rate-limiting of OPENAI. This allows new developers to experiment with LLMs with limited knowledge of the underlying architecture to obtain responses from prompts supplied.

In this notebook, we will look at 11 models that are labeled text-to-text models on their catalog. Since all the APIs for the models are the same except the last path item which is the uuid for the model. These APIs will be called using the Python requests module. Other request modules such as async with aiohttp, and httpx.

NOTE
There was experimentation with aiohttp with asyncio with the API calls. There are few issues discovered:

1. Some models will return a 200 status code instead of the expected 202. This required a rewrite to account
   for this problem.
   
2. This problem is variable as it is not consistent with a particular model. If you want to async
   call the API for each model for a single prompt, one to a few of the models will return with an empty
   response. The amount of models that return an empty response will vary with each call.

In [9]:
import requests as req
import os, sys
import asyncio
import nest_asyncio
import aiohttp
import time
import yaml
import json
import pandas as pd
from asyncio_throttle import Throttler

### Initialization
In the cell below we will initialize our global variables that will be used by the LLM functions.

#### NOTE:
At the time of making this notebook, Mamba-chat had been released to the Nvidia catalog page on 02/12/2024.
It was not included in the set of testing due to the payload differing in structure from the other which will
require a rework to include.

In [3]:
nest_asyncio.apply()
model_dict = {
    "mixtral8x7binstruct": "8f4118ba-60a8-4e6b-8574-e38a4067a4a3",
    "mistral7binstruct": "35ec3354-2681-4d0e-a8dd-80325dcf7c63",
    "nv-llama2-70b-rlhf": "7b3e3361-4266-41c8-b312-f5e33c81fc92",
    "nv-llama2-70b-steerlm": "d6fe6881-973a-4279-a0f8-e1d486c9618d",
    "codellama13b": "f6a96af4-8bf9-4294-96d6-d71aa787612e",
    "codellama70b": "2ae529dc-f728-4a46-9b8d-2697213666d8",
    "codellama34b": "df2bee43-fb69-42b9-9ee5-f4eabbeaf3a8",
    "llama213b": "e0bb7fb9-5333-4a27-8534-c6288f921d3f",
    "llama270b": "0e349b44-440a-44e1-93e9-abe8dcb27158",
    "yi-34b": "347fa3f3-d675-432c-b844-669ef8ee53df",
    "nemotron-3-8b-chat-steerlm": "1423ff2f-d1c7-4061-82a7-9e8c67afd43a",
}

invoke_url = "https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/"
fetch_url_format = "https://api.nvcf.nvidia.com/v2/nvcf/pexec/status/"

model_urls = {}
for k, v in model_dict.items():
    model_urls[k] = invoke_url + v

lst_urls = list(model_urls.values())
model_key = list(model_urls.keys())


In [4]:
os.environ["NGCKEY"] = "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"
my_api = os.environ.get('NGCKEY')

### LLM Functions
Two main LLM functions interact with the Nvidia LLMs. All the models
are interfacing with Nvidia Triton. Some of the models indicate what GPU/GPUS are
used for the interfacing while some indicate others. Comparisons may not be "apple-to-apple"
due to not all models interfacing with the same hardware.

One function named "llm_invoke" works by invoking a model name and prompt to return the model
response. This works with a helper function that formats the response to be displayed in the gradio
interface.

The second function is similar to the first model saves the output in the list of dictionaries with a
similar gradio formatting function supplementing it.

In [5]:
async def llm_invoke_model(model_name: str, prompt: str) -> str: 
    """
    This function will call from any model within the dictionary from Nvidia
    AI foundational models and run the model. This is strictly focused on
    the text-to-text models found on the link below:

    https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models

    Models that require context are not included in the dictionary.

    Inputs
    model_name (str): Name of the model
    prompt (str): Prompt to be passed to the model

    Outputs
    msg (str): Text generated from the model given the prompt
    resp_time (str): Time taken to generate the response
    out_Tokens (str): Number of tokens returned from the LLM.

    """
    model_name = model_name.lower().replace(" ", "")

    if model_name not in model_dict.keys():
        print("Model name not found in dictionary, using default model")
        print("Default model is NV-Llama2-70B-RLHF")
        model_name = "nv-llama2-70b-rlhf"

    url = model_urls[model_name]
    

    headers = {
    "Authorization": "Bearer " + str(my_api),
    "Accept": "application/json",
      }

    payload = {
      "messages": [
        {
          "content": str(prompt),
          "role": "user"
        }
      ],
      "temperature": 0.2,
      "top_p": 0.7,
      "max_tokens": 1024,
      "seed": 42,
      "stream": False
      }
    url = model_urls[model_name]
    
    start_time = time.time()
    async with aiohttp.ClientSession(headers = headers) as session:
        async with session.post(url, json = payload) as response:
            print(response.status)
            if response.status == 200:
                end_time = time.time()
                body_resp = json.loads(await response.text())
                elapsed = round((end_time - start_time), 3)
                msg = body_resp.get('choices')[0].get('message').get('content')
                out_tkns = body_resp.get('usage').get('completion_tokens')
            elif response.status == 202:
                request_id = response.headers.get("NVCF-REQID")
                fetch_url = fetch_url_format + request_id
                async with session.get(fetch_url, headers = headers) as resp:
                    print("{}".format(resp.status))
                    end_time = time.time()
                    body_resp = await resp.json(content_type=None)
                    msg = body_resp.get('choices')[0].get('message').get('content')
                    elapsed = round((end_time - start_time), 3)
                    out_tkns = body_resp.get('usage').get('completion_tokens')

    return msg, elapsed, out_tkns


async def return_llm_resp(model_name: str, prompt: str) -> None:
    rsp_msg , rsp_time, rsp_tkns = await llm_invoke_model(model_name, prompt)
    output = f"{rsp_msg}\n\nResponse Time:{rsp_time} seconds\n\nOutput tokens:{rsp_tkns}"
    print(output)
    

#### Testing Async call to the API


In [6]:
await return_llm_resp("nv-llama2-70b-rlhf", "I am visiting paris, what should I see? Limit to five best spots.")

202
200
Sure! Here are the top 5 places to visit in Paris:

1. Eiffel Tower - One of the most iconic landmarks in the world, the Eiffel Tower offers stunning views of the city.
2. Louvre Museum - Home to some of the world's most famous artworks, including the Mona Lisa, the Louvre is a must-visit for art lovers.
3. Notre Dame Cathedral - A stunning example of French Gothic architecture, Notre Dame is a beautiful and historic site to visit.
4. Arc de Triomphe - This iconic monument, located at the end of the Champs-Elysees, offers great views of the city and is a symbol of French history.
5. Montmartre - This charming neighborhood is home to the Sacre-Coeur Basilica, as well as many artists, cafes, and boutiques. It offers great views of the city and is a great place to wander and soak up the Parisian atmosphere.
These are just a few of the many amazing places to visit in Paris. Other notable sites include the Palace of Versailles, the Luxembourg Gardens, and the Pompidou Center. No mat

In [10]:
df = pd.read_csv('./Data/questions.csv', sep=",")
lst_questions = df['questions'].values
df.head()

Unnamed: 0,questions
0,What are the top tourist attractions in Paris?
1,How does the Eiffel Tower contribute to the ci...
2,What is the history behind the Louvre Museum?
3,How does the Notre-Dame Cathedral represent Go...
4,What are some popular neighborhoods to explore...


In [91]:
async def my_routine_all(session ,model: str,prompt: str) -> dict:
    
    headers = {
    "Authorization": "Bearer " + str(my_api),
    "Accept": "application/json",
      }

    payload = {
      "messages": [
        {
          "content": str(prompt),
          "role": "user"
        }
      ],
      "temperature": 0.2,
      "top_p": 0.7,
      "max_tokens": 1024,
      "seed": 42,
      "stream": False
      }
    
    tmp_dict = {}
    start_time = time.time()
    async with session.post(model_urls[model], json = payload, headers = headers) as response:
        # print("{}\n{}\n{}".format(model, response.status, await response.text()))
        if response.status == 200:
            end_time = time.time()
            body_resp = json.loads(await response.text())
            elapsed = round((end_time - start_time), 3)
            msg = body_resp.get('choices')[0].get('message').get('content')
            out_tkns = body_resp.get('usage').get('completion_tokens')
            tmp_dict = {
                    "Response": msg,
                    "Response_time": elapsed,
                    "Output_tokens": out_tkns,
                    "Model": model}
        elif response.status == 202:
            request_id = response.headers.get("NVCF-REQID")
            fetch_url = fetch_url_format + request_id
            async with session.get(fetch_url, headers = headers) as resp:
                resp.raise_for_status()
                end_time = time.time()
                body_resp = await resp.json(content_type=None)
                # body_resp = json.loads(await response.text())
                elapsed = round((end_time - start_time), 3)
                try:
                    msg = body_resp.get('choices')[0].get('message').get('content')
                    out_tkns = body_resp.get('usage').get('completion_tokens')
                except:
                    msg, out_tkns = "Error Retrieving", 0
                tmp_dict = {
                    "Response": msg,
                    "Response_time": elapsed,
                    "Output_tokens": out_tkns,
                    "Model": model}

    return tmp_dict

                            
                    
async def fetch_all(session, prompt):
    tasks = [asyncio.create_task(my_routine_all(session, model, prompt)) for model in model_key]
    results = await asyncio.gather(*tasks)
    return results
    
        
async def test_main(prompt):
    async with aiohttp.ClientSession() as session:
        documents = await fetch_all(session, prompt)
        for document in documents:
            print(f"{document}\n\n")

In [92]:
if __name__ == "__main__":
    test_prompt = "Describe the catacombs in Paris."
    asyncio.run(test_main(test_prompt))

{'Response': "The catacombs in Paris, France, are a series of underground ossuaries that hold the remains of millions of people. They were created in the late 18th century due to the city's overcrowded cemeteries. The catacombs stretch over 200 miles beneath the streets of Paris, but only a small portion is open to the public.\n\nThe sections that are open to visitors contain carefully arranged bones and skulls, creating a unique and somewhat eerie display. The walls are etched with quotes and poems, adding to the mysterious atmosphere. Visitors can learn about the history of the catacombs and the people whose remains are interred there.\n\nIt's important to note that the catacombs are not a typical tourist attraction. They are dark, damp, and can be claustrophobic. However, for those interested in history, architecture, or the macabre, they offer a one-of-a-kind experience. \n\nVisitors should remember to show respect for the deceased and their families while visiting the catacombs. I