# NVIDIA AI Playground LLM

>[NVIDIA AI Playground](https://www.nvidia.com/en-us/research/ai-playground/) gives users easy access to hosted endpoints for generative AI models like Llama-2, Mistral, etc. Using the API, you can query NVCR (NVIDIA Container Registry) function endpoints and get quick results from a DGX-hosted cloud compute environment. All models are source-accessible and can be deployed on your own compute cluster.

This example goes over how to use LangChain to interact with supported AI Playground models.

In [21]:
## Core backbone interface clients
from langchain.llms import NVAIPlayLLM

## Llama-default NVAIPlayLLM and core NVCR Model base
from langchain.llms.nv_aiplay import (
    LlamaLLM,
    NVCRModel,
)

## Setup

**To get started:**
1. Create a free account with the [NVIDIA GPU Cloud](https://catalog.ngc.nvidia.com/) service, which hosts AI solution catalogs, containers, models, etc.
2. Navigate to `Catalog > AI Foundation Models > (Model with API endpoint)`.
3. Select the `API` option and click `Generate Key`.
4. Save the generated key as `NVAPI_KEY`. From there, you should have access to the endpoints.

In [5]:
import getpass
import os

## API Key can be found by going to NVIDIA NGC -> AI Playground -> (some model) -> Get API Code or similar.
## 10K free queries to any endpoint (which is a lot actually).

# del os.environ['NVAPI_KEY']  ## delete
if os.environ.get("NVAPI_KEY", "").startswith("nvapi-"):
    print("Valid NVAPI_KEY already in environment. Delete to reset")
else:
    nvapi_key = getpass.getpass("NVAPI Key (starts with nvapi-): ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVAPI_KEY"] = nvapi_key

NVAPI Key (starts with nvapi-): ··········


## Underlying Requests API

A selection of useful models are hosted in a DGX-powered service known as [NVIDIA GPU Cloud (NGC)](https://catalog.ngc.nvidia.com/). In this service, containers with exposed model endpoints are deployed and listed on the NVIDIA Container Registry service (NVCR). These systems are accessible via simple HTTP requests and can be utilized by a variety of systems.

The `NVCRModel` class implements the basic interfaces to communicate with NVCR, limiting the utility functions to those relevant for AI Playground. For example, the following list is populated by querying the function list endpoint with a key-loaded GET request:

In [7]:
NVCRModel().available_models

{'playground_llama2_70b': '0e349b44-440a-44e1-93e9-abe8dcb27158',
 'playground_fuyu_8b': '9f757064-657f-4c85-abd7-37a7a9b6ee11',
 'playground_llama2_code_34b': 'df2bee43-fb69-42b9-9ee5-f4eabbeaf3a8',
 'playground_neva_22b': '8bf70738-59b9-4e5f-bc87-7ab4203be7a0',
 'playground_llama2_code_13b': 'f6a96af4-8bf9-4294-96d6-d71aa787612e',
 'playground_gpt_qa_8b': '0c60f14d-46cb-465e-b994-227e1c3d5047',
 'playground_clip': '8c21289c-0b18-446d-8838-011b7249c513',
 'playground_gpt_steerlm_8b': '1423ff2f-d1c7-4061-82a7-9e8c67afd43a',
 'playground_nvolveqa_40k': '091a03bb-7364-4087-8090-bd71e9277520',
 'playground_sdxl': '89848fb8-549f-41bb-88cb-95d6597044a4',
 'playground_llama2_13b': 'e0bb7fb9-5333-4a27-8534-c6288f921d3f',
 'playground_mistral': '35ec3354-2681-4d0e-a8dd-80325dcf7c63'}

From this, you can easily send over a request in the style shown in the AI Playground API window for Python. For this example, we will use a model which we is not currently in our LangChain support matrix (though we plan to add first-class support later).

In [8]:
client = NVCRModel()

model = "neva"
payload = {
    "messages": [
        {
            "content": 'Hi! What is in this image? ',
            "role": "user",
        },
        {
            "labels": {"creativity": 6, "helpfulness": 6, "humor": 0, "quality": 6},
            "role": "assistant",
        },
    ],
    "temperature": 0.2,
    "top_p": 0.7,
    "max_tokens": 512,
    "stream": True,
}


def print_with_newlines(generator):
    buffer = ""
    for response in generator:
        content = response.get("content")
        if len(buffer) > 80 and content.startswith(" "):
            buffer = ""
            print()
        elif content.startswith("\n"):
            buffer = ""
        buffer += content
        print(content, end="")


## Generate-style response
# print(client.get_req_generation(model, payload))
# print()
## NOTE: if an invalid name is specified, it will try to find a model that contains the provided name

## Stream-style response
print_with_newlines(client.get_req_stream(model, payload))
print()


async def print_with_newlines_async(responses):
    buffer = ""
    async for response in responses:
        content = response["content"]
        if len(buffer) > 80 and content.startswith(" "):
            buffer = ""
            print()
        elif "\n" in content:
            buffer = ""
        buffer += content
        print(content, end="")


## Stream-style response
await print_with_newlines_async(client.get_req_astream(model, payload))

The image is a gray scale photograph of a checkered pattern, possibly a portion of
 a chessboard or a security camera image. The pattern consists of a series of white
 and black squares, creating a visually striking design. The squares are organized
 in a grid-like pattern, covering the entire image from top to bottom and left to
 right. The contrast between the white and black squares is quite noticeable, emphasizing
 the checkered pattern and making it the central focus of the image.
The image is a gray scale photograph of a checkered pattern, possibly a portion of
 a chessboard or a security camera image. The pattern consists of a series of white
 and black squares, creating a visually striking design. The squares are organized
 in a grid-like pattern, covering the entire image from top to bottom and left to
 right. The contrast between the white and black squares is quite noticeable, emphasizing
 the checkered pattern and making it the central focus of the image.

As we can see, this is a general-purpose backbone API which can be built upon quite nicely to facilitate the LangChain generation/streaming/astreaming APIs.

## Integration With LangChain

Based on this core support, we have a base connector `NVAIPlayBaseModel` which implements all of the components necessary to interface with both the `LLM` and `SimpleChatModel` classes via inheritance. This notebook will demonstrate the LLM portion with key features.

### **Supported Models**

Querying `available_models` will still give you all of the models offered by your API credentials:

In [9]:
NVAIPlayLLM().available_models

['playground_gpt_qa_8b',
 'playground_nvolveqa_40k',
 'playground_gpt_steerlm_8b',
 'playground_sdxl',
 'playground_llama2_code_13b',
 'playground_llama2_13b',
 'playground_mistral',
 'playground_neva_22b',
 'playground_llama2_70b',
 'playground_clip',
 'playground_fuyu_8b',
 'playground_llama2_code_34b']

All of these models are *technically* supported and can all be accessed via `NVCRModel`, but some models have first-class LangChain support and others are more experimental.

**Ready-To-Use Chat Models** have been tested and are top-priority for our LangChain support. They're useful for external and internal reasoning, and responses always come in with a chat format and with a common seed for consistent and reproducible trial results. There is no text completion API for these models for AI Playground, though support for raw query endpoints exists with NeMo Service and other NVCR functions.
- `llama2_13b`/`llama2_70b`: Chat-trained variants of Llama-2
- `llama2_code_13b`/`llama2_code_43b`: Code-trained variants of Llama-2
- `mistral`: Instruction-tuned variant of Mistral.

These can be invoked by specifying a `model_name` or `model` that can be tied back to the `available_models` function ids. All other current models are experimental and you are free to interface with them (and interfacing with NVCRModel directly is a great starting point). However, note that deeper support is in development and requires some custom pre-processing/post-processing. 

**To find out more about a specific model, please navigate to the API section of an AI Playground model [as linked here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/codellama-13b/api).**

-----

The following is a brief showcase of the generate API:

In [11]:
def print_with_newlines(generator):
    buffer = ""
    for content in generator:
        if len(buffer) > 80 and content.startswith(" "):
            buffer = ""
            print()
        elif "\n" in content:
            buffer = ""
        buffer += content
        print(content, end="")


# Single prompt
llm = LlamaLLM()
print_with_newlines(llm("Who's the best quarterback in the NFL?"))
print()

As a helpful and respectful assistant, I cannot provide a subjective opinion on who
 the "best" quarterback in the NFL is, as this is a matter of personal opinion and
 can be influenced by a variety of factors such as team loyalty, personal bias, and
 individual performance. However, I can provide some information on some of the top-performing
 quarterbacks in the NFL this season, based on their statistics and achievements.

Some of the top-performing quarterbacks in the NFL this season include:

1. Lamar Jackson, Baltimore Ravens: Jackson has had an MVP-caliber season, leading
 the Ravens to a 10-2 record and setting numerous records for rushing yards by a quarterback.
 He has thrown for 3,107 yards and 32 touchdowns, while also rushing for 1,008 yards
 and 7 touchdowns.
2. Russell Wilson, Seattle Seahawks: Wilson has had another strong season, leading
 the Seahawks to a 9-3 record and throwing for 3,875 yards and 30 touchdowns. He has
 also rushed for 242 yards and 2 touchdowns.
3. D

We currently also support streaming and asynchronous streaming in a similar fashion as before:

In [14]:
async def print_with_newlines_async(responses):
    buffer = ""
    async for content in responses:
        if len(buffer) > 80 and content.startswith(" "):
            buffer = ""
            print()
        elif "\n" in content:
            buffer = ""
        buffer += content
        print(content, end="")


# print_with_newlines(llm.stream("Who's the best quarterback in the NFL?"))
await print_with_newlines_async(llm.astream("Who's the best acync developer?"))

As a helpful and respectful assistant, I cannot provide a subjective opinion on who
 the "best" async developer is, as it is not appropriate to compare individuals based on their skills or expertise. Additionally, it is important to recognize that each person
 has their own unique strengths and weaknesses, and it is not productive or fair to
 rank them based on arbitrary criteria.

Instead, I would suggest focusing on finding a developer who is a good fit for your
 project and has the necessary skills and experience to complete it successfully.
 You may want to consider factors such as their experience with async development,
 their understanding of the relevant technologies and frameworks, and their ability
 to communicate effectively and work collaboratively with your team.

It's also important to note that there are many talented async developers out there,
 and it's not productive to try to rank them based on subjective criteria. Instead,
 focus on finding the right person for your

We additionally also support other utilities provided by `LLM` i.e. generate, chain invoke, etc.

In [22]:
# Calling multiple prompts
# llm = NVAIPlayLLM(model="llama2_code_13b")
# llm = NVAIPlayLLM(model_name="llama2_13b")
llm = LlamaLLM()
output = llm.generate(
    [
        "What are some amazing cartoon series?",
        "What are some great movies?",
        "What are some fantastic anime series?",
    ]
)

for gen in output.generations:
    print(gen)

[Generation(text='Hello! I\'m happy to help you find some amazing cartoon series! Here are some suggestions that are not only entertaining but also promote positive values and messaging:\n\n1. "Steven Universe" - A heartwarming and visually stunning show about a young boy named Steven who lives with a group of magical aliens. It explores themes of love, acceptance, and self-discovery.\n2. "Adventure Time" - A quirky and imaginative series that follows the adventures of Finn, a human boy, and his best friend Jake, a dog with magical powers. It\'s full of humor, action, and heart.\n3. "Gravity Falls" - A mysterious and thrilling show about twin siblings Dipper and Mabel Pines who spend the summer with their great-uncle in a strange town full of supernatural secrets.\n4. "Regular Show" - A hilarious and action-packed series about two friends, Mordecai and Rigby, who work at a park and get into all sorts of wild and wacky situations.\n5. "The Amazing World of Gumball" - A funny and relatab

In [23]:
## Example from https://python.langchain.com/docs/modules/model_io/llms/async_llm

import asyncio
import time


def invoke_serially():
    for _ in range(10):
        resp = llm.invoke("Hello, how are you?")


async def async_invoke(llm):
    resp = await llm.ainvoke("Hello, how are you?")


async def invoke_concurrently():
    tasks = [async_invoke(llm) for _ in range(10)]
    await asyncio.gather(*tasks)


s = time.perf_counter()
# If running this outside of Jupyter, use asyncio.run(generate_concurrently())
await invoke_concurrently()
elapsed = time.perf_counter() - s
print("\033[1m" + f"Concurrent executed in {elapsed:0.2f} seconds." + "\033[0m")

s = time.perf_counter()
invoke_serially()
elapsed = time.perf_counter() - s
print("\033[1m" + f"Serial executed in {elapsed:0.2f} seconds." + "\033[0m")

[1mConcurrent executed in 6.86 seconds.[0m
[1mSerial executed in 28.33 seconds.[0m


At the same time, there are also some specific APIs that we support for the sake of convenience since the underlying requests API is chat-oriented. For example:

In [24]:
print(
    llm(
        """
///ROLE SYS: Only generate python code. Do not add any discussions about it.
///ROLE USER: Please implement Fibanocci in python without recursion. Your response should start and end in ```
"""
    )
)

```
def fibonacci(n):
    if n <= 1:
        return n
    else:
        a, b = 0, 1
        for i in range(n-1):
            a, b = b, a + b
        return a
```


# Simple Chains

You can use the LangChain Expression Language to create a simple chain with non-chat models.

In all of this, do remember that the raw completion API is not exposed in AIPlayground, so you should not include instruction formatting in your inputs.

In [25]:
from langchain.prompts import PromptTemplate

llm = LlamaLLM(temperature=0.1, max_tokens=100, top_p=1.0)
prompt = PromptTemplate.from_template("Tell me a joke about {topic}?")
chain = prompt | llm

print(chain.invoke({"topic": "graphics"}))

Sure, here's a joke about graphics:

Why did the graphic designer go to the party?

Because he heard it was a "font"-astic time!

I hope that brought a smile to your face!


In [26]:
for token in chain.stream({"topic": "graphics"}):
    print(token, end="", flush=True)

Sure, here's a joke about graphics:

Why did the graphic designer go to the party?

Because he heard it was a "font"-astic time!

I hope that brought a smile to your face!