# NVIDIA AI Playground LLM

>[NVIDIA AI Playground](https://www.nvidia.com/en-us/research/ai-playground/) gives users easy access to hosted endpoints for generative AI models like Llama-2, SteerLM, Mistral, etc. Using the API, you can query NVCR (NVIDIA Container Registry) function endpoints and get quick results from a DGX-hosted cloud compute environment. All models are source-accessible and can be deployed on your own compute cluster.

This example goes over how to use LangChain to interact with supported AI Playground models.

In [None]:
from langchain.llms.nv_aiplay import NVCRModel, NVAIPlayClient  ## Core backbone interface clients
from langchain.llms import NVAIPlayLLM                          ## Generic NVAIPlay Models
from langchain.llms.nv_aiplay import LlamaLLM                   ## Llama-default NVAIPlay Models

## Setup

**To get started:**
1. Create a free account with the [NVIDIA GPU Cloud](https://catalog.ngc.nvidia.com/) service, which hosts AI solution catalogs, containers, models, etc.
2. Navigate to `Catalog > AI Foundation Models > (Model with API endpoint)`.
3. Select the `API` option and click `Generate Key`.
4. Save the generated key as `NVAPI_KEY`. From there, you should have access to the endpoints.

In [3]:
import getpass
import os

## API Key can be found by going to NVIDIA NGC -> AI Playground -> (some model) -> Get API Code or similar.
## 10K free queries to any endpoint (which is a lot actually).

# del os.environ['NVAPI_KEY']  ## delete
if os.environ.get('NVAPI_KEY', '').startswith('nvapi-'):
    print('Valid NVAPI_KEY already in environment. Delete to reset')
else:
    nvapi_key = getpass.getpass('NVAPI Key (starts with nvapi-): ')
    assert nvapi_key.startswith('nvapi-'), \
        f"{nvapi_key[:5]}... is not a valid key"
    os.environ['NVAPI_KEY'] = nvapi_key

NVAPI Key (starts with nvapi-): ··········


## Underlying Requests API

A selection of useful models are hosted in a DGX-powered service known as NVIDIA GPU Cloud (NGC). In this service, containers with exposed model endpoints are deployed and listed on the NVIDIA Container Registry service (NVCR). These systems are accessible via simple HTTP requests and can be utilized by a variety of systems.

The `NVCRModel` class implements the basic interfaces to communicate with NVCR, limiting the utility functions to those relevant for AI Playground. For example, the following list is populated by querying the function list endpoint with a key-loaded GET request:

In [4]:
NVCRModel().available_models

{'playground_nvolveqa_40k': '091a03bb-7364-4087-8090-bd71e9277520',
 'playground_llama2_code_34b': 'df2bee43-fb69-42b9-9ee5-f4eabbeaf3a8',
 'playground_sdxl': '89848fb8-549f-41bb-88cb-95d6597044a4',
 'playground_clip': '8c21289c-0b18-446d-8838-011b7249c513',
 'playground_gpt_qa_8b': '0c60f14d-46cb-465e-b994-227e1c3d5047',
 'playground_llama2_70b': '0e349b44-440a-44e1-93e9-abe8dcb27158',
 'playground_neva_22b': '8bf70738-59b9-4e5f-bc87-7ab4203be7a0',
 'playground_gpt_steerlm_8b': '1423ff2f-d1c7-4061-82a7-9e8c67afd43a',
 'playground_mistral': '35ec3354-2681-4d0e-a8dd-80325dcf7c63',
 'playground_fuyu_8b': '9f757064-657f-4c85-abd7-37a7a9b6ee11',
 'playground_llama2_13b': 'e0bb7fb9-5333-4a27-8534-c6288f921d3f',
 'playground_llama2_code_13b': 'f6a96af4-8bf9-4294-96d6-d71aa787612e'}

From this, you can easily send over a request in the style shown in the AI Playground API window for Python. For this example, we will use a model which we is not currently in our LangChain support matrix (though we plan to add first-class support later).

In [5]:
client = NVCRModel()

model = 'neva'
payload = {
  "messages": [
    {
      "content": "Hi! What is in this image? <img src=\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAgAAAAIAQMAAAD+wSzIAAAABlBMVEX///+/v7+jQ3Y5AAAADklEQVQI12P4AIX8EAgALgAD/aNpbtEAAAAASUVORK5CYII==\" />",
      "role": "user"
    },
    {
      "labels": {
        "creativity": 6,
        "helpfulness": 6,
        "humor": 0,
        "quality": 6
      },
      "role": "assistant"
    }
  ],
  "temperature": 0.2,
  "top_p": 0.7,
  "max_tokens": 512,
  "stream": True
}

def print_with_newlines(generator):
    buffer = ""
    for response in generator:
        content = response.get('content')
        if len(buffer) > 80 and content.startswith(' '):
            buffer = ""
            print()
        elif content.startswith('\n'):
            buffer = ""
        buffer += content
        print(content, end='')

## Generate-style response
# print(client.get_req_generation(model, payload))
# print()
## NOTE: if an invalid name is specified, it will try to find a model that contains the provided name

## Stream-style response
print_with_newlines(client.get_req_stream(model, payload))
print()

async def print_with_newlines_async(responses):
    buffer = ""
    async for response in responses:
        content = response['content']
        if len(buffer) > 80 and content.startswith(' '):
            buffer = ""
            print()
        elif '\n' in content:
            buffer = ""
        buffer += content
        print(content, end='')

## Stream-style response
await print_with_newlines_async(client.get_req_astream(model, payload))


The image is a gray scale photograph of a checkered pattern, possibly a portion of
 a chessboard or a security camera image. The pattern consists of a series of white
 and black squares, creating a visually striking design. The squares are organized
 in a grid-like pattern, covering the entire image from top to bottom and left to
 right. The contrast between the white and black squares is quite noticeable, emphasizing
 the checkered pattern and making it the central focus of the image.
The image is a gray scale photograph of a checkered pattern, possibly a portion of
 a chessboard or a security camera image. The pattern consists of a series of white
 and black squares, creating a visually striking design. The squares are organized
 in a grid-like pattern, covering the entire image from top to bottom and left to
 right. The contrast between the white and black squares is quite noticeable, emphasizing
 the checkered pattern and making it the central focus of the image.

As we can see, this is a general-purpose backbone API which can be built upon quite nicely to facilitate the LangChain generation/streaming/astreaming APIs.

## Integration With LangChain

Based on this core support, we have a base connector `NVAIPlayBaseModel` which implements all of the components necessary to interface with both the `LLM` and `SimpleChatModel` classes via inheritance. This notebook will demonstrate the LLM portion with key features.

### **Supported Models**

Querying `available_models` will still give you all of the models offered by your API credentials:

In [6]:
NVAIPlayLLM().available_models

['playground_mistral',
 'playground_sdxl',
 'playground_gpt_qa_8b',
 'playground_clip',
 'playground_neva_22b',
 'playground_gpt_steerlm_8b',
 'playground_nvolveqa_40k',
 'playground_fuyu_8b',
 'playground_llama2_13b',
 'playground_llama2_code_34b',
 'playground_llama2_code_13b',
 'playground_llama2_70b']

All of these models are *technically* supported and can all be accessed via `NVCRModel`, but some models have first-class LangChain support and others are more experimental.

**Ready-To-Use Chat Models** have been tested and are top-priority for our LangChain support. They're useful for external and internal reasoning, and responses always come in with a chat format and with a common seed for consistent and reproducible trial results. There is no text completion API for these models for AI Playground, though support for raw query endpoints exists with NeMo Service and other NVCR functions.
- `llama2_13b`/`llama2_70b`: Chat-trained variants of Llama-2
- `llama2_code_13b`/`llama2_code_43b`: Code-trained variants of Llama-2
- `mistral`: Instruction-tuned variant of Mistral.

The following is a brief showcase of the generate API:

In [7]:
from langchain.llms.nv_aiplay import LlamaLLM

def print_with_newlines(generator):
    buffer = ""
    for content in generator:
        if len(buffer) > 80 and content.startswith(' '):
            buffer = ""
            print()
        elif '\n' in content:
            buffer = ""
        buffer += content
        print(content, end='')

# Single prompt
llm = LlamaLLM()
print_with_newlines(llm("Who's the best quarterback in the NFL?"))
print()

I'm just an AI, I don't have have personal opinions or beliefs. However, I can provide
 you with some information about the best quarterbacks in the NFL based on their performance
 and achievements.

There are several quarterbacks in the NFL who are widely considered to be among the
 best. Some of the most notable include:

1. Tom Brady: Brady is widely considered to be one of the greatest quarterbacks of
 all time. He has won six Super Bowls with the New England Patriots and has been named
 Super Bowl MVP four times.
2. Aaron Rodgers: Rodgers is a three-time NFL MVP and has led the Green Bay Packers
 to two Super Bowl victories. He is known for his accuracy and ability to read defenses.
3. Drew Brees: Brees is a two-time Super Bowl MVP and has led the New Orleans Saints
 to three Super Bowl victories. He is known for his accuracy and ability to read defenses.
4. Patrick Mahomes: Mahomes is a two-time NFL MVP and has led the Kansas City Chiefs
 to two Super Bowl victories. He is known 

We currently also support streaming and asynchronous streaming in a similar fashion as before:

In [8]:
async def print_with_newlines_async(responses):
    buffer = ""
    async for content in responses:
        if len(buffer) > 80 and content.startswith(' '):
            buffer = ""
            print()
        elif '\n' in content:
            buffer = ""
        buffer += content
        print(content, end='')

# print_with_newlines(llm.stream("Who's the best quarterback in the NFL?"))
await print_with_newlines_async(llm.astream("Who's the best quarterback in the NFL?"))

I'm just an AI, I don't have have personal opinions or beliefs. However, I can provide
 you with some information about the best quarterbacks in the NFL based on their performance and achievements.

There are several quarterbacks in the NFL who
 are widely considered to be among the best. Some of the most notable include:

1. Tom Brady: Brady is widely considered to be one of the greatest quarterbacks
 of all time. He has won six Super Bowls with the New England Patriots and has been
 named Super Bowl MVP four times.
2. Aaron Rodgers: Rodgers is a three-time NFL MVP and has led the Green Bay Packers
 to two Super Bowl victories. He is known for his accuracy and ability to read defenses.
3. Drew Brees: Brees is a two-time Super Bowl MVP and has led the New Orleans Saints
 to three Super Bowl victories. He is known for his accuracy and ability to read defenses.
4. Patrick Mahomes: Mahomes is a two-time NFL MVP and has led the Kansas City Chiefs
 to two Super Bowl victories. He is known f

We additionally also support other utilities provided by `LLM` i.e. generate, chain invoke, etc.

In [33]:
# Calling multiple prompts
output = llm.generate(
    [
        "What are some amazing cartoon series?",
        "What are some great movies?",
        "What are some fantastic anime series?",
    ]
)

for gen in output.generations:
    print(gen)

[Generation(text='There are many amazing cartoon series that have captured the hearts of audiences around the world. Here are a few examples:\n\n1. "The Simpsons" - This long-running series follows the misadventures of the Simpson family, a working-class family living in the fictional town of Springfield.\n2. "South Park" - This controversial series follows the misadventures of four foul-mouthed fourth graders')]
[Generation(text="There are many great movies that have been released over the years, and it's difficult to narrow it down to a specific list. However, here are some movies that are widely considered to be great:\n\n1. The Shawshank Redemption (1994) - a highly acclaimed drama about the power of hope and redemption.\n2. The Godfather (1972) - a crime drama that explores the world of organized")]
[Generation(text='There are many great anime series out there, but here are some that are highly recommended:\n\n1. Attack on Titan (2013) - a dark and suspenseful series that explores

In [34]:
## Example from https://python.langchain.com/docs/modules/model_io/llms/async_llm

import asyncio
import time


def invoke_serially():
    for _ in range(10):
        resp = llm.invoke("Hello, how are you?")


async def async_invoke(llm):
    resp = await llm.ainvoke("Hello, how are you?")


async def invoke_concurrently():
    tasks = [async_invoke(llm) for _ in range(10)]
    await asyncio.gather(*tasks)


s = time.perf_counter()
# If running this outside of Jupyter, use asyncio.run(generate_concurrently())
await invoke_concurrently()
elapsed = time.perf_counter() - s
print("\033[1m" + f"Concurrent executed in {elapsed:0.2f} seconds." + "\033[0m")

s = time.perf_counter()
invoke_serially()
elapsed = time.perf_counter() - s
print("\033[1m" + f"Serial executed in {elapsed:0.2f} seconds." + "\033[0m")

[1mConcurrent executed in 3.52 seconds.[0m
[1mSerial executed in 11.69 seconds.[0m


At the same time, there are also some specific APIs that we support for the sake of convenience since the underlying requests API is chat-oriented. For example:

In [10]:
print(llm("""
///ROLE SYS: Only generate python code. Do not add any discussions about it.
///ROLE USER: Please implement Fibanocci in python without recursion. Your response should start and end in ```
"""))

```
def fibonacci(n):
    a, b = 0, 1
    for i in range(n):
        a, b = b, a + b
    return a
```


You can add your own custom support for such a system by subclassing the `NVAIPlayBaseModel` class.

# Simple Chains

You can use the LangChain Expression Language to create a simple chain with non-chat models.

In [37]:
from langchain.prompts import PromptTemplate

llm = LlamaLLM(
    temperature = 0.1,
    max_tokens = 100,
    top_p = 1.0
)
prompt = PromptTemplate.from_template("Tell me a joke about {topic}?")
chain = prompt | llm

print(chain.invoke({"topic": "graphics"}))

Here's a joke about graphics:

Why did the graphic designer break up with his girlfriend?

Because he couldn't handle her curves!


In [38]:
for token in chain.stream({"topic": "graphics"}):
    print(token, end="", flush=True)

Here's a joke about graphics:

Why did the graphic designer break up with his girlfriend?

Because he couldn't handle her curves!

In all of this, do remember that the raw completion API is not exposed in AIPlayground, so you should not include instruction formatting in your inputs.