In [None]:
import getpass
import os

nvidia_api_key = getpass.getpass("Enter your NVIDIA API key: ")
os.environ["NVIDIA_API_KEY"] = nvidia_api_key

Enter your NVIDIA API key: ··········


I'm taking a minimalist approach in this tutorial, we're going to call the API using nothing but the `requests` library.

The NIM API has integrations with [LangChain](https://python.langchain.com/v0.2/docs/integrations/providers/nvidia/) and [LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/embeddings/nvidia/) and it's OpenAI API compatible. NVIDIA's put together [a repository](https://github.com/NVIDIA/GenerativeAIExamples/tree/main/notebooks) with examples that you hack around with after going through this basic tutorial.

Below is a helper function we'll use throughout the tutorial.


In [None]:
import requests
import base64
from IPython.display import HTML

def call_nim_api(endpoint, payload, headers = None, api_key=nvidia_api_key):
    """
    Generate a video using NVIDIA's AI API.

    Args:
        api_key (str): NVIDIA API key for authentication.
        payload (dict): The complete payload for the API request.
        endpoint (str, optional): API endpoint path. Defaults to "genai/stabilityai/stable-video-diffusion".

    Returns:
        dict: JSON response from the API.

    Raises:
        requests.HTTPError: If the API request fails.
    """
    DEFAULT_HEADERS = {
        "Authorization": f"Bearer {api_key}",
        "Accept": "application/json",
    }

    if headers is None:
        headers = DEFAULT_HEADERS

    response = requests.post(
        endpoint,
        headers=headers,
        json=payload
        )

    response.raise_for_status()

    return response.json()


# Large Language Models

I typically hack around with "small" language models - in the 7-13 billion parameter range - since that's what I can hack around with on the hardware I have available. But, since you get hooked up with 1000 credits right off the bat when you sign up for the API, I took this as an opportunity to play around with some massive language models. Ones that I would typically never get a chance to play around with otherwise.

Here's what I chose to play around with:

- [Nemotron-4-340B-Instruct](https://docs.api.nvidia.com/nim/reference/nvidia-nemotron-4-340b-instruct)

- [Snowflake Arctic](https://docs.api.nvidia.com/nim/reference/snowflake-arctic)

- [Yi-Large](https://docs.api.nvidia.com/nim/reference/01-ai-yi-large)

- [Mixtral 8x22B](https://docs.api.nvidia.com/nim/reference/mistralai-mixtral-8x22b-instruct-infer)

For this overview I'm selecting one prompt from the [IFEval dataset](https://huggingface.co/datasets/harpreetsahota/Instruction-Following-Evaluation-for-Large-Language-Models). I encourage you to try your own prompts, or some prompts from [here](https://huggingface.co/collections/harpreetsahota/red-teaming-prompts-656256235475849b82a91813)


In [None]:
PROMPT = """"The man was arrested for stealing a car. He was later released on bail."
Expand on it angrily in 90s gangster rap style, and make sure there are exactly 4 verses and a hook.
Separated the sections by the markdown divider: ***
"""

messages = [
    {
        "role": "user",
        "content": PROMPT
    }
]

def create_llm_payload(model, messages=messages, **kwargs):
    """
    Create a payload for an NVIDIA NIM API call to an LLM.

    Args:
        model (str): The model to use.
        messages (list): List of message dictionaries.
        **kwargs: Arbitrary keyword arguments for additional payload parameters.

    Returns:
        dict: The payload for the API call.
    """
    # Default values
    default_params = {
        "temperature": 0.2,
        "top_p": 0.7,
        "max_tokens": 1024,
        "stream": False
    }

    # Update default parameters with any provided kwargs
    default_params.update(kwargs)

    # Create the payload
    payload = {
        "model": model,
        "messages": messages,
        **default_params
    }

    return payload


def print_response(response):
    """
    Print the content of the response from an LLM API call.

    Args:
        response (dict): The response dictionary from the API call.

    Returns:
        None
    """
    try:
        content = response['choices'][0]['message']['content']
        print(content)
    except KeyError:
        print("Error: Unable to find the expected content in the response.")
    except Exception as e:
        print(f"An error occurred while trying to print the response: {str(e)}")

In [None]:
LLM_ENDPOINT = "https://integrate.api.nvidia.com/v1/chat/completions"

NEMOTRON = "nvidia/nemotron-4-340b-instruct"

response = call_nim_api(LLM_ENDPOINT, create_llm_payload(model=NEMOTRON))

print_response(response)

***
(Verse 1)
Yo, check it, this tale of a man, livin' life on the edge,
Stealin' cars for a living, never had no privilege.
Cops on his tail, but he's too slick, too quick,
Finally caught him, threw him in the back, no click-click.

(Verse 2)
Down at the station, they book him, no respect,
Just another criminal, no need for no detect.
But this man's got heart, won't let them break his spirit,
Bails out, back on the streets, same old merit.

(Hook)
We ride, we slide, we hide, we seek,
In the shadows of the night, no time for the weak.
The man, the myth, the legend, they can't keep him down,
In this game of life, he wears the crown.

***

(Verse 3)
Now he's out, but the heat's still on,
Gotta watch his back, can't trust no one.
But he's a survivor, a true gangster, you see,
In this world of deceit, he's as real as can be.

(Verse 4)
So here's to the man, the one they couldn't cage,
In this rap game of life, he's on the front page.
Stealin' cars was his sin, but he's more than that,
A sy

For the sake of keeping this tutorial as short as possible, I won't share the output from the other models that I've hacked around with. It's quite straight forward to make generations. All you have to do is change the model string to whatever model you want to use, for example:

```python
ARCTIC = "snowflake/arctic"

YI_LARGE = "01-ai/yi-large"

MIXTRAL = "mistralai/mixtral-8x22b-instruct-v0.1"

```

There's a lot of other models you can play around with, check out the [API reference](https://docs.api.nvidia.com/nim/reference) for more details, including the arguments you can pass to manipulate the model's output.

I had a blast playing around with these LLMs, especically since I wouldn't be able to otherwise. Thanks NVIDIA for hosting these and also making inferencing with them pretty damn fast!

# Visual Models

The Visual Models endpoint has some standard diffusion models, like various flavors of Stable Diffusion such as[SDXL](https://docs.api.nvidia.com/nim/reference/stabilityai-stable-diffusion-xl-infer). It also has a couple of NVIDIA's specialized in house models like [RetailObjectDetection](https://docs.api.nvidia.com/nim/reference/nvidia-retail-object-detection-infer) and [OCRNet](https://docs.api.nvidia.com/nim/reference/nvidia-ocdrnet-infer).

I took this opportunity to play around with [Stable Video Diffusion](https://docs.api.nvidia.com/nim/reference/stabilityai-stable-video-diffusion-infer)

Stable Video Diffusion (SVD) is a generative model that synthesizes 25-frame video sequences at 576x1024 resolution from a single input image. It uses diffusion-based generation to gradually add details and noise over multiple steps, creating short video clips with customizable frame rates and optional micro-conditioning parameters.

The version of the model available via the [NIM API](https://nvda.ws/4bcJs0j) is SVD XT, which is a image to video model (no text prompt). Feel free to use your own images, just note that your image must be smaller than 200KB. Otherwise, it needs to be uploaded to a presigned S3 bucket using [NVCF Asset APIs](https://docs.api.nvidia.com/cloud-functions/reference/createasset).

To start off with, here's a picture of Winnipeg.

<img src="https://weexplorecanada.com/wp-content/uploads/2023/05/Things-to-do-in-Winnipeg-Twitter.jpg">

In [None]:
!wget https://weexplorecanada.com/wp-content/uploads/2023/05/Things-to-do-in-Winnipeg-Twitter.jpg

Below are some helper functions to convert and work with images in [base64](https://en.wikipedia.org/wiki/Base64).

In [None]:
import base64

def image_to_base64(image_path):
  """
  Encodes an image into base64 format.

  Args:
    image_path: The path to the image file.

  Returns:
    A base64 encoded string of the image.
  """

  with open(image_path, "rb") as image_file:
    image_bytes = image_file.read()
  encoded_string = base64.b64encode(image_bytes).decode()
  return encoded_string

def save_base64_video_as_mp4(base64_string, output_mp4_path):
    """
    Save a base64-encoded video as an MP4 file.

    Args:
        base64_string (str): The base64-encoded video string.
        output_mp4_path (str): The path where the output MP4 should be saved.

    Returns:
        None
    """
    try:
        # Decode the base64 string
        video_data = base64.b64decode(base64_string['video'])

        # Write the binary data to an MP4 file
        with open(output_mp4_path, "wb") as mp4_file:
            mp4_file.write(video_data)

        print(f"MP4 video saved successfully at {output_mp4_path}")

    except Exception as e:
        print(f"An error occurred: {str(e)}")

def play_base64_video(base64_string, video_type="mp4"):
    """
    Play a base64-encoded video in a Colab notebook.

    Args:
        base64_string (str): The base64-encoded video string.
        video_type (str, optional): The video format (e.g., 'mp4', 'webm'). Defaults to 'mp4'.

    Returns:
        None
    """
    base64_string=base64_string['video']
    # Ensure the base64 string doesn't have the data URI prefix
    if base64_string.startswith('data:video/'):
        # Extract the actual base64 data
        base64_string = base64_string.split(',')[1]

    # Create the HTML video tag
    video_html = f'''
    <video width="640" height="480" controls>
        <source src="data:video/{video_type};base64,{base64_string}" type="video/{video_type}">
        Your browser does not support the video tag.
    </video>
    '''

    # Display the video
    display(HTML(video_html))


This function will create the payload for an image with or without a prompt:

In [None]:
def create_image_payload(image_b64, image_format='jpeg', prompt=None):
    """
    Create a payload with a base64-encoded image, with or without a prompt.

    Args:
        image_b64 (str): The base64-encoded image string (without the data URI prefix).
        image_format (str, optional): The format of the image. Accepted formats are jpg, png and jpeg.
        prompt (str, optional): The prompt to include before the image. Default is None.

    Returns:
        dict: The constructed payload.
    """
    # Ensure the image_b64 doesn't already have the data URI prefix
    if not image_b64.startswith('data:image/'):
        image_b64 = f"data:image/{image_format};base64,{image_b64}"

    if prompt:
        return f'{prompt} <img src="{image_b64}" />'
    else:
        # Scenario without a prompt
        return image_b64

Let's convert the image to base64:

In [None]:
winnipeg = image_to_base64("/content/Things-to-do-in-Winnipeg-Twitter.jpg")

Note that the `cfg_scale` guides how strongly the generated video sticks to the original image. Use lower values to allow the model more freedom to make changes and higher values to correct motion distortions.

In [None]:
SVD_ENDPOINT = "https://ai.api.nvidia.com/v1/genai/stabilityai/stable-video-diffusion"

winnipeg_payload  = create_image_payload(winnipeg, image_format='jpeg', prompt=None)

payload = {
  "image": winnipeg_payload,
  "cfg_scale": 2.42, #number must be lt or eq to 9
  "seed": 51
}

winnipeg_video = call_nim_api(endpoint = SVD_ENDPOINT, payload = payload)

play_base64_video(winnipeg_video)

In [None]:
!wget https://raw.githubusercontent.com/harpreetsahota204/hacking-with-nvidia-nim/main/assets/gettyimages-2004149010.jpeg

In [None]:
niners = image_to_base64("/content/gettyimages-2004149010.jpeg")

niners_payload  = create_image_payload(niners, image_format='jpeg', prompt=None)

payload = {
  "image": niners_payload,
  "cfg_scale": 1.2, #number must be lt or eq to 9
  "seed": 51
}

niners_video = call_nim_api(endpoint = SVD_ENDPOINT, payload = payload)

play_base64_video(niners_video)

# Vision-Language Models

The [NIM API](https://nvda.ws/4bcJs0j) has about 10 vision-language (aka "multimodal") models available.

I've hacked around with all of the ones here locally, but the speed of inference via the NIM was quite nice. What really caught my eye, though, is the [NeVA22B model](https://docs.api.nvidia.com/nim/reference/nvidia-neva-22b). NeVA is NVIDIA's version of the LLaVA model, where they replaced the open source LLaMA model with a GPT model trained by NVIDIA. In this approach, the image is encoded using a frozen Hugging Face CLIP model and combined with the prompt embeddings before being passed through the language model.

This was a fun model to hack around with. It's quite good, and has a bit of a different "personality" than the LLaVA models I've hacked with before. Those models were trained with either a Vicuna, Mistral, or Hermes LLMs; while NeVA uses an LLM that was trained by NVIDIA. Sadly, I couldn't find too much info (or a paper) about NeVA online.

In [None]:
NEVA22B_ENDPOINT = "https://ai.api.nvidia.com/v1/vlm/nvidia/neva-22b"

message_content = create_image_payload(
    image_b64 = niners,
    image_format='jpeg',
    prompt="Describe, as a rap in the style of Kendrick Lamar, what you see in this scene. Say 'Comption' and 'Bay Area' at least once each"
)

payload = {
  "messages": [{"role": "user", "content": message_content}],
  "max_tokens": 512,
  "temperature": 1.00,
  "top_p": 0.70,
  "stream": False
}

response = call_nim_api(endpoint = NEVA22B_ENDPOINT, payload = payload)

print_response(response)

(Verse 1)
Compton, Bay Area, where I'm from
The gridiron, the field, the sun
Red and gold, my team, the 49ers
Feelin' the heat, we're down to ten seconds

(Chorus)
It's a game of football, the clock's winding down
I'm throwin' the ball, I'm making a sound
Compton, Bay Area, my roots run deep
I'm playin' for the team, I'm never gonna sleep

(Verse 2)
I'm in the pocket, the clock's tickin' away
The team's dependin' on me, it's a big day
I throw the ball, it's catchin' in the air
Compton, Bay Area, I'm livin' my dream, no fear, no care

(Chorus)
It's a game of football, the clock's winding down
I'm throwin' the ball, I'm making a sound
Compton, Bay Area, my roots run deep
I'm playin' for the team, I'm never gonna sleep

(Verse 3)
The crowd's amped up, the energy's high
Compton, Bay Area, I'm feelin' alive
The game's on the line, the pressure's intense
But I'm ready, I'm comin' in for the entrance

(Chorus)
It's a game of football, the clock's winding down
I'm throwin' the ball, I'm making

# The NIM API also has various models related to Healthcare.

I didn't hack around with any of these models, but my teammate at Voxel51 (Dan Gural) wrote an awesome blog on [Segment Anything in a CT Scan with NVIDIA VISTA-3D](https://voxel51.com/blog/segment-anything-in-a-ct-scan-with-nvidia-vista-3d/), which I recommend checking out.

### Final thoughts

It's cool to see NVIDIA entering the API game.

They've got some great models in their model zoo, and I can only see them adding more over the coming months. The biggest thing that stands out to me is the speed. It's super impressive!