<p> <center> <a href="../start_here.ipynb">Home Page</a> </center> </p>

<div>
    <span style="float: left; width: 53%; text-align: right;">
        <a >1</a>
        <a href="02_introduction_mcp.ipynb">2</a>
        <a href="03_low_level_mcp.ipynb">3</a>
        <a href="04_langraph_agent.ipynb">4</a>
        <a href="05_challenge.ipynb">5</a>
    </span>
    <span style="float: left; width: 46%; text-align: right;"><a href="02_introduction_mcp.ipynb">Next Notebook</a></span>
</div>

## Learning objectives

By the end of this module, participants will be able to:
- Understand what NVIDIA Inference Microservices (NIMs) are and their role in accelerated AI inference
- Obtain and configure an NVIDIA API Key from the NVIDIA API Catalog
- Send inference requests to NIM cloud or local endpoints using Python's requests library
- Parse and interpret chat completion responses from the OpenAI-compatible NIM API

## Using NVIDIA Inference Microservices (NIMs)

NIMs are quickly accessible via easy-to-use open APIs available at [NVIDIA API Catalog](https://build.nvidia.com/explore/discover), a platform for accessing a wide range of microservices online. To start with NIMs, you need an `NVIDIA API Key` which requires registration. You can register by `clicking on the login button to enter your email address`, as shown in the screenshot below, and follow the rest process or attempt to generate the API Key via the [NVIDIA NGC](https://ngc.nvidia.com/signin) registration (*click on your account name -> setup -> Generate Personal Key*). After completing the process, please save your API Key somewhere you can access for future use. A sample API Key should start with `nvapi-` and 64 other characters, including underscore `_`.

If you already have an account please follow this step to get your NVIDIA API KEY:

- Login to your account from [here](https://build.nvidia.com/explore/discover).
- Click on your model of choice.
- Under Input select the Python tab, and click Get API Key and then click Generate Key.
- Copy and save the generated key as NVIDIA_API_KEY. From there, you should have access to the endpoints.

<div style="text-align: center;">
  <img src="images/nim-catalog.png" style="width: 900px; height: auto;">
</div>


In [None]:
import os
import getpass
import warnings
warnings.filterwarnings("ignore")
if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key
    os.environ["NGC_API_KEY"] = nvapi_key

## Getting Started

This lab can be completed using either of the following methods:

- **Cloud Endpoint** – Access NVIDIA NIMs via hosted on Cloud GPUs
- **Local Endpoint** – Deploy NIMs locally on your own GPU

## Cloud Endpoint

In [None]:
import requests
import json
from pprint import pprint

url = "https://integrate.api.nvidia.com/v1/chat/completions"

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {os.environ['NVIDIA_API_KEY']}"
}

payload = {
    "model": "meta/llama-3.2-3b-instruct",
    "messages": [{"role": "system", "content": "What is 2+2?"}],
    "temperature": 0.6,
    "top_p": 0.95,
    "max_tokens": 4096,
    "frequency_penalty": 0,
    "presence_penalty": 0,
    "stream": False
}

response = requests.post(url, headers=headers, json=payload)

In [None]:
pprint(response.json())

## Local Endpoint

### Self-Hosted NIMs

Please execute the cell below to ensure that your docker daemon is up and running.

In [None]:
! docker ps 

**Expected Output (if you have no running containers):**

```python

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

```

### Login to NVCR (NVIDIA Container Registry)

To access a NIM docker image, you must login via `docker login nvcr.io.` This process requires a default username as `--username $oauthtoken` and `--password-stdin` that accepts the value of `$NGC_API_KEY.`

In [None]:
! echo -e "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

**Expected Output**:
```
WARNING! Your password will be stored unencrypted in /home/yagupta/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
```

### Selection of NIM

The [NIMs Catalog](https://build.nvidia.com/explore/reasoning) lists multiple state-of-the-art models in different domains. Look for the ones with the `RUN ANYWHERE` tag, as shown in the screenshot below. These NIM images are available to download and contain models and required optimized runtimes that help in getting started quickly.

<img src="images/catalog3_0.png" style="width: 900px; height: auto;">

Select the NIM model of your choice, click on the docker tab, and  copy the image name in the red box as shown in the screenshot below.  

<img src="images/catalog3_1.png" style="width: 900px; height: auto;">
<img src="images/catalog3_2.png" style="width: 900px; height: auto;">

### Pull The Image 

The next step is to Pull the docker image. We demonstrate this step by pulling `llama-3.2-3b-instruct:latest`.

In [None]:
! docker pull nvcr.io/nim/meta/llama-3.2-3b-instruct:latest

**Likely output:** (When you have the image pulled already)
```
latest: Pulling from nim/meta/llama-3.2-3b-instruct
Digest: sha256:b07b99986a134689c335958c3bb6a3d01ec7a46d0c308087b4d56384fc094708
Status: Image is up to date for nvcr.io/nim/meta/llama-3.2-3b-instruct:latest
nvcr.io/nim/meta/llama-3.2-3b-instruct:latest
```

In [None]:
! docker image ls | grep llama-3.2-3b-instruct

**Expected Output**:

```python
REPOSITORY                            TAG       IMAGE ID       CREATED        SIZE
nvcr.io/nim/nvidia/llama-3.3-nemotron-super-49b-v1.5 latest 61eb40f73232 11 months ago 12.5GB
```

#### Setting up Cache for the Model Artifacts

The NIMs download a number of files for ensuring the best profiles are selected to achieve max performance on hardware. Set up location for caching the model artifacts as `LOCAL_NIM_CACHE` and export the variable.

In [None]:
from os.path import expanduser
home = expanduser("~")
os.environ['LOCAL_NIM_CACHE']=f"{home}/.cache/nim"
!echo $LOCAL_NIM_CACHE

In [None]:
!mkdir -p "$LOCAL_NIM_CACHE"
# !chmod 777 "$LOCAL_NIM_CACHE"

### Launch NIM LLM Microservice

Launch the NIM LLM microservice by executing the docker run command in the cell bellow.

```python
docker run -it --rm -d --gpus all --name=llm_nim --shm-size=16GB  -e NGC_API_KEY  -v '$LOCAL_NIM_CACHE':/opt/nim/.cache  -u $(id -u) -p 8000:8000 nvcr.io/nim/meta/llama-3.2-3b-instruct:latest
```

This Docker command launches NIM LLM microservice using the following flags:

- `-it`: Allocates a pseudo-TTY and keeps STDIN open for interactive processes
- `--rm`: Automatically removes the container when it exits
- `-d`: Runs the container in detached mode (in the background)
- `--gpus all`: Allows the container to access all available GPUs
- `--name=llm_nim`: Names the container "llm_nim"
- `--shm-size=16GB`: Sets the size of /dev/shm to 16GB
- `-e NGC_API_KEY`: Passes the NGC_API_KEY environment variable to the container
- `-v $LOCAL_NIM_CACHE:/opt/nim/.cache`: Mounts the local NIM cache directory to /opt/nim/.cache in the container
- `-u $(id -u)`: Runs the container with the current user's UID
- `-p 8000:8000`: Maps port 8000 on the host to port 8000 in the container
- `nvcr.io/nim/meta/llama-3.2-3b-instruct:latest`: Specifies the Docker image to use

A system can have multiple running proceesses, so it is must to ensure we are not overtaking a port with any running application. The following code finds a unique free port and allots it:

In [None]:
import random
import socket

def find_available_port(start=11000, end=11999):
    while True:
        # Randomly select a port between start and end range
        port = random.randint(start, end)
        
        # Try to create a socket and bind to the port
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
            try:
                sock.bind(("localhost", port))
                # If binding is successful, the port is free
                return port
            except OSError:
                # If binding fails, the port is in use, continue to the next iteration
                continue

# Find and print an available port
os.environ['CONTAINER_PORT'] = str(find_available_port())
print(f"Your have been alloted the available port: {os.environ['CONTAINER_PORT']}")

In [None]:
! docker run -it -d --rm \
--gpus "device=${CUDA_VISIBLE_DEVICES}" \
--name=llm_nim \
--shm-size=16GB  \
-e NGC_API_KEY \
-v $LOCAL_NIM_CACHE:/opt/nim/.cache \
-u $(id -u) \
-p $CONTAINER_PORT:8000 \
nvcr.io/nim/meta/llama-3.2-3b-instruct:latest

# In order to ensure, the local NIM container is completely loaded and doesn't remain in pending stage, we instantiate a wait interval
! sleep 60

In [None]:
! docker logs --tail 45 llm_nim

**Expected Output:**
```
INFO 2026-01-09 10:25:32.249 on.py:62] Application startup complete.
INFO 2026-01-09 10:25:32.251 server.py:214] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 2026-01-09 10:25:33.909 api_server.py:424] An example cURL request:
curl -X 'POST' \
  'http://0.0.0.0:8000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta/llama-3.2-3b-instruct",
    "messages": [
      {
        "role":"user",
        "content":"Hello! How are you?"
      },
      {
        "role":"assistant",
        "content":"Hi! I am quite well, how can I help you today?"
      },
      {
        "role":"user",
        "content":"Can you write me a song?"
      }
    ],
    "top_p": 1,
    "n": 1,
    "max_tokens": 15,
    "stream": true,
    "frequency_penalty": 1.0,
    "stop": ["hello"]
  }'

```

### Initiate A Quick Test
You can quickly test that your NIM is up and running via two methods:
- LangChain NVIDIA Endpoints
- A simple OpenAI completion request

**Parameter description:**
- **base_url**: The ULR where the NIM docker image is deployed.
- **model**: The name of the NIM model deployed. 
- **temperature**: To modulate the randomness of sampling. Reducing the temperature increases the chance of selecting words with high probabilities.
- **top_p**: To control how deterministic the model is. If you are looking for exact and factual answers, keep this low. If you seek more diverse responses, increase to a higher value.
- **max_tokens**: maximum number of output tokens to be generated.

In [None]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(base_url="http://0.0.0.0:{}/v1".format(os.environ['CONTAINER_PORT']), model="meta/llama-3.2-3b-instruct", temperature=0.1, max_tokens=1000, top_p=1.0)

result = llm.invoke("What is 2+2?")
print(result.content)

In case of error outputs, wait for sometime and rerun the above cell. The error might be due to the NIM container not being up completely.

In [None]:
!curl -X 'POST' \
    "http://0.0.0.0:${CONTAINER_PORT}/v1/completions" \
    -H "accept: application/json" \
    -H "Content-Type: application/json" \
    -d '{"model": "meta/llama-3.2-3b-instruct", "prompt": "What is 2+2?", "max_tokens": 64, \
    "temperature": 0.1, "max_tokens": 1000, "top_p": 1.0, "stop": ["."]}'

### Links and Resources

- [NVIDIA](https://docs.nvidia.com/nim/index.html)
- [llama-3.2-3b-instruct](https://build.nvidia.com/meta/llama-3.2-3b-instruct) 

---

## Licensing

Copyright © 2025 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<p> <center> <a href="../start_here.ipynb">Home Page</a> </center> </p>

<div>
    <span style="float: left; width: 53%; text-align: right;">
        <a >1</a>
        <a href="02_introduction_mcp.ipynb">2</a>
        <a href="03_low_level_mcp.ipynb">3</a>
        <a href="04_langraph_agent.ipynb">4</a>
        <a href="05_challenge.ipynb">5</a>
    </span>
    <span style="float: left; width: 46%; text-align: right;"><a href="02_introduction_mcp.ipynb">Next Notebook</a></span>
</div>