# NIMs Three Ways - build.nvidia.com, local, and then fine-tuned

The beauty of developing with NVIDIA NIMs is that you can use them anywhere.  You can develop against remote APIs, then stand them up locally, and then use the single NIM to serve multiple, parameter-efficient fine-tuned models. 

Here we'll walk through a simple example, with Llama-3.1 8B Instruct.   Let's get started!

<div>
<img src="https://developer.download.nvidia.com/images/products/practitioner-nim-1920x1080.png" width="50%"/>
</div>

## Set NGC API Key

We'll need an API key to use the nims - you can get yours (including free credits for developers!) at build.nvidia.com.   I've got mine in a local .env file, we'll use dotenv to read and use it

In [None]:
!pip install python-dotenv

In [None]:
from dotenv import load_dotenv
import os

In [None]:
load_dotenv()
NGC_API_KEY=os.getenv('NGC_API_KEY')
NGC_API_KEY_LOCAL=os.getenv('NGC_API_KEY_LOCAL')

## Some Routines for API Calls To NIMs

NIMs use standard OpenAPI interfaces, so we can just use that package.   Then we'll define some routines to do the most simple kind of NIM interaction - a simple completion.

In [None]:
!pip install openai

In [None]:
from openai import OpenAI

def nim_client(url, key=None):
    return OpenAI(base_url = url, api_key = key)

def simple_nim_complete(client, message_content, model="meta/llama-3.1-8b-instruct", temperature=0.5, top_p=1, max_tokens=1024):
    return client.chat.completions.create(model=model,
                                          messages=[{"role":"user",
                                                     "content": message_content}],
                                          temperature=temperature,
                                          top_p=top_p,
                                          max_tokens=max_tokens,
                                          stream=True)
def print_completion(completion):
    for chunk in completion:
      if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")


## Develop on build.nvidia.com

Let's create a remote client and ask our first simple question.

In [None]:
remote_client = nim_client("https://integrate.api.nvidia.com/v1", key=NGC_API_KEY)
completion = simple_nim_complete(remote_client, "What is your name?", model="nvdev/meta/llama-3.1-8b-instruct")

print_completion(completion)

Now let's try something harder - a math question.

In [None]:
completion = simple_nim_complete(remote_client, """A parabola with equation $y=xˆ2+bx+c$ passes through 
                                                   the points $(−1,−11)$ and $(3 ,17)$. 
                                                   What is $c$?""", model="nvdev/meta/llama-3.1-8b-instruct")

print_completion(completion)

## Install Locally
### Set up NIM cache directory

In [None]:
nim_cache_path="/tmp/nim/.cache"
os.environ["NIM_CACHE_PATH"] = nim_cache_path

In [None]:
!mkdir -p $NIM_CACHE_PATH
!chmod -R 777 $NIM_CACHE_PATH

In [None]:
repository = "nim/meta/llama-3.1-8b-instruct"
os.environ["CONTAINER_NAME"] = "Llama-3.1-8B-Instruct"
os.environ["IMG_NAME"]= f"nvcr.io/{repository}:1.3.0"

### Run base NIM

In [None]:
!docker run -d --rm --name=$CONTAINER_NAME \
  --gpus all \
  --shm-size=32GB \
  --network=host \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$NIM_CACHE_PATH:/opt/nim/.cache" \
  -u $(id -u) \
  $IMG_NAME

Let's watch the NIM start up

In [None]:
nim_docker_id = "d98f79e6e77812ba8e8d36719061f8b09f0d48fbb849def6c04732320c9c3b03"

In [None]:
!docker logs -f {nim_docker_id}

We can query our new endpoint to see what models are available...

In [None]:
!pip install requests

In [None]:
import requests
r = requests.get('http://0.0.0.0:8000/v1/models')
print(f"{r.status_code=}")
print("Models available = ", [item['id'] for item in r.json()['data']])

### Run against the local NIM

Fantastic!  Now let's try those API calls again...

In [None]:
local_client = nim_client("http://localhost:8000/v1", key=NGC_API_KEY_LOCAL)
completion = simple_nim_complete(local_client, "What is your name?")

print_completion(completion)

In [None]:
completion = simple_nim_complete(local_client, """A parabola with equation $y=xˆ2+bx+c$ passes through 
                                                   the points $(−1,−11)$ and $(3 ,17)$. 
                                                   What is $c$?""")

print_completion(completion)

## Use a local fine-tuned model

Now that we've got the model running, let's stop the NIM and add a fine-tuned model.  We're going to add a fine-tuned model from hugging face, and for that we're going to need Git LFS installed...

### Set up Git LFS

In [None]:
!sudo apt-get install git-lfs

In [None]:
!git lfs install

In [None]:
!git clone https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B

### Start NIM with the new Fine-tuned model

We'll stop the currently running image and point the new container to a different model...

In [None]:
!docker stop {nim_docker_id}

In [None]:
!docker run -d --rm --name="$CONTAINER_NAME"_PEFT \
    --gpus all \
    --shm-size=32GB \
    --network=host \
    -e NGC_API_KEY=$NGC_API_KEY \
    -u $(id -u) \
    -e NIM_FT_MODEL=/opt/weights/hf/OpenMath2-Llama3.1-8B \
    -e NIM_SERVED_MODEL_NAME=OpenMath2-Llama3.1-8B \
    -v "$NIM_CACHE_PATH:/opt/nim/.cache" \
    -v ${PWD}:/opt/weights/hf \
    $IMG_NAME

In [None]:
nim_docker_id="379dffcf4a26ccdd32e6ace11c25ed4239d0db54015a1cd663b8c4c7c0e27383"

In [None]:
!docker logs -f {nim_docker_id}

Now let's query the API and see what models are available

In [None]:
r = requests.get('http://0.0.0.0:8000/v1/models')
print(f"{r.status_code=}")
print("Models available = ", [item['id'] for item in r.json()['data']])

In [None]:
completion = simple_nim_complete(local_client, "What is your name?", model='OpenMath2-Llama3.1-8B')

print_completion(completion)

In [None]:
completion = simple_nim_complete(local_client, """A parabola with equation $y=xˆ2+bx+c$ passes through 
                                                   the points $(−1,−11)$ and $(3 ,17)$. 
                                                   What is $c$?""",
                                               model='OpenMath2-Llama3.1-8B')

print_completion(completion)

In [None]:
!docker stop {nim_docker_id}