### First Program to connect NVIDIA NIM via OpenAI API Calls

In this chapter, you'll learn:
- What a Large Language Model (LLM) is
- How to connect to an OpenAI-compatible API endpoint
- How to make your first request using Python
- The basics of interacting with `meta/llama-3.2-3b-instruct`

### 🔧 Setup: Install Required Library

In [None]:
# !pip install -q python-dotenv

### 🧪 Step 1: Connect to Your LLM Endpoint
* Set 'NVIDIA_API_KEY' Shell variable. Choices can be:
   * Read from Colab Repo
   * Read from .env file
   * Set in your Shell configuration file, i.e. .bashrc or command line

In [None]:
from openai import OpenAI
import os

from google.colab import userdata
os.environ['NVIDIA_API_KEY'] = userdata.get('NVIDIA_API_KEY')
apikey = os.getenv('NVIDIA_API_KEY')

# from dotenv import load_dotenv, find_dotenv
# load_dotenv(find_dotenv())  # Load .env file
# apikey = os.getenv('NVIDIA_API_KEY')


client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=apikey
)

### 💬 Step 2: Make Your First API Call

In [None]:
# Send a simple message to the LLM
response = client.chat.completions.create(
    model="meta/llama-3.2-3b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a short joke."}
    ]
)

# Print the response
print(response.choices[0].message.content)

### Alternative #1: Use OpenAI standart library

In [None]:
completion = client.chat.completions.create(
    model="meta/llama-3.2-3b-instruct",
    messages=[{"role": "user", "content": "Which one is bigger between 9.9 and 9.11"}],
    temperature=0.5,
    top_p=1,
    max_tokens=1024,
    stream=False
)

# Access the response content directly
print(completion.choices[0].message.content)


### Stream the Output

In [None]:
completion = client.chat.completions.create(
    model="meta/llama-3.2-3b-instruct",
    messages=[
        {"role": "user", "content": "What is the latest GPU model from NVIDIA"}
    ],
    temperature=0.5,
    top_p=1,
    max_tokens=1024,
    stream=True
)

for chunk in completion:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")


### Call Embedding model

In [None]:
from openai import OpenAI

response = client.embeddings.create(
    input=["What is the capital of France?"],
    model="nvidia/llama-3.2-nv-embedqa-1b-v2",
    encoding_format="float",
    extra_body={"input_type": "query", "truncate": "NONE"}
)

print(response.data[0].embedding)


### Alternative #2:  Call NIM using Python's requests library

In [None]:
import requests
import json
import os

from google.colab import userdata
os.environ['NVIDIA_API_KEY'] = userdata.get('NVIDIA_API_KEY')
apikey = os.getenv('NVIDIA_API_KEY')

def MyChat(prompt: str, max_tokens: int = 150, temperature: float = 0.7):
    """
    Send a prompt to the Llama-3.2-3B-Instruct model via NIM API.
    """
    url = "https://integrate.api.nvidia.com/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {apikey}" # Add the API key here
    }
    data = {
        "model": "meta/llama-3.2-3b-instruct",
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stream": False
    }

    response = requests.post(url, headers=headers, data=json.dumps(data))

    if response.status_code == 200:
        result = response.json()
        return result['choices'][0]['message']['content']
    else:
        raise Exception(f"Error {response.status_code}: {response.text}")

# Example Usage
prompt = "Explain the concept of gravity in simple terms."
response = MyChat(prompt)
print("🤖 Response:", response)

### 📝 Step 3: Try It Yourself

**Exercise**: Modify the prompt above to ask the model to explain what an LLM is in one sentence.