# Integrating Nemotron Nano 9B v2 with NVIDIA NIM

This notebook demonstrates how to integrate the Nemotron Nano 9B v2 model with NVIDIA NIM for production deployments.

## Prerequisites

- NVIDIA API key from [build.nvidia.com](https://build.nvidia.com/)
- Python 3.8+
- OpenAI Python library

## Installation


In [None]:
!pip install openai


## Setup API Key

Get your API key from [build.nvidia.com](https://build.nvidia.com/) and set it as an environment variable.


In [None]:
import os
from openai import OpenAI

# Set your API key
NVIDIA_API_KEY = "YOUR_API_KEY_HERE"  # Replace with your actual key
# Or use environment variable: os.getenv("NVIDIA_API_KEY")

client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=NVIDIA_API_KEY
)


## Example 1: Basic Chat Completion

Simple conversational interaction with Nemotron Nano 9B v2.


In [None]:
response = client.chat.completions.create(
    model="nvidia/nemotron-nano-9b-v2",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=512
)

print(response.choices[0].message.content)


## Example 2: Streaming Responses

Stream tokens as they are generated for a better user experience.


In [None]:
response = client.chat.completions.create(
    model="nvidia/nemotron-nano-9b-v2",
    messages=[
        {"role": "user", "content": "Write a Python function to calculate Fibonacci numbers."}
    ],
    temperature=0.5,
    max_tokens=1024,
    stream=True
)

print("Streaming response:")
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")


## Example 3: Function Calling

Demonstrate intelligent function calling capabilities.


In [None]:
import json

# Define available functions
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name, e.g. San Francisco"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="nvidia/nemotron-nano-9b-v2",
    messages=[
        {"role": "user", "content": "What's the weather like in Tokyo?"}
    ],
    tools=tools,
    tool_choice="auto"
)

# Check if the model wants to call a function
message = response.choices[0].message
if message.tool_calls:
    print("Function call requested:")
    print(f"Function: {message.tool_calls[0].function.name}")
    print(f"Arguments: {message.tool_calls[0].function.arguments}")
else:
    print(message.content)


## Example 4: Multi-turn Conversation

Build context over multiple exchanges.


In [None]:
conversation = [
    {"role": "system", "content": "You are a helpful coding tutor."},
    {"role": "user", "content": "I'm learning Python. What are decorators?"}
]

# First response
response = client.chat.completions.create(
    model="nvidia/nemotron-nano-9b-v2",
    messages=conversation,
    temperature=0.7,
    max_tokens=512
)

assistant_msg = response.choices[0].message.content
conversation.append({"role": "assistant", "content": assistant_msg})
print(f"Assistant: {assistant_msg}\n")

# Follow-up question
conversation.append({"role": "user", "content": "Can you show me a practical example?"})

response = client.chat.completions.create(
    model="nvidia/nemotron-nano-9b-v2",
    messages=conversation,
    temperature=0.7,
    max_tokens=512
)

print(f"Assistant: {response.choices[0].message.content}")


## Integration with Web Demo

To integrate this notebook's code with the web demo:

1. Create a backend API server (Flask, FastAPI, etc.)
2. Replace mock responses in `src/index.html` with API calls
3. Implement streaming using Server-Sent Events (SSE)
4. Add proper error handling and rate limiting

Example backend structure:

```python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post("/api/chat")
async def chat(request: ChatRequest):
    response = client.chat.completions.create(
        model="nvidia/nemotron-nano-9b-v2",
        messages=request.messages,
        stream=True
    )
    
    async def generate():
        for chunk in response:
            if chunk.choices[0].delta.content:
                yield f"data: {chunk.choices[0].delta.content}\n\n"
    
    return StreamingResponse(generate(), media_type="text/event-stream")
```

## Resources

- [NVIDIA NIM API Documentation](https://docs.nvidia.com/nim/)
- [OpenAI Python Library](https://github.com/openai/openai-python)
- [FastAPI Documentation](https://fastapi.tiangolo.com/)

---

**Happy Building with NVIDIA Nemotron!** 🚀
