# Interacting with NIM

Now that we've deployed NIM, lets learn how we can interact with it and integrate it with other applications by sending requests. The main way of interacting with NIM is through the REST API. In this notebook, we'll go over three ways of calling the REST API -- through `curl` commands, through the Python `requests` library, and through the Python `OpenAI` library.


## Verifying that NIM is Ready

You can check to see if the NIM is deployed and ready for inference by running the following cell:


In [1]:
!curl localhost:8000/v1/health/ready

{"object":"health.response","message":"Service is ready."}

If it responds with

```json
{"object":"health.response","message":"Service is ready."}
```

NIM is ready to start receiving requests.

## Checking the available models

In order to send requests to NIM, we need to know what the name of our model is. We can obtain that by sending a request to the `v1/models` endpoint via the `requests` library

In [2]:
import requests

# List all models
models = requests.get('http://localhost:8000/v1/models').json()
print(models)

# Get the name of our model
model_name = models['data'][0]['id']

{'object': 'list', 'data': [{'id': 'meta/llama3-8b-instruct', 'object': 'model', 'created': 1720560692, 'owned_by': 'system', 'root': 'meta/llama3-8b-instruct', 'parent': None, 'permission': [{'id': 'modelperm-fb81edfbff96473eb6ca2f78d7230092', 'object': 'model_permission', 'created': 1720560692, 'allow_create_engine': False, 'allow_sampling': True, 'allow_logprobs': True, 'allow_search_indices': False, 'allow_view': True, 'allow_fine_tuning': False, 'organization': '*', 'group': None, 'is_blocking': False}]}]}


## Sending Requests

There are two different APIs for sending requests to NIM. The first is the Chat Completions API, and the second is the plain Completions API, both of which follow the API Specification from OpenAI. Let's see some examples of sending requests -- first using the python `requests` library, and then using the python `OpenAI` client library:

### Chat Completions API <a name="openai-chat"></a>


The OpenAI Chat API supports the following parameters:
- messages
- model
- frequency_penalty
- max_tokens
- n = 1
- stop
- stream
- temperature
- top_p
- logprobs

In [4]:
import requests

endpoint = 'http://0.0.0.0:8000/v1/chat/completions'

headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}

messages = [
    {"role": "user", "content": "Hello! How are you?"},
    {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
    {"role": "user", "content": "Write a joke about deep learning."}
]

data = {
    'model': model_name,
    'messages': messages,
    'max_tokens': 100,
    'temperature': 1,
    'n': 1,
    'stream': False,
    'stop': 'string',
    'frequency_penalty': 0.0
}

response = requests.post(endpoint, headers=headers, json=data)
print(response.json()['choices'][0]['message']['content'])

Why did the deep learning model go to therapy?

Because it was feeling a little "overfit" and was struggling to "generalize" its thoughts!


In [5]:
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
blog_url = "https://www.datasciencewithmarco.com/blog/softs-the-latest-innovation-in-time-series-forecasting"
messages = [
    {"role": "user", "content": "Hello! How are you?"},
    {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
    {"role": "user", "content": f"Summarize the following blog article at this URL: {blog_url}"}
]
chat_response = client.chat.completions.create(
    model="meta/llama3-8b-instruct",
    messages=messages,
    max_tokens=100,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message.content)


Based on the article "SOFTS: The Latest Innovation in Time Series Forecasting" by Marco Rosa, here is a summary:

**What is Softs?**

Softs is a novel method for time series forecasting that combines various techniques to improve predictive accuracy. The term "SOFTS" stands for Scalable Online Forest of Time Series Sub-models.

**Challenges in Time Series Forecasting**

Traditional time series forecasting methods often struggle with non-stationarity, non-normality, and


We can also perform streaming inference with the OpenAI library

In [6]:
from openai import OpenAI

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {"role": "user", "content": "Hello! How are you?"},
    {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
    {"role": "user", "content": "Who is better: ChatGPT, Claude, or Llama?"}
]
chat_response = client.chat.completions.create(
    model="meta/llama3-8b-instruct",
    messages=messages,
    max_tokens=1024,
    stream=True
)


for chunk in chat_response:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")



A question that asked in the minds of many AI enthusiasts!

ChatGPT, Claude, and LLaMA are all highly advanced language models developed by Meta AI (formerly Facebook AI). While they share many similarities, each model has its unique strengths and is suited for specific tasks. Here's a brief rundown:

1. **ChatGPT**: ChatGPT is a transformer-based model designed for conversational dialogue tasks. It's trained on a massive dataset of text from the internet and can generate human-like responses to user input. ChatGPT excels at tasks like chatbot conversations, text classification, and language translation.
2. **Claude**: Claude is a large-scale language model similar to ChatGPT. However, it's specifically designed for more complex conversational tasks, such as storytelling, debate, and creative writing. Claude has been trained on a vast corpus of text and can generate longer, more cohesive responses.
3. **LLaMA**: LLaMA is a text-generation model that's particularly good at producing lon