# NVIDIA AI Foundation Endpoints

The `NVIDIA` and `ImageGenNVIDIA` classes are LLM-inheriting connectors that interface with the [NVIDIA AI Foundation Endpoints](https://www.nvidia.com/en-us/ai-data-science/foundation-models/).


> [NVIDIA AI Foundation Endpoints](https://www.nvidia.com/en-us/ai-data-science/foundation-models/) give users easy access to NVIDIA hosted API endpoints for NVIDIA AI Foundation Models like Mixtral 8x7B, Llama 2, Stable Diffusion, etc. These models, hosted on the [NVIDIA NGC catalog](https://catalog.ngc.nvidia.com/ai-foundation-models), are optimized, tested, and hosted on the NVIDIA AI platform, making them fast and easy to evaluate, further customize, and seamlessly run at peak performance on any accelerated stack.
> 
> With [NVIDIA AI Foundation Endpoints](https://www.nvidia.com/en-us/ai-data-science/foundation-models/), you can get quick results from a fully accelerated stack running on [NVIDIA DGX Cloud](https://www.nvidia.com/en-us/data-center/dgx-cloud/). Once customized, these models can be deployed anywhere with enterprise-grade security, stability, and support using [NVIDIA AI Enterprise](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/).
> 
> These models can be easily accessed via the [`langchain-nvidia-ai-endpoints`](https://pypi.org/project/langchain-nvidia-ai-endpoints/) package, as shown below.

This example goes over how to use LangChain to interact with and develop LLM-powered systems using the publicly-accessible AI Foundation endpoints.

## Installation

In [1]:
# %pip install --upgrade --quiet langchain-nvidia-ai-endpoints

## Setup

**To get started:**

1. Create a free account with the [NVIDIA NGC](https://catalog.ngc.nvidia.com/) service, which hosts AI solution catalogs, containers, models, etc.

2. Navigate to `Catalog > AI Foundation Models > (Model with API endpoint)`.

3. Select the `API` option and click `Generate Key`.

4. Save the generated key as `NVIDIA_API_KEY`. From there, you should have access to the endpoints.

In [2]:
import getpass
import os

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

## Accessing Raw LLM Endpoints

Though progressively becoming less common in services, raw LLM interfaces offer the most amount of flexibility with regard to prompting and output sampling. Unlike chat models which operate on `List[ChatMessages] -> AIMessage` logic, raw endpoints expect a "prompt" as input and return "text" as output.

In [3]:
## Core LC Chat Interface
from langchain_nvidia_ai_endpoints.llm import NVIDIA

NVIDIA.get_available_models("nvcf")

[Model(id='ai-mixtral-8x22b', model_type='completion'),
 Model(id='playground_starcoder2_15b', model_type='completion')]

As we can see, the AI Foundation Endpoints currently only host a single publicly-available LLM endpoint: `starcoder2_15b`. [**`StarCoder2`**](https://github.com/bigcode-project/starcoder2) is a model developed by [BigCode](https://www.bigcode-project.org) in collaboration with NVIDIA, and you can read more about the model in the [related blog post](https://developer.nvidia.com/blog/unlock-your-llm-coding-potential-with-starcoder2/). The following will import the model for us: 

In [4]:
from langchain_nvidia_ai_endpoints.llm import NVIDIA

starcoder = NVIDIA(model="starcoder2_15b")

print(starcoder.invoke("Here is my implementation of fizzbuzz:\n```python\n", stop="```"))
# for txt in starcoder.stream("Here is my implementation of fizzbuzz:\n```python\n", stop=["```"]):
#     print(txt, end="")

def fizzbuzz(n):
    for i in range(1, n+1):
        if i % 3 == 0 and i % 5 == 0:
            print("FizzBuzz")
        elif i % 3 == 0:
            print("Fizz")
        elif i % 5 == 0:
            print("Buzz")
        else:
            print(i)


Reading through the [related publication](https://arxiv.org/abs/2402.19173), we can see that it is especially suited for code completion, so in this case a message-based interface is suboptimal. Instead, it may be more useful to use the appropriate tags (i.e. `<jupyter_code>` and `<intermediate_to_code>`):

In [5]:
from langchain_nvidia_ai_endpoints.llm import NVIDIA

starcoder = NVIDIA(model="starcoder2_15b", frequency_penalty=2, max_tokens=64)

print("Priors from <jupyter_code>")
for txt in starcoder.stream("<jupyter_code>\n", stop=["</jupyter_code>"]):
    print(txt, end="")

print("\n\nPriors from <code_to_intermediate>")
for txt in starcoder.stream("<code_to_intermediate>\n", stop=["</code_to_intermediate>"]):
    print(txt, end="")

print("\n\nPriors from <intermediate_to_code>")
for txt in starcoder.stream("<intermediate_to_code>\n", stop=["</intermediate_to_code>"]):
    print(txt, end="")

Priors from <jupyter_code>
# 1. Import the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 2. Read the data as a data frame
df = pd.read_csv("https://raw.githubusercontent.com/insaid20

Priors from <code_to_intermediate>
define void @_QQmain() local_unnamed_addr {
  %1 = alloca i32, align 4
  %2 = alloca i32, align 4
  %3 = tail call ptr @_FortranAioBeginExternalListInput(i32 -1, ptr nonnull @_QQ

Priors from <intermediate_to_code>
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <sstream>
#include <queue>
#include <deque>
#include <bitset>
#include <iterator>
#include <list>
#include

### Stream, Batch, and Async

These models natively support streaming and expose a batch method to handle concurrent requests, as well as async methods for invoke, stream, and batch. Below are a few examples.

In [6]:
print(starcoder.batch(["What's 2*3?", "What's 2*6?"]))
# Or via the async API
# await starcoder.abatch(["What's 2*3?", "What's 2*6?"])

['## 00:00:00.000000000\n\n* <NAME>: [MUSIC PLAYING]\n\n## 00:00:03.000000000\n\n* <NAME>:', "What's 2*7?\n\nWhat's 2*8?\n\nWhat's 2*9?\n\nWhat's 2*10?\n\nWhat's 3*1?\n\nWhat's 3*2?\n\nWhat's 3*3?"]


In [7]:
for chunk in starcoder.stream("How far can a seagull fly in one day?"):
    # Show the token separations
    print(chunk, end="|")

A| se|ag|ull| can| fly| 2|0| km| in| one| hour|.| How| far| can| it| fly| in| one| day|?|
•| A|ircraft|
The| plane| f|lies| at| an| alt|itude| of| 6|5|0|0| m| above| the| ground| at| speed| 7|7|7| km|/|h|.| At| what| alt|itude| will| the||

In [8]:
async for chunk in starcoder.astream(
    "How long does it take for monarch butterflies to migrate?"
):
    print(chunk, end="|")

Mon|arch| butter|f|lies| migrate| from| M|ex|ico| to| California| every| year|.| They| spend| about| 1|0| months| in| M|ex|ico| and| 1|0| months| in| California|.|

###| Example| Question| #|1| :| How| Long| Does| It| Take| For| Mon|arch| B|utter|f|lies| To| Migrate|?|

How| long| does||

## Using with OpenAI

In [9]:
from langchain_nvidia_ai_endpoints.llm import NVIDIA
from getpass import getpass
import os

if not os.environ.get("OPENAI_API_KEY", "").startswith("sk-"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OPENAI Key: ")

llm = NVIDIA().mode("openai")
llm.available_models

[Model(id='babbage-002', model_type='completion'),
 Model(id='davinci-002', model_type='completion'),
 Model(id='gpt-3.5-turbo-instruct-0914', model_type='completion'),
 Model(id='gpt-3.5-turbo-instruct', model_type='completion')]

In [17]:
llm = NVIDIA(model="davinci-002").mode("openai")

print("Trying Out Invoke")
print(llm.invoke("Here is my implementation of fizzbuzz with minimal documentation:\n```python\n", stop="```", max_tokens=300))

print("\nTrying Out Streaming")
starter = "Here is my simple implementation of fizzbuzz in python with minimal documentation:\n```\ndef fizzbuzz(n):\n    #"
print(starter, end="")
for txt in llm.stream(starter, stop=["```"], max_tokens=300):
    print(txt, end="")

Trying Out Invoke
[visited: 212] import random
[visited: 213] def testme(i):
[visited: 214]     if no_nums(i):
[visited: 215]         return "FizzBuzz"
[visited: 216]     else:
[visited: 220]         return int(i % 3 == 0 and i % 5 == 0)
[visited: 221] def no_nums(i):
[visited: 222]     return True if i < 40 \
[visited: 247]         else return False
[visited: 248] for i in range(1, 41):
[visited: 249]     yield testme(i)
[visited:                             ]unittest.main


Trying Out Streaming
Here is my simple implementation of fizzbuzz in python with minimal documentation:
```
def fizzbuzz(n):
    # compute count packed by whole numbers
    count = 0
    while count < n:
        if count % 5 == 0:
            ans = "fizzbuzz"
        if count % 3 == 0:
            ans = "fizz"
        if count % 2 == 0:
            ans = "buzz"
        count += 1
        return printf("%d: %s%rs", count, ans)
    
    
    printf("%d:", n)
    return ""
    
    
def printf(str):
    # print a str

In [18]:
llm.client.last_inputs

{'url': 'https://api.openai.com/v1/completions',
 'headers': {'Accept': 'text/event-stream',
  'content-type': 'application/json',
  'Authorization': SecretStr('**********'),
  'User-Agent': 'langchain-nvidia-ai-endpoints'},
 'json': {'prompt': 'Here is my simple implementation of fizzbuzz in python with minimal documentation:\n```\ndef fizzbuzz(n):\n    #',
  'model': 'davinci-002',
  'stream': True,
  'stop': ['```'],
  'max_tokens': 300},
 'stream': True}