To run any prompt through a model, we need to set a foundation for how we will **Prompting** is the process of providing a partial, usually text, input to a model. As we discussed in the last chapter, models will then use their parameterized data transformations to find a probable completion or output that matches the prompt.

To run any prompt through a model, we need to set a foundation for how we will access generative AI models and perform inference. There is a huge variety in the landscape of generative AI models in terms of size, access patterns, licensing, etc. However, a common theme is the usage of LLMs through a REST API, which is either:
- Provided by a closded third party AI service (OpenAI, Anthropic, Cohere, etc.)
- Self-hosted in your own infrastructure or in an account you control with a platform that handles much of the infrastructure (e.g., Prediction Guard for security/privacy sensitive deployments or OpenShift AI for general k8s environments)
- Self-hosted using a model serving framework (TGI, vLLM, etc.)

We will use [Prediction Guard](https://www.predictionguard.com/) to call open access LLMs (like Llama 3.1, Mistral, deepseek, etc.) via a standardized OpenAI-like API. This will allow us to explore the full range of LLMs available. Further, it will illustrate how companies can access a wide range of models (outside of the GPT family).

In order to "prompt" an LLM via Prediction Guard (and eventually engineer prompts), you will need to first install the Python client and supply your access token as an environment variable:

# Install dependences, imports

In [1]:
! pip install predictionguard

Collecting predictionguard
  Downloading predictionguard-2.7.0-py2.py3-none-any.whl.metadata (872 bytes)
Downloading predictionguard-2.7.0-py2.py3-none-any.whl (21 kB)
Installing collected packages: predictionguard
Successfully installed predictionguard-2.7.0


In [2]:
import os
import json

from predictionguard import PredictionGuard
from getpass import getpass

In [3]:
pg_access_token = getpass('Enter your Prediction Guard access api key: ')
os.environ['PREDICTIONGUARD_API_KEY'] = pg_access_token

Enter your Prediction Guard access api key: ··········


In [4]:
client = PredictionGuard()

# List available models

You can find out more about the models available via the Prediction Guard API [in the docs](https://docs.predictionguard.com/models).

In [6]:
client.chat.completions.list_models()

['deepseek-coder-6.7b-instruct',
 'Hermes-2-Pro-Llama-3-8B',
 'Hermes-2-Pro-Mistral-7B',
 'Hermes-3-Llama-3.1-70B',
 'Hermes-3-Llama-3.1-8B',
 'llava-1.5-7b-hf',
 'llava-v1.6-mistral-7b-hf',
 'neural-chat-7b-v3-3']

In [7]:
client.embeddings.list_models()

['bridgetower-large-itm-mlm-itc', 'multilingual-e5-large-instruct']

# Generate some text with an LLM



In [10]:
response = client.completions.create(
    model="Hermes-3-Llama-3.1-8B",
    prompt="Some of the best advice I can give is "
)

print(json.dumps(
    response,
    sort_keys=True,
    indent=4,
    separators=(',', ': ')
))

{
    "choices": [
        {
            "index": 0,
            "text": "1) don\u2019t put all of your eggs in one basket and 2) make sure you have a plan. I\u2019ve learned this through trial and error. With investing, diversification is key. Putting your money into a single stock or sector could leave you exposed to a lot of risk. If that stock or sector tanks, you could lose all of your investment. But by investing in a variety of stocks and sectors, you can spread out that risk. You could still lose money, of course"
        }
    ],
    "created": 1732278275,
    "id": "cmpl-a34f8090-f1d8-4ed6-817d-d4c54b6d784f",
    "model": "Hermes-3-Llama-3.1-8B",
    "object": "text_completion",
    "usage": {
        "completion_tokens": 0,
        "prompt_tokens": 0,
        "total_tokens": 0
    }
}
