# (Limits to) reproducibility with OpenAI's models

| Authors | Last update |
|:------ |:----------- |
| Hauke Licht (https://github.com/haukelicht) | 2024-03-25 |

Generating chat completions with OpenAI's models is non-deterministic, meaning that model outputs may differ from request to request.
This is problematic for using its and other closed-source models in research (cf. Palmer et al. [2024](https://doi.org/10.1038/s43588-023-00585-1)), because it hinders *reproducibility* &mdash; the ability to generate the reported results of a study again using the same data and methods.

OpenAI has announced its measures to remedy this in November 2023 ([source](https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter)):

1. user can set the `seed` parameter to an integer value in their request
2. extract the "system_fingerprint" field from the response, which identifies the current combination of model weights, infrastructure, and other configuration options used by OpenAI servers.

If all other parameters (prompt, temperature, top_p, etc.) are held constant across requests, one should obtain consistent responses across requests *if* the same system is used (see the fingerprint).

> If the seed, request parameters, and system_fingerprint all match across your requests, then model outputs will mostly be identical

The **problems**:

- changes to the numerical configuration of OpenAI's infrastructure happen a few times a year &rArr; so as time passes, it gets increaseingly unlikely that you receive the same system fingerprint
- its anyways unlikely that you get the same system for consecutive requests 🤷‍♂️
- even if you know the system fingerprint and distribute it with your replication materials, there is currently no way of passing the finger rpint when making an API request so that you'd use the same system as when making the original request 🤷‍♂️

## Setup

In [1]:
import warnings
warnings.filterwarnings('ignore')

import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
MODEL = 'gpt-4-turbo-preview'

import pandas as pd
# from tqdm.auto import tqdm
# tqdm.pandas()

# from sklearn.metrics import classification_report

In [3]:
SEED=42 # https://medium.com/geekculture/the-story-behind-random-seed-42-in-machine-learning-b838c4ac290a

## Example

Let's request an answer to a question:

In [11]:
messages = chat_template=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello, who won the Nobel Prize in Physics in 2021? Answer only with the name of the laureate."},
]

response = client.chat.completions.create(
    model=MODEL,
    messages=chat_template, 
    max_tokens=20,
    seed=SEED,
    temperature=0.0,
    top_p=0.0
)

In [12]:
print('system fingerprint:', response.system_fingerprint)
print('response:', response.choices[0].message.content)

system fingerprint: fp_166a8e22c6
response: Syukuro Manabe, Klaus Hasselmann, and Giorgio Parisi.


In [14]:
for _ in range(5):
    response = client.chat.completions.create(
        model=MODEL,
        messages=chat_template, 
        max_tokens=20,
        seed=SEED,
        temperature=0.0,
        top_p=0.0
    )
    print('system fingerprint:', response.system_fingerprint, '; response:', response.choices[0].message.content)


system fingerprint: fp_a7daf7c51e ; response: Syukuro Manabe, Klaus Hasselmann, and Giorgio Parisi.
system fingerprint: fp_a7daf7c51e ; response: Syukuro Manabe, Klaus Hasselmann, and Giorgio Parisi.
system fingerprint: fp_a7daf7c51e ; response: Syukuro Manabe, Klaus Hasselmann, and Giorgio Parisi
system fingerprint: fp_8cc6edbbd5 ; response: Syukuro Manabe, Klaus Hasselmann, and Giorgio Parisi
system fingerprint: fp_a7daf7c51e ; response: Syukuro Manabe, Klaus Hasselmann, and Giorgio Parisi


As you can see, when we make the same request with exactly the same parameters, we still might be returned responses from different systems.
The upside is that this doesn't affect the generated response.

In [15]:
messages = chat_template=[
    {
        "role": "system", 
        "content": "You will be provided with a tweet, and your task is to classify its sentiment as positive, neutral, or negative."
    },
    {
        "role": "user", 
        "content": "I loved the new Batman movie"
    },
]

response = client.chat.completions.create(model='gpt-4', messages=chat_template)

response.choices[0].message.content

'Positive'