Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completion endpoint returns same response repeatedly #1723

Closed
joshuaipwork opened this issue Feb 19, 2024 · 4 comments · Fixed by #1820
Closed

Completion endpoint returns same response repeatedly #1723

joshuaipwork opened this issue Feb 19, 2024 · 4 comments · Fixed by #1820
Assignees

Comments

@joshuaipwork
Copy link

joshuaipwork commented Feb 19, 2024

LocalAI version:
LocalAI Release 2.8.0.

Environment, CPU architecture, OS, and Version:
Docker container running in Linux Mint 21.3 "Virginia".
Image built from 2.8.0 Dockerfile.
Ryzen 5 7600x (x86-64)

Describe the bug
In the /v1/chat/completions endpoint, it seems like cached responses are being returned over and over for the same prompt, even if the seed is changed. Changing the seed, temperature, min-p, top-k, and top-p has no effect. The same response with an identical ID will be returned every time. This happens even when prompt_cache_ro and prompt_cache_all are both set to false.

To Reproduce
Send a response to the chat endpoint. A CURL command is shown below, which may have to be adapted for your testing circumstances.

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "mixtral",
    "messages": [
        {
            "role": "system",
            "content": "You are a friendly, expressive, and curious chatbot who loves to engage in conversations and roleplays."
        },
        {
            "role": "user",
            "content": "How are you doing?"
        }
    ],
    "max_tokens": 1024,
    "temperature": 1.2,
    "seed": 2668776,
}'

Now, change the seed. Observe that the endpoint returns an identical response.

Expected behavior
Two requests to the same model with the same prompt and different parameters may return two different results. LLMs are stochastic due to the sampling process, and parameters like temperature, seed, min-p, top-p, and top-k all act to introduce randomness between responses.

Logs
@lunamidori5 might have some, since they were helping me troubleshoot this problem.

Additional context
Here's the YAML file used when this problem was observed:

context_size: 4096
f16: true
gpu_layers: 2 
low_vram: false
mmap: true
name: gpt-14b-carly
no_mulmatq: false
prompt_cache_all: false
prompt_cache_ro: true
parameters:
  model: gpt-14b-carly.gguf
  top_k: 0.5
  top_p: 0.9
  n: 1
  RepeatPenalty: 1
  typical_p: 0.8
stopwords:
- user|
- assistant|
- system|
- <|im_end|>
- <|im_start|>
template:
  chat: localai-chat
  chat_message: localai-chatmsg
threads: 14
@holzmichlnator
Copy link

I noticed this behaviour too while working with LocalAI. I also tried to set the prompt_cache_all option to false but this didn't change anything. Setting temperature, top_k, top_p, seed to different values also doesn't change anything. The strangest part is, it keeps generating the same output even after recreating the whole container except the ID is now different.
I also tried to modify the prompt slightly by adding an "!" at the end for example, but even then, the answer is identical.

I'm using localai/localai:v2.9.0-cublas-cuda12

@mudler
Copy link
Owner

mudler commented Mar 8, 2024

While i'm having a look at this - the options seems passed just fine up to the gRPC server and llama.cpp.

However, even when I set slot.params.seed = time(NULL); right before we configure the slot ready to be processed

llama_set_rng_seed(ctx, slot->params.seed);
( as llama.cpp does as well ), it seems completely ignored.

Also to note, that's the full json data printed out (and looks indeed setting seed, top_k, and top_p accordingly):

12:36PM DBG GRPC(c0c3c83d0ec33ffe925657a56b06771b-127.0.0.1:40341): stdout {"cache_prompt":false,"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"mirostat":0,"mirostat_eta":0.0,"mirostat_tau":0.0,"n_keep":0,"n_predict":-1,"penalize_nl":false,"presence_penalty":0.0,"prompt":"Instruct: tell me a story about llamas\nOutput:\n","repeat_last_n":0,"repeat_penalty":0.0,"seed":-1,"stop":[],"stream":false,"temperature":0.20000000298023224,"tfs_z":0.0,"top_k":40,"top_p":0.949999988079071,"typical_p":0.0}

so it looks definitely something is off in the llama.cpp implementation, as our is just a gRPC wrapper on top of the http example (with few edits to avoid bugs like #1333 )

@mudler
Copy link
Owner

mudler commented Mar 11, 2024

Ok, tracing seed seemed to be just a red herring and dragged me in the wrong direction. I'm tracing back the usage and it looks like it is because it doesn't have enough candidates to select the tokens.

I've tried to switch sampler with phi-2 and I got finally a more undeterministic result. It looks to me it is very much depending on the model/sampler strategy - mirostat can get more candidates, while the temperature sampler has less to select which (so it is more deterministic).

E.g. with phi-2:

name: phi-2
context_size: 2048
f16: true
gpu_layers: 90
mmap: true
trimsuffix:
- "\n"
parameters:
  model: huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
  temperature: 1.0
  top_k: 40
  top_p: 0.95
mirostat: 2
mirostat_eta: 1.0
mirostat_tau: 1.0

  seed: -1
template:
  chat: &template |
    Instruct: {{.Input}}
    Output:
  completion: *template

usage: |
      To use this model, interact with the API (in another terminal) with curl for instance:
      curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
          "model": "phi-2",
          "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}]
      }'

mudler added a commit that referenced this issue Mar 11, 2024
The default sampler on some models don't return enough candidates which
leads to a false sense of randomness. Tracing back the code it looks
that with the temperature sampler there might not be enough
candidates to pick from, and since the seed and "randomness" take effect
while picking a good candidate this yields to the same results over and
over.

Fixes #1723 by updating the
examples and documentation to use mirostat instead.
mudler added a commit that referenced this issue Mar 11, 2024
The default sampler on some models don't return enough candidates which
leads to a false sense of randomness. Tracing back the code it looks
that with the temperature sampler there might not be enough
candidates to pick from, and since the seed and "randomness" take effect
while picking a good candidate this yields to the same results over and
over.

Fixes #1723 by updating the
examples and documentation to use mirostat instead.
@mudler
Copy link
Owner

mudler commented Mar 11, 2024

There were incongruences with the docs, updated also the samples in #1820. If the issue persist feel free to re-open

mudler added a commit that referenced this issue Mar 12, 2024
This changeset aim to have better defaults and to properly detect when
no inference settings are provided with the model.

If not specified, we defaults to mirostat sampling, and offload all the
GPU layers (if a GPU is detected).

Related to #1373 and #1723
mudler added a commit that referenced this issue Mar 12, 2024
This changeset aim to have better defaults and to properly detect when
no inference settings are provided with the model.

If not specified, we defaults to mirostat sampling, and offload all the
GPU layers (if a GPU is detected).

Related to #1373 and #1723
mudler added a commit that referenced this issue Mar 13, 2024
* fix(defaults): set better defaults for inferencing

This changeset aim to have better defaults and to properly detect when
no inference settings are provided with the model.

If not specified, we defaults to mirostat sampling, and offload all the
GPU layers (if a GPU is detected).

Related to #1373 and #1723

* Adapt tests

* Also pre-initialize default seed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants