Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some inferences take forever to complete #450

Open
gaspard-dv opened this issue Dec 19, 2023 · 5 comments
Open

Some inferences take forever to complete #450

gaspard-dv opened this issue Dec 19, 2023 · 5 comments
Labels
enhancement optimization Related to performance optimizations structured generation Linked to structured generation

Comments

@gaspard-dv
Copy link

gaspard-dv commented Dec 19, 2023

Issue description

The issue was raised by other people on Discord too.

To quote one of them:

I'm running the same query 10 times (with equivalent prompts and output sizes), but some inferences are taking abnormally longer than others.

their screenshot

image

Repro

I made a reproduction code snippet that can run in Google Colab (w/ free T4 GPU):

💻 Code snippet
pip install outlines==0.0.13 transformers datasets optimum auto-gptq accelerate
from outlines import models
from outlines.text.generate import json, continuation
from json import dumps
from time import perf_counter
import torch


prompt = """<|system|>
You are a friendly AI assistant.
You're specialized in mathematics and open source Github repositories.
Your answers must be concise and factual.</s>
<|user|>
Write a very long poem</s>
<|assistant|>
"""
output_format = {
    "type": "object",
    "properties": {
        "poem": {"type": "string"}
    }
}
model = models.transformers("TheBloke/zephyr-7B-beta-GPTQ", device="cuda")
rng = torch.Generator(device="cuda")
rng.manual_seed(789001)

errors = []
for i in range(20):
  start_time = perf_counter()
  try:
    sequence = json(model, dumps(output_format))(prompt, rng=rng)
    poem = sequence.get('poem')
    elapsed_time = round(perf_counter() - start_time)
    n_characters_per_second = len(poem) // elapsed_time
    print(f"{i}\t{elapsed_time}\t{n_characters_per_second}\t{poem[:30]}..")
  except Exception as e:
    errors.append(e)
    print(f"{i}\t{elapsed_time}\tINFERENCE FAILED")
📃 Output
0	14	76	In the vastness of cosmic spac..
1	14	INFERENCE FAILED
2	769	0	In this universe, a vast expan..
3	389	0	In ancient lands, where skies ..
4	16	67	In the depths of the cosmos, w..
5	35	70	In the stillness of the mornin..
6	32	60	In a universe vast and unceasi..
7	13	77	75000 lines of blank verse, hi..
8	22	69	In a land of purest light, Who..
9	34	59	A cosmic dance of stars, a sym..
10	49	68	In the land of the digit, wher..
11	34	78	In a world vast and unknown,  ..
12	43	68	There was a time when words we..
13	54	70	In a world where chaos reigns..
14	12	62	Let the words unfurl like the ..
15	330	0	Infinity beckons from the far ..
16	31	60	In the depths of the universe,..
17	137	0	In this vast expanse of time a..
18	32	81	in this universe vast and unfa..
💥 Exceptions raised
import traceback

for error in errors:
    try:
        raise error
    except Exception as e:
        traceback.print_exc()

Traceback (most recent call last):
  File "<ipython-input-6-d8471672a411>", line 5, in <cell line: 3>
    raise error
  File "<ipython-input-5-1a425bb0404a>", line 8, in <cell line: 5>
    sequence = json(model, dumps(output_format))(prompt, rng=rng)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/outlines/text/generate/sequence.py", line 240, in __call__
    result = self.postprocess_completions(result)
  File "/usr/local/lib/python3.10/dist-packages/outlines/text/generate/regex.py", line 226, in postprocess_completions
    return [self.format_fn(result) for result in results]
  File "/usr/local/lib/python3.10/dist-packages/outlines/text/generate/regex.py", line 226, in <listcomp>
    return [self.format_fn(result) for result in results]
  File "/usr/local/lib/python3.10/dist-packages/outlines/text/generate/regex.py", line 397, in <lambda>
    format_fn = lambda x: pyjson.loads(x)
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.10/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 2 column 570 (char 571)

Results

  • 14 inferences succeeded fast
  • 5 inferences succeeded but were extremely slow (indices: 2, 3, 15, 17, 19)
  • 💥 1 inference failed fast (index: 1)

Outlines/Python version information:

Outlines 0.0.13
Python 3.10.12
@gaspard-dv gaspard-dv added the bug label Dec 19, 2023
@rlouf
Copy link
Member

rlouf commented Dec 19, 2023

Thank you so much for the detailed report! Will come back to you shortly.

@rlouf rlouf added enhancement optimization Related to performance optimizations structured generation Linked to structured generation and removed bug labels Dec 19, 2023
@brandonwillard
Copy link
Contributor

brandonwillard commented Dec 20, 2023

These timing results contain significant non-inference setup steps (e.g. json(model, dumps(output_format))).

@gaspard-dv
Copy link
Author

Yes indeed!
json(model, dumps(output_format)) takes a few seconds to complete and shouldn't be in the for-loop.
But this is not the step that gets "stuck".

@rlouf
Copy link
Member

rlouf commented Jan 7, 2024

It would still be nice to have results without having it in the loop, and use cProfile to understand which step "gets stuck". To get to similar experimental conditions I would also use the maxLength field constraint.

@lapp0
Copy link
Collaborator

lapp0 commented May 9, 2024

Please try

class OutputModel(BaseModel):
    poem: str

And pass OutputModel instead output_format. This schema ensures 'required': ['poem'] attr is included and you don't have any generations missing the poem key.

Additionally, you will need to set whitespace_pattern as explained here #690 (comment)

json(model, dumps(output_format), whitespace_pattern=r"[ ]?")...

With these changes your script works for me and doesn't have any slow or failed inference.

rlouf pushed a commit that referenced this issue May 24, 2024
Fixes #839 #908 #690 #450

## Problem

A major problem, especially with smaller language models, is the
repetition problem.

For example, let's say a model is generating json and must provide 12
space tokens for indentation in json output. Often a language model will
assign a high probability to a 13th space token, and do the same for a
14th space, and then enter an infinite space generation loop.

This is a problem with NLG that has been known for half a decade, but
only has mitigations (mirostat, repetition penalty, using hundreds of
billions of weights, etc), no absolute solutions (except for
**structured generation**)

## Solution

For structured json generation, we set a sane default whitespace pattern
of `r"[ ]?"`. This removes all newlines and indentation. It disallows
any syntactic whitespace beyond a single space separator.

Users can still set the argument `whitespace_pattern=` if they want
different behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement optimization Related to performance optimizations structured generation Linked to structured generation
Projects
None yet
Development

No branches or pull requests

4 participants