High Latency for a simple CFG generation #617

Shivam-Srivastava · 2024-02-06T01:02:59Z

Shivam-Srivastava
Feb 6, 2024

Describe the issue as clearly as possible:

I was trying to use outlines with the CFG listed here along with the TinyLlama model on a GPU machine. The latency is very high even for a simple grammar like this. It takes 20+ seconds to generate the response for a simple prompt like "generate a random policy". Since Outlines is almost as fast as non-guided generation, this shouldn't be the case. Has the library been tested for larger Pydantic models/CFGs?

Steps/code to reproduce the bug:

# Install required libraries. Skipping that here. Also, run this on a GPU. On a CPU, this will take several minutes.
import outlines
import time
model = outlines.models.transformers("TinyLlama/TinyLlama-1.1B-Chat-v0.6", device="cuda")
policy_grammar = r'''
    start: policy
    policy : START_OBJECT _NL version_block _NL id_block _NL statement_block _NL END_OBJECT
    version_block: ("Version:\t") ("2008-10-17" | "2012-10-17")
    id_block: ("Id:\t") WORD
    EOL: /(\n)/
    _NL: /(\t\n)/
    statement_block: ("Statement:\t") START_ARRAY _NL "\t" statement~2..4 _NL "\t" END_ARRAY
    statement: START_OBJECT _NL "\t" sid_block _NL "\t" principal_block _NL "\t" effect_block _NL "\t" action_block _NL "\t" 
...
    INTEGER : /[0-9]{2,5}/
    //WS: " "
    WORD: /"[\w\d]{2,5}"/ // Alternative: /[^-:#()\[\]{}\n\s]{3,10}/
    LONG_WORD: /"[\w\d]{6,10}"/
    START_OBJECT: "{"
    END_OBJECT: "}"
    START_ARRAY: "["
    END_ARRAY: "]"
    COMMA: ","
    COLON: ":"
    //STRING: "@:/[^"]*/"
    //INT: [0.9]+;
    %import common.NUMBER
    %import common.STRING
    %import common.WS
    %import common.WS_INLINE
    %import common.NEWLINE
    %ignore NEWLINE
    //%ignore WS_INLINE

'''
generator = outlines.generate.cfg(model, policy_grammar)
start_time = time.time()
sequence = generator("Generate some random policy.")
print("Total time for generation: ", time.time() - start_time)

Expected result:

Total time for generation < 3 seconds

Error message:

No error but the time taken is 20+ seconds in each run which is unacceptable for practical usage.

Outlines/Python version information:

Version information

``` 0.0.18 Python 3.11.6 aiohttp==3.9.1 aiosignal==1.3.1 annotated-types==0.6.0 anyio==4.1.0 argon2-cffi==23.1.0 argon2-cffi-bindings==21.2.0 arrow==1.3.0 asttokens==2.4.1 async-lru==2.0.4 attrs==23.1.0 aws-lambda-powertools==2.32.0 awscurl==0.32 Babel==2.13.1 backoff==2.2.1 beartype==0.16.4 beautifulsoup4==4.12.2 bleach==6.1.0 boto3==1.33.7 botocore==1.33.7 certifi==2023.11.17 cffi==1.16.0 chardet==5.2.0 charset-normalizer==3.3.2 clarifai==9.11.1 clarifai-grpc==9.11.5 click==8.1.7 cloudpickle==2.2.1 cohere==4.40 comm==0.2.0 ConfigArgParse==1.7 configparser==6.0.0 contextlib2==21.6.0 dataclasses-json==0.6.3 datasets==2.15.0 debugpy==1.8.0 decorator==5.1.1 defusedxml==0.7.1 dill==0.3.7 distro==1.9.0 docker==7.0.0 executing==2.0.1 fastapi==0.95.2 fastavro==1.9.2 fastjsonschema==2.19.0 filelock==3.13.1 fqdn==1.5.1 frozenlist==1.4.1 fsspec==2023.10.0 glibc==0.6.1 google-pasta==0.2.0 googleapis-common-protos==1.62.0 greenlet==3.0.3 grpcio==1.60.1 grpcio-tools==1.60.1 h11==0.14.0 httpcore==1.0.2 httpx==0.26.0 huggingface-hub==0.19.4 icontract==2.6.6 idna==3.6 importlib-metadata==6.11.0 inquirerpy==0.3.4 InstructorEmbedding==1.0.1 interegular==0.3.2 ipykernel==6.27.1 ipython==8.18.1 ipywidgets==8.1.1 isoduration==20.11.0 jedi==0.19.1 Jinja2==3.1.2 jmespath==1.0.1 joblib==1.3.2 json5==0.9.14 jsonpatch==1.33 jsonpointer==2.4 jsonschema==4.20.0 jsonschema-specifications==2023.11.2 jupyter==1.0.0 jupyter-console==6.6.3 jupyter-events==0.9.0 jupyter-lsp==2.2.1 jupyter_client==8.6.0 jupyter_core==5.5.0 jupyter_server==2.11.2 jupyter_server_terminals==0.4.4 jupyterlab==4.0.9 jupyterlab-widgets==3.0.9 jupyterlab_pygments==0.3.0 jupyterlab_server==2.25.2 langchain==0.1.0 langchain-community==0.0.10 langchain-core==0.1.8 langsmith==0.0.78 lark==1.1.8 llvmlite==0.41.1 manifest-ml==0.0.1 markdown-it-py==3.0.0 MarkupSafe==2.1.3 marshmallow==3.20.1 matplotlib-inline==0.1.6 mdurl==0.1.2 mistune==3.0.2 mpmath==1.3.0 multidict==6.0.4 multiprocess==0.70.15 mypy-extensions==1.0.0 nbclient==0.9.0 nbconvert==7.12.0 nbformat==5.9.2 nest-asyncio==1.5.8 networkx==3.2.1 nlpcloud==1.1.45 nltk==3.8.1 notebook==7.0.6 notebook_shim==0.2.3 numba==0.58.1 numpy==1.26.2 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.18.1 nvidia-nvjitlink-cu12==12.3.101 nvidia-nvtx-cu12==12.1.105 openai==1.6.1 opencv-python==4.9.0.80 openlm==0.0.5 outlines==0.0.18 overrides==7.4.0 packaging==23.2 pandas==2.1.3 pandocfilters==1.5.0 parso==0.8.3 pathos==0.3.1 perscache==0.6.1 pexpect==4.9.0 pfzy==0.3.4 Pillow==10.1.0 platformdirs==4.1.0 pox==0.3.3 ppft==1.7.6.7 prometheus-client==0.19.0 prompt-toolkit==3.0.41 protobuf==4.25.1 protobuf-to-pydantic==0.2.3 psutil==5.9.6 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==14.0.2 pyarrow-hotfix==0.6 pycparser==2.21 pydantic==1.10.13 pydantic_core==2.14.5 Pygments==2.17.2 python-dateutil==2.8.2 python-json-logger==2.0.7 python-rapidjson==1.14 pytz==2023.3.post1 PyYAML==6.0.1 pyzmq==25.1.1 qtconsole==5.5.1 QtPy==2.4.1 redis==5.0.1 referencing==0.31.1 regex==2023.10.3 requests==2.31.0 rfc3339-validator==0.1.4 rfc3986-validator==0.1.1 rich==13.7.0 rpds-py==0.13.2 s3transfer==0.8.2 safetensors==0.4.1 sagemaker==2.203.1 schema==0.7.5 scikit-learn==1.3.2 scipy==1.11.4 Send2Trash==1.8.2 sentence-transformers==2.2.2 sentencepiece==0.1.99 six==1.16.0 smdebug-rulesconfig==1.0.1 sniffio==1.3.0 soupsieve==2.5 SQLAlchemy==2.0.25 sqlitedict==2.1.0 stack-data==0.6.3 starlette==0.27.0 sympy==1.12 tblib==2.0.0 tenacity==8.2.3 terminado==0.18.0 threadpoolctl==3.2.0 tinycss2==1.2.1 tokenizers==0.15.0 torch==2.1.1 torchvision==0.16.1 tornado==6.4 tqdm==4.66.1 traitlets==5.14.0 transformers==4.35.2 triton==2.1.0 tritonclient==2.41.1 types-python-dateutil==2.8.19.14 typing-inspect==0.9.0 typing_extensions==4.8.0 tzdata==2023.3 uri-template==1.3.0 urllib3==1.26.18 uvicorn==0.22.0 wcwidth==0.2.12 webcolors==1.13 webencodings==0.5.1 websocket-client==1.7.0 widgetsnbextension==4.0.9 xxhash==3.4.1 yarl==1.9.4 zipp==3.17.0 ```

Context for the issue:

No response

Answered by brandonwillard

Feb 7, 2024

Since Outlines is almost as fast as non-guided generation, this shouldn't be the case.

Only the regex-guided generation in outlines uses the efficient/optimal approach we described in our paper, to which that statement is referring. The community provided CFG-guided generation takes a different approach and does not offer similar performance guarantees.

View full answer

lapp0 · 2024-02-06T04:32:00Z

lapp0
Feb 6, 2024

I see you're on outlines==0.0.18. Please upgrade to the latest version and report back if this issue remains.

0 replies

Shivam-Srivastava · 2024-02-06T06:21:00Z

Shivam-Srivastava
Feb 6, 2024
Author

Ran it with version 0.0.26. Same high latencies. @lapp0

0 replies

lapp0 · 2024-02-06T06:31:03Z

lapp0
Feb 6, 2024

In newer versions of outlines the automata are constructed at runtime and cached for future runs. The initial run is slow, but later runs should be faster.

There is plenty of room for improvement though. While regex generation is extremely fast, CFG generation still has some problems to be solved.

3 replies

Shivam-Srivastava Feb 7, 2024
Author

Is there any example for using the cache? For each of the runs for the same prompt (using CFG, regex, json or Pydantic), I get around the same latency. There is some caching implemented here. Would appreciate knowing how to use it. cc: @rlouf

lapp0 Feb 7, 2024

Caching should be enabled by default.

I've introduced some CFG benchmarks here #587 They show that caching only results in marginal improvements.

The profiler output linked above has highlighted clear opportunities for enhancing performance. In the upcoming week, I'll concentrate on these areas. If you're open to it, I'd appreciate your help performance testing by applying your grammar to my upcoming PR :)

lapp0 Feb 7, 2024

@Shivam-Srivastava could you please try this branch? First run performance has improved a bit, and second (and thereafter) run performance has improved 20-fold.

pip install git+https://github.com/lapp0/outlines@calculate-transsformers-tokenizer-hash-once

Fortunately the performance issues were very low hanging fruit.

rlouf · 2024-02-06T08:50:10Z

rlouf
Feb 6, 2024
Maintainer

CFG-structured generation is very much a WIP implementation. As @lapp0 said regex-structured generation (and JSON by extention) should be fast.

0 replies

Shivam-Srivastava · 2024-02-06T20:45:06Z

Shivam-Srivastava
Feb 6, 2024
Author

Quoting from the blog

For one final example, let’s really shoot for the stars: using regex to enforce a JSON schema. It is possible6 to do this for general schemas, but in this case, I’m going to do it by hand.

Are there any pointers for generating a regex from a json schema? I have a json schema from a Pydantic model. Would like to convert it to a regex and try it as well. @rlouf @lapp0

Edit: I think I found it in the source code:
from outlines.fsm.json_schema import build_regex_from_object

1 reply

rlouf Feb 7, 2024
Maintainer

There should be an example in the README!

brandonwillard · 2024-02-07T07:17:56Z

brandonwillard
Feb 7, 2024
Maintainer

Since Outlines is almost as fast as non-guided generation, this shouldn't be the case.

Only the regex-guided generation in outlines uses the efficient/optimal approach we described in our paper, to which that statement is referring. The community provided CFG-guided generation takes a different approach and does not offer similar performance guarantees.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Latency for a simple CFG generation #617

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

High Latency for a simple CFG generation #617

Shivam-Srivastava Feb 6, 2024

Describe the issue as clearly as possible:

Steps/code to reproduce the bug:

Expected result:

Error message:

Outlines/Python version information:

Context for the issue:

Replies: 6 comments · 4 replies

lapp0 Feb 6, 2024

Shivam-Srivastava Feb 6, 2024 Author

lapp0 Feb 6, 2024

Shivam-Srivastava Feb 7, 2024 Author

lapp0 Feb 7, 2024

lapp0 Feb 7, 2024

rlouf Feb 6, 2024 Maintainer

Shivam-Srivastava Feb 6, 2024 Author

rlouf Feb 7, 2024 Maintainer

brandonwillard Feb 7, 2024 Maintainer

Shivam-Srivastava
Feb 6, 2024

Replies: 6 comments 4 replies

lapp0
Feb 6, 2024

Shivam-Srivastava
Feb 6, 2024
Author

lapp0
Feb 6, 2024

Shivam-Srivastava Feb 7, 2024
Author

rlouf
Feb 6, 2024
Maintainer

Shivam-Srivastava
Feb 6, 2024
Author

rlouf Feb 7, 2024
Maintainer

brandonwillard
Feb 7, 2024
Maintainer