Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation Error during pydantic validation for Llama3 GGUF #952

Closed
polplop opened this issue Jun 11, 2024 · 1 comment · Fixed by #992
Closed

Validation Error during pydantic validation for Llama3 GGUF #952

polplop opened this issue Jun 11, 2024 · 1 comment · Fixed by #992
Assignees
Labels
bug JSON llama.cpp Related to the `llama.cpp` integration

Comments

@polplop
Copy link

polplop commented Jun 11, 2024

Describe the issue as clearly as possible:

I'm currently attempting to summarize an article and classify the relevancy, which worked fine on outlines 0.0.36, however upgrading to outlines 0.0.43 produces a validation error which did not occur before.

I have tried:

  • Manually specifying the tokenizer to avoid any dictionary bugs
  • Reducing the complexity of the prompt to JUST summarization, in order to make a minimal example (I have more complicated use cases which worked in 0.0.36)
  • Tried other Function Calling fine-tuned models e.g https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B, but still exhibits the same issue

The model seems to be unable to generate valid json and there is an "Invalid control character at" bug that occurs during pydantic validation

notes:
Running on #20~22.04.1-Ubuntu, AWS instance with A10G GPU, Cuda 12.1
llama_cpp_python==0.2.77
outlines==0.0.43

Steps/code to reproduce the bug:

from outlines import models, generate
import llama_cpp

from pydantic import BaseModel

# 0.0.36
# model = models.llamacpp(
#                         # "./models/bartowski_Meta-Llama-3-8B-Instruct-Q8_0.gguf",
#                         n_ctx=8000,
#                         n_gpu_layers=-1,  # to use GPU acceleration
#                         )

# 0.0.43
model = models.llamacpp("bartowski/Meta-Llama-3-8B-Instruct-GGUF",
                        "Meta-Llama-3-8B-Instruct-Q8_0.gguf",
                        tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct"),
                        n_ctx=8000,
                        n_gpu_layers=-1,  # to use GPU acceleration
                        )


class User(BaseModel):
    name: str
    last_name: str
    id: int


class RelevantSummary(BaseModel):
    relevant_summary: str

generator = generate.json(model, RelevantSummary)

result = generator(
"""
<|start_header_id|>system<|end_header_id|>
<|eot_id|>
## OBJECTIVE
1. Write a detailed summary related to Product Announcements.
2. Output your answer in JSON

<|eot_id|>
<|start_header_id|>user<|end_header_id|>

## ARTICLE
VeriSilicon’s 2nd generation automotive ISP series IP passed ISO 26262 ASIL B and ASIL D certifications

Las Vegas, USA, January 8, 2024--VeriSilicon (688521.SH) today announced its Image Signal Processor (ISP) IP ISP8200-FS and ISP8200L-FS, designed for high-performance automotive applications, have been certified compliant with the ISO 26262 automotive functional safety standard, achieving ASIL B certification for random failures and ASIL D certification for systematic failures, respectively. The certifications were granted by ResilTech, a leading safety consultancy company. Building upon the 1st generation of ISO 26262 certified ISP IP, the ISP8200-FS series is updated with advanced ISP technologies and several crucial enhancements for automotive applications after multiple automotive customers’ engagements on the 1st generation version.

ISP8200-FS series automotive ISP IP delivers high pixel throughputs from 1.6Giga to 2Giga pixel per second under different process technologies, supports up to 8 real-time or 16 camera streams from DDR with low latency technology based on multi-camera scheduling mechanism, and supplements the raw pixel processing pipelines for efficient AI processing. In addition, ISP8200-FS has a built-in FLEXA AI interface to capture automotive related ROI objects from AI processor for pedestrians, vehicles, traffic lights and signs detecting and processing.

Since its launch, multiple global major automotive SoC vendors have adopted ISP8200-FS series IP in their products for in-cabin ADAS, the next generation autonomous driving, and unified autonomous driving applications.

“ISP plays a pivotal role in the realm of autonomous driving. To meet the rapidly evolving demands of this industry, VeriSilicon is dedicated to providing our automotive customers with cutting-edge capabilities through our functional safety certified IPs,” said Wei-Jin Dai, Executive VP and GM of IP Division of VeriSilicon. “With adoption by multiple customers worldwide, our certified ISP8200-FS and ISP8200L-FS are specifically designed to cater to both primary application processor and the sensor fusion SoC requirements, including image, radar, and LiDAR capabilities. Minimizing latency from sensing to action is crucial in automotive applications. VeriSilicon offers a comprehensive solution with its Glass-to-Glass intelligent pixel processing functional safety IPs.”

To explore our rich IP portfolios, we invite you to visit VeriSilicon’s booth at the Venetian Expo (Booth No.: Bassano 2701 & Bassano 2702) during the Consumer Electronics Show (CES) 2024, taking place from January 9 to January 12 in Las Vegas.

## SUMMARY
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"""
, max_tokens=5000)
print(result)

Expected result:

### Results from outlines==0.0.36
relevant_summary="Verisilicon's second generation automotive ISP series IP has passed ISO26262 ASIL B and ASIL D certifications. The ISP8200-FS and ISP8200L-FS IPs are designed for high-performance automotive applications, achieving ASIL B certification for random failures and ASIL D certification for systematic failures respectively. They deliver high pixel throughputs, support multiple camera streams with low latency, and have a built-in FLEXA AI interface. Multiple major automotive SoC vendors have adopted these IPs in their products for in-cabin ADAS, autonomous driving, and unified autonomous driving applications."

Error message:

$ python3 test_outlines.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Compiling FSM index for all state transitions:  76%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                      | 25/33 [00:03<00:01,  7.24it/s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pydantic/main.py", line 1143, in parse_raw
    obj = parse.load_str_bytes(
  File "/usr/local/lib/python3.10/dist-packages/pydantic/deprecated/parse.py", line 49, in load_str_bytes
    return json_loads(b)  # type: ignore
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.10/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 1 column 22 (char 21)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/test_outlines.py", line 32, in <module>
    result = generator(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/outlines/generate/api.py", line 511, in __call__
    return format(completions)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/outlines/generate/api.py", line 497, in format
    return self.format_sequence(sequences)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/outlines/generate/json.py", line 50, in <lambda>
    generator.format_sequence = lambda x: schema_object.parse_raw(x)
  File "/usr/local/lib/python3.10/dist-packages/pydantic/main.py", line 1170, in parse_raw
    raise pydantic_core.ValidationError.from_exception_data(cls.__name__, [error])
pydantic_core._pydantic_core.ValidationError: 1 validation error for RelevantSummary
__root__
  Invalid control character at: line 1 column 22 (char 21) [type=value_error.jsondecode, input_value='{"relevant_summary":"\nV...ion SoC requirements."}', input_type=str]

Outlines/Python version information:

Version information

``` ubuntu@ip-:~$ python3 -c "from outlines import _version; print(_version.version)". 0.0.43

ubuntu@ip-:~$ python3 -c "import sys; print('Python', sys.version)".
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

pip freeze

aiohttp==3.9.5
aiosignal==1.3.1
amqp==5.2.0
annotated-types==0.6.0
anyio==4.3.0
astroid==3.2.2
asttokens==2.4.1
async-timeout==4.0.3
attrs==23.2.0
Automat==22.10.0
awscli==1.32.108
Babel==2.15.0
backports.tarfile==1.1.1
bcrypt==3.2.0
billiard==4.2.0
black==24.4.2
blessed==1.20.0
blinker==1.8.2
boto3==1.34.108
botocore==1.34.108
build==1.2.1
celery==5.4.0
certifi==2024.2.2
cffi==1.16.0
chalice==1.31.0
chardet==4.0.0
charset-normalizer==3.3.2
click==8.1.7
click-didyoumean==0.3.1
click-plugins==1.1.1
click-repl==0.3.0
cloud-init==24.1.3
cloudpickle==3.0.0
cmake==3.29.3
colorama==0.4.6
command-not-found==0.3
configobj==5.0.6
constantly==23.10.4
cryptography==42.0.7
cssselect==1.2.0
dask==2024.5.1
datasets==2.19.1
dbus-python==1.2.18
decorator==5.1.1
defusedxml==0.7.1
devscripts===2.22.1ubuntu1
dill==0.3.8
diskcache==5.6.3
distlib==0.3.8
distro==1.7.0
distro-info==1.1+ubuntu0.2
dnspython==2.6.1
docutils==0.16
dparse==0.6.3
ec2-hibinit-agent==1.0.0
email_validator==2.1.1
exceptiongroup==1.2.1
executing==2.0.1
fastapi==0.111.0
fastapi-cli==0.0.4
filelock==3.14.0
Flask==3.0.3
frozenlist==1.4.1
fsspec==2024.3.1
gpg==1.16.0
greenlet==3.0.3
h11==0.14.0
hibagent==1.0.1
httpcore==1.0.5
httpie==3.2.2
httplib2==0.22.0
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.23.2
hyperlink==21.0.0
idna==3.7
importlib_metadata==7.1.0
incremental==22.10.0
iniconfig==2.0.0
inquirer==2.10.1
inquirerpy==0.3.4
interegular==0.3.3
ipython==8.24.0
isort==5.13.2
itemadapter==0.9.0
itemloaders==1.2.0
itsdangerous==2.2.0
jaraco.classes==3.4.0
jaraco.context==5.3.0
jaraco.functools==4.0.1
jedi==0.19.1
jeepney==0.8.0
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
jsonpatch==1.32
jsonpointer==2.0
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
keyring==25.2.1
kombu==5.3.7
lark==1.1.9
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
llama_cpp_python==0.2.77
llvmlite==0.42.0
lm-format-enforcer==0.10.1
locket==1.0.0
lxml==5.2.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib-inline==0.1.7
mccabe==0.7.0
mdurl==0.1.2
more-itertools==10.2.0
mpmath==1.3.0
msgpack==1.0.8
multidict==6.0.5
multiprocess==0.70.16
mypy==1.10.0
mypy-extensions==1.0.0
nest-asyncio==1.6.0
netifaces==0.11.0
networkx==3.3
nh3==0.2.17
ninja==1.11.1.1
numba==0.59.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.550.52
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
olefile==0.46
openai==1.31.1
orjson==3.10.3
outlines==0.0.43
packaging==21.3
pandas==2.2.2
parsel==1.9.1
parso==0.8.4
partd==1.4.2
pathspec==0.12.1
pbr==6.0.0
pexpect==4.9.0
pfzy==0.3.4
pillow==10.3.0
pip-tools==7.4.1
pipdeptree==2.20.0
pipenv==2023.12.1
pkginfo==1.10.0
platformdirs==4.2.2
pluggy==1.5.0
prometheus-fastapi-instrumentator==7.0.0
prometheus_client==0.20.0
prompt-toolkit==3.0.43
Protego==0.3.1
protobuf==5.27.0
psutil==5.9.8
psycopg2-binary==2.9.9
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyairports==2.1.1
pyarrow==16.1.0
pyarrow-hotfix==0.6
pyasn1==0.6.0
pyasn1_modules==0.4.0
pycairo==1.26.0
pycountry==24.6.1
pycparser==2.22
pydantic==2.7.1
pydantic_core==2.18.2
PyDispatcher==2.0.7
Pygments==2.18.0
PyGObject==3.42.1
PyHamcrest==2.0.2
PyJWT==2.8.0
pylint==3.2.1
pyOpenSSL==24.1.0
pyparsing==3.1.2
pyproject_hooks==1.1.0
pyrsistent==0.18.1
pyserial==3.5
PySocks==1.7.1
pytest==8.2.1
python-apt==2.4.0+ubuntu3
python-dateutil==2.9.0.post0
python-debian==0.1.43+ubuntu1.1
python-dotenv==1.0.1
python-editor==1.0.4
python-magic==0.4.24
python-multipart==0.0.9
pytz==2022.1
pyxdg==0.27
PyYAML==6.0.1
queuelib==1.7.0
ray==2.23.0
readchar==4.1.0
readme_renderer==43.0
redis==5.0.4
referencing==0.35.1
regex==2024.5.15
requests==2.31.0
requests-file==2.0.0
requests-toolbelt==1.0.0
rfc3986==2.0.0
rich==13.7.1
roman==3.3
rpds-py==0.18.1
rsa==4.7.2
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
s3transfer==0.10.1
safetensors==0.4.3
scikit-learn==1.5.0
scipy==1.13.1
Scrapy==2.11.2
SecretStorage==3.3.3
sentence-transformers==3.0.0
sentencepiece==0.2.0
service-identity==24.1.0
shellingham==1.5.4
six==1.16.0
sniffio==1.3.1
sos==4.5.6
SQLAlchemy==2.0.30
ssh-import-id==5.11
stack-data==0.6.3
starlette==0.37.2
sympy==1.12.1
systemd-python==234
testresources==2.0.1
threadpoolctl==3.5.0
tiktoken==0.7.0
tldextract==5.1.2
tokenizers==0.19.1
tomli==2.0.1
tomlkit==0.12.5
toolz==0.12.1
torch==2.3.0
tqdm==4.66.4
traitlets==5.14.3
transformers==4.41.2
triton==2.3.0
twine==5.1.0
Twisted==24.3.0
typer==0.12.3
typing_extensions==4.11.0
tzdata==2024.1
ubuntu-pro-client==8001
ufw==0.36.1
ujson==5.10.0
unattended-upgrades==0.1
unidiff==0.5.5
urllib3==2.2.1
uvicorn==0.29.0
uvloop==0.19.0
vine==5.1.0
virtualenv==20.26.2
vllm-flash-attn==2.5.8.post2
w3lib==2.1.2
wadllib==1.3.6
watchfiles==0.21.0
wcwidth==0.2.13
websockets==12.0
Werkzeug==3.0.3
xdg==5
xformers==0.0.26.post1
xxhash==3.4.1
yapf==0.40.2
yarl==1.9.4
zipp==3.18.2
zope.interface==6.4

</details>


### Context for the issue:

I would like to improve the performance of my summarization and classification pipeline with the newer Llama 3 gguf models. The current performance on the older 0.0.36 outlines library also has some number formatting issues.

 No other issue has brought up any problems with Llama3 ggufs, but all of the finetunes I have tried also have the same issue. Either i'm doing something wrong or there is a signification Llama 3 gguf issue that there should be a discussion about. Thank you!
@polplop polplop added the bug label Jun 11, 2024
@rlouf rlouf added llama.cpp Related to the `llama.cpp` integration JSON labels Jun 18, 2024
@lapp0
Copy link
Collaborator

lapp0 commented Jun 20, 2024

Investigating a solution.

Related: huggingface/transformers#31030

Problem:

It appears the tokenizer represents 198 differently between tokenizer.vocabulary() and tokenizer.decode()

>>> tokenizer.decode([198])
['\n']
>>> [(k, v) for k, v in tokenizer.vocabulary().items() if v == 198][0][0].encode()
'Ċ'

This isn't the case for other tokens

>>> tokenizer.decode([10])
['+']
>>> [(k, v) for k, v in tokenizer.vocabulary().items() if v == 10][0][0]
'+'

Inconsistent Tokens

    from transformers import AutoTokenizer
    tokenizer = TransformerTokenizer(
        AutoTokenizer.from_pretrained("failspy/Meta-Llama-3-8B-Instruct-abliterated-v3")
    )
    bad_tokens = []
    for vocab_token_str, token_id in tokenizer.vocabulary.items():
        decoded_token_str = tokenizer.decode([token_id])[0]
        if decoded_token_str != vocab_token_str:
            bad_tokens.append((decoded_token_str, vocab_token_str))

    if bad_tokens:
        bad_tok_output = '\n'.join(map(repr, bad_tokens))
        raise Exception(f"Found {len(bad_tokens)} bad tokens: {bad_tok_output}")

Found these inconsistent tokens:

E           Exception: Found 78029 bad tokens: (' ROOM', 'ĠROOM')
E           (' 않는', 'ĠìķĬëĬĶ')
E           (' Overse', 'ĠOverse')
E           (' slov', 'Ġslov')
E           ('�', 'æ¦')
E           (' Infragistics', 'ĠInfragistics')
E           ('�', 'çĻ')
E           (' DIFF', 'ĠDIFF')
E           (' 武', 'ĠæѦ')
E           (' eighth', 'Ġeighth')
...

I'm looking into whether we should be constructing a "true vocabulary" by decoding each token.

Edit:

It appears we already have a method to normalize:

class TransformerTokenizer(Tokenizer):
    ...
    def convert_token_to_string(self, token: str) -> str:
        from transformers.file_utils import SPIECE_UNDERLINE

        string = self.tokenizer.convert_tokens_to_string([token])

Investigating the reason this failed to prevent a \n during generation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug JSON llama.cpp Related to the `llama.cpp` integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants