index() function returns wrong offset for non-ascii chars #1430

atschabu · 2017-06-19T02:56:19Z

I'm trying to strip away some text from part of a text. Trying to use something like sub("!.*"; "") doesn't work, as it is giving me a Segmentation fault when text is too long. So I tried to go this route:

$ jq '.msg | .[0:index("!")]'

which works fine with input like:
{"msg": "hello world!"}
but fails when text contains wide characters:
{"msg": "здравствуй мир!"}

$ echo '{"msg": "здравствуй мир!"}' | jq '.msg | index("!")'
27

$ echo '{"msg": "hello world!"}' | jq '.msg | index("!")'
11

$ jq --version
jq-1.5
$ uname -a
Darwin atschabu-C02SF0UTG8WM 15.6.0 Darwin Kernel Version 15.6.0: Tue Apr 11 16:00:51 PDT 2017; root:xnu-3248.60.11.5.3~1/RELEASE_X86_64 x86_64

The text was updated successfully, but these errors were encountered:

pkoppstein · 2017-06-19T03:15:30Z

There is some documentation about this on the "Pitfalls" page (https://github.com/stedolan/jq/wiki/How-to:-Avoid-Pitfalls)

In brief, you can use match/1:

echo '{"msg": "здравствуй мир!"}' | jq '.msg | match("!").offset'
14

This works in jq 1.5 and later.

By the way, could you please give more details about the failure of sub/2. Here is an illustration that it does not always fail when given a long string:

 jq1.5 -n '[range(0;100000) | "a"] | join("") + "!xx" | sub("!.*";"") | length'
100000

atschabu · 2017-06-19T17:03:49Z

My bad. I haven't even realized there is a wiki. I took all the information from the manual, which didn't mention anything about index being byte wise. I'll give match a go.

I still haven't figured out when exactly the Segmentation fault is happening, as I couldn't find the input yet which is producing it. But I went by the assumption it is related to issue 922 until I can proof the opposite.

I guess we can close this one, and I'll open a new ticket, in case my segmentation fault issue is not related to 922.

nicowilliams · 2017-11-28T19:00:25Z

No, this is a bug. We should fix it.

Previsouly byte index was used. Fixes jqlang#1430 jqlang#1624 jqlang#3064

nicowilliams added the bug label Nov 28, 2017

This was referenced Jun 4, 2019

substring of index/rindex doesn't work for utf8 inputs #1624

Open

improve index, rindex and indices against string to count the index by utf8 characters #1916

Closed

itchyny mentioned this issue Apr 29, 2021

About the jq's release process (Was: Is jq is still alive/maintained ?) #2305

Closed

D3vil0p3r mentioned this issue Jun 8, 2023

[Request] gojq chaotic-aur/packages#2543

Closed

itchyny mentioned this issue Mar 12, 2024

indices reports byte offsets instead of character offsets #3064

Open

wader added a commit to wader/jq that referenced this issue Mar 12, 2024

Use codepoint index for indices/1, index/ 1 and rindex/1

ca38058

Previsouly byte index was used. Fixes jqlang#1430 jqlang#1624 jqlang#3064

wader linked a pull request Mar 12, 2024 that will close this issue

Use codepoint index for indices/1, index/1 and rindex/1 #3065

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index() function returns wrong offset for non-ascii chars #1430

index() function returns wrong offset for non-ascii chars #1430

atschabu commented Jun 19, 2017

pkoppstein commented Jun 19, 2017 •

edited

Loading

atschabu commented Jun 19, 2017

nicowilliams commented Nov 28, 2017

index() function returns wrong offset for non-ascii chars #1430

index() function returns wrong offset for non-ascii chars #1430

Comments

atschabu commented Jun 19, 2017

pkoppstein commented Jun 19, 2017 • edited Loading

atschabu commented Jun 19, 2017

nicowilliams commented Nov 28, 2017

pkoppstein commented Jun 19, 2017 •

edited

Loading