Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

substring of index/rindex doesn't work for utf8 inputs #1624

Open
ilovenwd opened this issue Mar 10, 2018 · 6 comments · May be fixed by #3065
Open

substring of index/rindex doesn't work for utf8 inputs #1624

ilovenwd opened this issue Mar 10, 2018 · 6 comments · May be fixed by #3065

Comments

@ilovenwd
Copy link

ilovenwd commented Mar 10, 2018

echo  '正xyz' | jq -Rsr '.[:rindex("x")]'
正xy
while it should output:
正

It seems substring thinks length(正xy)==3
but rindex thinks length(正xy)==5
this issue make jsonp parse example in FAQ fails for utf8 inputs

Q: How can I convert JSON-P (JSONP) to JSON using jq?
A: Assuming that the padding takes the form of a function call:
$ jq -s -R  '.[1+index("("): rindex(")")] | fromjson'
@pkoppstein
Copy link
Contributor

pkoppstein commented Mar 10, 2018

You're right and I've updated the FAQ so that it uses match. In the example you give, we would have:

echo  '正xyz' | jq1.5 -Rsr '.[: (match("x").offset)]'
正

Thank you!

@ilovenwd
Copy link
Author

match works, but it's really confusing that index and slice use different string model (bytes vs strings)
it's very different from common languages like c,python, etc.
I suggest add another pair of functions like indexu/rindexu to behave exactly as substring slice does.

@itchyny
Copy link
Contributor

itchyny commented Jun 4, 2019

I believe this behavior is worth changing its default. How much is byte index important in jq? Defining index of type string by explode | .[$x|explode] will work with utf8 strings. If someone needs conversion between string and bytes, how about adding byte-version of explode and implode?

@itchyny
Copy link
Contributor

itchyny commented Jun 4, 2019

Same issue: #1430.

@pkoppstein
Copy link
Contributor

@itchyny - Changing the semantics of index would only be possible in a "Major Release" of jq, and might never happen.

Rather than tilting at that particular windmill, I would suggest adding a new C-coded built-in function with the desired semantics, not least because the existing implementation of index is ill-suited for finding the first index of anything.

Although there is something to be said for a function with a narrow domain (e.g. codepointOf for JSON strings), it would be more in keeping with jq's existing builtins to be polymorphic, which would suggest a name such as indexOf, though a more distinctive name would no doubt be preferable.

@itchyny
Copy link
Contributor

itchyny commented Jun 5, 2019

Okay, thanks for detail explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants