grep: Invalid collation character #6

freebiesoft · 2020-12-12T20:52:50Z

When I run the clean-lex.sh file I get grep: Invalid collation character error

The text was updated successfully, but these errors were encountered:

polm · 2020-12-13T04:31:03Z

Must be an issue with your locale. Try calling export LC_COLLATE=C or export LC_COLLATE=C.UTF-8 first, that should fix it.

If that doesn't fix it, let me know. If that does fix it, I can add it to the script.

freebiesoft · 2020-12-13T10:20:37Z

@polm I worked around it by substituting the grep string's special characters (i.e. "wide latin" characters) with their respective unicode hex numbers i.e. '^([A-Za-z\x{FF21}-\x{FF3A}\x{FF41}-\x{FF5A}]|[0-9\x{FF10}-\x{FF19}]*),', although it does confuse me a little as to why grep -E '^([A-Za-z\x{FF21}-\x{FF3A}\x{FF41}-\x{FF5A}]|[0-9\x{FF10}-\x{FF19}]*),' lex.csv | wc -l only outputs just over 100, whereas you claim (in the clean-lex.sh code comments) that the script should remove 232 entries... however, I am using the latest version of unidic.

polm · 2020-12-13T10:25:51Z

Well that's weird. Can you tell me what your LOCALE and OS are and confirm whether or not the LC_COLLATE setting fixes things for you?

The number of entries I mentioned is for UniDic 2.3.0, which is still the latest version. So if you're just getting 100 it's possible something is going wrong.

freebiesoft · 2020-12-13T10:31:50Z

@polm I just realised I have downloaded the spoken japanese dict (as that is version 3.0.1.1). I'll download the written language one (tomorrow) and check again :)

polm · 2020-12-16T16:36:31Z

@freebiesoft Any update on this?

polm · 2020-12-21T16:47:18Z

Closing due to lack of activity.

srctaha · 2021-02-16T11:09:52Z

@polm FYI

UniDic: 2.3.0
OS: macOS Big Sur 11.2.1

Locale:

LANG=""
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

With grep (BSD grep) 2.5.1-FreeBSD, grep -E '^([A-Za-zＡ-Ｚａ-ｚ]|[0-9０-９]*),' lex.csv | wc -l returns 863869, removing 8962 entries.

With ggrep (GNU grep) 3.6, ggrep -P '^([A-Za-z\x{FF21}-\x{FF3A}\x{FF41}-\x{FF5A}]|[0-9\x{FF10}-\x{FF19}]*),' lex.csv | wc -l returns 232.

polm · 2021-02-16T11:13:42Z

Thanks for the report. I'll add the collate setting to the script to avoid issues like this.

polm · 2021-02-16T11:16:42Z

Oh wait... your locale settings look right and it's still not doing what I would expect. Hm.

I'll take a look at this later, but I would recommend using ggrep.

srctaha · 2021-02-16T11:33:34Z

Sorry for misleading. Mine is indeed not a locale issue. ggrep -E '^([A-Za-zＡ-Ｚａ-ｚ]|[0-9０-９]*),' lex.csv | wc -l returns 232, as expected.

polm · 2021-02-24T12:50:17Z

So I'm looking into the difference between BSD and GNU grep and I'm still unsure why this is happening. Here are some questions I have:

is there even a way to install BSD grep on Linux? There seem to be a lot of people wanting to use GNU grep on BSD but I can't find any example of the opposite. (I do not have a Mac.)
is there some way to check if /bin/grep is BSD or GNU? I guess I could check the output of grep --version but without BSD grep I can't compare it... Something like grep --version | grep GNU might work.
is GNU grep always available on Macs or is it optional?

polm · 2021-02-24T13:03:38Z

@srctaha Can you upload the 8962 entries that BSD grep picks up as a gist or something so I can check it?

srctaha · 2021-02-24T14:42:34Z

@polm Here is the gist that lists the 8962 entries that grep (BSD grep) 2.5.1-FreeBSD picks up for grep -Ev '^([A-Za-zＡ-Ｚａ-ｚ]|[0-9０-９]*),' lex.csv.

Settings are same as before.

polm · 2021-02-24T15:04:06Z

Thanks, that's helpful. The results contain punctuation, kaomoji, and alphabetic entries with more than one character, which is weird... I'll see if I can figure out what's up.

srctaha · 2021-02-24T15:33:50Z

FYI, given line="合," or line="あい,", or line=明空,

grep (BSD grep) 2.5.1-FreeBSD

echo $line | grep -E '^([Ａ-Ｚ]*),' -  # match
echo $line | grep -E '^([ａ-ｚ]*),' -  # match

ggrep (GNU grep) 3.6

echo $line | ggrep -E '^([Ａ-Ｚ]*),' -  # no match
echo $line | ggrep -E '^([ａ-ｚ]*),' -  # no match

srctaha · 2021-02-24T15:46:15Z

is there some way to check if /bin/grep is BSD or GNU?

grep -V works for both BSD and GNU grep.

is GNU grep always available on Macs or is it optional?

For at least macOS Big Sur 11.2.1, GNU grep is unavailable by default. Users need to download it through brew, which is a free and open-source software package management system for macos.

polm · 2021-02-26T07:36:27Z

Thanks for the details about Macs. Since installing GNU grep isn't that hard, and since I'm still not sure exactly what's going on here, I'll just leave this alone for now, but I may come back to it later.

polm closed this as completed Dec 21, 2020

polm reopened this Feb 16, 2021

polm closed this as completed Feb 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grep: Invalid collation character #6

grep: Invalid collation character #6

freebiesoft commented Dec 12, 2020

polm commented Dec 13, 2020

freebiesoft commented Dec 13, 2020 •

edited

polm commented Dec 13, 2020

freebiesoft commented Dec 13, 2020 •

edited

polm commented Dec 16, 2020

polm commented Dec 21, 2020

srctaha commented Feb 16, 2021

polm commented Feb 16, 2021

polm commented Feb 16, 2021

srctaha commented Feb 16, 2021

polm commented Feb 24, 2021

polm commented Feb 24, 2021

srctaha commented Feb 24, 2021

polm commented Feb 24, 2021

srctaha commented Feb 24, 2021

srctaha commented Feb 24, 2021

polm commented Feb 26, 2021

grep: Invalid collation character #6

grep: Invalid collation character #6

Comments

freebiesoft commented Dec 12, 2020

polm commented Dec 13, 2020

freebiesoft commented Dec 13, 2020 • edited

polm commented Dec 13, 2020

freebiesoft commented Dec 13, 2020 • edited

polm commented Dec 16, 2020

polm commented Dec 21, 2020

srctaha commented Feb 16, 2021

polm commented Feb 16, 2021

polm commented Feb 16, 2021

srctaha commented Feb 16, 2021

polm commented Feb 24, 2021

polm commented Feb 24, 2021

srctaha commented Feb 24, 2021

polm commented Feb 24, 2021

srctaha commented Feb 24, 2021

srctaha commented Feb 24, 2021

polm commented Feb 26, 2021

freebiesoft commented Dec 13, 2020 •

edited

freebiesoft commented Dec 13, 2020 •

edited