Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grep: Invalid collation character #6

Closed
freebiesoft opened this issue Dec 12, 2020 · 17 comments
Closed

grep: Invalid collation character #6

freebiesoft opened this issue Dec 12, 2020 · 17 comments

Comments

@freebiesoft
Copy link

When I run the clean-lex.sh file I get grep: Invalid collation character error

@polm
Copy link
Owner

polm commented Dec 13, 2020

Must be an issue with your locale. Try calling export LC_COLLATE=C or export LC_COLLATE=C.UTF-8 first, that should fix it.

If that doesn't fix it, let me know. If that does fix it, I can add it to the script.

@freebiesoft
Copy link
Author

freebiesoft commented Dec 13, 2020

@polm I worked around it by substituting the grep string's special characters (i.e. "wide latin" characters) with their respective unicode hex numbers i.e. '^([A-Za-z\x{FF21}-\x{FF3A}\x{FF41}-\x{FF5A}]|[0-9\x{FF10}-\x{FF19}]*),', although it does confuse me a little as to why grep -E '^([A-Za-z\x{FF21}-\x{FF3A}\x{FF41}-\x{FF5A}]|[0-9\x{FF10}-\x{FF19}]*),' lex.csv | wc -l only outputs just over 100, whereas you claim (in the clean-lex.sh code comments) that the script should remove 232 entries... however, I am using the latest version of unidic.

@polm
Copy link
Owner

polm commented Dec 13, 2020

Well that's weird. Can you tell me what your LOCALE and OS are and confirm whether or not the LC_COLLATE setting fixes things for you?

The number of entries I mentioned is for UniDic 2.3.0, which is still the latest version. So if you're just getting 100 it's possible something is going wrong.

@freebiesoft
Copy link
Author

freebiesoft commented Dec 13, 2020

@polm I just realised I have downloaded the spoken japanese dict (as that is version 3.0.1.1). I'll download the written language one (tomorrow) and check again :)

@polm
Copy link
Owner

polm commented Dec 16, 2020

@freebiesoft Any update on this?

@polm
Copy link
Owner

polm commented Dec 21, 2020

Closing due to lack of activity.

@polm polm closed this as completed Dec 21, 2020
@srctaha
Copy link

srctaha commented Feb 16, 2021

@polm FYI

  • UniDic: 2.3.0
  • OS: macOS Big Sur 11.2.1
  • Locale:
    LANG=""
    LC_COLLATE="C"
    LC_CTYPE="UTF-8"
    LC_MESSAGES="C"
    LC_MONETARY="C"
    LC_NUMERIC="C"
    LC_TIME="C"
    LC_ALL=
    

With grep (BSD grep) 2.5.1-FreeBSD, grep -E '^([A-Za-zA-Za-z]|[0-90-9]*),' lex.csv | wc -l returns 863869, removing 8962 entries.

With ggrep (GNU grep) 3.6, ggrep -P '^([A-Za-z\x{FF21}-\x{FF3A}\x{FF41}-\x{FF5A}]|[0-9\x{FF10}-\x{FF19}]*),' lex.csv | wc -l returns 232.

@polm polm reopened this Feb 16, 2021
@polm
Copy link
Owner

polm commented Feb 16, 2021

Thanks for the report. I'll add the collate setting to the script to avoid issues like this.

@polm
Copy link
Owner

polm commented Feb 16, 2021

Oh wait... your locale settings look right and it's still not doing what I would expect. Hm.

I'll take a look at this later, but I would recommend using ggrep.

@srctaha
Copy link

srctaha commented Feb 16, 2021

Sorry for misleading. Mine is indeed not a locale issue. ggrep -E '^([A-Za-zA-Za-z]|[0-90-9]*),' lex.csv | wc -l returns 232, as expected.

@polm
Copy link
Owner

polm commented Feb 24, 2021

So I'm looking into the difference between BSD and GNU grep and I'm still unsure why this is happening. Here are some questions I have:

  • is there even a way to install BSD grep on Linux? There seem to be a lot of people wanting to use GNU grep on BSD but I can't find any example of the opposite. (I do not have a Mac.)
  • is there some way to check if /bin/grep is BSD or GNU? I guess I could check the output of grep --version but without BSD grep I can't compare it... Something like grep --version | grep GNU might work.
  • is GNU grep always available on Macs or is it optional?

@polm
Copy link
Owner

polm commented Feb 24, 2021

@srctaha Can you upload the 8962 entries that BSD grep picks up as a gist or something so I can check it?

@srctaha
Copy link

srctaha commented Feb 24, 2021

@polm Here is the gist that lists the 8962 entries that grep (BSD grep) 2.5.1-FreeBSD picks up for grep -Ev '^([A-Za-zA-Za-z]|[0-90-9]*),' lex.csv.

Settings are same as before.

@polm
Copy link
Owner

polm commented Feb 24, 2021

Thanks, that's helpful. The results contain punctuation, kaomoji, and alphabetic entries with more than one character, which is weird... I'll see if I can figure out what's up.

@srctaha
Copy link

srctaha commented Feb 24, 2021

FYI, given line="合," or line="あい,", or line=明空,

  • grep (BSD grep) 2.5.1-FreeBSD
    echo $line | grep -E '^([A-Z]*),' -  # match
    echo $line | grep -E '^([a-z]*),' -  # match
    
  • ggrep (GNU grep) 3.6
    echo $line | ggrep -E '^([A-Z]*),' -  # no match
    echo $line | ggrep -E '^([a-z]*),' -  # no match
    

@srctaha
Copy link

srctaha commented Feb 24, 2021

is there some way to check if /bin/grep is BSD or GNU?

grep -V works for both BSD and GNU grep.

is GNU grep always available on Macs or is it optional?

For at least macOS Big Sur 11.2.1, GNU grep is unavailable by default. Users need to download it through brew, which is a free and open-source software package management system for macos.

@polm
Copy link
Owner

polm commented Feb 26, 2021

Thanks for the details about Macs. Since installing GNU grep isn't that hard, and since I'm still not sure exactly what's going on here, I'll just leave this alone for now, but I may come back to it later.

@polm polm closed this as completed Feb 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants