-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve remove_diacritics function. Fixes #71 #72
Improve remove_diacritics function. Fixes #71 #72
Conversation
The remove_diacritics function produced transliterated output for e.g. the Urdu alphabet. Through the unicodedata package, diacritics are now safely filtered out.
These are the new functions:
|
Hi @henrifroese, thank you for this PR! See @71 for a general comment on how we might want to deal with other non-western (or non-English) languages. Also, it appears that your carriage return is different from the actual one of Texthero's file (that should be Unix). The consequence is that your actual code is changing other lines that the functions |
Hi, I took a look and can see in the diff that my commit removes some unnecessary whitespace (e.g. a tab at the beginning of an empty line). So it's not a carriage return and I think it makes sense in line with the style guide (although Black does not automatically remove it). I think I could "re-insert" the whitespace if you want, but I don't think it's really necessary (except if I'm missing something). I'll also comment at #71 regarding multilingual support. |
Right! For curiosity: did you changed it yourself or your IDE changed it automatically? Which IDE are you using? |
Yes, I'm using VSCode with Pylint, it also sorts the imports alphabetically if I add a new one. I'm sure black has an option to strip the whitespace |
All right. Now I remember. We do not want to have black that by default removes trailing whitespace. The reason is that sometimes for passing the docstring test we need to keep this whitespaces, for instance:
I agree with you we can get rid of all the whitespace you are removing there. Though probably this should not be done in this PR, rather in an another one (but for now that's okay, but having some rules is for sure useful, especially if later we will have to deal with more PRs..) |
Aah right that makes sense! We certainly want to keep that whitespace.
I'm not sure I understand you correctly 🤓 ; do you mean that this PR is okay for now (I'll just mark it as "ready for review" so if that's the case you could merge it) or do you mean I should still change it back for this PR? |
That's fine for now! Merged! Thank you Henri!! 👍 |
Merged, but now I have still two questios:
Probably, what the user expects that the function removes diacritics only on the first cell ... And, from this, another question/feedback arises... we might want that all Texthero's function can deal with not-assigned cell, no? If yes and if that's feasible (probably that's not super easy for all function) we will need to test that, a bit as you did with |
The remove_diacritics function produced transliterated output for e.g.
the Urdu alphabet.
Through the unicodedata package, diacritics are now safely filtered out.