Improve remove_diacritics function. Fixes #71 #72

henrifroese · 2020-07-12T17:49:15Z

The remove_diacritics function produced transliterated output for e.g.
the Urdu alphabet.

Through the unicodedata package, diacritics are now safely filtered out.

The remove_diacritics function produced transliterated output for e.g. the Urdu alphabet. Through the unicodedata package, diacritics are now safely filtered out.

henrifroese · 2020-07-12T17:56:52Z

These are the new functions:

def _remove_diacritics(text: str) -> str:
    """
    Remove diacritics and accents from one string.

    Examples
    --------
    >>> import texthero as hero
    >>> import pandas as pd
    >>> text = "Montréal, über, 12.89, Mère, Françoise, noël, 889, اِس, اُس"
    >>> _remove_diacritics(text)
    'Montreal, uber, 12.89, Mere, Francoise, noel, 889, اس, اس'
    """
    nfkd_form = unicodedata.normalize("NFKD", text)
    # unicodedata.combinding(char) checks if the character is in
    # composed form (consisting of several unicode chars combined), i.e. a diacritic
    return "".join([char for char in nfkd_form if not unicodedata.combining(char)])


def remove_diacritics(input: pd.Series) -> pd.Series:
    """
    Remove all diacritics and accents.

    Remove all diacritics and accents from any word and characters from the given Pandas Series. Return a cleaned version of the Pandas Series.

    Examples
    --------
    >>> import texthero as hero
    >>> import pandas as pd
    >>> s = pd.Series("Montréal, über, 12.89, Mère, Françoise, noël, 889, اِس, اُس")
    >>> hero.remove_diacritics(s)[0]
    'Montreal, uber, 12.89, Mere, Francoise, noel, 889, اس, اس'
    """
    return input.astype("unicode").apply(_remove_diacritics)

jbesomi · 2020-07-13T12:14:51Z

Hi @henrifroese, thank you for this PR!

See @71 for a general comment on how we might want to deal with other non-western (or non-English) languages.

Also, it appears that your carriage return is different from the actual one of Texthero's file (that should be Unix). The consequence is that your actual code is changing other lines that the functions _remove_diacritics

henrifroese · 2020-07-13T12:37:05Z

Hi, I took a look and can see in the diff that my commit removes some unnecessary whitespace (e.g. a tab at the beginning of an empty line). So it's not a carriage return and I think it makes sense in line with the style guide (although Black does not automatically remove it). I think I could "re-insert" the whitespace if you want, but I don't think it's really necessary (except if I'm missing something).

I'll also comment at #71 regarding multilingual support.

jbesomi · 2020-07-13T12:41:42Z

Right! For curiosity: did you changed it yourself or your IDE changed it automatically? Which IDE are you using?
Good to know black does not handle this kind of thing. But, who knows, maybe there is a hidden black parameter we can tune ...

henrifroese · 2020-07-13T12:54:08Z

Yes, I'm using VSCode with Pylint, it also sorts the imports alphabetically if I add a new one. I'm sure black has an option to strip the whitespace

jbesomi · 2020-07-13T12:59:41Z

All right. Now I remember. We do not want to have black that by default removes trailing whitespace. The reason is that sometimes for passing the docstring test we need to keep this whitespaces, for instance:

s = pd.Series("10")
hero.replace_ditigs(s, " ")
0  (we want some whitespaces there)

I agree with you we can get rid of all the whitespace you are removing there. Though probably this should not be done in this PR, rather in an another one (but for now that's okay, but having some rules is for sure useful, especially if later we will have to deal with more PRs..)

henrifroese · 2020-07-13T13:45:46Z

Aah right that makes sense! We certainly want to keep that whitespace.

but for now that's okay

I'm not sure I understand you correctly 🤓 ; do you mean that this PR is okay for now (I'll just mark it as "ready for review" so if that's the case you could merge it) or do you mean I should still change it back for this PR?

jbesomi · 2020-07-13T13:53:46Z

That's fine for now! Merged! Thank you Henri!! 👍

jbesomi · 2020-07-13T14:01:59Z

Merged, but now I have still two questios:

We do not need anymore unidecode, right? Not even from setup.cfg
In another PR, haven't we said we shouldn't use .astype("unicode")?
--> What does it happen with the following Series:

s = pd.Series(["Hèllo", np.nan])
hero.remove_diacritics()

Probably, what the user expects that the function removes diacritics only on the first cell ...

And, from this, another question/feedback arises... we might want that all Texthero's function can deal with not-assigned cell, no? If yes and if that's feasible (probably that's not super easy for all function) we will need to test that, a bit as you did with test_index.

Improve remove_diacritics function. Fixes jbesomi#71

c14414a

The remove_diacritics function produced transliterated output for e.g. the Urdu alphabet. Through the unicodedata package, diacritics are now safely filtered out.

vercel bot deployed to Preview July 12, 2020 17:49 View deployment

henrifroese mentioned this pull request Jul 12, 2020

Remove Diacritics for Urdu Language #71

Closed

henrifroese marked this pull request as ready for review July 13, 2020 13:46

jbesomi merged commit 391cec6 into jbesomi:master Jul 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve remove_diacritics function. Fixes #71 #72

Improve remove_diacritics function. Fixes #71 #72

henrifroese commented Jul 12, 2020

henrifroese commented Jul 12, 2020 •

edited

Loading

jbesomi commented Jul 13, 2020

henrifroese commented Jul 13, 2020

jbesomi commented Jul 13, 2020

henrifroese commented Jul 13, 2020

jbesomi commented Jul 13, 2020

henrifroese commented Jul 13, 2020

jbesomi commented Jul 13, 2020

jbesomi commented Jul 13, 2020

Improve remove_diacritics function. Fixes #71 #72

Improve remove_diacritics function. Fixes #71 #72

Conversation

henrifroese commented Jul 12, 2020

henrifroese commented Jul 12, 2020 • edited Loading

jbesomi commented Jul 13, 2020

henrifroese commented Jul 13, 2020

jbesomi commented Jul 13, 2020

henrifroese commented Jul 13, 2020

jbesomi commented Jul 13, 2020

henrifroese commented Jul 13, 2020

jbesomi commented Jul 13, 2020

jbesomi commented Jul 13, 2020

henrifroese commented Jul 12, 2020 •

edited

Loading