Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve remove_diacritics function. Fixes #71 #72

Merged
merged 1 commit into from
Jul 13, 2020

Conversation

henrifroese
Copy link
Collaborator

The remove_diacritics function produced transliterated output for e.g.
the Urdu alphabet.

Through the unicodedata package, diacritics are now safely filtered out.

The remove_diacritics function produced transliterated output for e.g.
the Urdu alphabet.

Through the unicodedata package, diacritics are now safely filtered out.
@henrifroese
Copy link
Collaborator Author

henrifroese commented Jul 12, 2020

These are the new functions:

def _remove_diacritics(text: str) -> str:
    """
    Remove diacritics and accents from one string.

    Examples
    --------
    >>> import texthero as hero
    >>> import pandas as pd
    >>> text = "Montréal, über, 12.89, Mère, Françoise, noël, 889, اِس, اُس"
    >>> _remove_diacritics(text)
    'Montreal, uber, 12.89, Mere, Francoise, noel, 889, اس, اس'
    """
    nfkd_form = unicodedata.normalize("NFKD", text)
    # unicodedata.combinding(char) checks if the character is in
    # composed form (consisting of several unicode chars combined), i.e. a diacritic
    return "".join([char for char in nfkd_form if not unicodedata.combining(char)])


def remove_diacritics(input: pd.Series) -> pd.Series:
    """
    Remove all diacritics and accents.

    Remove all diacritics and accents from any word and characters from the given Pandas Series. Return a cleaned version of the Pandas Series.

    Examples
    --------
    >>> import texthero as hero
    >>> import pandas as pd
    >>> s = pd.Series("Montréal, über, 12.89, Mère, Françoise, noël, 889, اِس, اُس")
    >>> hero.remove_diacritics(s)[0]
    'Montreal, uber, 12.89, Mere, Francoise, noel, 889, اس, اس'
    """
    return input.astype("unicode").apply(_remove_diacritics)

@jbesomi
Copy link
Owner

jbesomi commented Jul 13, 2020

Hi @henrifroese, thank you for this PR!

See @71 for a general comment on how we might want to deal with other non-western (or non-English) languages.

Also, it appears that your carriage return is different from the actual one of Texthero's file (that should be Unix). The consequence is that your actual code is changing other lines that the functions _remove_diacritics

@henrifroese
Copy link
Collaborator Author

Hi, I took a look and can see in the diff that my commit removes some unnecessary whitespace (e.g. a tab at the beginning of an empty line). So it's not a carriage return and I think it makes sense in line with the style guide (although Black does not automatically remove it). I think I could "re-insert" the whitespace if you want, but I don't think it's really necessary (except if I'm missing something).

I'll also comment at #71 regarding multilingual support.

@jbesomi
Copy link
Owner

jbesomi commented Jul 13, 2020

Right! For curiosity: did you changed it yourself or your IDE changed it automatically? Which IDE are you using?
Good to know black does not handle this kind of thing. But, who knows, maybe there is a hidden black parameter we can tune ...

@henrifroese
Copy link
Collaborator Author

Yes, I'm using VSCode with Pylint, it also sorts the imports alphabetically if I add a new one. I'm sure black has an option to strip the whitespace

@jbesomi
Copy link
Owner

jbesomi commented Jul 13, 2020

All right. Now I remember. We do not want to have black that by default removes trailing whitespace. The reason is that sometimes for passing the docstring test we need to keep this whitespaces, for instance:

s = pd.Series("10")
hero.replace_ditigs(s, " ")
0  (we want some whitespaces there)

I agree with you we can get rid of all the whitespace you are removing there. Though probably this should not be done in this PR, rather in an another one (but for now that's okay, but having some rules is for sure useful, especially if later we will have to deal with more PRs..)

@henrifroese
Copy link
Collaborator Author

Aah right that makes sense! We certainly want to keep that whitespace.

but for now that's okay

I'm not sure I understand you correctly 🤓 ; do you mean that this PR is okay for now (I'll just mark it as "ready for review" so if that's the case you could merge it) or do you mean I should still change it back for this PR?

@henrifroese henrifroese marked this pull request as ready for review July 13, 2020 13:46
@jbesomi jbesomi merged commit 391cec6 into jbesomi:master Jul 13, 2020
@jbesomi
Copy link
Owner

jbesomi commented Jul 13, 2020

That's fine for now! Merged! Thank you Henri!! 👍

@jbesomi
Copy link
Owner

jbesomi commented Jul 13, 2020

Merged, but now I have still two questios:

  1. We do not need anymore unidecode, right? Not even from setup.cfg
  2. In another PR, haven't we said we shouldn't use .astype("unicode")?
    --> What does it happen with the following Series:
s = pd.Series(["Hèllo", np.nan])
hero.remove_diacritics()

Probably, what the user expects that the function removes diacritics only on the first cell ...

And, from this, another question/feedback arises... we might want that all Texthero's function can deal with not-assigned cell, no? If yes and if that's feasible (probably that's not super easy for all function) we will need to test that, a bit as you did with test_index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants