Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(unicode): utf-8 encoded strings throw errors #13

Closed
AmitMY opened this issue Mar 8, 2018 · 2 comments
Closed

bug(unicode): utf-8 encoded strings throw errors #13

AmitMY opened this issue Mar 8, 2018 · 2 comments

Comments

@AmitMY
Copy link

AmitMY commented Mar 8, 2018

Given the following hyp:

  • uruguay where the leader is raúl fernando sendic rodríguez alfredo zitarrosa died in montevideo montevideo , uruguay montevideo where the leader is daniel martínez ( politician ) uruguay and the language spoken is spanish .

And following refs:

  • Alfredo Zitarrosa died in Montevideo, Uruguay. Daniel Martinez is a political leader in Montevideo, and Raul Fernando Sendic Rodriguez is a leader in Uruguay, where Spanish is spoken.
  • Alfredo Zitarrosa died in Montevideo, the leader of which, is Daniel Martinez. Montevideo is in Uruguay, where Spanish is the language and where Raúl Fernando Sendic Rodríguez is the leader.
  • Raúl Fernando Sendic Rodríguez is the leader of Spanish speaking, Uruguay. Daniel Martinez is the leader of Montevideo, the final resting place of Alfredo Zitarrosa.

corpus check fails.

Meh solution: Map to a string, ignore unicode characters shouldn't hurt too much, but can

def sentence_unicode(s):
    return str(''.join([i if ord(i) < 128 else 'X' for i in s]))
@juharris
Copy link
Member

We now support Python 3 and Unicode strings should work. We have a test to check.

@AmitMY
Copy link
Author

AmitMY commented Jun 26, 2018

That's awesome! Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants