You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi!
Thank you for your contribution on nltk project. Your model handling Russian punctuation much better than other nltk models, but there is an issue with a ellipsis(...). Examples:
>>> import nltk
>>> sent_tokenize = nltk.data.load('tokenizers/punkt/russian.pickle')
>>> sent_tokenize.tokenize("Мама мыла раму… Папа мыл кларнет...")
['Мама мыла раму… Папа мыл кларнет...']
>>> sent_tokenize.tokenize("Мама мыла раму... Папа мыл кларнет...")
['Мама мыла раму... Папа мыл кларнет...']
>>> sent_tokenize.tokenize("Мама мыла раму!!! Папа мыл кларнет...")
['Мама мыла раму!!!', 'Папа мыл кларнет...']
>>> sent_tokenize.tokenize("Мама мыла раму!.. Папа мыл кларнет...")
['Мама мыла раму!..', 'Папа мыл кларнет...']
Is it work as designed (ex. 1 and ex. 2)? Ellipsis in Russian usually shows the end of a sentence, but maybe I am wrong.
The text was updated successfully, but these errors were encountered:
I doubt we can fix that easy. Probably one can split sentences by ellipsis taking into account the case of the first letter of the next word following ellipsis. Need to have a deeper look how it could be done with PunktSentenceTokenizer class.
Hi!
Thank you for your contribution on nltk project. Your model handling Russian punctuation much better than other nltk models, but there is an issue with a ellipsis(...). Examples:
Is it work as designed (ex. 1 and ex. 2)? Ellipsis in Russian usually shows the end of a sentence, but maybe I am wrong.
The text was updated successfully, but these errors were encountered: