-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snowball stemmer for Russian muches apostrophes #125
Comments
Could somebody check what other versions of the Snowball stemmer do? And even better, could we get commentary from a Russian speaker? |
Russian speaker here: the apostrophe shouldn't being turned into Ь.
Apostrophe is not a Russian Snowball stemmers for other languages don't use transliteration and don't suffer from this issue. The issue is that the original algorithm is not for Russian, it is for a "transliterated Russian". There is a question about the purpose of the NLTK Snowball stemmer. If it is a reference implementation of a snowball stemmer, then the transliteration burden should be probably moved to user (stemmer should accept already transliterated words because that is how algorithm works) and this issue should be marked as "wontfix". If it is an attempt to write an useful stemmer for Russian then it should be fixed either by accepting only cyrillic words and escaping the apostrophe before the transliteration or (better) by rewriting the algorithm to make it work without the transliteration. As far as I can tell, there is no way to have a strict reference Snowball stemmer implementation that properly transliterates "Кот-д'Ивуар" or "д`Артаньян". |
Hello, sorry to arrive so late on the thread, but I wanted to set the record straight about my Russian stemmer (part of the snowball project). It is not for transliterated Russian, but merely uses a transliteration scheme for purposes of exposition. In the script of the stemmer itself, the Cyrillic character are represented by "macro" equivalents, so ш is represented by {sh}, щ by {shch} and so on. This is done partly to make it easier for English speaking readers, partly so that the script is entirely in ASCII, and partly so that it is not tied to any particular encoding scheme. (KOI8-R and Unicode versions are both given.) The transliteration scheme used forllows the Library of Congress standard, in which hard sign and soft sign are repesrented by double quote and single quote. But the compiled stemmer does work directly on Cyrillic characters. |
Version: 2.0b9
To reproduce:
>>> print stm.stem(u"Кот-д'Ивуару")
Output: кот-дьивуар
Notice the apostrophe being turned into Ь.
It shouldn't.
Migrated from http://code.google.com/p/nltk/issues/detail?id=593
earlier comments
PeMiStahl said, at 2010-09-12T16:30:37.000Z:
Hi,
I'm the one who ported the Snowball stemmers to NLTK. Therefore, I exactly followed the descriptions given by Martin Porter on his website. You can find the description of the Russian stemmer here: http://snowball.tartarus.org/algorithms/russian/stemmer.html
The apostrophe is turned into Ь because, before the actual stemming process starts, every Cyrillic letter is turned into its Roman transliterated form as stated by Mr Porter. After stemming, the transliterated form is turned back into the Cyrillic equivalent.
The problem now is the following: The stemming algorithm defined by Mr Porter assumes that an apostrophe may only appear in the Roman transliteration of word, but not in the original Cyrillic form. As I do not speak any Russian myself, I do not know whether an apostrophe can appear in a Russian word or not.
So, if you can tell me which positions an apostrophe can appear at in a Russian word, I might be able to adjust my code accordingly.
Thanks,
Peter
gmal.rom said, at 2010-09-18T00:29:00.000Z:
As it can appear only in loanwords, it's safely to assume it can appear within the root, not in first/last position and not in suffixes or flexions.
PeMiStahl said, at 2010-10-28T10:49:26.000Z:
Hi everyone,
I apologize for posting this so lately, but I moved to a new town and started my Master studies. Therefore, I just did not find the time to deal with this issue. Now I can provide a fix to the apostrophe problem. It was actually very easy to solve because I just had to add some new replace rules to the transliteration methods. The apostrophe is no longer turned into another character but stays the same during the stemming process.
Best,
Peter
The text was updated successfully, but these errors were encountered: