Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowball stemmer for Russian muches apostrophes #125

Closed
alexrudnick opened this issue Jan 17, 2012 · 3 comments
Closed

Snowball stemmer for Russian muches apostrophes #125

alexrudnick opened this issue Jan 17, 2012 · 3 comments

Comments

@alexrudnick
Copy link
Member

Version: 2.0b9
To reproduce:
>>> print stm.stem(u"Кот-д'Ивуару")
Output: кот-дьивуар
Notice the apostrophe being turned into Ь.
It shouldn't.

Migrated from http://code.google.com/p/nltk/issues/detail?id=593


earlier comments

PeMiStahl said, at 2010-09-12T16:30:37.000Z:

Hi,

I'm the one who ported the Snowball stemmers to NLTK. Therefore, I exactly followed the descriptions given by Martin Porter on his website. You can find the description of the Russian stemmer here: http://snowball.tartarus.org/algorithms/russian/stemmer.html

The apostrophe is turned into Ь because, before the actual stemming process starts, every Cyrillic letter is turned into its Roman transliterated form as stated by Mr Porter. After stemming, the transliterated form is turned back into the Cyrillic equivalent.

The problem now is the following: The stemming algorithm defined by Mr Porter assumes that an apostrophe may only appear in the Roman transliteration of word, but not in the original Cyrillic form. As I do not speak any Russian myself, I do not know whether an apostrophe can appear in a Russian word or not.

So, if you can tell me which positions an apostrophe can appear at in a Russian word, I might be able to adjust my code accordingly.

Thanks,
Peter

gmal.rom said, at 2010-09-18T00:29:00.000Z:

As it can appear only in loanwords, it's safely to assume it can appear within the root, not in first/last position and not in suffixes or flexions.

PeMiStahl said, at 2010-10-28T10:49:26.000Z:

Hi everyone,

I apologize for posting this so lately, but I moved to a new town and started my Master studies. Therefore, I just did not find the time to deal with this issue. Now I can provide a fix to the apostrophe problem. It was actually very easy to solve because I just had to add some new replace rules to the transliteration methods. The apostrophe is no longer turned into another character but stays the same during the stemming process.

Best,
Peter

@alexrudnick
Copy link
Member Author

Could somebody check what other versions of the Snowball stemmer do? And even better, could we get commentary from a Russian speaker?

@kmike
Copy link
Member

kmike commented Oct 17, 2012

Russian speaker here: the apostrophe shouldn't being turned into Ь.

<name> is defined as follows here http://snowball.tartarus.org/compiler/snowman.html:

<name>          ::= <letter> [ <letter> || <digit> || _ ]*

Apostrophe is not a Russian <letter>, it should be handled the same way as hyphen; apostrophe is not a diacritic mark in Russian. The confusion came from the fact that letter Ь is commonly transliterated as "", but "" is also a common punctuation and orthography mark in Russian. So the cyrillic -> transliterated -> stemmed -> cyrillic conversion is broken because the transliteration is lossy.

Snowball stemmers for other languages don't use transliteration and don't suffer from this issue.
I personally think Russian snowball stemmer also shouldn't use transliteration.

The issue is that the original algorithm is not for Russian, it is for a "transliterated Russian". There is a question about the purpose of the NLTK Snowball stemmer. If it is a reference implementation of a snowball stemmer, then the transliteration burden should be probably moved to user (stemmer should accept already transliterated words because that is how algorithm works) and this issue should be marked as "wontfix". If it is an attempt to write an useful stemmer for Russian then it should be fixed either by accepting only cyrillic words and escaping the apostrophe before the transliteration or (better) by rewriting the algorithm to make it work without the transliteration.

As far as I can tell, there is no way to have a strict reference Snowball stemmer implementation that properly transliterates "Кот-д'Ивуар" or "д`Артаньян".

@Martin-Porter
Copy link

Hello, sorry to arrive so late on the thread, but I wanted to set the record straight about my Russian stemmer (part of the snowball project). It is not for transliterated Russian, but merely uses a transliteration scheme for purposes of exposition. In the script of the stemmer itself, the Cyrillic character are represented by "macro" equivalents, so ш is represented by {sh}, щ by {shch} and so on. This is done partly to make it easier for English speaking readers, partly so that the script is entirely in ASCII, and partly so that it is not tied to any particular encoding scheme. (KOI8-R and Unicode versions are both given.) The transliteration scheme used forllows the Library of Congress standard, in which hard sign and soft sign are repesrented by double quote and single quote. But the compiled stemmer does work directly on Cyrillic characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants