Snowball stemmer for Russian muches apostrophes #125

alexrudnick · 2012-01-17T06:26:39Z

Version: 2.0b9
To reproduce:
>>> print stm.stem(u"Кот-д'Ивуару")
Output: кот-дьивуар
Notice the apostrophe being turned into Ь.
It shouldn't.

Migrated from http://code.google.com/p/nltk/issues/detail?id=593

earlier comments

PeMiStahl said, at 2010-09-12T16:30:37.000Z:

Hi,

I'm the one who ported the Snowball stemmers to NLTK. Therefore, I exactly followed the descriptions given by Martin Porter on his website. You can find the description of the Russian stemmer here: http://snowball.tartarus.org/algorithms/russian/stemmer.html

The apostrophe is turned into Ь because, before the actual stemming process starts, every Cyrillic letter is turned into its Roman transliterated form as stated by Mr Porter. After stemming, the transliterated form is turned back into the Cyrillic equivalent.

The problem now is the following: The stemming algorithm defined by Mr Porter assumes that an apostrophe may only appear in the Roman transliteration of word, but not in the original Cyrillic form. As I do not speak any Russian myself, I do not know whether an apostrophe can appear in a Russian word or not.

So, if you can tell me which positions an apostrophe can appear at in a Russian word, I might be able to adjust my code accordingly.

Thanks,
Peter

gmal.rom said, at 2010-09-18T00:29:00.000Z:

As it can appear only in loanwords, it's safely to assume it can appear within the root, not in first/last position and not in suffixes or flexions.

PeMiStahl said, at 2010-10-28T10:49:26.000Z:

Hi everyone,

I apologize for posting this so lately, but I moved to a new town and started my Master studies. Therefore, I just did not find the time to deal with this issue. Now I can provide a fix to the apostrophe problem. It was actually very easy to solve because I just had to add some new replace rules to the transliteration methods. The apostrophe is no longer turned into another character but stays the same during the stemming process.

Best,
Peter

alexrudnick · 2012-09-15T05:40:16Z

Could somebody check what other versions of the Snowball stemmer do? And even better, could we get commentary from a Russian speaker?

kmike · 2012-10-17T15:48:28Z

Russian speaker here: the apostrophe shouldn't being turned into Ь.

<name> is defined as follows here http://snowball.tartarus.org/compiler/snowman.html:

<name>          ::= <letter> [ <letter> || <digit> || _ ]*

Apostrophe is not a Russian <letter>, it should be handled the same way as hyphen; apostrophe is not a diacritic mark in Russian. The confusion came from the fact that letter Ь is commonly transliterated as "", but "" is also a common punctuation and orthography mark in Russian. So the cyrillic -> transliterated -> stemmed -> cyrillic conversion is broken because the transliteration is lossy.

Snowball stemmers for other languages don't use transliteration and don't suffer from this issue.
I personally think Russian snowball stemmer also shouldn't use transliteration.

The issue is that the original algorithm is not for Russian, it is for a "transliterated Russian". There is a question about the purpose of the NLTK Snowball stemmer. If it is a reference implementation of a snowball stemmer, then the transliteration burden should be probably moved to user (stemmer should accept already transliterated words because that is how algorithm works) and this issue should be marked as "wontfix". If it is an attempt to write an useful stemmer for Russian then it should be fixed either by accepting only cyrillic words and escaping the apostrophe before the transliteration or (better) by rewriting the algorithm to make it work without the transliteration.

As far as I can tell, there is no way to have a strict reference Snowball stemmer implementation that properly transliterates "Кот-д'Ивуар" or "д`Артаньян".

Martin-Porter · 2014-04-11T08:37:17Z

Hello, sorry to arrive so late on the thread, but I wanted to set the record straight about my Russian stemmer (part of the snowball project). It is not for transliterated Russian, but merely uses a transliteration scheme for purposes of exposition. In the script of the stemmer itself, the Cyrillic character are represented by "macro" equivalents, so ш is represented by {sh}, щ by {shch} and so on. This is done partly to make it easier for English speaking readers, partly so that the script is entirely in ASCII, and partly so that it is not tied to any particular encoding scheme. (KOI8-R and Unicode versions are both given.) The transliteration scheme used forllows the Library of Congress standard, in which hard sign and soft sign are repesrented by double quote and single quote. But the compiled stemmer does work directly on Cyrillic characters.

alvations added the pleaseverify label Apr 12, 2017

alvations added the stem/lemma label Oct 13, 2017

stevenbird added inactive and removed inactive labels Aug 12, 2019

stevenbird closed this as completed Aug 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snowball stemmer for Russian muches apostrophes #125

Snowball stemmer for Russian muches apostrophes #125

alexrudnick commented Jan 17, 2012

alexrudnick commented Sep 15, 2012

kmike commented Oct 17, 2012

Martin-Porter commented Apr 11, 2014

Snowball stemmer for Russian muches apostrophes #125

Snowball stemmer for Russian muches apostrophes #125

Comments

alexrudnick commented Jan 17, 2012

earlier comments

alexrudnick commented Sep 15, 2012

kmike commented Oct 17, 2012

Martin-Porter commented Apr 11, 2014