ToAscii: Some russian texts in Cyrillic (Azbuka) are wrong rewrited #203

janbarasek · 2019-10-24T14:28:25Z

Hi,

I can speak Russian language and I noticed that the transcription of Cyrillic (Azbuka) to Ascii is sometimes inaccurate.

For example string:

dump(Strings::toAscii('для'));

should be dlja, but not current dla, because char я means ja (Czech) or ya (English).

Sample:

If my suggestion makes sense I can implement better translation with support for whole syllables and special cases.

Thanks.

The text was updated successfully, but these errors were encountered:

dg · 2019-10-29T12:57:30Z

Do you have enabled Transliterator?

utils/src/Utils/Strings.php

Line 141 in c133e18

if ($transliterator === null && class_exists('Transliterator', false)) {

And what implementation of iconv are you using?

utils/src/Utils/Strings.php

Line 154 in c133e18

if (ICONV_IMPL === 'glibc') {

janbarasek · 2019-10-29T16:44:34Z

@dg Yes, Transliterator was instanced and ICONV_IMPL is libiconv.

I detected it by:

/**
 * Converts UTF-8 string to ASCII.
 */
public static function toAscii(string $s): string
{
	static $transliterator = null;
	if ($transliterator === null && class_exists('Transliterator', false)) {
		$transliterator = \Transliterator::create('Any-Latin; Latin-ASCII');
	}
	dump(['transliterator' => $transliterator, 'iconv' => ICONV_IMPL]);

Dumped:

Thanks.

dg · 2019-10-29T17:38:27Z

Transliterator converts я to â (according to ISO 9 https://cs.wikipedia.org/wiki/ISO_9) and then â to a.

Probably solution is to add strtr for chars Я я Ю ю and maybe some others.

janbarasek · 2019-10-29T20:18:38Z

I think we can use (for Czech language):

[
	'ё' => 'jo',
	'ъ' => '',
	'ы' => 'y',
 	'ь' => '', 
	'э' => 'eh', 
	'ю' => 'ju',
	'я' => 'ja',
]

But for correct behavior is very important use target language (which can be loaded from second parameter string $language = 'en').

And table for English:

[
	'ё' => 'jo',
	'ъ' => '',
	'ы' => 'y',
 	'ь' => '', 
	'э' => 'e', 
	'ю' => 'yu',
	'я' => 'ya',
]

dg closed this as completed in 5716775 Jan 3, 2020

dg added a commit that referenced this issue Jan 3, 2020

Strings::toAscii transliterates я/ю as ya/yu [Closes #203]

88187c7

dg added a commit that referenced this issue Jan 3, 2020

Strings::toAscii transliterates я/ю as ya/yu [Closes #203]

d6cd63d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ToAscii: Some russian texts in Cyrillic (Azbuka) are wrong rewrited #203

ToAscii: Some russian texts in Cyrillic (Azbuka) are wrong rewrited #203

janbarasek commented Oct 24, 2019

dg commented Oct 29, 2019 •

edited

Loading

janbarasek commented Oct 29, 2019

dg commented Oct 29, 2019

janbarasek commented Oct 29, 2019

ToAscii: Some russian texts in Cyrillic (Azbuka) are wrong rewrited #203

ToAscii: Some russian texts in Cyrillic (Azbuka) are wrong rewrited #203

Comments

janbarasek commented Oct 24, 2019

dg commented Oct 29, 2019 • edited Loading

janbarasek commented Oct 29, 2019

dg commented Oct 29, 2019

janbarasek commented Oct 29, 2019

dg commented Oct 29, 2019 •

edited

Loading