Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ToAscii: Some russian texts in Cyrillic (Azbuka) are wrong rewrited #203

Closed
janbarasek opened this issue Oct 24, 2019 · 4 comments
Closed

ToAscii: Some russian texts in Cyrillic (Azbuka) are wrong rewrited #203

janbarasek opened this issue Oct 24, 2019 · 4 comments

Comments

@janbarasek
Copy link
Contributor

@janbarasek janbarasek commented Oct 24, 2019

Hi,

I can speak Russian language and I noticed that the transcription of Cyrillic (Azbuka) to Ascii is sometimes inaccurate.

For example string:

dump(Strings::toAscii('для'));

should be dlja, but not current dla, because char я means ja (Czech) or ya (English).

Sample:

Snímek obrazovky 2019-10-24 v 16 26 54

If my suggestion makes sense I can implement better translation with support for whole syllables and special cases.

Thanks.

@dg

This comment has been minimized.

Copy link
Member

@dg dg commented Oct 29, 2019

Do you have enabled Transliterator?

if ($transliterator === null && class_exists('Transliterator', false)) {
And what implementation of iconv are you using?
if (ICONV_IMPL === 'glibc') {

@janbarasek

This comment has been minimized.

Copy link
Contributor Author

@janbarasek janbarasek commented Oct 29, 2019

@dg Yes, Transliterator was instanced and ICONV_IMPL is libiconv.

I detected it by:

/**
 * Converts UTF-8 string to ASCII.
 */
public static function toAscii(string $s): string
{
	static $transliterator = null;
	if ($transliterator === null && class_exists('Transliterator', false)) {
		$transliterator = \Transliterator::create('Any-Latin; Latin-ASCII');
	}
	dump(['transliterator' => $transliterator, 'iconv' => ICONV_IMPL]);

Dumped:

Snímek obrazovky 2019-10-29 v 17 43 49

Thanks.

@dg

This comment has been minimized.

Copy link
Member

@dg dg commented Oct 29, 2019

Transliterator converts я to â (according to ISO 9 https://cs.wikipedia.org/wiki/ISO_9) and then â to a.

Probably solution is to add strtr for chars Я я Ю ю and maybe some others.

@janbarasek

This comment has been minimized.

Copy link
Contributor Author

@janbarasek janbarasek commented Oct 29, 2019

I think we can use (for Czech language):

[
	'ё' => 'jo',
	'ъ' => '',
	'ы' => 'y',
 	'ь' => '', 
	'э' => 'eh', 
	'ю' => 'ju',
	'я' => 'ja',
]

But for correct behavior is very important use target language (which can be loaded from second parameter string $language = 'en').

And table for English:

[
	'ё' => 'jo',
	'ъ' => '',
	'ы' => 'y',
 	'ь' => '', 
	'э' => 'e', 
	'ю' => 'yu',
	'я' => 'ya',
]
@dg dg closed this in 5716775 Jan 3, 2020
dg added a commit that referenced this issue Jan 3, 2020
dg added a commit that referenced this issue Jan 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.