Skip to content

Conversation

@manuelfuenmayor
Copy link
Contributor

@manuelfuenmayor manuelfuenmayor commented Nov 29, 2019

@ronaldtse
Copy link
Contributor

Ping @tsega for this

@tsega
Copy link
Contributor

tsega commented Sep 12, 2020

@ronaldtse two issues why the tests are failing here:

  1. Similar to the alalc-amh-ethi-latn-1997 the 6th order sometimes can be simply the consonant, e.g. '\u1215' (ሕ) can be transliterated as hi̠ or h; also pointed out in the notes.
  • (A) The vowel of the sixth order (i̠) is eliminated in spelling except when the actual pronunciation requires it (e.g. not Me̠ni̠gi̠si̠ti̠ but Me̠ngi̠st).

This I have corrected by adding the consonant only as the second entry in the yaml;

'\u1215' :       # ሕ
      - 'hi̠'
      - 'h'
  1. The next issue is capitalization; in Amharic there is no concept of capitalization. The tests are expecting the first letter of every work to be a capital letter. This as per my knowledge is wrong.

How do I fix the second issue?

@ronaldtse
Copy link
Contributor

@tsega thanks for this.

This I have corrected by adding the consonant only as the second entry in the yaml;

Should we instead use h as the first choice, since only when pronounced the i is used? Which situation is more common? We should use the more common situation as default.

The next issue is capitalization; in Amharic there is no concept of capitalization. The tests are expecting the first letter of every work to be a capital letter. This as per my knowledge is wrong.

What do locations at unstats.un.org do for capitalization? We probably should follow them, e.g. if they are all downcased we should also downcase.

@tsega
Copy link
Contributor

tsega commented Sep 13, 2020

@ronaldtse location names from unstats.un.org are all capitalized when transliterated into English since they are names of places. Is the sample data only names of locations? If so, then the tests are consistent in capitalizing the results and I would need to change my test data.

However, I was under the impression that the test data can come from anywhere and just has to be mapped correctly.

@ronaldtse
Copy link
Contributor

We will need to use machine learning to decide whether a word is a name (place, people) or not, so at this point we can just keep all examples in lower case. If that's fine we can merge and consider this done.

@ronaldtse
Copy link
Contributor

Completed in #414

@ronaldtse ronaldtse closed this Sep 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants