Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Japanese board #36

Closed
pserwylo opened this issue Jul 22, 2017 · 8 comments
Closed

Investigate Japanese board #36

pserwylo opened this issue Jul 22, 2017 · 8 comments

Comments

@pserwylo
Copy link
Member

pserwylo commented Jul 22, 2017

Pretty much exactly the same as #35 (but that is for Chinese). The Japanese UI translation was one of the first to be contributed, so I want to also have a Japanese version of the board, but I don't have enough knowledge of the language to figure out if it is meaningful or how to go about implementing it.

@Riotism
Copy link
Contributor

Riotism commented Aug 2, 2017

For a Japanese board, I would imagine each tile being a kana. A kana is a syllable (like letters) that make up a "word". There are two ways of writing the same kana, hiragana and katakana. There is already a word game involving linking kana (Shiritori) which uses a similar mechanism. I am not a Japanese speaker but I definitely see a Japanese board being feasible.

@wichmann
Copy link

Disclaimer: I do not speak Japanese as mother tongue. I'm learning Japanese as a hobby and can understand/read it to some degree. But I would be interested to see Lexica with Japanese words.

I agree with @Riotism. IMHO it would be possible and reasonable to use only kana. The two syllabaries (hiragana and katakana) are used for different purposes but are otherwise interchangeable. Therefore you could use just one of them.

To get a Japanese word list, I think you would have to start with a good dictionary and get readings (kana) for all words, because most Japanese words are written with a script called kanji, which are logograms like the Chinese characters. Then you could convert all kana into one of the syllabaries and use the result as word list.

Currently the dictionary most free apps are using is JMdict/EDICT. It can be used under the terms of the Creative Commons licence. I wrote a simple Python script to get the dictionary and create a usable word list from that. Here is the result: https://gist.github.com/wichmann/7912e0f7694ad8fdbd584b94b2e792f0.

@pserwylo
Copy link
Member Author

Oh, that is great, thanks so much @wichmann! I've taken your word list, and it does indeed work successfully (working on my fork on a branch called japanese). My first attempt at running my scripts successfully:

  • Build a trie data structure from the word list (./gradlew buildDictionary_jp)
  • Generates a probability distribution to produce boards (./gradlew analyseLanguage_jp). After generating 1000 boards, it produces boards which average about 35 wrods, the worst board was about 10 words, and the best board was about 75 boards. I'll keep running the algorithm and see if I can improve those statistics and get some better boards.

I will try and prepare a release with it to get further feedback, but before that I'll quickly test:

  • That the game can actually be played (e.g. it doesn't crash when using Kanji)
  • Will document the process properly, including including your script and documenting the JMdict/EDICT stuff.

@wichmann - Do you mind if I include your script in the ./tools directory? If so, what license may I use? Preferably the GPLv3+ license, but it is of course your choice.

Also, would you be able to provide any feedback on the letter scores I've taken from Wikipedia and added here?

@pserwylo
Copy link
Member Author

I've taken the "small letters" and put them next to what I think looked like (to my naive English-reading eyes) to be the larger versions of the same letter, giving them the same score:

pserwylo@0ad1bbd

Commit message above explains further.

@pserwylo
Copy link
Member Author

Now I've dealt with the diacritics in this commit:

pserwylo@2bd305a

Only a few more characters left:

Any feedback for these?

@pserwylo
Copy link
Member Author

FYI, I'm guessing that the idea in #71 will also be appropriate here, based on the wikipedia article about Scrabble letters, and how they seem to be somewhat normalized (with regards to diacritics). If so, it will probably have to wait until myself or someone else is able to implement the neccessary changes to the guts of Lexica and how it stores word lists internally.

@wichmann
Copy link

@pserwylo - Thanks for all your work. Of course you can include my script. As license the GPLv3+ is fine by me.

As for the characters left:

"ゐ" and "ゑ" are obsolete hiragana which are not used today, only in old texts. "〜" represents a Japanese tilde, IMHO it is never used in words, only for ranges or special purposes. All words with these three characters can be eliminated from the word list, as there are only a few of those.

"を" is used as a grammatical marker ("particle") and in loan words, but usually not in japanese dictionary words. Mostly, it is present in the word list, because the list contains phrases where it serves as particle.

"が" and "ぎ" are just versions of "か" and "き" with diacritics.

"ー" is used as a symbol for a long vowel, almost never used with hiragana, only with katakana. My script tries to convert all words to hiragana and it falsely leaves these characters in. In hiragana the symbol should be replaced by the vowel which it represents. Maybe there is a better way to make the conversion in the script?!

@pserwylo
Copy link
Member Author

Closing as a Japanese dictionary has existed for some time. If there are any issues with it, we can always open new issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants