Skip to content

Latest commit

 

History

History
72 lines (53 loc) · 3.77 KB

README.md

File metadata and controls

72 lines (53 loc) · 3.77 KB

Dictionaries

Adding new languages

Brief introduction

This is not meant to be comprehensive, but it should at least touch on the main aspects of adding a language.

  • Add a dictionary file - ./add-lang.sh de DE (will require the relevant GNU ASpell library to be installed)
  • Attribute dictionary source - Add a relevant assets/dictionaries/dictionary.*.LICENSE file
  • Make Gradle aware of the language - Add "de_DE", to the languages array in build.gradle
  • Generate a trie representation of the dictionary - ./gradlew buildDictionary_de
  • Subclass Language - Add it to the libraries/trie library
  • Tell Lexica about your Language class - Add an entry to Language#getAllLanguages()
  • Generate random letter distributions - ./gradlew analyseLanguage_de
  • Add language name (in English) - Edit app/src/main/res/values/strings.xml, adding pref_dict_LANG and optionally pref_dict_LANG_description
  • Add scrabble scores - Edit your Language subclass, adding letter scores from Wikipedia - Scrabble letter distributions
  • Run tests - ./gradlew check && ./gradlew connectedCheck

Obtaining a dictionary

This project uses the GNU Aspell project to obtain dictionaries. The script add_lang.sh in the root directory of this project will:

  • Dump all words from a specific dictionary (e.g. en_UK or de_DE).
  • Omit words shorter than 3 characters and longer than 9.

Although it is of course technically possible for words longer than 9 to be recorded on Lexica boards, in practice it is so unlikely as to cause problems when generating new random board generators. The reason is that it is hard to measure how successful a board generator is if the vast majority of words in a language are very long (e.g. in German).

Other dictionaries have also been included, such as the Japanese dictionary. This comes to us from the JMdict project and used under the CC-BY-SA-3.0 license. This was made possible from the work of @wichmann here.

Anatomy of a random board generator in Lexica

Once a dictionary is available, the next trick is to create a set of probability distributions to be used by random board generators, so that the generated boards tend to have nice properties (i.e. lots of words).

The format used by these probability distributions is inherited from the original Lexic project:

a 12 3 2 1 1
b 5 3 1
c 3 1
d 8 4 1
e 24 12 3 1 1
...

Boards are generated by consulting this probability distribution, and performing a weighted random choice of letters. This is done by looking at the first column of numbers, and performing roulette wheel selection based on this value. The higher the value, the more likely a letter will be chosen.

Once a letter is chosen (e.g. d in the example board above), then the first number in that row is removed (leaving d 4 1 in this example. You will note that each successive number is lower, meaning the probability of a subsequent letter being chosen again is less than the original choice of that letter.

Once a letters has had all of its numbers exhausted, that letter will not appear on the board any more. Thus, in the example above, c 3 1 means that the letter c can be chosen at most twice per board.

Generating random board generators

The original source code of Lexic included hard coded versions of these letter frequencies, without a way to generate them. This version of Lexica includes a script to count the number of times each letter occurs.

To run the algorithm: ./gradlew analyseLanguage_LANG.