Skip to content

Latest commit

 

History

History
73 lines (53 loc) · 2.98 KB

wlip-0002.md

File metadata and controls

73 lines (53 loc) · 2.98 KB
Number:  WLIP-0002
Title:   Word list file format
Type:    Standard
Status:  Draft
Authors: Paweł Broda <pwl.broda@gmail.com>
Created: 2019-10-02

Abstract

WLIP-0002 describes word list file format - both content and file name.

Motivation

This standard allows to easily move backups between various wallets. Additionally, standardizing file name the way described below allows for more than one list for given language (although this feature should be used only in exceptional cases).

Body

All the characters used for both file name and content are UTF-8.

File content format

The very first line contains character set definition enclosed in square brackets []. Rest of the lines is devoted to actual words. Words should be placed in a file one per line. For a list consisting of 2048 words they fill in lines 2-2049.

Character set definition has a strict format. Let's explain it based on the Polish word list.
English character set: abcdefghijklmnopqrstuvwxyz
Polish character set: aąbcćdeęfghijklłmnńoóprsśtuwyzźż

One can see, that Polish character set might be defined as:

  • BASE_CHARACTER_SET: english
  • REDUNDANT_CHARACTER_SET: from English alphabet remove 'qvx'
  • EXTRA_CHARACTER_SET: to English alphabet add ą,ć,ę,ł,ń,ó,ź,ż; because these letters are not present in the base character set, mappings need to be provided, i.e: ą:a,ć:c,ę:e,ł:l,ń:n,ó:o,ś:s,ź:z,ż:z

All above combined gives us the first line written as: [english-qvx+ą:a+ć:c+ę:e+ł:l+ń:n+ó:o+ś:s+ź:z+ż:z]

No empty lines are allowed in the file.
Line separator is U+000A (i.e. LF known as line feed or simply '\n').
Beginning of REDUNDANT_CHARACTER_SET is U+002D (i.e. '-')
EXTRA_CHARACTER_SET defines mapping using U+002B and U+003A (i.e. '+' and ':')

See references section below for an example of properly formatted list (also one with mapping)

File name format

LANGUAGE-HASH-OPTIONAL_DESCRIPTION

where:
LANGUAGE: language name in its native spelling, for example 'English' (without apostrophes)

HASH: exactly 8 characters. Those eight characters uniquely define a list. Compute SHA-3 of the entire word list file and leave the least significant 8 characters. Note it is a hash of entire file that is why all the lines including mappings are taken into hash calculation. Hash characters are from the '0123456789abcdef' set.

OPTIONAL_DESCRIPTION - optional description, in the language for which particular list is applied, starting with lowercase. Use in exceptional cases as the aim is to have one official list per language. In most cases this field will remain unused. For example 'obsolete'.

Additionally, filename is max 255 bytes long (256th byte optionally left for a character encoding end of the string in some programming languages)

See references section below for properly named lists.

References

Examples:

   English-missing_hash
   English-a1d03317-obsolete
   Polski-missing_hash

where 'Polski' stands for 'Polish' in Polish language