The `Lua-UCA` package

\iffalse

The `Lua-UCA` package

\fi

This package adds support for the Unicode collation algorithm for Lua 5.3 and later. It is mainly intended for use with Lua\TeX and working \TeX\ distribution, but it can work also as a standalone Lua module. You will need to install a required Lua-uni-algos package by hand in that case.

Usage

To sort a table using Czech collation rules:

kpse.set_program_name "luatex"
local ducet = require "lua-uca.lua-uca-ducet"
local collator = require "lua-uca.lua-uca-collator"
local languages = require "lua-uca.lua-uca-languages"

local collator_obj = collator.new(ducet)
-- load Czech rules
collator_obj = languages.cs(collator_obj)

local t = {"cihla",  "chochol", "hudba", "jasan", "čáp"}

table.sort(t, function(a,b) 
  return collator_obj:compare_strings(a,b) 
end)

for _, v in ipairs(t) do
  print(v)
end

The output:

cihla čáp hudba chochol jasan

More samples of the library usage can be found in the source repository of this package on Github. % See HACKING.md file in the repo for more information.

Use with Xindex processor

Xindex is flexible index processor written in Lua by Herbert Voß. It has built-in Lua-UCA support starting with version 0.23. The support can be requested using the -u option:

 xindex -u -l no -c norsk filename.idx

Use with LuaJIT

The default version of lua-uca-ducet fails with Luajit. You can use alternative version of this file, lua-uca-ducet-jit.

Change sorting rules

The simplest way to change the default sorting order is to use the tailor_string method of the collator_obj object. It updates the collator object using special syntax which is subset of the format used by the Unicode locale data markup language.

collator_obj:tailor_string "&a<b"

Full example with Czech rules:

kpse.set_program_name "luatex"
local ducet = require "lua-uca.lua-uca-ducet"
local collator = require "lua-uca.lua-uca-collator"
local languages = require "lua-uca.lua-uca-languages"

local collator_obj = collator.new(ducet)
local tailoring = function(s) collator_obj:tailor_string(s) end

tailoring "&c<č<<<Č"
tailoring "&h<ch<<<cH<<<Ch<<<CH"
tailoring "&R<ř<<<Ř"
tailoring "&s<š<<<Š"
tailoring "&z<ž<<<Ž"

Note that the sequence of letters ch, Ch, cH and CH will be sorted after h

It is also possible to expand a letter to multiple letters, like this example for DIN 2:

tailoring "&Ö=Oe"
tailoring "&ö=oe"

Some languages, like Norwegian, sort uppercase letters before lowercase. This can be enabled using collator_obj:uppercase_first() function:

local tailoring = function(s) collator_obj:tailor_string(s) end
collator_obj:uppercase_first()
tailoring("&D<<đ<<<Đ<<ð<<<Ð")
tailoring("&th<<<þ")
tailoring("&TH<<<Þ")
tailoring("&Y<<ü<<<Ü<<ű<<<Ű")
tailoring("&ǀ<æ<<<Æ<<ä<<<Ä<ø<<<Ø<<ö<<<Ö<<ő<<<Ő<å<<<Å<<<aa<<<Aa<<<AA")
tailoring("&oe<<œ<<<Œ")

Some languages, for example Canadian French, sort accent backwards, like gêne < gëne < gêné. In this case, you can set the collator_obj.accents_backward variable to true.

% More information on a new language support is in the HACKING.md % document in the Lua-UCA Github repo.

Script reordering

Many languages sort different scripts after the script this language uses. As Latin based scripts are sorted first, it is necessary to reorder scripts in such cases.

The collator_obj:reorder function takes table with scripts that need to be reordered. For example Cyrillic can be sorted before Latin using:

collator_obj:reorder {"cyrillic"}

In German or Czech, numbers should be sorted after all other characters. This can be done using:

collator_obj:reorder {"others", "digits"}

The special keyword "others" means that the scripts that follows in the table will be sorted at the very end.

Headers for index entries

In some languages, for example Czech, multiple letters may count as one character. This is the case of the ch character.

Lua-UCA provides function collator_obj:get_lowest_char(). It returns table with UTF-8 codepoints for correct first character for a given language that can be used for example as an index header.

local czech = collator.new(ducet)
languages.cs(czech)
-- first we need to convert string to codepoints
local codepoints = czech:string_to_codepoints("Chrobák")
local first_char = czech:get_lowest_char(codepoints)
-- it should print letters "ch"
print(utf8.char(table.unpack(first_char)))
-- you can also specify position of the character
local second_char = czech:get_lowest_char(codepoints, 2)
-- it should print letter "h", as it is second codepoint in the string
print(utf8.char(table.unpack(second_char)))

Unicode normalization

By default, no Unicode normalization is used internally. You can explicitly request normalization that use the Uninormalize package. Note that it will significantly increase the procesing time.

There are two normalization methods, NFC and NFD. They can be enabled using collation.use_nfc() and collation.use_nfd() functions.

What is missing

Algorithm for setting implicit sort weights of characters that are not explicitly listed in DUCET.
Special handling of CJK scripts.

\iffalse

Copyright

Michal Hoftich, 2021–2024. See LICENSE file for more details.

\fi

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
rockspecs		rockspecs
spec		spec
src/lua-uca		src/lua-uca
tools		tools
xindex		xindex
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
HACKING.md		HACKING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
lua-uca-doc.tex		lua-uca-doc.tex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The `Lua-UCA` package

Usage

Use with Xindex processor

Use with LuaJIT

Change sorting rules

Script reordering

Headers for index entries

Unicode normalization

What is missing

Copyright

About

Releases

Packages

Contributors 2

Languages

License

michal-h21/lua-uca

Folders and files

Latest commit

History

Repository files navigation

The Lua-UCA package

Usage

Use with Xindex processor

Use with LuaJIT

Change sorting rules

Script reordering

Headers for index entries

Unicode normalization

What is missing

Copyright

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

The `Lua-UCA` package

Packages