Skip to content

Solving CLDR redundancy

Patryk Gołębiowski edited this page May 15, 2016 · 3 revisions

Problem

Represent CLDR data without redundancy.

Investigation

If the CLDR JSON data was merged naively, a huge tree with lots of redundancy would be yielded. The schema of such a tree is represented on the following image:

Legend

  • Orange pentagon represents the root.
  • Yellowish circles represent locale nodes.
  • Greenish triangles represent localized data subtrees (categories, subcategories, etc.).
  • Greenish squares represent localized values.
  • Purple triangle represents CLDR data that is not localized (less than 1% of the data).

Facts

  • Greenish triangles are very similar. They're not exactly the same, but almost. Huge redundancy here.
  • Greenish squares often represent same values at corresponding places.
  • Purple triangle has no redundancy within itself.

Solution

Value redundancy (greenish squares)

Considering leaves of the tree, many values are duplicated. The optimal solution is very easy: simply store List<string> in the root (orange pentagon). It will represent a mapping between string values and their indexes - unique identifiers. All leaves will now store those indexes (ints) instead of strings.

It is true that finding the identifier of a certain string is very non-optimal. However, this would only be needed during tree building - later the set of strings will be fixed and the lookup will be performed only by identifiers found in the leaves (so by indexes). Which is blindingly fast. And definitely optimal. For optimizing the creation stage, a temporary dictionary would do the work.

Subtree-without-values redundancy (greenish triangles)

To handle redundancy here we need to do some reorganization in the tree structure.

  1. Create a Dictionary<CldrLocale, int> in the root. This will be a mapping between locales (represented by yellowish circles) and their newly assigned unique identifiers.

  2. Remove yellowish circles (the paths are shortened by one element in the middle - greenish triangles are now directly under the root).

  3. Merge greenish triangles so that the resulting supertriangle contains all the paths of these greenish triangles. So all the greenish triangles can be considered as its subtriangles.

  4. Of course, while we can merge branches easily, for leaves we need a different scenario. All the leaf values will now be stored in their parents (so each greenish square goes to his parent). After triangle merging, such ex-leaf parents contain sets of values (from various locales). Here a new Dictionary<int, int> comes in handy. It will be a mapping between locale identifiers and value identifiers.

Comment

This structure can be further optimized, using relationships between locales and introducing a fallback strategy - some values could be removed and inherited from other locales, if they're same. But this is another story - the problem with redundancy is solved.