Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CJK char frequency #36

Open
2 of 5 tasks
yhahn opened this issue Jun 1, 2014 · 24 comments
Open
2 of 5 tasks

CJK char frequency #36

yhahn opened this issue Jun 1, 2014 · 24 comments

Comments

@yhahn
Copy link
Member

yhahn commented Jun 1, 2014

TL;DR

  • good news is that there are definitely commonly used chinese characters that we can bundle together into special glyph PBFs.
  • bad news is that this amounts to about ~3000-4000 characters which will amount to a ~2-3MB overhead and it's not likely we can reduce this overhead much.
  • other good news is this approach will likely work for hangul (korean) which has similar size/range waste issues though not as bad as cjk.

Notes from weekend analysis (source committed here https://github.com/mapbox/fontserver/tree/char-spec/spec).

Background

Prep

  • Sample of text that we need to render. In spec/fixtures/ I've added 12 z14 vector tiles of major chinese cities.
  • 4096 most frequent chinese characters, in all name tags from OSM. This is dumped to cjk-osm.json.
  • 4096 most frequent chinese characters, from an academic analysis of modern chinese character frequency (http://lingua.mtsu.edu/chinese-computing/statistics/). This is dumped to cjk-modern.json.

Approach

Assuming a character that falls into the 4096 most freq character use can be grabbed from a set of CJK common PBFs or so, we can analyze the range count for our 12 vector tiles:

none (78 ranges)
[ '0-255',
  '19968-20223',
  '20224-20479',
  '20736-20991',
  '20992-21247',
  '21248-21503',
  '21504-21759',
  '21760-22015',
  '22016-22271',
  '22272-22527',
  '22528-22783',
  '22784-23039',
  '23040-23295',
  '23296-23551',
  '23552-23807',
  '23808-24063',
  '24064-24319',
  '24320-24575',
  '24576-24831',
  '24832-25087',
  '25088-25343',
  '25344-25599',
  '256-511',
  '25600-25855',
  '25856-26111',
  '26112-26367',
  '26368-26623',
  '26624-26879',
  '26880-27135',
  '27136-27391',
  '27392-27647',
  '27648-27903',
  '27904-28159',
  '28160-28415',
  '28416-28671',
  '28672-28927',
  '28928-29183',
  '29184-29439',
  '29440-29695',
  '29696-29951',
  '29952-30207',
  '30208-30463',
  '30464-30719',
  '30720-30975',
  '30976-31231',
  '31232-31487',
  '31488-31743',
  '31744-31999',
  '32000-32255',
  '32256-32511',
  '32512-32767',
  '32768-33023',
  '33280-33535',
  '33536-33791',
  '33792-34047',
  '34048-34303',
  '34304-34559',
  '34560-34815',
  '34816-35071',
  '35072-35327',
  '35584-35839',
  '35840-36095',
  '36096-36351',
  '36608-36863',
  '36864-37119',
  '37120-37375',
  '37888-38143',
  '38144-38399',
  '38400-38655',
  '38656-38911',
  '38912-39167',
  '39168-39423',
  '39424-39679',
  '39936-40191',
  '40448-40703',
  '40704-40959',
  '8192-8447',
  '8448-8703' ]

osm (26 ranges)
[ '0-255',
  '20224-20479',
  '20992-21247',
  '256-511',
  '28416-28671',
  '29440-29695',
  '29952-30207',
  '33536-33791',
  '38912-39167',
  '8192-8447',
  '8448-8703',
  'cjk-common-0',
  'cjk-common-1',
  'cjk-common-10',
  'cjk-common-11',
  'cjk-common-12',
  'cjk-common-13',
  'cjk-common-14',
  'cjk-common-2',
  'cjk-common-3',
  'cjk-common-4',
  'cjk-common-5',
  'cjk-common-6',
  'cjk-common-7',
  'cjk-common-8',
  'cjk-common-9' ]

modern (34 ranges)
[ '0-255',
  '22272-22527',
  '256-511',
  '26880-27135',
  '27136-27391',
  '27648-27903',
  '27904-28159',
  '28416-28671',
  '29184-29439',
  '29952-30207',
  '30976-31231',
  '32512-32767',
  '33280-33535',
  '33536-33791',
  '34304-34559',
  '39424-39679',
  '8192-8447',
  '8448-8703',
  'cjk-common-0',
  'cjk-common-1',
  'cjk-common-10',
  'cjk-common-11',
  'cjk-common-12',
  'cjk-common-13',
  'cjk-common-14',
  'cjk-common-15',
  'cjk-common-2',
  'cjk-common-3',
  'cjk-common-4',
  'cjk-common-5',
  'cjk-common-6',
  'cjk-common-7',
  'cjk-common-8',
  'cjk-common-9' ]

The scripts assume we split up the cjk-common PBF into chunks of 256. It looks like you will want most if not all of these 4096 characters as a baseline, always, all the time (if you run the script with 1 or 2 tiles alone you will often end up with 12-16 of the common ranges). You basically end up grabbing the common characters no matter what.

So combining the common glyphs into a single pack and eliminating the non cjk ranges from the list to reduce noise, we're down to:

osm (8 ranges)
[ '20224-20479',
  '20992-21247',
  '28416-28671',
  '29440-29695',
  '29952-30207',
  '33536-33791',
  '38912-39167',
  'cjk-common' ]

modern (15 ranges)
[ '22272-22527',
  '26880-27135',
  '27136-27391',
  '27648-27903',
  '27904-28159',
  '28416-28671',
  '29184-29439',
  '29952-30207',
  '30976-31231',
  '32512-32767',
  '33280-33535',
  '33536-33791',
  '34304-34559',
  '39424-39679',
  'cjk-common' ]

Conclusion + questions

Overall having a cjk-common glyph that takes precedence if a character falls into a fixed list of n (I picked 4096 based on a few runs -- 3000 not enough, 5000 has diminishing returns) looks like a good approach. You'll have most commonly used characters loaded and cached and will fire off requests as normal for other ranges as you hit less common characters.

The discrepancy between OSM's top characters and other analysis' top characters is worth some study though. Note that once we set this list changing it is very painful. It will affect any implementations using these endpoints and likely means we need to spec and version any endpoints around this strictly.

  • "Place language" differs from normal usage significantly in english, german, etc. This is likely (?) the case for Chinese as well and could lead to this discrepancy.
  • OSM data quality in China is an unknown to me. Setting a list of character frequency list based on the current state of OSM could be a bad idea without some overall confidence in current OSM data in China being representative of Chinese-language maps as a whole and future OSM data/map data.

Next actions

  • Similar analysis for Hangul
  • Branch of fontserver that can generate a PBF from a charlist rather than a start/end range,
  • Branch of llmr that uses a cjk charlist to request the common CJK PBF and fallback to normal range glyphs otherwise,
  • Test IRL
  • Consult/further research on chinese char freq questions

cc @mikemorris @kkaefer @ansis @nickidlugash

@yhahn
Copy link
Member Author

yhahn commented Jun 1, 2014

Added to the test above a first example of what happens when there are no common character range PBFs around (78 range requests).

@yhahn
Copy link
Member Author

yhahn commented Jun 19, 2014

@mikemorris answering some of your questions:

we'll need to build separate common glyphs for korean and japanese?

These languages all have a shared characteristic: Their writing systems all completely or partly use Chinese characters — Hànzì in Chinese, kanji in Japanese, hanja in Korean, and Chữ Nôm in Vietnamese. Chinese is written in Chinese characters only and requires approximately 4,000 characters for general literacy although there are up to 40,000 characters for reasonably complete coverage.

From http://en.wikipedia.org/wiki/CJK_characters

Basically: Korean, Japanese writing systems often include chinese characters for kind of "old-school" writing, and then have their own character/alphabet systems for the majority of normal use. For example, in Korea newspaper headlines are often have some CJK characters, and then the article will be in the Hangul character set.

Ideally we do not need to build separate common CJK glyph range PBFs for other languages that use CJK. We would only need to do this if the frequency distribution of character usage in Korean or Japanese of CJK characters is very different from Chinese. Let's stick with a single set of common CJK for now.

Korean: Hangul

There is the separate issue of whether Korean Hangul (a very different character set and also very large portion of unicode) needs a common glyph freq analysis. It is a big character range in unicode (http://jrgraphix.net/r/Unicode/AC00-D7AF) though not nearly as huge as CJK:

> parseInt('d7af',16) - parseInt('ac00',16);
11183

I did a quick analysis of this using a seoul OSM extract and recall from the results that likely a 256-512 set of common glyphs would have a very good impact. I haven't run an analysis that covers North/South Korea and that is what I think would be the next step here.

*Japanese: Hiragana + Katakana

I am less familiar with Japanese but these are two charactersets that do not cause as many headaches as CJK/Hangul. http://jrgraphix.net/r/Unicode/3040-309F, http://jrgraphix.net/r/Unicode/30A0-30FF

> parseInt('30ff',16) - parseInt('3040',16);
191

@mikemorris
Copy link
Contributor

Emailed info@geofabrik.de asking for a Korea (North and South) sub region extract at http://download.geofabrik.de/asia.html, is this something @joto could help out with?

@lxbarth
Copy link

lxbarth commented Jun 19, 2014

@mikemorris - per chat, use the Overpass API to download extracts. You can create a download URL with bbox and curl it.

@ajashton says there's also http://overpass-turbo.eu/ where you can use the wizard to easily make complex queries, eg place=city in "North Korea".

@mikemorris
Copy link
Contributor

Thanks @lxbarth, got extracts of North Korea and South Korea to analyze now.

@mikemorris
Copy link
Contributor

http://extract.bbbike.org/ is another resource for creating extracts, recommended by the folks at geofabrik

@mikemorris
Copy link
Contributor

Per chat with @nickidlugash, should we possibly do another analysis for traditional Chinese characters used in Taiwan, Hong Kong and Macau?

@yhahn
Copy link
Member Author

yhahn commented Jun 23, 2014

Start by adding more tile fixtures for these areas to

and running analysis of what the request profile looks like using the existing common-cjk ranges. If these fare badly/worse than the current tests, then yes, I think you should look into finding out why.

@mikemorris
Copy link
Contributor

[For Hangul], a 256-512 set of common glyphs would have a very good impact

Committed results of a full analysis of North and South Korea. North Korea has very limited OSM coverage, but the entirety of glyphs in use ended up being just 445. It also appears that none of the glyphs in the Hangul Compatibility Jamo are used.

@mikemorris
Copy link
Contributor

If these fare badly/worse than the current tests, then yes, I think you should look into finding out why.

@yhahn These look significantly worse than mainland China to me, I think we'll be needing a cjk-traditional common set as well.

@mikemorris
Copy link
Contributor

About 30% of simplified Chinese characters match simplified kanji (see shinjitai).[28] This makes it easier for people who know simplified characters to be able to read and understand Japanese kanji. For example, the character 国 (country) is written the same way in Japanese (国) although in traditional Chinese it is 國. However, those who understand traditional Chinese will understand a much greater proportion of Japanese Kanji, as the current standard Japanese character set is much more similar to traditional Chinese.
https://en.wikipedia.org/wiki/Debate_on_traditional_and_simplified_Chinese_characters#Aesthetics

@mikemorris
Copy link
Contributor

Adding all Hangul Unicode ranges:

@mikemorris
Copy link
Contributor

Need to add some tile fixtures for Japan, but here are current results. Switching primary sorting to Unicode index instead of frequency adds 3-5 ranges to each, so these results only sort on index if frequency is equal. The extraneous ranges (CJK Symbols, Bopomofo for Taiwan, etc) are pretty well arranged within the Unicode spec, so adding them to the CJK common set just ended up bloating it and causing more ranges to be loaded unnecessarily.

china
none (78 ranges)
cjk-osm (26 ranges)
cjk-modern (34 ranges)
hangul-osm (78 ranges)

taiwan
none (90 ranges)
cjk-osm (33 ranges)
cjk-modern (106 ranges)
hangul-osm (90 ranges)

hong-kong
none (86 ranges)
cjk-osm (25 ranges)
cjk-modern (102 ranges)
hangul-osm (86 ranges)

macau
none (82 ranges)
cjk-osm (23 ranges)
cjk-modern (79 ranges)
hangul-osm (82 ranges)

north-korea
none (34 ranges)
cjk-osm (34 ranges)
cjk-modern (34 ranges)
hangul-osm (5 ranges)

south-korea
none (45 ranges)
cjk-osm (45 ranges)
cjk-modern (45 ranges)
hangul-osm (8 ranges)

@mikemorris
Copy link
Contributor

Compressing into a single common set for each yields this:

china
none (78 ranges)
cjk-osm (5 ranges)
cjk-modern (34 ranges)
hangul-osm (78 ranges)

taiwan
none (90 ranges)
cjk-osm (9 ranges)
cjk-modern (106 ranges)
hangul-osm (90 ranges)

hong-kong
none (86 ranges)
cjk-osm (5 ranges)
cjk-modern (102 ranges)
hangul-osm (86 ranges)

macau
none (82 ranges)
cjk-osm (3 ranges)
cjk-modern (79 ranges)
hangul-osm (82 ranges)

north-korea
none (34 ranges)
cjk-osm (34 ranges)
cjk-modern (34 ranges)
hangul-osm (2 ranges)

south-korea
none (45 ranges)
cjk-osm (45 ranges)
cjk-modern (45 ranges)
hangul-osm (4 ranges)

@mikemorris
Copy link
Contributor

Trimming cjk-common from 6405 to 4096 hits Taiwan REALLY hard, with a moderate impact on the rest. This could possibly be alleviated with a cjk-traditional-common set for Taiwan, Hong Kong and Macau.

china cjk-osm (15 ranges)
taiwan cjk-osm (71 ranges)
hong-kong cjk-osm (31 ranges)
macau cjk-osm (9 ranges)

Trimming hangul-common from 1110 to 1024 is manageable, but still adds a few ranges.

north-korea hangul-osm (2 ranges)
south-korea hangul-osm (7 ranges)

@mikemorris
Copy link
Contributor

Best attempt so far at splitting cjk-common isn't all that impressive, feels like I'm just spinning wheels here and that even the best solution here is still incredibly fragile.

In this test, cjk-osm is built only from the China OSM extract, and cjk-extended-osm is built from China, Taiwan and Japan extracts deduped against cjk-osm.

cjk-combined-osm is the union of cjk-osm and cjk-extended-osm.

range sizes
cjk-osm 4096
cjk-extended-osm 2048
cjk-combined-osm 6144
hangul-osm 1024

china
cjk-osm (10 ranges)
cjk-extended-osm (79 ranges)
cjk-combined-osm (7 ranges)

taiwan
cjk-osm (85 ranges)
cjk-extended-osm (91 ranges)
cjk-combined-osm (17 ranges)

hong-kong
cjk-osm (20 ranges)
cjk-extended-osm (87 ranges)
cjk-combined-osm (8 ranges)

macau
cjk-osm (6 ranges)
cjk-extended-osm (83 ranges)
cjk-combined-osm (4 ranges)

north-korea
hangul-osm (2 ranges)

south-korea
hangul-osm (7 ranges)

@mikemorris
Copy link
Contributor

Only real way to fix Taiwan is to not trim at all.

range sizes
cjk-osm 4096
cjk-extended-osm 2309
cjk-combined-osm 6405

china
cjk-osm (10 ranges)
cjk-extended-osm (79 ranges)
cjk-combined-osm (6 ranges)

taiwan
cjk-osm (85 ranges)
cjk-extended-osm (91 ranges)
cjk-combined-osm (10 ranges)

hong-kong
cjk-osm (20 ranges)
cjk-extended-osm (87 ranges)
cjk-combined-osm (6 ranges)

macau
cjk-osm (6 ranges)
cjk-extended-osm (83 ranges)
cjk-combined-osm (4 ranges)

@mikemorris
Copy link
Contributor

Added test fixtures for Japan:

japan
none (95 ranges)
cjk-osm (14 ranges)
cjk-modern (111 ranges)
hangul-osm (95 ranges)
ranges [ '0-255',
  '1024-1279',
  '12288-12543',
  '13312-13567',
  '256-511',
  '55296-55551',
  '57088-57343',
  '64000-64255',
  '65280-65535',
  '8192-8447',
  '8448-8703',
  '8704-8959',
  '9472-9727',
  'cjk-common' ]

Interesting that there appear to be requests for characters in the High Surrogates and Low Surrogates ranges, which all appear as � to me.

@mikemorris
Copy link
Contributor

TypeKit went with "dynamic subsetsetting", allowing requests for any number and combination of glyphs rather than predefined blocks.

Instead of redownloading an entirely new font, we can now simply request the additional glyphs, and perform the update right in your browser. Need one glyph? We can do that! And when you need another, no need to download the first again.

http://blog.typekit.com/2015/06/15/announcing-east-asian-web-font-support/

It also sounds like this was a really easy issue for them to solve:

After many years — working across four teams, on three continents, and in five time zones — we are proud to announce that we’ve extended Typekit’s web font service to support Chinese, Japanese and Korean fonts.

@kkaefer
Copy link
Contributor

kkaefer commented Aug 6, 2015

Super interesting read, thanks for posting!

@huangyingjie
Copy link

hi @mikemorris , what about the recent schedule? It's in great request for me. And is there any way to take part in?

@glzcc
Copy link

glzcc commented Dec 8, 2016

image
download the file,how use this file?

@jayantchens
Copy link

@yhahn @mikemorris Thank you for your answer. Could you tell me the exact details? Developing webGL.js (javaScript) in web web pages?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants