## Download and filter Unihan database

Wasted some time trying out unihan-etl parser (https://github.com/cihai/unihan-etl) - ***NOT RECOMMENDED, BUGGY BLOATED PIECE OF WORK***

```
#!pip install -q unihan-etl
#!~/.local/bin/unihan-etl -F json --no-expand
#!~/.local/bin/unihan-etl -F csv
#!cp -vf ~/.local/share/unihan_etl/unihan.csv ./
```

Download from upstream: https://www.unicode.org/Public/UCD/latest/ucd/

In [1]:
!curl -s -o LICENSE "https://www.unicode.org/license.txt"
![ -f Unihan.zip ] || wget https://www.unicode.org/Public/UCD/latest/ucd/Unihan.zip
!rm -f Unihan_*.txt && unzip Unihan.zip

--2023-10-17 22:52:33--  https://www.unicode.org/Public/UCD/latest/ucd/Unihan.zip
Resolving www.unicode.org (www.unicode.org)... 64.182.27.164
Connecting to www.unicode.org (www.unicode.org)|64.182.27.164|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7999959 (7.6M) [application/zip]
Saving to: ‘Unihan.zip’


2023-10-17 22:52:37 (2.25 MB/s) - ‘Unihan.zip’ saved [7999959/7999959]

Archive:  Unihan.zip
  inflating: Unihan_DictionaryIndices.txt  
  inflating: Unihan_DictionaryLikeData.txt  
  inflating: Unihan_IRGSources.txt   
  inflating: Unihan_NumericValues.txt  
  inflating: Unihan_OtherMappings.txt  
  inflating: Unihan_RadicalStrokeCounts.txt  
  inflating: Unihan_Readings.txt     
  inflating: Unihan_Variants.txt     


Parse to .csv:

In [2]:
import pandas as pd
import re, glob

unihan_mp = {}
for filename in glob.glob('Unihan_*.txt'):
    df = pd.read_csv(filename, sep='\t', comment='#', dtype='str', names=['ucn', 'col', 'val'])
    for row in df.itertuples():
        c = int(row.ucn[2:], 16)
        unihan_mp.setdefault(c, {'ucn': row.ucn, 'char': chr(c)})[row.col] = row.val

df = pd.DataFrame([unihan_mp[c] for c in sorted(unihan_mp.keys())]).set_index('ucn')
df.to_csv('unihan.csv')

In [4]:
!egrep '# (Date:|Unicode version:)' Unihan_Readings.txt >VERSION.txt
!rm -f Unihan.zip Unihan_*.txt

## Explorations

In [5]:
import pandas as pd
import re
pd.options.display.max_rows = 1000

df = pd.read_csv('unihan.csv', dtype='str').set_index('ucn')

In [6]:
df.describe().T

Unnamed: 0,count,unique,top,freq
char,98682,98682,㐀,1
kSemanticVariant,3418,3398,U+756B<kMatthews,2
kCantonese,29807,1869,jyu4,153
kDefinition,23259,17456,name of a variety of grass,37
kJapanese,51582,14980,コウ,924
kMandarin,41419,1512,yì,431
kCangjie,29188,27045,EYRN,6
kMojiJoho,52515,52515,MJ000004,1
kIRG_GSource,66572,65621,GHC,553
kIRG_JSource,16226,16226,JA-2121,1


Sample chars:

In [7]:
df[df.char.isin(['和', '說', '説', '说', '裡', '裏'])].T

ucn,U+548C,U+88CF,U+88E1,U+8AAA,U+8AAC,U+8BF4
char,和,裏,裡,說,説,说
kSemanticVariant,"U+548A<kLau,kMatthews U+9FA2<kLau,kMatthews","U+88E1<kHKGlyph,kLau,kMatthews","U+88CF<kHKGlyph,kLau,kMatthews",,,
kCantonese,wo4,leoi5,leoi5,syut3,syut3,syut3
kDefinition,"harmony, peace; peaceful, calm","inside, interior, within","inside, interior, within","speak, say, talk; scold, upbraid",speak,"speak, say, talk; scold, upbraid"
kJapanese,ワ オ カ やわらぐ やわらげる なごむ なごやか あえる,リ うら うち,リ うら,セツ ゼイ ネ セイ セ タツ セチ エツ とく,セツ ゼイ エツ タツ セイ とく よろこぶ,
kMandarin,hé,lǐ,lǐ,shuō,shuō,shuō
kCangjie,HDR,YWGV,LWG,YRCRU,YRCRU,IVCRU
kMojiJoho,MJ008199,MJ023994,MJ024015,MJ024533 MJ024533:E0101 MJ058743:E0102,MJ024535,
kIRG_GSource,G0-3A4D,G1-406F,GE-4C30,GE-4C73,G1-4B35,G0-4B35
kIRG_JSource,J0-4F42,J0-4E22,J0-4E23,,J0-4062,


### kZVariant

Same meaning ("x-axis"), same principal shape ("y-axis"), but different stylistic variants ("z-axis") that for historical/compatibility reasons haven't been unified.

In [8]:
groups = set()
for row in df[df.kZVariant.notnull()].itertuples():
    grp = tuple(sorted(re.findall(r'(U\+[0-9A-F]+)', row.Index + ' ' + row.kZVariant)))
    if grp not in groups:
        groups.add(grp)
        for u in grp:
            print('%s' % (df.loc[u].char), end='')
        print('[%s]' % row.kMandarin, end=' ' if len(groups) % 10 != 0 else '\n')

㖈䎛[nan] 㘽㦳[nan] 㩁搉[què] 㫚曶[hū] 㮣槩[gài] 䱍䱎[gèng] 𣢧䶾[nan] 併倂[bìng] 値值[zhí] 𠮟叱[chì]
吳吴呉[wú] 塡填[tián] 墫壿[zūn] 𡉟壯[zhuàng] 奨奬獎[jiǎng] 娛娯娱[yú] 媯嬀[guī] 帡帲[píng] 𢖽志[zhì] 𢗿怽[mo]
恆恒[héng] 悅悦[yuè] 戶户戸[hù] 挩捝[tuō] 挿插揷[chā] 揺搖摇[yáo] 敓敚[duó] 晚晩[wǎn] 梲棁[zhuó] 榝樧[shā]
涗涚[shuì] 溈潙[wéi] 𤽜皌[mò] 研硏[yán] 𥑘砞[mò] 𥘯祙[mèi] 稅税[shuì] 𥡴稽[jī] 絕絶[jué] 絙絚[huán]
緒緖[xù] 胼腁[pián] 脫脱[tuō] 苿茉[wèi] 蒍蔿[wěi] 蘷虁[kuí] 蛻蜕[tuì] 訮詽[yán] 說説[shuō] 謠謡[yáo]
豜豣[jiān] 跥跺[duò] 躗躛[wèi] 軿輧[píng] 𨓜逸[yì] 郎郞[láng] 郷鄉鄕[xiāng] 銳鋭[ruì] 鎭鎮[zhèn] 𨺓隆[lóng]
隷隸[lì] 黑黒[hēi] 𢫮𢬎[nan] 𦰥𱽨[bāng] 𩿣𩿲[mò] 𬻋𱽌[nan] 