# HSK 3.0 wordlist

**Primary sources**
  * http://www.moe.gov.cn/jyb_xwfb/gzdt_gzdt/s5987/202103/t20210329_523304.html
  * http://www.moe.gov.cn/jyb_xwfb/gzdt_gzdt/s5987/202103/W020210329527301787356.pdf
    * Scanned document without OCR layer
    * Error: `1856 火药` misindexed in .pdf, should be `1836 火药`
    * SHA256 e451fdf0899d9267bbd122db66e7e75bdd2851ad1e6732e47b6e960290d73a63
  * https://www.chinesetest.cn/standardsAction.do?means=standardInfo
    * Parseable pinyin and part of speech for most terms available
    * Indexing doesn't match .pdf, fuzzy sorted by pinyin

**Third party sources**

**https://github.com/elkmovie/hsk30**
* OCRed list by Pleco, only characters, no pinyin
* https://www.plecoforums.com/threads/hsk-3-0-flashcards.6706/

In [1]:
!pip install -q opencc

import os, io, json, re
import opencc
import pandas as pd

![ -f downloads/pleco.txt ] || curl -s -o downloads/pleco.txt https://raw.githubusercontent.com/elkmovie/hsk30/main/wordlist.txt
![ -f downloads/pleco-charlist.txt ] || curl -s -o downloads/pleco-charlist.txt https://raw.githubusercontent.com/elkmovie/hsk30/main/charlist.txt
!sha256sum downloads/pleco.txt downloads/pleco-charlist.txt

# Parse and convert to .csv

level = 0
idx = 0
rows = []

for line in open('downloads/pleco.txt'):
    line = line.strip()
    if not line or line.startswith('#'):
        continue
    if '级词汇表' in line:
        level += 1
        idx = 0
    else:
        idx += 1
        assert ' ' in line
        assert line.split()[0] == str(idx) or line == '1856 火药', (idx,line)  # typo in original .pdf
        rows.append({
            'ID': f'L{level}-{idx:04d}',
            'Level': str(level) if level < 7 else '7-9',
            'No': line.split()[0],  # note: has 1856 火药
            'OCR': line[len(str(idx)):].strip()
        })

pleco_df = pd.DataFrame(rows)
pleco_df.to_csv('downloads/pleco.csv', index=False)

4a8cfdc3a8fa85ced837c71aa187454c4d0ee36a1f70765655a93acb6a463af3  downloads/pleco.txt
4c779e47844147bc5aac7ff835e5ee8e2f312125314153e9c5e844501ed79897  downloads/pleco-charlist.txt


**https://github.com/shawkynasr/HSK-official-Query-System/**
* Scrape from [HSK website](https://www.chinesetest.cn/standardsAction.do?means=getStandardWordsList&leves=&words=&pinyin=&words_type=&pager.offset=0)?
* Pinyin and POS for most terms
* Two files, 2022 version is more accurate:
  * corrected error: 四级,会,huìlǜ,名 -> 四级,汇率,huìlǜ,名
  * corrected error: 一级,妈妈,māma∣mā,名 -> 一级,妈妈∣妈,māma∣mā,名
  * dropped brackets that contained POS; still has brackets for variants ("有（一）些") and examples for prefix/suffix entries ("们（朋友们）")
  * some off by one indexing error fixed
* Grammar points
* Character list seems wrong

In [2]:
%%bash -e

[[ -d ../downloads/hsk30 && ! -d downloads ]] && ln -s ../downloads/hsk30 downloads
[[ -d downloads ]] || mkdir -p downloads

for url in \
    downloads/shawkynasr2021.csv::https://raw.githubusercontent.com/shawkynasr/HSK-official-Query-System/main/%E8%AF%8D%E6%B1%87.csv \
    downloads/shawkynasr2022.csv::https://raw.githubusercontent.com/shawkynasr/HSK-official-Query-System/main/%E8%AF%8D%E6%B1%87%202022.csv \
    downloads/shawkynasr-grammar.csv::https://raw.githubusercontent.com/shawkynasr/HSK-official-Query-System/main/%E8%AF%AD%E6%B3%95.csv \
    downloads/shawkynasr-hanzi.csv::https://raw.githubusercontent.com/shawkynasr/HSK-official-Query-System/main/%E6%B1%89%E5%AD%97.csv; do
  filename="${url%::*}"
  url="${url#*::}"
  [[ -f "$filename" ]] || curl -s -o "$filename" "$url"
done

sha256sum downloads/shawkynasr*.csv

f79d832ffe2a4015d5a24d6aa9c6f734e90053a02456fa0fc37eeefb65c630cc  downloads/shawkynasr2021.csv
e4fd1e046f3fb94c2dd7ca7a34ae671230ca173cca50af5ff7d2913399150038  downloads/shawkynasr2022.csv
f5d2b25e889ce9b54034612dcf16500d53f6168086d3642b0bb3749f0644d783  downloads/shawkynasr-grammar.csv
2a2393d7896e4d43959a0a25f931286f00d73385655252fba0fa7f32f4534fee  downloads/shawkynasr-hanzi.csv


In [3]:
sh2021_df = pd.read_csv('downloads/shawkynasr2021.csv')
sh2022_df = pd.read_csv('downloads/shawkynasr2022.csv')

# Ignoring brackets and some formatting diffs, lists are same as pleco's:
def norm(c):
    c = c.str.replace('（.*', '', regex=True)
    for x,y in ['∣｜', '1¹', '2²']:
        c = c.str.replace(x, y)
    return c.sort_values()
assert list(sh2022_df['No.']) == list(range(1, 11093))
assert set(norm(sh2022_df.iloc[:, 2])) == set(norm(pleco_df.OCR))
assert len(set(norm(pleco_df.OCR)) ^ set(norm(sh2021_df.iloc[:, 2]))) <= 3  # {'妈妈', '妈妈｜妈', '汇率'}

In [4]:
# Grammar points
df = pd.read_csv('downloads/shawkynasr-grammar.csv').rename(columns={
    'No.': 'No',
    '级别': 'Level',
    '语法项目': 'Group',
    '类别': 'Category',
    '细目': 'Details',
    '语法内容': 'Content',
})
df['Level'] = df['Level'].map({'一级':'1', '三级':'3', '二级':'2', '五级':'5', '六级':'6', '四级':'4', '高等':'7-9'})
df.to_csv('hsk30-grammar.csv', index=False)

**https://github.com/andycburke/HSK-3.0-Word-List/**
* Alternatively OCRed list
* Obsolete/abandonded: all corrections now already incorporated in pleco's list, still has one unfixed error: 人手 (https://github.com/andycburke/HSK-3.0-Word-List/pull/2)

In [5]:
![ -f downloads/andycburke.csv ] || curl -s -o downloads/andycburke.csv https://raw.githubusercontent.com/andycburke/HSK-3.0-Word-List/main/HSK-3.0-Word-List.csv
df = pd.read_csv('downloads/andycburke.csv', dtype='str').fillna('')
assert list(df.HSK_3_0_Level) == list(pleco_df.Level)
assert list(df.HSK_3_0_No) == list(pleco_df.No)  # has 1856 火药
assert len(df[df.OCR != pleco_df.OCR]) <= 2
# L7-3498: correct 入手 rùshǒu    (pleco)
# L5-0101: correct 称²（动) chēng (pleco)
#print(df[df.OCR != pleco_df.OCR], '\npleco:\n', pleco_df[df.OCR != pleco_df.OCR])

**https://github.com/krmanik/HSK-3.0-words-list**
  * Based on pleco's list, some confusing old files in repo
  * Pinyin derived from hanzi, so has extra pinyin; entries with same hanzi merged together
  * Has traditional terms, but very low quality in ambiguous cases - obviously not checked, even basic characters.
  * Variants missing
  * 入手 error from pleco's list not fixed

In [6]:
krmanik = ['\t'.join(['Level', 'Simplified', 'Traditional', 'Pinyin', 'Zhuyin', 'Audio', 'Sim'])]

for lv in ['1', '2', '3', '4', '5', '6', '7-9']:
    if not os.path.exists(f'downloads/krmanik{lv}.tsv'):
        !curl -o "downloads/krmanik{lv}.tsv" "https://raw.githubusercontent.com/krmanik/HSK-3.0-words-list/main/HSK%20list%20with%20meaning/Anki%20xiehanzi__HSK%20{lv}.tsv"
    for line in open(f'downloads/krmanik{lv}.tsv'):
        krmanik.append(lv + '\t' + line.rstrip('\r\n'))

df = pd.read_csv(io.StringIO('\n'.join(krmanik)), sep='\t')
df['Pinyin'] = df.Pinyin.str.replace('</?span[^>]*>', '', regex=True)
df.to_csv('downloads/krmanik.csv', index=False)

if 0 and os.path.exists('hsk30-expanded.csv'):
    #hdf = pd.read_csv('hsk30.csv', dtype='str').fillna('')
    hdf = pd.read_csv('hsk30-expanded.csv', dtype='str').fillna('')[lambda X: X.Example != '1']
    for lv in sorted(set(df.Level)):
        miss = set(hdf[lambda X: (X.Level == lv)].Simplified) - set(df[df.Level == lv].Simplified)
        extra = set(df[df.Level == lv].Simplified) - set(hdf[lambda X: X.Level == lv].Simplified)
        print('L%s missing: %s, extra: %s' % (lv, miss, extra))

    trads = {}
    for row in hdf.itertuples():
        trads.setdefault(row.Simplified, []).extend(row.Traditional.split('|'))
    k = 0
    for row in df.itertuples():
        if row.Traditional not in trads.get(row.Simplified, []):
            k += 1
            print(k, row.Simplified, row.Traditional, 'not in', set(trads.get(row.Simplified, [])))

## Merge and cleanup

In [7]:
# Main data source: a scrape of vocab from https://www.chinesetest.cn/standardsAction.do?means=getStandardWordsList&leves=&words=&pinyin=&words_type=&pager.offset=0
# https://raw.githubusercontent.com/shawkynasr/HSK-official-Query-System/main/%E8%AF%8D%E6%B1%87%202022.csv
sh_df = pd.read_csv('downloads/shawkynasr2022.csv', dtype='str').fillna('').rename(columns={
    'No.': 'No',
    '词语': 'Hanzi',
    '级别': 'Level',
    '拼音': 'Pinyin',
    '词性': 'POS',
})
sh_df['Level'] = sh_df['Level'].map({'一级':1, '三级':3, '二级':2, '五级':5, '六级':6, '四级':4, '高等':7})

# Auxiliary source (for verification and term indexes in original .pdf): pleco's OCRed list
# https://raw.githubusercontent.com/elkmovie/hsk30/main/wordlist.txt
pleco_df = pd.read_csv('downloads/pleco.csv', dtype='str').fillna('').set_index('ID')
level_ocr_to_id = {}
for row in pleco_df.reset_index().itertuples():
    key = (int(row.Level[0]), row.OCR)
    level_ocr_to_id.setdefault(key, []).append(row.ID)

# Ambiguous entries crossreferenced against original pdf
# http://www.moe.gov.cn/jyb_xwfb/gzdt_gzdt/s5987/202103/W020210329527301787356.pdf
PDF_DISAMBIG_MP = {
   #(level, hanzi, pinyin, pos) -> term_id
   (1, '地', 'de', '助'):	'L1-0066',
   (1, '地', 'dì', '名'):	'L1-0069',
   (1, '干', 'gān', '形'):	'L1-0111',
   (1, '干', 'gàn', '动'):	'L1-0113',
   (1, '还', 'hái', '副'):	'L1-0132',
   (1, '还', 'huán', '动'):	'L1-0153',
   (2, '长', 'cháng', '形'):	'L2-0055',
   (2, '长', 'zhǎng', '动'):	'L2-0715',
   (2, '倒', 'dǎo', '动'):	'L2-0109',
   (2, '倒', 'dào', '动'):	'L2-0111',
   (2, '得', 'dé', '动'):	'L2-0115',
   (2, '得', 'de', '助'):	'L2-0118',
   (2, '实在', 'shízài', '副'): 'L2-0487',
   (2, '实在', 'shízai', '形'): 'L2-0488',
   (3, '背', 'bèi', '名'):	'L3-0030',
   (3, '调', 'diào', '动'):	'L3-0153',
   (3, '调', 'tiáo', '动'):	'L3-0708',
   (3, '精神', 'jīngshén', '名'): 'L3-0386',
   (3, '精神', 'jīngshen', '形、名'): 'L3-0387',
   (4, '倒车', 'dǎo∥chē', ''):	'L4-0169',
   (4, '倒车', 'dào∥chē', ''):	'L4-0170',
   (4, '划', 'huá', '动'):	'L4-0328',
   (4, '划', 'huà', '动'):	'L4-0329',
   (4, '卷', 'juǎn', '动'):	'L4-0428',
   (4, '卷', 'juàn', '量'):	'L4-0429',
   (4, '挑', 'tiāo', '动'):	'L4-0711',
   (4, '挑', 'tiǎo', '动'):	'L4-0714',
   (5, '编辑', 'biānjí', '动'): 'L5-0035',
   (5, '编辑', 'biānji', '名'): 'L5-0036',
   (5, '扇', 'shān', '动'):	'L5-0644',
   (5, '扇', 'shàn', '量、名'): 'L5-0645',
   (5, '吐', 'tǔ', '动'):	'L5-0772',
   (5, '吐', 'tù', '动'):	'L5-0773',
   (6, '露', 'lòu', '动'):	'L6-0557',
   (6, '露', 'lù', '动'):	'L6-0562',
   (6, '蒙', 'mēng', '动'):	'L6-0573',
   (6, '蒙', 'méng', '动'):	'L6-0574',
   (7, '担', 'dān', '动'):	'L7-0736',
   (7, '担', 'dàn', '量'):	'L7-0748',
   (7, '大意', 'dàyì', '名'):	'L7-0719',
   (7, '大意', 'dàyi', '形'):	'L7-0720',
   (7, '地道', 'dìdào', '名'):	'L7-0830',
   (7, '地道', 'dìdao', '形'):	'L7-0831',    
   (7, '缝', 'féng', '动'):	'L7-1175',
   (7, '缝', 'fèng', '名'):	'L7-1179',
   (7, '晃', 'huǎng', '动'):	'L7-1781',
   (7, '晃', 'huàng', '动'):	'L7-1784',
   (7, '哄', 'hōng', '拟声'): 'L7-1657',
   (7, '哄', 'hǒng', '动'):	'L7-1671',
   (7, '哄', 'hòng', '动'):	'L7-1672',
   (7, '闷', 'mēn', '形、动'): 'L7-2831',
   (7, '闷', 'mèn', '形'):	'L7-2836',
   (7, '拧', 'níng', '动'):	'L7-3014',
   (7, '拧', 'nǐng', '动'):	'L7-3017',
}

# Manual expansions for variants/special entries
variants_df = pd.read_csv('data/variants.csv', dtype='str').fillna('')
variants_mp = variants_df.groupby('ID').apply(lambda X: X.drop(columns='ID').to_dict(orient='records')).to_dict()
for vals in variants_mp.values():  # drop NAs
    for v in vals:
        for col in list(v.keys()):
            if v[col].strip() == '':
                v.pop(col)

# Normalize pinyin for matching: drop ∥· and undo tone changes to 不 一
def normalize_pinyin(pinyin, hz):
    mp = {'pò∥àn': "pò'àn", 'gǎn∥ēn': "gǎn'ēn", 'yìxīn-yíyì': 'yīxīn-yīyì'}
    if pinyin in mp:
        return mp[pinyin]

    pinyin = re.sub('[∥·]', '', pinyin)

    if '不' in hz and 'bú' in pinyin:
        assert hz.count('不') == pinyin.count('bú') + pinyin.count('bù'), (pinyin, hz)
        pinyin = pinyin.replace('bú', 'bù')

    if '一' in hz and re.search('(yí|yì)', pinyin):
        assert hz.count('一') == pinyin.count('yí') + pinyin.count('yì'), (pinyin, hz)
        pinyin = pinyin.replace('yí', 'yī')
        pinyin = pinyin.replace('yì', 'yī')

    return pinyin


rows = []
seen_ids = set()

for row in sh_df.sort_values(['Level', 'Pinyin']).itertuples():
    # Determine term_id (level+index from .pdf) by matching up with pleco's list
    hz = row.Hanzi.replace('∣', '｜').replace('1', '¹').replace('2', '²')
    ids = (
        level_ocr_to_id.get((row.Level, hz), []) +
        level_ocr_to_id.get((row.Level, f'{hz}（{row.POS}）'), [])
    )
    if len(ids) >= 2:
        key = (row.Level, row.Hanzi, row.Pinyin, row.POS)
        ids = [PDF_DISAMBIG_MP[key]]
        dis_ids=list(ids)
        del PDF_DISAMBIG_MP[key]
    assert len(ids) == 1
    term_id = ids[0]
    assert term_id in pleco_df.index
    assert term_id not in seen_ids
    seen_ids.add(term_id)

    # Cleanup formatting in pinyin slightly
    pinyin = row.Pinyin
    for x, y in [("[’‘]", "'"), (' *（', ' ('), ('） *', ') '), ('∣', '|')]:
        pinyin = re.sub(x, y, pinyin).strip()
    assert re.match(r"^[-a-zāáǎàēéěèīíǐìōóǒòūúǔùüǖǘǚǜ ∥·'|/…()]+$", pinyin.lower()), pinyin

    # Translate part of speech to english
    POS_MAP = {
        '名': 'N',     # noun
        '动': 'V',     # verb
        '形': 'Adj',   # adjective/state verb; Vs in TOCFL
        '副': 'Adv',   # adverb
        '代': 'Pron',  # pronoun; sometimes Det/N in TOCFL
        '数': 'Num',   # numeral
        '量': 'M',     # measure word/classifier
        '介': 'Prep',  # preposition
        '连': 'Conj',  # conjunction
        '助': 'Aux',   # auxiliary word; usually Ptc in TOCFL
        '叹': 'Intj',  # interjection/exclamation/particle 喂 啊 哎呀, Ptc in tocfl
        '前缀': 'Prefix',
        '后缀': 'Suffix',
        '拟声': 'Phonetic',
    }
    pos = ''
    if row.POS:
        pos = '/'.join([POS_MAP[s] for s in row.POS.split('、')])

    hanzi = re.sub(r'[\u2223∣|]', '|', row.Hanzi)
    assert re.match(r'^([\u4E00-\u9FFF]|[〇|…（）12])+$', hanzi)

    variants = variants_mp.get(term_id, [])
    if any(c in '12（）|…/'  for c in (hanzi+pinyin)):
        # all special entries have Variants entry
        assert term_id in variants_mp, row
    else:
        assert term_id not in variants_mp, row

    for variant in variants:
        variant['Pinyin'] = normalize_pinyin(variant['Pinyin'], variant['Simplified'])

    rows.append({
        'ID': term_id,
        'Simplified': hanzi,
        'Pinyin': normalize_pinyin(pinyin, hanzi),
        'POS': pos,
        'Level': str(row.Level) if row.Level <= 6 else '7-9',
        'WebNo': row.No,
        'WebPinyin': pinyin,
        'OCR': pleco_df.loc[ids[0], 'OCR'],
        'Variants': json.dumps(variants_mp[term_id], ensure_ascii=False) if term_id in variants_mp else '',
    })

hsk30_df = pd.DataFrame(rows).sort_values('ID').reset_index(drop=True).set_index('ID').copy()
hsk30_df.to_csv('hsk30.csv')

assert len(PDF_DISAMBIG_MP) == 0  # all disambiguation entries matched
assert len(hsk30_df) == 11092
assert list(hsk30_df.index) == list(pleco_df.index)
assert list(hsk30_df.Level.value_counts().sort_index()) == [500, 772, 973, 1000, 1071, 1140, 5636]

## Add traditional forms and join with CEDICT

In [8]:
# ID, simplified => space separated trad variants. Main/TW variant first.
TRAD_VARIANTS_MP = {
  ('L1-0213', '里'): '裡 裏',	    # lǐ N
  ('L1-0251', '哪里'): '哪裡 哪裏',	# nǎlǐ Pron
  ('L1-0256', '那里'): '那裡 那裏',	# nàlǐ Pron
  ('L1-0271', '你'): '你',	        # nǐ Pron, TW has 妳 but that's rather esoteric for HSK
  ('L2-0028', '表'): '表 錶',	    # biǎo
  ('L2-0377', '面'): '面',	        # miàn N/M
  ('L2-0378', '面'): '麵',	        # miàn N
  ('L2-0574', '喂'): '喂',	        # wèi Intj
  ('L3-0693', '台'): '台 臺',	    # tái N/M
  ('L2-0742', '周'): '週',	        # zhōu M
  ('L3-0772', '系'): '系',	        # xì N
  ('L3-0898', '证'): '證',	        # zhèng N
  ('L3-0954', '准'): '準',	        # zhǔn Adj/Adv
  ('L4-0099', '冲'): '衝 沖',	    # chōng V
  ('L4-0430', '卷'): '捲 卷',	    # juǎn V
  ('L4-0482', '了解'): '了解 瞭解',	# liǎojiě V
  ('L4-0683', '松'): '鬆',	        # sōng Adj/V
  ('L4-0760', '喂'): '餵',	        # wèi V
  ('L5-0094', '尝'): '嚐 嘗',	    # cháng V
  ('L5-0120', '丑'): '醜',	        # chǒu Adj
  ('L5-0224', '发布'): '發布 發佈',	# fābù V
  ('L5-0326', '胡同儿'): '胡同兒',	# hútòngr N, theoretical variant: 衚衕兒 but taiwanese wouldn't write with erhua
  ('L5-0406', '尽可能'): '盡可能',	# jìn kěnéng; alt. 儘可能 has different tone [jǐn]
  ('L7-3998', '坛'): '壇',	        # tán N; alt. 罈 esoteric
  ('L7-4684', '须'): '須',	        # xū V; alt. 鬚 beard

  # complicated/rare variant cases, leave unchanged as main trad variant
  ('L7-4577', '效仿'): '效仿',
  ('L7-4741', '熏'): '熏',
  ('L7-4742', '熏陶'): '熏陶 薰陶',
}

UNTONE_MP = {
    'a': 'a', 'ā': 'a', 'á': 'a', 'ǎ': 'a', 'à': 'a',
    'e': 'e', 'ē': 'e', 'é': 'e', 'ě': 'e', 'è': 'e',
    'o': 'o', 'ō': 'o', 'ó': 'o', 'ǒ': 'o', 'ò': 'o',
    'i': 'i', 'ī': 'i', 'í': 'i', 'ǐ': 'i', 'ì': 'i',
    'u': 'u', 'ū': 'u', 'ú': 'u', 'ǔ': 'u', 'ù': 'u',
    'ü': 'ü', 'ǖ': 'ü', 'ǘ': 'ü', 'ǚ': 'ü', 'ǜ': 'ü'
}

opencc_tw2s = opencc.OpenCC('tw2s')
opencc_s2tw = opencc.OpenCC('s2tw')

# At this point, additional data sources are needed from zhongwen repo
# to cross-reference against -- run the notebook from inside that repo.
cedict_df = pd.read_csv('../cedict/cedict.csv', dtype='str').fillna('')
cedict_idx = cedict_df.assign(idx=cedict_df.index).groupby('Simplified').idx.apply(list)

# Characters from Table of General Standard Chinese Characters for verification.
# HSK uses characters from only first two levels
tgh_chars = set(pd.read_csv('../chars/tgh.csv')[lambda X: X.level <= 2].char)
assert len(tgh_chars) == 6500

tw_words = \
  set(pd.read_csv('../dangdai/dangdai.csv').Traditional) | \
  set(pd.read_csv('../modernchinese/modernchinese.csv').Traditional) | \
  set(pd.read_csv('../pavc/pavc.csv').Traditional) | \
  set(pd.read_csv('../tocfl/tocfl-expanded.csv').Traditional) | \
  set(pd.read_csv('../tbcl/tbcl-expanded.csv').Traditional)

# Check if pinyin from hsk (py1) matches cedict's (py2)
# Optionally matching untoned vowels with tones if untone==True.
def pinyin_matches(py1, py2, hz='', untone=False, yi=True, bu=True):  # py2 cedict
    py1 = py1.lower()
    py2 = py2.lower()
    i, j = 0, 0
    while i < len(py1) or j < len(py2):
        a = ''
        if i < len(py1):
            a = py1[i]
            if a in "-∥·', ":
                i += 1
                continue

        b = ''
        if j < len(py2):
            b = py2[j]
            if b in "-∥·', ":
                j += 1
                continue

        match = (a == b)
        match |= untone and (UNTONE_MP.get(a, a) == b or a == UNTONE_MP.get(b, b))
        if i > 0 and j > 0:
            match |= yi and py1[i-1:i+1] in ['yí', 'yì'] and py2[j-1:j+1] == 'yī' and '一' in hz
            match |= bu and py1[i-1:i+1] == 'bú' and py2[j-1:j+1] == 'bù' and '不' in hz

        if match:
            i += 1
            j += 1
        else:
            return False

    return i == len(py1) and j == len(py2)

# Join with CEDICT and determine traditional form.
# Updates row in place: adds 'Traditional' and 'CEDICT' columns.
def cedict_join(row):
    matches = [cedict_df.iloc[i] for i in cedict_idx.get(row['Simplified'], [])]
    trad_variants = TRAD_VARIANTS_MP.get((row['ID'], row['Simplified']), '').split()

    if trad_variants:
        matches = [m for m in matches if m.Traditional in trad_variants]
        matches.sort(key=lambda m: trad_variants.index(m.Traditional))
        # Keep applying other filters, at least pinyin filter is needed for wei4

    if len(matches) == 0:
        if not trad_variants:
            trad_variants = [opencc.OpenCC('s2tw').convert(row['Simplified'])]
        if row['Simplified'] != opencc_tw2s.convert(trad_variants[0]):
            print(f"{row['ID']},{row['Simplified']},{row['Pinyin']},{row['POS']}: {matches_str};  {opencc_tw2s.convert(trad_variants[0])}")
        if row.get('Example') != '1':
            print(f"Missing:\t{row['ID']} {row['Simplified']} {row['Pinyin']} {row['POS']} {trad_variants}")
        row['Traditional'] = '|'.join(trad_variants)
        row['CEDICT'] = ''
        return

    if len(matches) >= 2:
        filt = [m for m in matches if pinyin_matches(row['Pinyin'], m.Pinyin, row['Simplified'])]
        if len(filt) == 0:
            filt = [m for m in matches if pinyin_matches(row['Pinyin'], m.Pinyin, row['Simplified'], untone=True)]
        if len(filt) == 0:
            mm = [m.to_dict() for m in matches]
            print(f"No matching pinyin for: {row['ID']} {row['Simplified']} {row['Pinyin']}: {mm}")
        else:
            matches = filt

    # Filter name matches by capitalization - CEDICT has a lot of names
    if len(matches) >= 2:
        filt = [m for m in matches if m.Pinyin[0].islower() == row['Pinyin'][0].islower()]
        assert len(filt) > 0
        matches = filt

    # Filter by words contained in datasets for traditional chinese learners.
    # TODO: further verification, add MOE dict
    if len(matches) >= 2 and not trad_variants:
        filt = [m for m in matches if m.Traditional in tw_words]
        if len(filt) > 0:
            matches = filt

    # Filter weird/obsolete char matches. TODO: further verification
    if len(matches) >= 2 and not trad_variants:
        filt = [m for m in matches if not re.match('^((old|archaic|Japanese)? ?variant of).*', m.Definitions)]
        assert len(filt) > 0, row
        matches = filt

    # Put opencc's s2tw conversion at the front.
    if len(matches) >= 2 and not trad_variants:
        matches.sort(key=lambda m: (int(m.Traditional != opencc_tw2s.convert(m.Simplified)), m.Traditional))

    if not trad_variants:
        trad_variants = [m.Traditional for m in matches]

    row['CEDICT'] = '/'.join([f'{m.Traditional}|{m.Simplified}[{m.PinyinNumbered}]' for m in matches])
    row['Traditional'] = '|'.join(trad_variants)

    if len(matches) >= 2 and len(set(m.Traditional for m in matches)) > 1:
        print(f"Multiple:\t{row['ID']},{row['Simplified']},{row['Pinyin']},{row['POS']}: {row['CEDICT']}")
    if row['Simplified'] != opencc_tw2s.convert(trad_variants[0]):
        # simplified back translation diff
        print(f"Simp diff:\t{row['ID']},{row['Simplified']},{row['Pinyin']},{row['POS']}:",
              f"{row['CEDICT']}; our {row['Traditional']}, tw2s {opencc_tw2s.convert(trad_variants[0])}; "
              f"s2tw {opencc_s2tw.convert(row['Simplified'])}")


df = hsk30_df.copy()
df.insert(1, 'Traditional', '')
df['CEDICT'] = ''

for row in df.reset_index().fillna('').to_dict(orient='records'):
    if row['Variants']:
        variants = json.loads(row['Variants'])
        for variant in variants:
            vrow = dict(row)
            vrow.update(variant)
            cedict_join(vrow)
            variant['Traditional'] = vrow['Traditional']
            variant['CEDICT'] = vrow['CEDICT']
        df.loc[row['ID'], 'Variants'] = json.dumps(variants, ensure_ascii=False)
        df.loc[row['ID'], 'Traditional'] = opencc_s2tw.convert(row['Simplified'])
        df.loc[row['ID'], 'CEDICT'] = [v['CEDICT'] for v in variants if 'CEDICT' in v][0]
        assert all(c in tgh_chars or c in '|（）〇12…' for c in row['Simplified']), row
    else:
        cedict_join(row)
        df.loc[row['ID'], 'Traditional'] = row['Traditional']
        df.loc[row['ID'], 'CEDICT'] = row['CEDICT']
        assert all(c in tgh_chars for c in row['Simplified']), row

hsk30_trad_df = df
hsk30_trad_df.to_csv('hsk30.csv')

Missing:	L1-0044 车上 chē shang  ['車上']
Multiple:	L1-0213,里,lǐ,N: 裡|里[li3]/裏|里[li3]
Multiple:	L1-0251,哪里,nǎlǐ,Pron: 哪裡|哪里[na3 li3]/哪裏|哪里[na3 li3]
Multiple:	L1-0256,那里,nàlǐ,Pron: 那裡|那里[na4 li5]/那裏|那里[na4 li5]
Multiple:	L2-0028,表,biǎo,N: 表|表[biao3]/錶|表[biao3]
Missing:	L2-0034 不太 bù tài  ['不太']
Missing:	L2-0044 不一会儿 bù yīhuìr  ['不一會兒']
Missing:	L2-0264 见过 jiànguo  ['見過']
Missing:	L2-0507 送到 sòngdào  ['送到']
Missing:	L2-0722 这时候 zhè shíhou  ['這時候']
Missing:	L3-0192 放到 fàngdào  ['放到']
Missing:	L3-0508 能不能 néng bu néng  ['能不能']
Multiple:	L3-0693,台,tái,N/M: 台|台[tai2]/臺|台[tai2]
Multiple:	L4-0099,冲,chōng,V: 衝|冲[chong1]/沖|冲[chong1]
Multiple:	L4-0328,划,huá,V: 划|划[hua2]/劃|划[hua2]
Multiple:	L4-0337,汇报,huìbào,V/N: 匯報|汇报[hui4 bao4]/彙報|汇报[hui4 bao4]
Multiple:	L4-0428,卷,juǎn,V: 卷|卷[juan3]/捲|卷[juan3]
Multiple:	L4-0482,了解,liǎojiě,V: 了解|了解[liao3 jie3]/瞭解|了解[liao3 jie3]
Missing:	L4-0853 眼里 yǎnli N ['眼裡']
Missing:	L4-0897 有劲儿 yǒujìnr  ['有勁兒']
Multiple:	L5-0094,尝,cháng,V: 嚐|尝[chang2]/嘗|尝[chang2]
Missing:	L5-010

## Expand variants

Expand both simplified and traditional variants.

In [9]:
expanded_rows = []
for row in hsk30_trad_df.reset_index().to_dict(orient='records'):
    variants = row.pop('Variants')
    variants = json.loads(variants) if variants else [{}]
    for variant in variants:  # expand simplified
        vrow = dict(row)
        vrow.update(variant)
        assert len(vrow['Traditional'].split()) == 1  # not empty and not spaces
        assert '/' not in vrow['Traditional']
        trad_variants = vrow['Traditional'].split('|')

        for trad in trad_variants:  # expand traditional
            vrow = dict(row)
            vrow.update(variant)
            vrow['Traditional'] = trad
            if '/' in vrow['CEDICT']:
                vrow['CEDICT'] = [v for v in vrow['CEDICT'].split('/') if v.startswith(trad + '|')][0]
            expanded_rows.append(vrow)

            for col in ['Simplified', 'Traditional', 'Pinyin']:
                assert all(c not in '|/…()（）12∥·' for c in vrow[col]), vrow
            assert all(c in tgh_chars or c == '〇' for c in vrow['Simplified']), vrow

expanded_df = pd.DataFrame(expanded_rows)
expanded_df.to_csv('hsk30-expanded.csv', index=False)
print('hsk30-expanded.csv: %d rows\n' % len(expanded_df))

hsk30-expanded.csv: 11165 rows



## Character list

In [10]:
char_level = {}
char_to_words = {}
char_to_ids = {}
char_to_trad = {}
for row in expanded_df.itertuples():
    lv = int(row.Level[0])
    assert len(row.Simplified) == len(row.Traditional)
    for ch, tch in zip(row.Simplified, row.Traditional):
        char_level[ch] = min(char_level.get(ch, lv), lv)
        wl = char_to_words.setdefault(ch, [])
        if row.Simplified not in wl:
            wl.append(row.Simplified)
        char_to_ids.setdefault(ch, set()).add(row.ID)
        char_to_trad.setdefault(ch, [])
        if tch not in char_to_trad[ch]:
            char_to_trad[ch].append(tch)

df = pd.read_csv('downloads/shawkynasr-hanzi.csv', encoding='GB2312').rename(columns={
    'No.': 'No',
    '汉字': 'Hanzi',
    '级别': 'Level',
    '拼音': 'Pinyin',
})
df['Level'] = df['Level'].map({'一级':1, '三级':3, '二级':2, '五级':5, '六级':6, '四级':4, '高等':7})

miss_chars = set(char_level.keys()) - set(df.Hanzi)
print('In wordlist, but not char list:', miss_chars)
assert miss_chars == {'〇'}  # variant outside unihan anyway

extra_chars = set(df.Hanzi) - set(char_level.keys())
print(f'Extra {len(extra_chars)} chars not in wordlist: {"".join(extra_chars)}')
assert set(df[df.Hanzi.isin(extra_chars)].Level) == {7}  # all are L7 chars

for c in extra_chars:
    char_level[c] = 7
del char_level['〇']
print('Final list: %d chars' % len(char_level))
assert set(char_level.keys()) == set(df.Hanzi)

# Levels don't match
#for row in df.itertuples():
#    if char_level[row.Hanzi] != row.Level:
#        print(row.Hanzi, char_level[row.Hanzi], row.Level)

char_writing_level = {}
char_level_index = {}
lv = 0
for line in open('downloads/pleco-charlist.txt'):
    line = line.strip()
    if not line or line.startswith('#'):
        continue
    if '级汉字表' in line:
        lv += 1
        idx = 0
    elif '等手写字表' in line:
        lv = (lv + 10) - (lv % 10)
        idx = 0
    else:
        idx += 1
        m = re.match('^([0-9]+)\t([\u4300-\u9FFF])$', line)
        assert m, line
        assert int(m[1]) == idx
        hanzi = m[2]
        hanzi = {'洎': '泊'}.get(hanzi, hanzi)  # OCR error https://github.com/elkmovie/hsk30/issues/9
        if lv >= 10:
            assert hanzi in char_level
            char_writing_level[hanzi] = lv//10
        else:
            assert char_level[hanzi] == lv, (hanzi, char_level[hanzi], lv)
            char_level_index[hanzi] = (lv, idx)

rows = []
for ch in sorted(char_level.keys(), key=lambda ch: char_level_index[ch]):
    assert len(ch) == 1
    examples = char_to_words.get(ch, [])
    rows.append({
        'Hanzi': ch,
        'Level': '7-9' if str(char_level[ch]) == '7' else str(char_level[ch]),
        'WritingLevel': str(char_writing_level.get(ch, '')),
        'Traditional': '/'.join(char_to_trad.get(ch, [])),
        #'Freq': len(char_to_ids.get(ch, [])),
        'Freq': len(char_to_words.get(ch, [])),
        'Examples': ' '.join(examples[:min(len(examples), 5)]),
    })
chars_df = pd.DataFrame(rows).set_index('Hanzi')
chars_df.to_csv('hsk30-chars.csv')

print(chars_df.Level.value_counts().sort_index())
print(chars_df.WritingLevel.value_counts().sort_index())

In wordlist, but not char list: {'〇'}
Extra 29 chars not in wordlist: 吴冯州渝袁粤赵蜀秦潘韩浙杭欧邓宋澳淮魏浦郭孟孔洲吕沪刘曹唐
Final list: 3000 chars
Level
1       300
2       300
3       300
4       300
5       300
6       300
7-9    1200
Name: count, dtype: int64
WritingLevel
     1800
1     300
2     400
3     500
Name: count, dtype: int64


## Check readings

In [11]:
# Check pinyin against possible syllable readings in cedict
# Should have just minor 5th tone mismatches, other diffs to look into
if os.path.exists('../cedict/syllables.csv'):
    readings_mp = {} #{'一': set(['yì','yí']), '不': set(['bú'])}
    syll_df = pd.read_csv('../cedict/syllables.csv', dtype='str').fillna('')
    for row in syll_df.itertuples():
        readings_mp.setdefault(row.Traditional, set()).add(row.Pinyin.lower())
    readings_mp = {x: set([y.strip().lower() for y in readings_mp[x] if y.strip()]) for x in readings_mp}

    def gen_readings(trad):
        if trad == '':
            yield ''
        elif trad[0] not in readings_mp or ord(trad[0]) < 0x3000:
            yield from gen_readings(trad[1:])
        else:
            for x in readings_mp[trad[0]]:
                for y in gen_readings(trad[1:]):
                    yield x + ("'" if y and y[0] in 'aāáǎàeēéěèoōóǒò' else '') + y
                    yield x + ("-" if y and y[0] in 'aāáǎàeēéěèoōóǒò' else '') + y
                    if y:
                        yield x + ' ' + y
                        yield x + '-' + y

    for row in expanded_df.fillna('').itertuples():
        trad, pinyin = row.Traditional, row.Pinyin
        readings = list(set(gen_readings(trad)))
        if pinyin.lower() not in readings:
            print(list(row._asdict().values())[2:5], 'vs.', readings[:min(len(readings), 10)])

['困难', '困難', 'kùnnan'] vs. ['kùn nán', 'kùn-nàn', 'kùn nàn', 'kùnnán', 'kùnnàn', 'kùn-nán']
['学问', '學問', 'xuéwen'] vs. ['xué wèn', 'xué-wèn', 'xuéwèn']
['买卖', '買賣', 'mǎimai'] vs. ['mǎimài', 'mǎi-mài', 'mǎi mài']
['熟悉', '熟悉', 'shúxi'] vs. ['shóu xī', 'shóuxī', 'shóu-xī', 'shúxī', 'shú-xī', 'shú xī']
['队伍', '隊伍', 'duìwu'] vs. ['duìwǔ', 'duì wǔ', 'duì-wǔ']
['比试', '比試', 'bǐshi'] vs. ['bíshì', 'bī-shì', 'bǐ-shì', 'bìshì', 'bí-shì', 'bǐshì', 'bì-shì', 'bī shì', 'bí shì', 'bǐ shì']
['大大咧咧', '大大咧咧', 'dàdaliēliē'] vs. ['dàidài liěliě', 'dàdá lie liē', 'dà-dài lie liě', 'dà-dàiliě liē', 'dádà-lieliě', 'dài dà-liělie', 'dài-dà liē-lie', 'dádà lie-lie', 'dá-dàiliē-liě', 'dài-dài liěliě']
['灯笼', '燈籠', 'dēnglong'] vs. ['dēnglóng', 'dēng-lǒng', 'dēng-lóng', 'dēnglǒng', 'dēng lóng', 'dēng lǒng']
['动静', '動靜', 'dòngjing'] vs. ['dong jìng', 'dòng-jìng', 'dòngjìng', 'dòng jìng', 'dongjìng', 'dong-jìng']
['风筝', '風箏', 'fēngzheng'] vs. ['fēngzhēng', 'fēng zhēng', 'fēng-zhēng']
['固执', '固執', 'gùzhi'] vs. ['gù 

In [12]:
# Check all characters were converted to traditional

unihan_df = pd.read_csv('../unihan/unihan.csv', dtype='str').fillna('').set_index('char')

# https://www.unicode.org/reports/tr38/#SCTC
def classify(c):
    if c == '〇': return 'both'
    assert c in unihan_df.index, c
    tv = unihan_df.kTraditionalVariant.loc[c]
    sv = unihan_df.kSimplifiedVariant.loc[c]
    if tv == '' and sv == '': return 'both'
    if tv == '': return 'T'
    if sv == '': return 'S'
    return 'complex'

for row in expanded_df.itertuples():
    trad, simp = row.Traditional, row.Simplified
    assert len(trad) == len(simp), (row._asdict())
    for sc, tc in zip(simp, trad):
        assert classify(tc) in ('T', 'both', 'complex')