There are 6 thousand unique Kanji, each consisting of one or more radicals.
There are 254 unique radicals.

I'm thinking:

1. Represent each Kanji as a "bag-of-radicals"
2. Learn a vector for each radical
3. To get the vector for a Kanji, sum the vectors of its radicals

Once we have vectors for Kanji, we can calculate similarity/difference, etc. For example:

# 雪 (snow) vs 電 (electricity)

- 雪 ['ヨ', '雨']
- 電 ['雨', '田', '乙']

For learning, I was thinking of using Word2Vec.
Each Kanji would be a "sentence".
Each radical would be a "word".

# Questions

- How is this better than our previous approaches?
- What is Word2Vec actually learning, and _how_?
- Is the training data sufficient? Can we improve it somehow?
- Is the order of words in a sentence important? KRAD doesn't enforce an order...

In [1]:
import gzip

def parse_krad():
    with gzip.GzipFile('kradfile.gz') as fin:
        krad = fin.read().decode('euc-jp')
    for line in krad.split('\n'):
        if line and line[0] == "#":
            continue
        elif ' : ' in line:
            kanji, radicals = line.split(' : ')
            radicals = radicals.split(' ')
            yield kanji, radicals
        
krad_dict = dict(parse_krad())

In [2]:
sentences = list(krad_dict.values())
sentences[:10]

[['｜', '一', '口'],
 ['｜', '一', '口'],
 ['女', '土'],
 ['一', '口', '亅', '阡'],
 ['衣', '口', '亠'],
 ['心', '爪', '冖', '夂'],
 ['矢', '厶', '扎', '乞'],
 ['一', '口', '女', '个'],
 ['｜', '込', '二', '夂'],
 ['人', '大', '二', '癶', '艾', 'ノ']]

In [3]:
len(sentences)

6355

In [4]:
import collections
counter = collections.Counter()
for sent in sentences:
    counter.update(sent)
len(counter)

254

In [5]:
krad_dict['雪']

['ヨ', '雨']

In [6]:
krad_dict['電']

['雨', '田', '乙']

In [7]:
from gensim.models.word2vec import Word2Vec
model = Word2Vec(size=100, min_count=1)
model.build_vocab(sentences, keep_raw_vocab=True)

def jumble(sent, iterations=10):
    import random
    for i in range(iterations):
        for s in sent:
            random.shuffle(s)
            yield s

# model.train(sentences, total_examples=len(sentences), epochs=100)
model.train(jumble(sentences), total_examples=len(sentences) * 10, epochs=10)

(133272, 257000)

In [8]:
len(model.wv.vocab)

254

In [9]:
import numpy as np

def kanjivec(k):
    # TODO: weigh each radical by its stroke count?
    vector = np.zeros((model.wv.vector_size,))
    # print(k, krad_dict[k])
    for rad in krad_dict[k]:
        # print(k, rad, model.wv[rad])
        vector += model.wv[rad]
    vector /= len(krad_dict[k])
    # print(vector)
    return vector

def kanjisim(k1, k2):
    return np.linalg.norm(kanjivec(k1) - kanjivec(k2))

kanjisim('愛', '受')

0.6431980119578202

In [10]:
def find_similar(target, n=10):
    d = {k: kanjisim(target, k) for k in krad_dict if k != target}
    best = sorted(d.items(), key=lambda x: x[1])
    return best[:n]

find_similar('驪')

[('鶻', 0.2535393381398565),
 ('傅', 0.2651712946052006),
 ('鰤', 0.28658093164707327),
 ('鯖', 0.2967796693262081),
 ('鴎', 0.30603099186896016),
 ('鰊', 0.3080451830199496),
 ('駭', 0.31056915260175655),
 ('嵎', 0.3160042276736563),
 ('駲', 0.31697914730156396),
 ('黜', 0.3180580704107598)]