There are 2775 single-character words in CWN, and these words have 17,009 senses. Next, senses with at least two bisyllabic synonyms (either with synonymous relations or synset relations) are selected. There are 436 selected senses, which are 291 unique single-character words. Among these selected senses, there are 340 senses having 2 synonomous bisyllabic words, and 96 of them having more than 2 synonomous words. Noise words are added into each sense entry. That is, words do not have synonmous relations with their corresponding sense, but nevertheless share the same single-character word. There are totally 13,530 noise words added.

In [49]:
from CwnGraph import CwnImage, CwnSense
from itertools import chain
from tqdm.auto import tqdm

In [50]:
cwn = CwnImage.latest()

In [51]:
chars = cwn.find_lemma("^[\u4e00-\u9fff]$")

In [52]:
char_senses = list(chain.from_iterable(x.senses for x in chars))

In [116]:
len(set(x.head_word for x in char_senses))

2775

In [53]:
len(char_senses)

17009

In [72]:
sense_synos = {}
for sense_x in tqdm(char_senses):
    if sense_x.pos == "Nb": continue
    src_char = sense_x.head_word
    synos = set(sense_x.synonym)
    if sense_x.synset:
        synos |= set(sense_x.synset.senses)        
    synos = synos.difference(set([sense_x]))
    synos = [x for x in synos
             if src_char in x.head_word and 
                len(x.head_word)==2 and
                type(x) is CwnSense]     
    if len(synos) > 1:
        sense_synos[sense_x] = synos

  0%|          | 0/17009 [00:00<?, ?it/s]

In [73]:
len(sense_synos)

436

In [113]:
len(set(x.head_word for x in sense_synos))

291

In [77]:
sorted_senses = sorted(sense_synos.keys(), key=lambda x: -len(sense_synos[x]))
[(x, len(sense_synos[x])) for x in sorted_senses[:10]]

[(<CwnSense[03010601](多，Neqa): 形容數量大或種類多。>, 6),
 (<CwnSense[05205701](靠，I): 咒罵語。>, 6),
 (<CwnSense[03018211](後，Nd): 晚於現在或前述事件的時段。>, 5),
 (<CwnSense[09258602](交，Nc): 交通大學的建築物及建築物所在的位置。>, 5),
 (<CwnSense[06053401](若，Cbb): 連接假設條件。>, 5),
 (<CwnSense[06648204](減，VC): 比喻使後述事件的程度降低。>, 5),
 (<CwnSense[03052801](原，A,D): 表沒有經過改變的，或者改變之前的。>, 4),
 (<CwnSense[03054401](遭，VJ): 遇到不幸或不利的事。>, 4),
 (<CwnSense[04000602](且，Cbb): 連接兩對等句子，表示進一層關係。>, 4),
 (<CwnSense[04020102](那，Ncd): 離說話者或參考點較遠的位置。>, 4)]

In [114]:
from collections import Counter
Counter(len(x) for x in sense_synos.values()).most_common()

[(2, 340), (3, 77), (4, 13), (5, 4), (6, 2)]

In [115]:
77+13+4+2

96

## Adding noises

In [82]:
bisylls = list(set([x.lemma for x in cwn.find_lemma("^..$")]))

In [112]:
len(bisylls)

17435

In [86]:
list(sense_synos.items())[100]

(<CwnSense[04045801](些，Nf): 表不確定的小數量。>,
 [<CwnSense[05161701](些許，Neqa): 表不確定的小數量。>,
  <CwnSense[05052701](一些，Neqa): 表不確定的小數量。>,
  <CwnSense[03052002](有些，Dfa): 表不確定的小數量。>])

In [94]:
def make_noise(sense_pair):
    src_sense, syno_senses = sense_pair
    src_char = src_sense.head_word
    candidates = set(filter(lambda x: src_char in x, bisylls))
    candidates = candidates.difference(set(x.head_word for x in syno_senses))
    return candidates

In [106]:
char_dataset = []
for sense_pair in sense_synos.items():
    src_sense, syno_senses = sense_pair
    noises = make_noise(sense_pair)
    dataitem = {
        "src_sense": [src_sense.id],
        "syno_senses": [x.id for x in syno_senses],
        "src_char": [src_sense.head_word],
        "syno_words": [x.head_word for x in syno_senses],
        "noise_words": list(noises)
    }
    char_dataset.append(dataitem)

In [107]:
char_dataset[:2]

[{'src_sense': ['08060701'],
  'syno_senses': ['08037501', '09297601'],
  'src_char': ['排'],
  'syno_words': ['排放', '排出'],
  'noise_words': ['排汗',
   '排除',
   '排擠',
   '排泄',
   '排序',
   '排行',
   '綵排',
   '排成',
   '安排',
   '排氣',
   '排球',
   '肋排',
   '排骨',
   '牛排',
   '排演',
   '排名',
   '豬排',
   '排入',
   '排便',
   '排乾',
   '排水',
   '排擋',
   '排列',
   '排版',
   '排污',
   '排卵',
   '排隊',
   '排斥']},
 {'src_sense': ['08028001'],
  'syno_senses': ['06684701', '06687901'],
  'src_char': ['拍'],
  'syno_words': ['拍照', '拍攝'],
  'noise_words': ['拍動', '重拍', '球拍', '輕拍', '拍賣', '拍子', '拍擊', '拍打']}]

In [108]:
len(char_dataset)

436

In [109]:
sum(len(x["syno_words"]) for x in char_dataset)

995

In [110]:
sum(len(x["noise_words"]) for x in char_dataset)

13530

## Save dataset

In [111]:
import json
with open("../data/char_senses_with_noises.json", "w", encoding="UTF-8") as fout:
    json.dump(char_dataset, fout)