# Tokenization ( or Segmentation )


## Score based

텍스트에 공백과 같이 기준이 되는 표시도 없고, 용어집도 없다면 token을 도출하는 일은 매우 어려워진다. 특히 표점이 되지 않은 한자의 경우에는 그 어려움이 더 심각하다. 

여러가지 unsupervised segmentation 방법들이 제안되고 있지만, 계산이 복잡하고 그에 비해 결과가 좋지 않은 편이다. 

지금까지는 [soynlp](https://github.com/lovit/soynlp)에서 제공하고 있는 응집력 지수(`cohesion`)를 이용한 방법이 가장 효과적인 듯하다. 

다만 token extraction 단계가 한글에 최적화 되어 있어 그대는 사용하기 어렵다. 


## Cohesion based Segmentaion

$$
cohesion(c_1, c_2, ....,c_n) = \sqrt[n-1]{ \prod_{i=1}^{n-1} P(c_i, c_2, ...., c_{i+1}|c_1,...,c_i) }
$$

$$
= \sqrt[n-1]{ {{ Freq(c_1, c_2,... c_n )} \over { Freq(c_1, c_2, ..c_{n-1}) }} \cdot {{ Freq(c_1, ... c_{n-1}) } \over { Freq( c_1,...c_{n-2} ) } } \cdot ...{ { Freq( c_1, c_2 )} \over {Freq( c_1 )}} } = \sqrt[n-1]{ {Freq(c_1, c_2, ...c_n)} \over { Freq( c_1} }
$$

## Lib

In [1]:
def n_gram( text, n=2 ):
    size = len( text )
    grams = [ text[i:i+n] for i in range(size -n+1 ) ]
    return grams

In [2]:
from collections import Counter

def allgrams( corpus, n_min=2, n_max=8):
    """
    n_min=2, n_max=8 : # 1-gram, 2-gram, 3-gram .... 8-gram
    corpus = [doc1, doc2, ... docn]
    """
    ns = range( n_min, n_max+1 ) 
    rst = Counter()
    for line in corpus:
        for n in ns:
            rst.update(  n_gram( line, n ) )
    return rst

In [3]:
import math

def cohesion_calcurator( allgram_counter, unigram_counter, alpha=1 ):
    def cohesion( term ):
        size = len( term )
        numerator = allgram_counter.get( term )
        denominator = unigram_counter.get( term[0] )
        if not numerator : numerator = 0
        if not denominator : denominator = 1
        return math.pow( ( numerator/denominator), (1/ (size-alpha)  ) )
    return cohesion


## Training

In [4]:
from tqdm import tqdm_notebook

corpus_path = "../data/DYBG_tn.txt"
corpus = open( corpus_path, 'r', encoding="utf-8").read().split()

corpus_allgram = allgrams( corpus )
corpus_unigram = Counter()
for line in tqdm_notebook( corpus ): 
    corpus_unigram.update( list( line ) )

print( "# Top 10 grams")
print( corpus_allgram.most_common(10) )

HBox(children=(IntProgress(value=0, max=34070), HTML(value='')))


# Top 10 grams
[('本草', 3548), ('入門', 2833), ('爲末', 2532), ('一錢', 2492), ('各一', 2179), ('方見', 1848), ('甘草', 1834), ('煎服', 1738), ('二錢', 1653), ('右剉', 1538)]


## Test

In [5]:
# 雜病篇卷之四 > 內傷 > 勞倦傷治法 > 補中益氣湯 1.17.1
# https://mediclassics.kr/books/8/volume/12#content_267
data1 = "治勞役太甚或飮食失節身熱而煩自汗倦怠黃芪 一錢半人參白朮甘草各一錢當歸身陳皮各五分升麻柴胡各三分右剉作一貼水煎服"

In [6]:
candidates = list( allgrams( [data1] ).keys() )
print( candidates[:100] )

['治勞', '勞役', '役太', '太甚', '甚或', '或飮', '飮食', '食失', '失節', '節身', '身熱', '熱而', '而煩', '煩自', '自汗', '汗倦', '倦怠', '怠黃', '黃芪', '芪 ', ' 一', '一錢', '錢半', '半人', '人參', '參白', '白朮', '朮甘', '甘草', '草各', '各一', '錢當', '當歸', '歸身', '身陳', '陳皮', '皮各', '各五', '五分', '分升', '升麻', '麻柴', '柴胡', '胡各', '各三', '三分', '分右', '右剉', '剉作', '作一', '一貼', '貼水', '水煎', '煎服', '治勞役', '勞役太', '役太甚', '太甚或', '甚或飮', '或飮食', '飮食失', '食失節', '失節身', '節身熱', '身熱而', '熱而煩', '而煩自', '煩自汗', '自汗倦', '汗倦怠', '倦怠黃', '怠黃芪', '黃芪 ', '芪 一', ' 一錢', '一錢半', '錢半人', '半人參', '人參白', '參白朮', '白朮甘', '朮甘草', '甘草各', '草各一', '各一錢', '一錢當', '錢當歸', '當歸身', '歸身陳', '身陳皮', '陳皮各', '皮各五', '各五分', '五分升', '分升麻', '升麻柴', '麻柴胡', '柴胡各', '胡各三', '各三分']


In [9]:
cohesion_calc = cohesion_calcurator( corpus_allgram, corpus_unigram, alpha=0 )

term_with_cohesion = [ ( term, cohesion_calc( term ) ) for term in candidates if cohesion_calc( term ) > 0 ]
term_with_cohesion = sorted( term_with_cohesion, key=lambda x: -x[1] )

print("# Extracted Tokens")
for t_s in term_with_cohesion[:20]:
    print( t_s )


# Extracted Tokens
('柴胡', 0.9510544172151794)
('剉作一貼', 0.9218559078552367)
('剉作一', 0.8990372174142683)
('剉作', 0.8557176258057264)
('陳皮', 0.8533820185387183)
('甘草', 0.8331693000304605)
('右剉作一貼', 0.7963630180141201)
('剉作一貼水煎', 0.7915415602984852)
('剉作一貼水煎服', 0.7878279162957555)
('剉作一貼水', 0.761134086768265)
('分右剉作一貼', 0.7533170048562811)
('右剉作一', 0.7531721296222287)
('作一貼', 0.7513026526250618)
('當歸', 0.7453188907306991)
('煎服', 0.735706737506584)
('右剉作一貼水煎服', 0.733111289044791)
('右剉作一貼水煎', 0.728687337816663)
('分右剉作一', 0.7127138182828562)
('右剉作一貼水', 0.6953899084412595)
('分右剉作一貼水煎', 0.6950036741023483)


In [10]:
data_ = data1 + ""
mark_in = " _¶{:d}_ "
mark_out = "_¶{:d}_"
token_box = []

for i, t_ in enumerate( term_with_cohesion ):
    data_ = data_.replace( t_[0], mark_in.format(i) )
    token_box.append( (mark_out.format(i), t_[0]) )

print("# Segment")
print("* encoding")
print(data_)

for m, t in token_box:
    data_ = data_.replace( m, t )

print("* decoding")
print(data_)



# Segment
* encoding
 _¶318_  _¶29_  _¶146_  _¶324_  _¶175_  _¶61_  _¶218_   _¶59_ 半 _¶55_  _¶101_  _¶5_  _¶32_  _¶13_ 身 _¶4_  _¶60_  _¶40_  _¶0_  _¶199_  _¶47_  _¶1_ 水 _¶14_ 
* decoding
 治勞  役太甚或飮食失節  身熱  而煩  自汗  倦怠  黃芪   一錢 半 人參  白朮  甘草  各一錢  當歸 身 陳皮  各五分  升麻  柴胡  各三  分右  剉作一貼 水 煎服 


## REF

* [Cohesion score + L-Tokenizer. 띄어쓰기가 잘 되어있는 한국어 문서를 위한 unsupervised tokenizer](https://lovit.github.io/nlp/2018/04/09/cohesion_ltokenizer/)
* [ratsgo's blog > Cohesion Probability](https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/05/05/cohesion/)