# Co-occurrence & Association measures

전산언어학에서는 Collocations이라고 표현한다. 

* Evert, Stefan (2009), Corpora and Collocations, in Lüdelinging and Kytö (eds.), 1212-1248.


* co-word
* measure
  - 상호정보( mutual information: MI )
  - t-점수 ( t-score ) 
  - 단순로그공산( simple log-likelihood: simple-ll ) 

$$
MI = \log_2 { O \over E }
$$

$$
t-score = { { O - E } \over { \sqrt O } }
$$

$$
simple-ll = 2 \times ( O \cdot \log { O \over E } - ( O - E ) )
$$

__# t-score__

$$
t-score = { { \bar x - \mu } \over {\sqrt { s^2 \over N } }  }
$$

원소의 값들이 획률분포를 따르고, O를 관찰값, E를 기대값이라고 하였을 때, 위의 식은 다음과 같이 쓸 수 있다.

$$
{ { { O \over N } - { E \over N } } \over { {\sqrt s^2} \over {\sqrt N} } } = { { O  - E } \over { {\sqrt N} \cdot {\sqrt { s^2 }} } }
$$

확률분포가 이항분포(binary distributuin)일 경우, 분산$s^2$은 $s^2 = p(1-p)$과 같다. 단어가 등장하지 않을 확률에 비해 등장할 확률이 매우 작기 때문에 $ 1 - p \approx p$와 같으므로 다음이 성립한다. 

$$s^2 = p(1-p) \approx p = { O \over N }$$

따라서 최초의 식은 다음과 같이 정리된다,

$$
t-score \approx { { O  - E } \over { {\sqrt N} \cdot {\sqrt { O \over N}} } } = { { O - E } \over { \sqrt O } }
$$

__# the simple log-likelihood (simple-ll)__

$$simplell = -2 \log \lambda = -2 \cdot \log { Pro_o (X=O) \over Pr(X=O ) }$$

포아송 분포(Poisson distribution)를 가정하였을 때, 

$$Pr(X=O) = e^{-O}\cdot{ O^O \over O! }$$

$$Pr_o(X=O) = e^{-E}\cdot{ E^O \over O! }$$

$$ \lambda = e^{O-E}\cdot \left({ E \over O }\right)^O$$

$$simplell = -2 \log \lambda = -2 \cdot \log { Pro_o (X=O) \over Pr(X=O ) }$$


In [1]:
import json
import numpy as np
np.set_printoptions( precision=2, edgeitems=6, linewidth=240 )

data_path = "../data/kntk_formulas.json"
fmls = json.loads( open(data_path, 'r', encoding='utf-8').read() )

In [16]:
def cleansing( term ):
    a, _ = term.split(")")
    h, b = a.split("(")
    return h + "|" + b

In [17]:
herb_list = []
symp_list = []

output_paths = [ "../data/tntk_formulas_herbs", "../data/tntk_formulas_symps" ]
fi_herb = open( output_paths[0], 'w', encoding="utf-8")
fi_symp = open( output_paths[1], 'w', encoding="utf-8")

for fml in fmls:
    herbs_ = [ ig.get('herb') for ig in fml.get( 'ingredients' ) ]
    symps_ = [ ig.get('symptom') for ig in fml.get( 'diseases' ) ]
    
    herbs_ = list(filter(None, herbs_ ))
    symps_ = list(filter(None, symps_ ))
                           
    herbs = list( map( cleansing, herbs_ ) )
    symps = list( map( cleansing, symps_ ) )
    
    if ( len( herbs ) < 1 ) or  ( len( symps ) < 1 ): continue
    herb_list.append( herbs )
    symp_list.append( symps )
    fi_herb.write( " ".join( herbs ) + "\n" )
    fi_symp.write( " ".join( symps ) + "\n" )

fi_herb.close()
fi_symp.close()


18137 18137


In [2]:
from sklearn.feature_extraction.text import CountVectorizer

min_df = 5
vectorizer = CountVectorizer( min_df=min_df )
X = vectorizer.fit_transform( fml_corpus)
herb_names =  vectorizer.get_feature_names()

print( X.shape )
print( herb_names )

(219, 34)
['감초', '갱미', '건강', '계지', '과루근', '귤피', '대조', '대황', '도인', '마황', '망초', '모려', '반하', '복령', '부자', '생강', '석고', '세신', '시호', '아교', '오미자', '용골', '인삼', '작약', '정력', '지실', '치자', '택사', '행인', '향시', '황금', '황기', '황련', '후박']


In [3]:
herb_co = X.T * X
print( "# HERB NETWORK MATRIX according to co-occurrence", herb_co.shape )
print( herb_co.toarray() )

# HERB NETWORK MATRIX according to co-occurrence (34, 34)
[[106   6  19  46   3   1 ...  12   1  12   5   6   3]
 [  6   7   1   1   0   0 ...   0   0   0   0   0   0]
 [ 19   1  31   5   1   0 ...   2   0   6   0   5   0]
 [ 46   1   5  61   2   0 ...   6   0   4   5   1   3]
 [  3   0   1   2   5   0 ...   0   0   2   0   0   0]
 [  1   0   0   0   0   5 ...   0   0   0   0   0   0]
 ...
 [ 12   0   2   6   0   0 ...  16   0   0   0   0   2]
 [  1   0   0   0   0   0 ...   0   6   0   0   0   0]
 [ 12   0   6   4   2   0 ...   0   0  20   0   8   0]
 [  5   0   0   5   0   0 ...   0   0   0   7   0   0]
 [  6   0   5   1   0   0 ...   0   0   8   0  13   0]
 [  3   0   0   3   0   0 ...   2   0   0   0   0  10]]
