# Co-occurrence & Association measures

전산언어학에서는 Collocations이라고 표현한다. 

## REF

* Evert, Stefan (2009), Corpora and Collocations, in Lüdelinging and Kytö (eds.), 1212-1248.

## Background


* co-word
* measure
  - 상호정보( mutual information: MI )
  - t-점수 ( t-score ) 
  - 단순로그우도비( simple log-likelihood ratio: simple-ll ) 

$$
MI = \log_2 { O \over E }
$$

$$
t-score = { { O - E } \over { \sqrt O } }
$$

$$
simple-ll = 2  \left( O \cdot \log { O \over E } - ( O - E ) \right)
$$

__# t-score__

기대집단 vs 관찰집단 사이의 비료

* 기대집단 : 단어A와 단어B가 서로 독립적으로 나타나리라 기대되는 텍스트 
* 관찰집단 : 단어A와 단어B가 서로 관련이 있게 나타난 실재 텍스트


$$
t-score = { { \bar x - \mu } \over {\sqrt { s^2 \over N } }  }
$$

원소의 값들이 획률분포를 따르고, O를 관찰값, E를 기대값이라고 하였을 때, 위의 식은 다음과 같이 쓸 수 있다.

$$
{ { { O \over N } - { E \over N } } \over { {\sqrt s^2} \over {\sqrt N} } } = { { O  - E } \over { {\sqrt N} \cdot {\sqrt { s^2 }} } }
$$

확률분포가 이항분포(binary distributuin)일 경우, 분산$s^2$은 $s^2 = p(1-p)$과 같다. 단어가 등장하지 않을 확률에 비해 등장할 확률이 매우 작기 때문에 $ 1 - p \approx p$와 같으므로 다음이 성립한다. 

$$s^2 = p(1-p) \approx p = { O \over N }$$

따라서 최초의 식은 다음과 같이 정리된다,

$$
t-score \approx { { O  - E } \over { {\sqrt N} \cdot {\sqrt { O \over N}} } } = { { O - E } \over { \sqrt O } }
$$

![](../image/D0200_Co-occurrence_Measures.t-test.png)

__# the simple log-likelihood (simple-ll)__

* $H_o$ : 단어A와 단어B는 서로 독립적으로 나타난다. 
* $H$ : 단어A와 단어B는 서로 관련을 가지고 나타난다. 


$$ LogLikelihoodRatio = -2 \cdot \log { Pr_o (X=O) \over Pr(X=O ) }$$

포아송 분포(Poisson distribution)를 가정하면 : 

* $H_o$의 likelihood는 $Pr_o(X=O; E)$ : 독립을 가정하였으므로 포아송 분포의 $\alpha$(기대값)에는 독립사건의 기대값 E가 대입된다. 
* $H$의 likelihood는 $Pr(X=O; O)$ : 서로 연관성을 가지므로 포아송 분포의 $\alpha$(기대값)는 실재 관찰값 O가 대입된다. 

포아송 분포에서 정해진 시간 안에 어떤 사건이 일어날 횟수에 대한 기댓값을 $\alpha$ 라고 했을 때, 그 사건이 $n$ 회 일어날 확률은 다음과 같다.

$$Pr(n; \alpha)=e^{-\alpha}\cdot{ \alpha^n \over n! } $$

이를 적용해 보면

$$Pr(X=O; O) = e^{-O}\cdot{ O^O \over O! }$$

$$Pr_o(X=O; E) = e^{-E}\cdot{ E^O \over O! }$$

$$ \lambda = e^{O-E}\cdot \left({ E \over O }\right)^O$$

$$simpleLL = -2 \log \lambda = 2 \left( (E - O) + O \cdot \log { O \over E  } \right) = 2 \left( O \cdot \log { O \over E  } - ( O - E ) \right) $$

이 값은 자유도 1인 $\chi^2$ 분포에 근사된다.

## Data

### Copus

In [1]:
import json
import numpy as np
np.set_printoptions( precision=2, edgeitems=6, linewidth=240 )

data_path = "../data/kntk_formulas.json"
fmls = json.loads( open(data_path, 'r', encoding='utf-8').read() )

In [2]:
def cleansing( term ):
    a, _ = term.split(")")
    h, b = a.split("(")
    return h + "|" + b

In [3]:
herb_list = []
symp_list = []

output_paths = [ "../data/tntk_formulas_herbs", "../data/tntk_formulas_symps" ]
fi_herb = open( output_paths[0], 'w', encoding="utf-8")
fi_symp = open( output_paths[1], 'w', encoding="utf-8")

for fml in fmls:
    herbs_ = [ ig.get('herb') for ig in fml.get( 'ingredients' ) ]
    symps_ = [ ig.get('symptom') for ig in fml.get( 'diseases' ) ]
    
    herbs_ = list(filter(None, herbs_ ))
    symps_ = list(filter(None, symps_ ))
                           
    herbs = list( map( cleansing, herbs_ ) )
    symps = list( map( cleansing, symps_ ) )
    
    if ( len( herbs ) < 1 ) or  ( len( symps ) < 1 ): continue
    herb_list.append( herbs )
    symp_list.append( symps )
    fi_herb.write( " ".join( herbs ) + "\n" )
    fi_symp.write( " ".join( symps ) + "\n" )

fi_herb.close()
fi_symp.close()


### Reverse Index

In [4]:
data_size = len( herb_list )

herb_ridx = {}
symp_ridx = {}

In [5]:
for idx in range( data_size ):
    h_targets = herb_list[ idx ]
    
    for _h in h_targets:
        h, _ = _h.split("|")
        if herb_ridx.get( h ): herb_ridx[ h ].append( idx )
        else: herb_ridx[ h ] = [ idx ]

    s_targets = symp_list[ idx ]
    
    for _s in s_targets:
        s, _ = _s.split("|")
        if symp_ridx.get( s ): symp_ridx[ s ].append( idx )
        else: symp_ridx[ s ] = [ idx ]


## Lib

### Association Measures

In [6]:
import math 

def t_score( o, e ):
    return ( o - e ) / math.sqrt( o + 1 )

def sim_ll( o, e ):
    if e == o or o == 0 : return 0
    rst = 2 * ( o * math.log( o / e ) - ( o - e ) )
    if o >= e : return rst
    else : return -1 * rst


In [7]:
def calc_measure( data_tgt, data_org, by, n, measure_func=t_score, o_min=6 ):
    basement = data_org.get( by )
    if not basement: 
        print( "There is no basement" )
        return 
    org_p = len( basement ) / n
    rst = []
    for tg, ridx in data_tgt.items():
        tgt_p = len( ridx ) / n
        o = len( list( set( basement ) & set( ridx ) )  )
        if o < o_min: continue
        e = n * org_p * tgt_p
        m = measure_func( o, e )
        rst.append( (tg, o, e, m) )
    return rst
    

### Visualization

In [8]:
from bokeh.plotting import figure, show, output_notebook, ColumnDataSource

def assoc_plot_tooltip( x, y, size, labels, title=""  ):
    
    source = ColumnDataSource(data=dict( x=x, y=y, size=size, label=labels ))

    TOOLTIPS = [
        ("label", "@label"),
        ("index", "$index"),
        ("(x,y,size)", "($x, $y,@size)"),
    ]
    
    p = figure( plot_width=600, plot_height=600, title=title, tooltips=TOOLTIPS )
    p.circle('x', 'y', size='size', color="navy", alpha=0.5, source=source)

    return p


## Exercise

### 병증에 사용되는 본초 분석

In [9]:
term = "상한"
associations = calc_measure( data_tgt=herb_ridx, data_org=symp_ridx, by=term, n=data_size, measure_func=t_score, o_min=6 )
assoc_sorted = sorted( associations, key=lambda x: x[3], reverse=True)
for e in assoc_sorted[:10]:
    print( e )

labels, _o, _e, y = zip( *assoc_sorted )
x = [ math.sqrt(o+1) for o in _o ] 
size = [6]*len(assoc_sorted)

output_notebook() 
p =  assoc_plot_tooltip( x=x, y=y, size=size, labels=labels, title=term + " Associations ( t-score )"  ) 
p.xaxis.axis_label = "squared Observed Value"
p.yaxis.axis_label = "T-Score"
p.line( [min(x), max(x)], [0, 0], line_dash='dotted', line_width=2  )
show( p )

('마황', 41, 9.357005017367811, 4.882620128653813)
('감초', 130, 80.87125765010751, 4.2923981489756)
('황금', 54, 22.447593317527705, 4.254525649141421)
('계지', 31, 7.144511220157688, 4.217094471186524)
('생강', 39, 12.90621381705905, 4.125789855776224)
('작약', 30, 8.227711308375143, 3.910418486273419)
('대조', 26, 6.337872856591498, 3.7839781330291817)
('지실', 21, 7.294315487677124, 2.922061756171232)
('석고', 20, 6.741192038374594, 2.893309100430075)
('건강', 32, 16.536086453106908, 2.6919218221447108)


In [10]:
term = "상한"
associations = calc_measure( data_tgt=herb_ridx, data_org=symp_ridx, by=term, n=data_size, measure_func=sim_ll, o_min=6 )
assoc_sorted = sorted( associations, key=lambda x: x[3], reverse=True)
for e in assoc_sorted[:10]:
    print( e )

labels, _o, _e, y = zip( *assoc_sorted )
x = [ math.sqrt(o+1) for o in _o ] 
size = [6]*len(assoc_sorted)

output_notebook() 
p = assoc_plot_tooltip( x=x, y=y, size=size, labels=labels, title=term + " Associations ( Simple LL )"  ) 
p.xaxis.axis_label = "squared Observed Value"
p.yaxis.axis_label = "Simple LL"
p.line( [min(x), max(x)], [0, 0], line_dash='dotted', line_width=2  )
show( p )

('마황', 41, 9.357005017367811, 57.86464797874547)
('계지', 31, 7.144511220157688, 43.2828762944299)
('작약', 30, 8.227711308375143, 34.0767924352613)
('대조', 26, 6.337872856591498, 34.07651926045704)
('생강', 39, 12.90621381705905, 34.068942874581346)
('황금', 54, 22.447593317527705, 31.697655756611283)
('감초', 130, 80.87125765010751, 25.15826801112395)
('지실', 21, 7.294315487677124, 17.00056889109233)
('석고', 20, 6.741192038374594, 16.982204235739058)
('망초', 11, 2.70800022054364, 14.253066839837409)


### 본초에 사용되는 병증 분석

In [11]:
term = "인삼"
associations = calc_measure( data_tgt=symp_ridx, data_org=herb_ridx, by=term, n=data_size, measure_func=t_score, o_min=6 )
assoc_sorted = sorted( associations, key=lambda x: x[3], reverse=True)
for e in assoc_sorted[:10]:
    print( e )

labels, _o, _e, y = zip( *assoc_sorted )
x = [ math.sqrt(o+1) for o in _o ] 
size = [6]*len(assoc_sorted)
    
output_notebook() 
p = assoc_plot_tooltip( x=x, y=y, size=size, labels=labels, title=term + " Associations ( t-score )"  )
p.xaxis.axis_label = "squared Observed Value"
p.yaxis.axis_label = "T-Score"
p.line( [min(x), max(x)], [0, 0], line_dash='dotted', line_width=2  )
show( p )

('비위허약', 69, 22.203672051607214, 5.593230997571372)
('허로', 59, 23.603903622429286, 4.569616393036135)
('기허', 47, 17.20284501295694, 4.300848863213578)
('건망', 38, 11.201852566576612, 4.291137873910621)
('위허', 33, 8.60142250647847, 4.184321519628802)
('사지권태', 33, 11.801951811214643, 3.6354352721594108)
('오심번열', 32, 11.201852566576612, 3.6204927534313565)
('정충', 34, 12.8021172189447, 3.583096164729049)
('자한', 72, 42.60704636930033, 3.440185012419761)
('경계', 33, 13.40221646358273, 3.3609921484247147)


In [12]:
term = "인삼"
associations = calc_measure( data_tgt=symp_ridx, data_org=herb_ridx, by=term, n=data_size, measure_func=sim_ll, o_min=6 )
assoc_sorted = sorted( associations, key=lambda x: x[3], reverse=True)
for e in assoc_sorted[:10]:
    print( e )

labels, _o, _e, y = zip( *assoc_sorted )
x = [ math.sqrt(o+1) for o in _o ] 
size = [6]*len(assoc_sorted)
    
output_notebook() 
p = assoc_plot_tooltip( x=x, y=y, size=size, labels=labels, title=term + " Associations ( simple LL )"  )
p.xaxis.axis_label = "squared Observed Value"
p.yaxis.axis_label = "Simple LL"
p.line( [min(x), max(x)], [0, 0], line_dash='dotted', line_width=2  )
show( p )

('비위허약', 69, 22.203672051607214, 62.878481493304776)
('위허', 33, 8.60142250647847, 39.94512264720767)
('건망', 38, 11.201852566576612, 39.238236173549296)
('허로', 59, 23.603903622429286, 37.31059710790777)
('기허', 47, 17.20284501295694, 34.882535466774904)
('정신단소', 19, 4.000661630920218, 29.204533773644567)
('원기허약', 21, 5.400893201742295, 25.836026187747166)
('오심번열', 32, 11.201852566576612, 25.581735881759514)
('사지권태', 33, 11.801951811214643, 25.467917594677644)
('정충', 34, 12.8021172189447, 24.023231693333614)


### Measure 사이의 관계

In [13]:
term = "상한"
associations1 = calc_measure( data_tgt=herb_ridx, data_org=symp_ridx, by=term, n=data_size, measure_func=t_score, o_min=6 )
associations2 = calc_measure( data_tgt=herb_ridx, data_org=symp_ridx, by=term, n=data_size, measure_func=sim_ll, o_min=6 )

labels1, _o, _e, y = zip( *associations1 )
labels2, _o, _e, y = zip( *associations2 )

labels = list( set( labels1 ) & set( labels2 ) )
coordinates = [  ( associations1[ labels1.index( l ) ][3],  associations2[ labels2.index( l ) ][3] ) for l in labels ]
x, y = zip( *coordinates )
size = [6] * len(associations1)

output_notebook() 
p = assoc_plot_tooltip( x=x, y=y, size=size, labels=labels, title=term + " Associations Compare"  )
p.xaxis.axis_label = "T-Score"
p.yaxis.axis_label = "Simple LL"
p.line( [0, 0], [min(y), max(y)], line_dash='dotted', line_width=2 )
p.line( [min(x), max(x)], [0, 0], line_dash='dotted', line_width=2  )
show( p )

In [14]:
term = "시호"
associations1 = calc_measure( data_tgt=symp_ridx, data_org=herb_ridx, by=term, n=data_size, measure_func=t_score, o_min=6 )
associations2 = calc_measure( data_tgt=symp_ridx, data_org=herb_ridx, by=term, n=data_size, measure_func=sim_ll, o_min=6 )

labels1, _o, _e, y = zip( *associations1 )
labels2, _o, _e, y = zip( *associations2 )

labels = list( set( labels1 ) & set( labels2 ) )
coordinates = [  ( associations1[ labels1.index( l ) ][3],  associations2[ labels2.index( l ) ][3] ) for l in labels ]
x, y = zip( *coordinates )
size = [6] * len(associations1)

output_notebook() 
p = assoc_plot_tooltip( x=x, y=y, size=size, labels=labels, title=term + " Associations Compare"  )
p.xaxis.axis_label = "T-Score"
p.yaxis.axis_label = "Simple LL"
p.line( [0, 0], [min(y), max(y)], line_dash='dotted', line_width=2 )
p.line( [min(x), max(x)], [0, 0], line_dash='dotted', line_width=2  )
show( p )