# 第9章: ベクトル空間法 (I)
> enwiki-20150112-400-r10-105752.txt.bz2は，2015年1月12日時点の英語のWikipedia記事のうち，約400語以上で構成される記事の中から，ランダムに1/10サンプリングした105,752記事のテキストをbzip2形式で圧縮したものである．このテキストをコーパスとして，単語の意味を表すベクトル（分散表現）を学習したい．第9章の前半では，コーパスから作成した単語文脈共起行列に主成分分析を適用し，単語ベクトルを学習する過程を，いくつかの処理に分けて実装する．第9章の後半では，学習で得られた単語ベクトル（300次元）を用い，単語の類似度計算やアナロジー（類推）を行う．

In [3]:
import locale

# encoding 問題、athena で対話型の python を使っても起きないので screen か jupyter の問題っぽい
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

'en_US.UTF-8'

In [84]:
locale.getpreferredencoding()

'UTF-8'

### 80. コーパスの整形
> 文を単語列に変換する最も単純な方法は，空白文字で単語に区切ることである． ただ，この方法では文末のピリオドや括弧などの記号が単語に含まれてしまう． そこで，コーパスの各行のテキストを空白文字でトークンのリストに分割した後，各トークンに以下の処理を施し，単語から記号を除去せよ．
- トークンの先頭と末尾に出現する次の文字を削除: .,!?;:()[]'"
- 空文字列となったトークンは削除
>
> 以上の処理を適用した後，トークンをスペースで連結してファイルに保存せよ．



In [117]:
def trimmer(sentence):
    words = sentence.split(' ')
    for token in words:
        token = token.strip('.,!?;:()[]\'"')
        if token:
            yield token

with open('enwiki-20150112-400-r10-105752.txt', 'r') as rf, open('wiki_corpus80.txt', 'wt') as wf:
    for line in rf:
        wf.write(' '.join([token for token in trimmer(line.rstrip('\n'))]) + '\n')

### 81. 複合語からなる国名への対処
> インターネット上から国名リストを各自で入手し，80のコーパス中に出現する複合語の国名に関して，スペースをアンダーバーに置換せよ．例えば，"United States"は"United_States"，"Isle of Man"は"Isle_of_Man"になるはずである．

In [43]:
!wget https://raw.githubusercontent.com/umpirsky/country-list/master/data/en_US/country.json

--2018-07-24 19:04:46--  https://raw.githubusercontent.com/umpirsky/country-list/master/data/en_US/country.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.72.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.72.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4559 (4.5K) [text/plain]
Saving to: 'country.json'


2018-07-24 19:04:46 (168 MB/s) - 'country.json' saved [4559/4559]



In [109]:
import json

with open('country.json') as f:
    country_list = list(json.load(f).values())

country_list = [country for country in country_list if ' ' in country]

In [114]:
country_list[30:40]

['Hong Kong SAR China',
 'Isle of Man',
 'Macau SAR China',
 'Marshall Islands',
 'Myanmar (Burma)',
 'New Caledonia',
 'New Zealand',
 'Norfolk Island',
 'North Korea',
 'Northern Mariana Islands']

In [119]:
with open('wiki_corpus80.txt', 'r') as rf, open('wiki_corpus81.txt', 'w') as wf:
    for line in rf:
        for country in country_list:
            undered = country.replace(' ', '_')
            line = line.replace(country, undered)
        wf.write(line)

In [129]:
!grep -m 2 --color=auto Isle_of_Man wiki_corpus81.txt

Beltane is the Gaelic May Day festival Most commonly it is held on 30 April but sometimes on 1 May or about halfway between the spring equinox and the summer solstice Historically it was widely observed throughout Ireland Scotland and the [01;31m[KIsle_of_Man[m[K In Irish it is Bealtaine in Scottish Gaelic Bealltainn and in Manx Gaelic Boaltinn or Boaldyn It is one of the four Gaelic seasonal festivals���along with Samhain Imbolc and Lughnasadh���and is similar to the Welsh Calan Mai
Bonfires continued to be a key part of the festival in the modern era and were generally lit on mountains and hills Ronald Hutton writes that To increase the potency of the holy flames in Britain at least they were often kindled by the most primitive of all means of friction between wood In the 19th century for example John Ramsay described Scottish Highlanders kindling a need-fire or force-fire at Beltane Such a fire was deemed sacred In the 19th century the ritual of driving cattle between two fires�

### 82. 文脈の抽出
> 81で作成したコーパス中に出現するすべての単語tに関して，単語tと文脈語cのペアをタブ区切り形式ですべて書き出せ．ただし，文脈語の定義は次の通りとする．
- ある単語tの前後d単語を文脈語cとして抽出する（ただし，文脈語に単語tそのものは含まない）
- 単語tを選ぶ度に，文脈幅dは{1,2,3,4,5}の範囲でランダムに決める．

In [172]:
import random

In [173]:
def contexts(sentence):
    words = sentence.split(' ')
    length = len(words)
    for i in range(length):
        word_t = words[i]
        context_width = random.randint(1, 5)
        begin = max(0, i - context_width)
        end = min(length, i + context_width) + 1
        context = words[begin:end]
        word_t = context.pop(i) if begin == 0 else context.pop(context_width) 
        yield word_t, context

with open('wiki_corpus81.txt') as rf, open('context.txt', 'w') as wf:
    for line in rf:
        for word_t, context in contexts(line.rstrip('\n')):
            for c in context:
                wf.write('\t'.join((word_t, c)) + '\n')  

In [175]:
!head context.txt

Anarchism	is
Anarchism	a
is	Anarchism
is	a
is	political
is	philosophy
is	that
is	advocates
a	Anarchism
a	is


### 83. 単語／文脈の頻度の計測
> 82の出力を利用し，以下の出現分布，および定数を求めよ．
- f(t,c): 単語tと文脈語cの共起回数
- f(t,∗): 単語tの出現回数
- f(∗,c): 文脈語cの出現回数
- N: 単語と文脈語のペアの総出現回数

In [5]:
from collections import Counter

In [6]:
f_t_c = Counter()
f_t = Counter()
f_c = Counter()

with open('context.txt') as f:
    for line in f:
        word, context_w = line.rstrip('\n').split('\t')
        f_t_c[word, context_w] += 1
        f_t[word] += 1
        f_c[context_w] += 1

In [8]:
f_t_c.most_common(5)

[(('of', 'the'), 2856375),
 (('the', 'of'), 2854872),
 (('the', 'the'), 1607499),
 (('the', 'in'), 1313046),
 (('in', 'the'), 1312895)]

In [17]:
count = !wc -l context.txt

In [18]:
count

['688818827 context.txt']

In [19]:
N = int(count[0].split()[0])
N

688818827

### 84. 単語文脈行列の作成
> 83の出力を利用し，単語文脈行列Xを作成せよ．ただし，行列Xの各要素Xtcは次のように定義する．
- f(t,c)≥10ならば，$X_{tc}=PPMI(t,c)=max\{log \frac{N×f(t,c)}{f(t,∗)×f(∗,c)},0\}$
- f(t,c)<10ならば，$X_{tc}=0$
>
> ここで，PPMI(t,c)はPositive Pointwise Mutual Information（正の相互情報量）と呼ばれる統計量である．なお，行列Xの行数・列数は数百万オーダとなり，行列のすべての要素を主記憶上に載せることは無理なので注意すること．幸い，行列Xのほとんどの要素は0になるので，非0の要素だけを書き出せばよい．

In [85]:
import math
from scipy import sparse

In [86]:
f_t_c['of', 'the']

2856375

In [87]:
def ppmi(t, c):
    return max(math.log((N*f_t_c[t,c]) / (f_t[t]*f_c[c])), 0)

In [88]:
class Vocab(object):
    def __init__(self):
        self._word2index = {}
        self.index2word = {}
        
    def word2index(self, word):
        if word not in self._word2index:
            word_ix = len(self._word2index)
            self._word2index[word] = word_ix
            self.index2word[word_ix] = word
        else:
            word_ix = self._word2index[word]
        return word_ix
    
    def vocab_count(self):
        return len(self.index2word)

In [156]:
vocab = Vocab()

In [157]:
for key in f_t_c.keys():
    word, context = key
    vocab.word2index(word)
    vocab.word2index(context)

In [111]:
length = vocab.vocab_count()
X = sparse.lil_matrix((length, length))

for item in f_t_c.items():
    (word, context), count = item
    if count >= 10:
        ppmi_val = ppmi(word, context)
        if ppmi_val != 0:
            word_ix = vocab.word2index(word)
            context_ix = vocab.word2index(context)
            X[word_ix, context_ix] = ppmi_val


In [116]:
print(X[100000, :])

  (0, 29)	0.6241173140915302
  (0, 44)	0.44760882695192133
  (0, 29698)	7.887262544256288


In [117]:
vocab.index2word[29698]

'End'

In [118]:
vocab.index2word[100000]

"Land's"

In [150]:
X.shape

(1849348, 1849348)

## 85. 主成分分析による次元圧縮
> 84で得られた単語文脈行列に対して，主成分分析を適用し，単語の意味ベクトルを300次元に圧縮せよ．

In [151]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(300)
matrix = svd.fit_transform(X.tocsr())

## 86. 単語ベクトルの表示
> 85で得た単語の意味ベクトルを読み込み，"United States"のベクトルを表示せよ．ただし，"United States"は内部的には"United_States"と表現されていることに注意せよ．

In [158]:
word_ix = vocab.word2index('United_States')

In [163]:
United_States = matrix[word_ix]
United_States

array([ 1.50587813e+01,  1.15192787e+01, -6.98531470e+00, -4.70355272e+00,
        3.54215407e+00,  3.29927278e+00, -4.24907953e-01,  1.27579809e+00,
       -8.57274852e+00,  4.63593177e+00, -4.63137695e+00, -5.41656260e+00,
        5.37415688e+00, -6.84042383e-01, -6.73631003e+00, -5.43350830e-01,
        2.40992074e+00, -1.39031067e-01, -1.01080800e+00, -3.56400022e+00,
       -8.37431036e-01, -5.00790820e-01, -3.05963873e+00,  2.95387265e+00,
        6.40611467e+00, -5.44044953e+00, -2.01400273e+00,  2.59234714e+00,
       -1.21773635e+00,  3.00144947e+00,  4.74963548e+00,  6.63254768e-01,
       -1.60842992e+00,  2.12897891e+00,  4.01010149e+00, -2.00211717e-01,
       -4.04357238e+00,  1.25722582e-02,  1.50073940e+00,  1.55335984e+00,
        3.08613080e+00, -1.05250647e+00, -3.98379239e+00,  2.13727891e+00,
        1.47992335e+00, -1.56295379e-01, -3.84480656e-01, -1.27361901e+00,
        3.63000528e+00,  1.60165771e+00,  4.40680358e-01,  1.90306199e+00,
        2.94326639e+00, -

## 87. 単語の類似度
> 85で得た単語の意味ベクトルを読み込み，"United States"と"U.S."のコサイン類似度を計算せよ．ただし，"U.S."は内部的に"U.S"と表現されていることに注意せよ．

In [161]:
from sklearn.metrics.pairwise import cosine_similarity

In [165]:
word_ix = vocab.word2index('U.S')
U_S = matrix[word_ix]
cosine_similarity(United_States.reshape(1, -1) , U_S.reshape(1, -1) )

array([[1.]])

## 88. 類似度の高い単語10件
> 85で得た単語の意味ベクトルを読み込み，"England"とコサイン類似度が高い10語と，その類似度を出力せよ．

In [194]:
def most_similar(vocab, matrix, vec, top=10):
    similarity = cosine_similarity(vec.reshape(1, -1), matrix)[0]
    top_n = np.argsort(similarity)[-top:][::-1]
    return [(vocab.index2word[word_ix], similarity[word_ix]) for word_ix in top_n]

In [177]:
word_ix = vocab.word2index('England')
England = matrix[word_ix]

In [195]:
top10 = most_similar(vocab, matrix, England)
for word, similarity in top10:
    print('{0:15} {1}'.format(word, similarity))

England         0.9999999999999999
Scotland        0.770625032838204
Wales           0.7244994426181566
Ireland         0.6404434870696478
Australia       0.6213859509363562
Yard's          0.6031340187409365
Yorkshire       0.5965747027784329
Britain         0.587418027622513
Somerset        0.5865673407153547
Cornwall        0.5564804453050681


## 89. 加法構成性によるアナロジー
> 85で得た単語の意味ベクトルを読み込み，vec("Spain") - vec("Madrid") + vec("Athens")を計算し，そのベクトルと類似度の高い10語とその類似度を出力せよ．

In [196]:
Spain = matrix[vocab.word2index('Spain')]
Madrid = matrix[vocab.word2index('Madrid')]
Athens = matrix[vocab.word2index('Athens')]
vec = Spain - Madrid + Athens
top10 = most_similar(vocab, matrix, vec)
for word, similarity in top10:
    print('{0:15} {1}'.format(word, similarity))

Spain           0.9059768008200785
Portugal        0.8726097123370322
Sweden          0.8697913990016827
Greece          0.8687233223498887
Norway          0.8521270358194808
Denmark         0.8488707695932212
Belgium         0.8455743607584609
Finland         0.8296915173017858
Italy           0.8266023476484177
Turkey          0.817492054474928
