今回は、feedforward neural networkを使って、固定ウィンドウ幅(左右３)の情報を使って単語の品詞タグを予測するモデルを学習してみます．  
かなりシンプルですが、高速に予測できるという魅力から、より強力なモデルが沢山存在する今でも、NLPのいろんなツールで使われてます．  
easyccg: https://github.com/mikelewis0/easyccg  
syntaxnet: https://github.com/tensorflow/models/tree/master/research/syntaxnet  
stanford parser: https://nlp.stanford.edu/software/lex-parser.shtml  
(イメージ)　　
<img src='images/ff.png'>
(画像: https://xbt.net/blog/what-is-enigma/)

POSタグ(Part-Of-Speech, 品詞タグ)のリスト:  
- ADJ: adjective
- ADP: adposition
- ADV: adverb
- AUX: auxiliary
- CCONJ: coordinating conjunction
- DET: determiner
- INTJ: interjection
- NOUN: noun
- NUM: numeral
- PART: particle
- PRON: pronoun
- PROPN: proper noun
- PUNCT: punctuation
- SCONJ: subordinating conjunction
- SYM: symbol
- VERB: verb
- X: other

In [18]:
# つかうライブラリの読み込み

import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, Input, Reshape, Concatenate
from keras.optimizers import SGD
from collections import Counter

# 👇無視してOK

In [2]:
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto(
    gpu_options=tf.GPUOptions(
        visible_device_list="0", # specify GPU number
        allow_growth=True
    )
)
set_session(tf.Session(config=config))

In [3]:
# CoNLLフォーマットを読み込む関数
def read_conll(file):
    res = []
    words = []
    tags = []
    for line in open(file):
        line = line.strip()
         # 空行ならそれまでにつくった文を出力
        if len(line) == 0:
            res.append((words, tags))
            words = []
            tags = []
        # 単語とタグを取り出す
        else:
            items = line.split('\t')
            words.append(items[1].lower()) # 小文字にしておく
            tags.append(items[3])
    return res

In [4]:
# 学習データ (training data)
train_sents = read_conll('data/train.conll')
# 評価用データ (test data)
test_sents = read_conll('data/test.conll')
# 開発データ (development data)
dev_sents = read_conll('data/dev.conll')

In [5]:
def get_prefix(word):
    word = word.rjust(2, ' ')
    return word[:2]

def get_suffix(word):
    word = word.ljust(2, ' ')
    return word[-2:]

In [6]:
# テスト
word = 'dog'
print(get_prefix(word), get_suffix(word))
word = 'is'
print(get_prefix(word), get_suffix(word))
word = 'a'
print(get_prefix(word), get_suffix(word))

do og
is is
 a a 


In [7]:
def sliding_windows(lst):
    res = []
    for i in range(len(lst) - 6):
        res.append(lst[i:i+7])
    return res

In [8]:
PAD = 'PAD'

train_sents = [([PAD] * 3 + words + [PAD] * 3, tags) for words, tags in train_sents]
test_sents = [([PAD] * 3 + words + [PAD] * 3, tags) for words, tags in test_sents]
dev_sents = [([PAD] * 3 + words + [PAD] * 3, tags) for words, tags in dev_sents]

In [9]:
def make_vocab(tokens, max_count=2, UNK='UNK'):
    counter = Counter(tokens)
    token_list = [token for token, count in counter.most_common() if count >= max_count]
    token_list.append(UNK)
    token_dict = {t: i for i, t in enumerate(token_list)}
    return token_dict

In [11]:
UNK = 'UNK'
# 単語の埋め込みベクトルがうまくいくためには、その単語がいろいろな文脈で出現してほしい．
# 学習データにちょっと(２回より下)しか出ない単語はUNKで置き換える．
word_dict = make_vocab(word for words, _ in train_sents for word in words)
prefix_dict = make_vocab(get_prefix(word) for words, _ in train_sents for word in words)
suffix_dict = make_vocab(get_suffix(word) for words, _ in train_sents for word in words)

In [12]:
#POSタグを自然数のIDに変換する辞書
tag_set = set(tag for _, tags in train_sents for tag in tags)
tag_dict = {w: i for i, w in enumerate(tag_set)}

In [14]:
print('word_dict size', len(word_dict))
print('prefix_dict size', len(prefix_dict))
print('suffix_dict size', len(suffix_dict))
print('tag_dict size', len(tag_dict))

word_dict size 21569
prefix_dict size 671
suffix_dict size 662
tag_dict size 17


In [36]:
def make_matrices(words_and_tags):
    word_xs = []
    prefix_xs = []
    suffix_xs = []
    ys = []
    for words, tags in words_and_tags:
        for window in sliding_windows(words):
            word_xs.append([word_dict.get(word, word_dict[UNK]) for word in window])
            prefix_xs.append([prefix_dict.get(get_prefix(word), prefix_dict[UNK]) for word in window])
            suffix_xs.append([suffix_dict.get(get_suffix(word), suffix_dict[UNK]) for word in window])
        ys.extend(tag_dict[tag] for tag in tags)

    word_xs = np.array(word_xs, 'i')
    prefix_xs = np.array(prefix_xs, 'i')
    suffix_xs = np.array(suffix_xs, 'i')
    ys = np.array(ys, 'i')
    ys = keras.utils.to_categorical(ys, len(tag_dict))
    print('dimensions of xs', word_xs.shape, prefix_xs.shape, suffix_xs.shape)
    print('dimensions of ys', ys.shape)
    return [word_xs, prefix_xs, suffix_xs], ys

In [37]:
train_xs, train_ys = make_matrices(train_sents)
test_xs, test_ys = make_matrices(test_sents)
dev_xs, dev_ys = make_matrices(dev_sents)

dimensions of xs (929552, 7) (929552, 7) (929552, 7)
dimensions of ys (929552, 17)
dimensions of xs (55371, 7) (55371, 7) (55371, 7)
dimensions of ys (55371, 17)
dimensions of xs (45422, 7) (45422, 7) (45422, 7)
dimensions of ys (45422, 17)


In [30]:
def make_model(word_input, prefix_input, suffix_input, word_vocab_size, prefix_vocab_size, suffix_vocab_size, n_tags,
              WORD_EMBED_DIM=128, PREFIX_EMBED_DIM=32, SUFFIX_EMBED_DIM=32, HIDDEN1_DIM=256, HIDDEN2_DIM=128):
    word_embed = Embedding(word_vocab_size, WORD_EMBED_DIM)(word_input)
    prefix_embed = Embedding(prefix_vocab_size, PREFIX_EMBED_DIM)(prefix_input)
    suffix_embed = Embedding(suffix_vocab_size, SUFFIX_EMBED_DIM)(suffix_input)
    embed = Concatenate()([word_embed, prefix_embed, suffix_embed])
    embed_reshaped = Reshape((-1,))(embed)
    hidden1 = Dense(HIDDEN1_DIM, activation='tanh')(embed_reshaped)
    hidden2 = Dense(HIDDEN2_DIM, activation='tanh')(hidden1)
    logits = Dense(n_tags, activation='softmax')(hidden2)
    model = keras.Model(inputs=[word_input, prefix_input, suffix_input], outputs=logits)
    return model

In [31]:
word_input, prefix_input, suffix_input = Input(shape=(7,)), Input(shape=(7,)), Input(shape=(7,))
model = make_model(word_input, prefix_input, suffix_input, len(word_dict), len(prefix_dict), len(suffix_dict), len(tag_dict))

In [32]:
keras.utils.plot_model(model, 'images/model.png')

ImportError: Failed to import `pydot`. Please install `pydot`. For example with `pip install pydot`.

単語のID列${\bf x} = x_{-2}, x_{-1}, x, x_{+1}, x_{+2}$に対して  
$Embedding(\bf x) = [ {\bf e}_{x_{-2}}　| {\bf e}_{x_{-1}}　| {\bf e}_{x}　| {\bf e}_{x_{+1}}　|　{\bf e}_{x_{+2}} ]^T　= {\bf E}^T$,  
$ Reshape({\bf E}) = [ {\bf e}_{x_{-2}}, {\bf e}_{x_{-1}}, {\bf e}_{x}, {\bf e}_{x_{+1}}, {\bf e}_{x_{+2}} ]^T = {\bf e}$ (縦に並べる),  
$f({\bf x}) = {\mathit softmax}(W_3 \tanh (W_2 \tanh (W_1 {\bf e} + b_1) + b_2) + b_3)$.

計算グラフの可視化
<img src='images/model.png'>

In [33]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_7 (InputLayer)            (None, 7)            0                                            
__________________________________________________________________________________________________
input_8 (InputLayer)            (None, 7)            0                                            
__________________________________________________________________________________________________
input_9 (InputLayer)            (None, 7)            0                                            
__________________________________________________________________________________________________
embedding_7 (Embedding)         (None, 7, 128)       2760832     input_7[0][0]                    
__________________________________________________________________________________________________
embedding_

In [34]:
model.compile(loss='categorical_crossentropy',
              optimizer=SGD(),
              metrics=['accuracy'])

In [25]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 4787645517783878182, name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 7082170778
 locality {
   bus_id: 1
 }
 incarnation: 18102384936197633674
 physical_device_desc: "device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0, compute capability: 5.2", name: "/device:GPU:1"
 device_type: "GPU"
 memory_limit: 6375404340
 locality {
   bus_id: 1
 }
 incarnation: 9913575533825859606
 physical_device_desc: "device: 1, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0, compute capability: 5.2"]

In [38]:
# 学習
model.fit(train_xs, train_ys, batch_size=1024, epochs=200, verbose=1, validation_data=(dev_xs, dev_ys))

# 学習がめんどい場合こっち (学習済みのパラメータを読み込み)
# model.load_weights('models/weights.h5')

Train on 929552 samples, validate on 45422 samples
Epoch 1/200
Epoch 2/200
111616/929552 [==>...........................] - ETA: 36s - loss: 2.0891 - acc: 0.3932

KeyboardInterrupt: 

In [27]:
# 学習したモデルを適当に使ってみるとなんか行列が出てきます
model.predict(test_xs[:10])

array([[8.10557082e-02, 1.32073769e-02, 7.10472639e-04, 7.18352795e-01,
        5.89033652e-06, 3.89746914e-04, 5.57578518e-04, 2.58904400e-08,
        1.58037595e-03, 2.01247330e-03, 1.02006830e-03, 7.10095046e-04,
        4.66211577e-06, 2.16930785e-04, 3.60750452e-09, 1.80175677e-01,
        7.04472061e-16],
       [7.86494547e-06, 2.18097651e-09, 1.24326913e-08, 6.10316864e-13,
        5.12558381e-06, 3.25297265e-07, 2.90912703e-05, 5.02471835e-07,
        8.15867799e-16, 5.47013144e-07, 9.99931097e-01, 5.55866075e-10,
        3.77816776e-14, 2.23222092e-12, 2.88937202e-14, 2.54073948e-05,
        2.39162648e-17],
       [5.88764760e-06, 9.99991417e-01, 3.21845732e-08, 2.37876435e-07,
        4.31857744e-11, 3.83840426e-09, 3.75526753e-07, 3.62298209e-07,
        8.90485342e-12, 1.44225919e-06, 1.40812717e-09, 4.49098270e-09,
        4.32071739e-12, 1.78185413e-08, 1.58697949e-16, 3.02612193e-07,
        8.71168284e-21],
       [2.78762332e-06, 2.99380929e-07, 6.22993355e-08, 1.895

In [29]:
# その行列の形
Out[27].shape

(10, 17)

In [30]:
# POSタグとIDの逆向きの辞書
rev_tag_dict = {v: k for k, v in tag_dict.items()}

In [32]:
# なにかでてきました
[rev_tag_dict[i] for i in np.argmax(Out[27], 1)]

['DET', 'PUNCT', 'PRON', 'VERB', 'PART', 'ADJ', 'PROPN', 'PUNCT', 'ADV', 'ADV']

In [35]:
# 単語リストを入力してPOSタグを予測する関数
def predict(words):
    words = [PAD] * 3 + words + [PAD] * 3
    ids = [word_dict.get(word, word_dict[UNK]) for word in words]
    windows = sliding_windows(ids)
    matrix = np.array(windows, 'i')
    probabilities = model.predict(matrix)
    result_ids = np.argmax(probabilities, 1)
    result = [rev_tag_dict[i] for i in result_ids]
    return result

In [36]:
predict(['this', 'is', 'a', 'test', 'sentence', '.'])

['PRON', 'VERB', 'DET', 'NOUN', 'NOUN', 'PUNCT']

In [43]:
import random
for _ in range(5):
    i = random.randint(0, len(test_sents))
    words, tags = test_sents[i]
    words = words[3:-3]
    print('sentence:\t', ' '.join(words))
    print('predict:\t', ' '.join(predict(words)))
    print('answer:\t', ' '.join(tags))

sentence:	 the president , at a news conference friday , also renewed a call for the ouster of panama 's noriega .
predict:	 DET NOUN PUNCT ADP DET NOUN NOUN PROPN PUNCT ADV VERB DET NOUN ADP DET NOUN ADP PROPN PART PROPN PUNCT
answer:	 DET NOUN PUNCT ADP DET NOUN NOUN PROPN PUNCT ADV VERB DET NOUN ADP DET NOUN ADP PROPN PART PROPN PUNCT
sentence:	 also , the company 's hair-growing drug , rogaine , is selling well -- at about $ 125 million for the year , but the company 's profit from the drug has been reduced by upjohn 's expensive print and television campaigns for advertising , analysts said .
predict:	 ADV PUNCT DET NOUN PART ADJ NOUN PUNCT ADJ PUNCT AUX VERB ADV PUNCT ADP ADP SYM NUM NUM ADP DET NOUN PUNCT CONJ DET NOUN PART NOUN ADP DET NOUN AUX AUX VERB ADP PROPN PART ADJ NOUN CONJ NOUN NOUN ADP NOUN PUNCT NOUN VERB PUNCT
answer:	 ADV PUNCT DET NOUN PART ADJ NOUN PUNCT PROPN PUNCT AUX VERB ADV PUNCT ADP ADV SYM NUM NUM ADP DET NOUN PUNCT CONJ DET NOUN PART NOUN ADP DET NOUN AUX

In [39]:
model.save_weights('models/weights.h5')