今回は、feedforward neural networkを使って、固定ウィンドウ幅(左右３)の情報を使って単語の品詞タグを予測するモデルを学習してみます．  
かなりシンプルですが、高速に予測できるという魅力から、より強力なモデルが沢山存在する今でも、NLPのいろんなツールで使われてます．  
easyccg: https://github.com/mikelewis0/easyccg  
syntaxnet: https://github.com/tensorflow/models/tree/master/research/syntaxnet  
stanford parser: https://nlp.stanford.edu/software/lex-parser.shtml  
(イメージ)　　
<img src='images/ff.png'>
(画像: https://xbt.net/blog/what-is-enigma/)

POSタグ(Part-Of-Speech, 品詞タグ)のリスト:  
- ADJ: adjective
- ADP: adposition
- ADV: adverb
- AUX: auxiliary
- CCONJ: coordinating conjunction
- DET: determiner
- INTJ: interjection
- NOUN: noun
- NUM: numeral
- PART: particle
- PRON: pronoun
- PROPN: proper noun
- PUNCT: punctuation
- SCONJ: subordinating conjunction
- SYM: symbol
- VERB: verb
- X: other

In [52]:
# つかうライブラリの読み込み

import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, Input, Reshape, LSTM, Bidirectional
from keras.optimizers import Adagrad
from collections import Counter


In [2]:
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto(
    gpu_options=tf.GPUOptions(
        visible_device_list="0", # specify GPU number
        allow_growth=True
    )
)
set_session(tf.Session(config=config))

In [3]:
# CoNLLフォーマットを読み込む関数
def read_conll(file):
    res = []
    words = []
    tags = []
    for line in open(file):
        line = line.strip()
         # 空行ならそれまでにつくった文を出力
        if len(line) == 0:
            res.append((words, tags))
            words = []
            tags = []
        # 単語とタグを取り出す
        else:
            items = line.split('\t')
            words.append(items[1].lower()) # 小文字にしておく
            tags.append(items[3])
    return res

In [4]:
# 学習データ (training data)
train_sents = read_conll('data/train.conll')
# 評価用データ (test data)
test_sents = read_conll('data/test.conll')
# 開発データ (development data)
dev_sents = read_conll('data/dev.conll')

In [12]:
#単語を自然数のIDに変換する辞書
UNK = 'UNK'
word_count = Counter(word for words, _ in train_sents for word in words)
word_set = [word for word, count in word_count.most_common() if count >= 2]
word_set.append(UNK)
word_dict = {w: i for i, w in enumerate(word_set, 1)}

In [21]:
#POSタグを自然数のIDに変換する辞書
tag_set = set(tag for _, tags in train_sents for tag in tags)
tag_dict = {w: i for i, w in enumerate(tag_set)}

In [22]:
print('word_dict size', len(word_dict))
print('tag_dict size', len(tag_dict))

word_dict size 21568
tag_dict size 17


In [30]:
UNKNOWN_ID = word_dict['UNK']
def make_matrices(sents):
    max_length = max(len(sent) for sent, _ in sents)
    xs = np.zeros((len(sents), max_length), 'i')
    ys = np.zeros((len(sents), max_length, len(tag_dict)), 'i')
    for i, sent in enumerate(sents):
        for j, (word, tag) in enumerate(zip(*sent)):
            xs[i, j] = word_dict.get(word, UNKNOWN_ID)
            ys[i, j, tag_dict[tag]] = 1

    print('dimensions of xs', xs.shape)
    print('dimensions of ys', ys.shape)
    return xs, ys

In [31]:
train_xs, train_ys = make_matrices(train_sents)
test_xs, test_ys = make_matrices(test_sents)
dev_xs, dev_ys = make_matrices(dev_sents)

dimensions of xs (39604, 141)
dimensions of ys (39604, 141, 17)
dimensions of xs (2407, 65)
dimensions of ys (2407, 65, 17)
dimensions of xs (1913, 249)
dimensions of ys (1913, 249, 17)


In [53]:
VOCAB_SIZE = len(word_dict) + 1 # 単語数
EMBED_DIM = 64                      # 埋め込みベクトルの次元数
HIDDEN1_DIM = 128                   # 隠れ層１
HIDDEN2_DIM = 64                   # 隠れ層２
NUM_TAGS = len(tag_dict)

model = Sequential()
model.add(Embedding(VOCAB_SIZE, EMBED_DIM, mask_zero=True))
model.add(Bidirectional(LSTM(HIDDEN1_DIM, return_sequences=True)))
model.add(Dense(HIDDEN2_DIM, activation='tanh'))
model.add(Dense(NUM_TAGS, activation='softmax'))

In [49]:
keras.utils.plot_model(model, 'images/model_lstm.png')

計算グラフの可視化
<img src='images/model_lstm.png'>

In [54]:
model.compile(loss='categorical_crossentropy',
              optimizer=Adagrad(),
              metrics=['accuracy'])

In [55]:
# 学習
model.fit(train_xs, train_ys, batch_size=64, epochs=20, verbose=1, validation_data=(dev_xs, dev_ys))

# 学習がめんどい場合こっち (学習済みのパラメータを読み込み)
# model.load_weights('models/weights.h5')

Train on 39604 samples, validate on 1913 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20

KeyboardInterrupt: 

In [63]:
# 単方向LSTM
model2 = Sequential([
    Embedding(VOCAB_SIZE, EMBED_DIM, mask_zero=True),
    LSTM(HIDDEN1_DIM, return_sequences=True),
    Dense(HIDDEN2_DIM, activation='tanh'),
    Dense(NUM_TAGS, activation='softmax')
])

In [64]:
model2.compile(loss='categorical_crossentropy',
              optimizer=Adagrad(),
              metrics=['accuracy'])

In [65]:
model2.fit(train_xs, train_ys, batch_size=64, epochs=20, verbose=1, validation_data=(dev_xs, dev_ys))

Train on 39604 samples, validate on 1913 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20

KeyboardInterrupt: 

In [56]:
# POSタグとIDの逆向きの辞書
rev_tag_dict = {v: k for k, v in tag_dict.items()}

In [57]:
UNKNOWN_ID = word_dict['UNK']
def predict(words):
    ids = [word_dict.get(word, UNKNOWN_ID) for word in words]
    matrix = np.array([ids], 'i')
    probabilities = model.predict(matrix)[0]
    result_ids = np.argmax(probabilities, 1)
    result = [rev_tag_dict[i] for i in result_ids]
    return result

In [58]:
predict(['this', 'is', 'a', 'test', 'sentence', '.'])

['PRON', 'VERB', 'DET', 'NOUN', 'NOUN', 'PUNCT']

In [62]:
import random
for _ in range(5):
    i = random.randint(0, len(test_sents))
    words, tags = test_sents[i]
    print('sentence:\t', ' '.join(words))
    print('predict: \t', ' '.join(predict(words)))
    print('answer:  \t', ' '.join(tags))

sentence:	 i disagree with the statement by mr. lantos that one should not draw an adverse inference against former hud officials who assert their fifth amendment privilege against self-incrimination in congressional hearings .
predict: 	 PRON VERB ADP DET NOUN ADP PROPN PROPN SCONJ NOUN AUX PART VERB DET ADJ NOUN ADP ADJ PROPN NOUN PRON VERB PRON PROPN NOUN NOUN ADP NOUN ADP ADJ NOUN PUNCT
answer:  	 PRON VERB ADP DET NOUN ADP PROPN PROPN SCONJ PRON AUX PART VERB DET ADJ NOUN ADP ADJ PROPN NOUN PRON VERB PRON PROPN PROPN NOUN ADP NOUN ADP ADJ NOUN PUNCT
sentence:	 mr. perkins believes , however , that the market could be stabilized if california investor marvin davis steps back in to the united bidding with an offer of $ 275 a share .
predict: 	 PROPN PROPN VERB PUNCT ADV PUNCT SCONJ DET NOUN AUX AUX VERB SCONJ PROPN NOUN PROPN PROPN VERB ADV ADP ADP DET PROPN VERB ADP DET NOUN ADP SYM NUM DET NOUN PUNCT
answer:  	 PROPN PROPN VERB PUNCT ADV PUNCT SCONJ DET NOUN AUX AUX VERB SCONJ PRO

In [None]:
model.save_weights('models/weights.h5')