# CNNとRNNを用いた意図検出

このノートブックでは、CNNとRNNを用いて、ATISデータセットを用いた意図検出タスクを解く方法を紹介します。ATISデータセットは、意図検出のための標準的なベンチマークデータセットです。ATISとは、Airline Travel Information Systemの略です。このデータセットは、`Data`フォルダの下の`Data2`フォルダにあります。

## 準備

### パッケージのインストール

In [1]:
!pip install -q tensorflow==2.6.0 numpy==1.19.5 scikit-learn==0.22.2.post1

### インポート

In [2]:
import os

import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras.initializers import Constant
from tensorflow.keras.layers import (LSTM, Conv1D, Dense, Embedding, GlobalMaxPooling1D, MaxPooling1D, TextVectorization)
from tensorflow.keras.models import Sequential

### データのアップロード

まずはデータをアップロードします。ノートブックと同じ階層に`Data`フォルダがあり、その下に`data2`フォルダがあります。以下の2つのファイルをアップロードしましょう。Colabでない場合は、読み込むときに正しいパスを指定してください。

- atis.train.w-intent.iob
- atis.test.w-intent.iob

In [3]:
from google.colab import files
uploaded = files.upload()

Saving atis.test.w-intent.iob to atis.test.w-intent.iob
Saving atis.train.w-intent.iob to atis.train.w-intent.iob


In [4]:
!head atis.train.w-intent.iob

BOS i want to fly from boston at 838 am and arrive in denver at 1110 in the morning EOS	 O O O O O O B-fromloc.city_name O B-depart_time.time I-depart_time.time O O O B-toloc.city_name O B-arrive_time.time O O B-arrive_time.period_of_day atis_flight
BOS what flights are available from pittsburgh to baltimore on thursday morning EOS	O O O O O O B-fromloc.city_name O B-toloc.city_name O B-depart_date.day_name B-depart_time.period_of_day atis_flight
BOS what is the arrival time in san francisco for the 755 am flight leaving washington EOS	O O O O B-flight_time I-flight_time O B-fromloc.city_name I-fromloc.city_name O O B-depart_time.time I-depart_time.time O O B-fromloc.city_name atis_flight_time
BOS cheapest airfare from tacoma to orlando EOS	O B-cost_relative O O B-fromloc.city_name O B-toloc.city_name atis_airfare
BOS round trip fares from pittsburgh to philadelphia under 1000 dollars EOS	O B-round_trip I-round_trip O O B-fromloc.city_name O B-toloc.city_name B-cost_relative B-fare_amo

In [5]:
!head atis.test.w-intent.iob

BOS O
i O
would O
like O
to O
find O
a O
flight O
from O
charlotte B-fromloc.city_name


面倒なことに、学習データとテストデータで形式が異なるようです。

最後にGloVeをダウンロードしておきます。

In [6]:
# GloVeのダウンロードと展開
!wget  https://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip -d DATAPATH

--2021-09-21 09:38:59--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-09-21 09:39:00--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-09-21 09:41:39 (5.15 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

Archive:  glove.6B.zip
  inflating: DATAPATH/glove.6B.50d.txt  
  inflating: DATAPATH/glove.6B.100d.txt  
  inflating: DATAPATH/glove.6B.200d.txt  
  inflating: DATAPATH/glove.6B.300d.txt  


## データの読み込み

データをアップロードしたら、学習データとテストデータを読み込みましょう。それぞれを読み込むための関数を定義します。


In [7]:
train_data_path = "atis.train.w-intent.iob"
test_data_path = "atis.test.w-intent.iob"

### 学習データ

In [8]:
def load_train_data(filename, remove_validation=True):
    sents, labels, intents = [], [], []
    with open(filename, encoding="utf-8") as f:
        for line in f:
            words, labs = [i.split(' ') for i in line.strip().split('\t')]
            if remove_validation and "#" in labs[-1]:
                continue
            sents.append(words[1:-1])
            labels.append(labs[1:-1])
            intents.append(labs[-1])
    return sents, labels, intents

In [9]:
sents, labels, intents = load_train_data(train_data_path)
train_texts = [" ".join(words) for words in sents]
train_labels = intents

print("Number of training sentences :", len(train_texts))
print("Number of unique intents :", len(set(train_labels)))

for i in zip(train_texts[:5], train_labels[:5]):
    print(i)

Number of training sentences : 4952
Number of unique intents : 17
('i want to fly from boston at 838 am and arrive in denver at 1110 in the morning', 'atis_flight')
('what flights are available from pittsburgh to baltimore on thursday morning', 'atis_flight')
('what is the arrival time in san francisco for the 755 am flight leaving washington', 'atis_flight_time')
('cheapest airfare from tacoma to orlando', 'atis_airfare')
('round trip fares from pittsburgh to philadelphia under 1000 dollars', 'atis_airfare')


### テストデータ

In [10]:
def load_test_data(filename, remove_validation=True):
    sents, labels, intents = [], [], []
    with open(filename, encoding="utf-8") as f:
        words, tags = [], []
        for line in f:
            line = line.strip()
            if line:
                word, tag = line.split()
                words.append(word)
                tags.append(tag)
            else:
                if not (remove_validation and "#" in tags[-1]):
                    sents.append(words[1: -1])
                    labels.append(tags[1: -1])
                    intents.append(tags[-1])
                words, tags = [], []
    return sents, labels, intents

In [11]:
sents, labels, intents = load_test_data(test_data_path)

test_texts = [" ".join(words) for words in sents]
test_labels = intents

new_labels = set(test_labels) - set(train_labels)

# テストデータにだけ出現するラベルを除去
vals = []
for i in range(len(test_labels)):
    if test_labels[i] in new_labels:
        print(test_labels[i])
        vals.append(i)
for i in vals[::-1]:
    test_labels.pop(i)
    test_texts.pop(i)

print("Number of testing sentences :", len(test_texts))
print("Number of unique intents :", len(set(test_labels)))

for i in zip(test_texts[:5], test_labels[:5]):
    print(i)

atis_day_name
atis_day_name
Number of testing sentences : 876
Number of unique intents : 15
('i would like to find a flight from charlotte to las vegas that makes a stop in st. louis', 'atis_flight')
('on april first i need a ticket from tacoma to san jose departing before 7 am', 'atis_airfare')
('on april first i need a flight going from phoenix to san diego', 'atis_flight')
('i would like a flight traveling one way from phoenix to san diego on april first', 'atis_flight')
('i would like a flight from orlando to salt lake city for april first on delta airlines', 'atis_flight')


## 前処理

In [12]:
BASE_DIR = 'DATAPATH'
GLOVE_PATH = os.path.join(BASE_DIR, 'glove.6B.100d.txt')
MAX_SEQUENCE_LENGTH = 300
MAX_NUM_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.3

「ニューラルネットワークを用いたテキスト分類」のノートブックでも用いた`TextVectorization`を使って、単語をIDに変換しましょう。

In [13]:
vectorize_layer = TextVectorization(
    max_tokens=MAX_NUM_WORDS,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH
)
vectorize_layer.adapt(train_texts)
vectorize_layer.vocabulary_size()

896

In [14]:
x_train = vectorize_layer(train_texts).numpy()
x_test = vectorize_layer(test_texts).numpy()

続いて、`LabelEncoder`を使って、ラベルをIDに変換します。

In [15]:
le = LabelEncoder()
le.fit(train_labels)
y_train = le.transform(train_labels)
y_test = le.transform(test_labels)
y_train[:10]

array([ 9,  9, 11,  2,  2,  9,  1,  9,  9, 13])

学習データを分割して、検証用のデータを作成しましょう。

In [16]:
x_train, x_valid, y_train, y_valid = train_test_split(
    x_train, y_train, test_size=VALIDATION_SPLIT, random_state=42
)

### 埋め込み行列の準備

In [17]:
# 埋め込み行列の準備
# 最初に、単語のインデックスとベクトルのマッピングを作成
embeddings_index = {}
with open(os.path.join(GLOVE_PATH)) as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
print('Found %s word vectors in Glove embeddings.' % len(embeddings_index))

# 埋め込み行列の準備
# 行は単語、列はGloVeから得た埋め込みに対応
num_words = min(MAX_NUM_WORDS, vectorize_layer.vocabulary_size()) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for i, word in enumerate(vectorize_layer.get_vocabulary()):
    if i > MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    # 単語が見つからなければ、ゼロベクトルのまま
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# Embedding層に事前学習済み単語埋め込みを読み込み
# 埋め込みを更新しないように、trainable=Falseを設定していることに注意
embedding_layer = Embedding(
    num_words,
    EMBEDDING_DIM,
    embeddings_initializer=Constant(embedding_matrix),
    input_length=MAX_SEQUENCE_LENGTH,
    trainable=False,
    mask_zero=True,
)

Found 400000 word vectors in Glove embeddings.


## モデルの学習と評価

### 事前学習済み埋め込みを用いたCNN

In [18]:
cnnmodel = Sequential()
cnnmodel.add(embedding_layer)
cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(GlobalMaxPooling1D())
cnnmodel.add(Dense(128, activation="relu"))
cnnmodel.add(Dense(len(le.classes_), activation="softmax"))

cnnmodel.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="adam",
    metrics=["acc"]
)

cnnmodel.summary()

cnnmodel.fit(
    x_train,
    y_train,
    batch_size=128,
    epochs=1,
    validation_data=(x_valid, y_valid)
)
score, acc = cnnmodel.evaluate(x_test, y_test)
print("Test accuracy with CNN:", acc)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 100)          89700     
_________________________________________________________________
conv1d (Conv1D)              (None, 296, 128)          64128     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 59, 128)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 55, 128)           82048     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 11, 128)           0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 7, 128)            82048     
_________________________________________________________________
global_max_pooling1d (Global (None, 128)               0

### CNN

次に、事前学習済み埋め込みを使わずに学習してみましょう。

In [19]:
cnnmodel = Sequential()
cnnmodel.add(Embedding(MAX_NUM_WORDS, 128))
cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation="relu"))
cnnmodel.add(GlobalMaxPooling1D())
cnnmodel.add(Dense(128, activation="relu"))
cnnmodel.add(Dense(len(le.classes_), activation="softmax"))

cnnmodel.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="adam",
    metrics=["acc"]
)

cnnmodel.summary()

cnnmodel.fit(
    x_train,
    y_train,
    batch_size=128,
    epochs=1,
    validation_data=(x_valid, y_valid)
)
score, acc = cnnmodel.evaluate(x_test, y_test)
print("Test accuracy with CNN:", acc)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 128)         2560000   
_________________________________________________________________
conv1d_3 (Conv1D)            (None, None, 128)         82048     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, None, 128)         0         
_________________________________________________________________
conv1d_4 (Conv1D)            (None, None, 128)         82048     
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, None, 128)         0         
_________________________________________________________________
conv1d_5 (Conv1D)            (None, None, 128)         82048     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128)              

### LSTM

次に、LSTMを使ったモデルを学習しましょう。まずは、事前学習済み埋め込みを使わずに学習します。

In [20]:
rnnmodel = Sequential()
rnnmodel.add(Embedding(MAX_NUM_WORDS, 128))
rnnmodel.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
rnnmodel.add(Dense(len(le.classes_), activation="softmax"))
rnnmodel.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)

rnnmodel.summary()

rnnmodel.fit(
    x_train,
    y_train,
    batch_size=32,
    epochs=1,
    validation_data=(x_valid, y_valid)
)
score, acc = rnnmodel.evaluate(x_test, y_test, batch_size=32)
print("Test accuracy with RNN:", acc)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 128)         2560000   
_________________________________________________________________
lstm (LSTM)                  (None, 128)               131584    
_________________________________________________________________
dense_4 (Dense)              (None, 17)                2193      
Total params: 2,693,777
Trainable params: 2,693,777
Non-trainable params: 0
_________________________________________________________________
Test accuracy with RNN: 0.7214611768722534


### 事前学習済み埋め込みを用いたLSTM

In [21]:
rnnmodel2 = Sequential()
rnnmodel2.add(embedding_layer)
rnnmodel2.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
rnnmodel2.add(Dense(len(le.classes_), activation="softmax"))
rnnmodel2.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)

rnnmodel2.summary()

rnnmodel2.fit(
    x_train,
    y_train,
    batch_size=32,
    epochs=1,
    validation_data=(x_valid, y_valid)
)
score, acc = rnnmodel2.evaluate(x_test, y_test, batch_size=32)
print("Test accuracy with RNN:", acc)

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 100)          89700     
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               117248    
_________________________________________________________________
dense_5 (Dense)              (None, 17)                2193      
Total params: 209,141
Trainable params: 119,441
Non-trainable params: 89,700
_________________________________________________________________
Test accuracy with RNN: 0.8116438388824463
