## Text Classification Using a Convolutional Neural Network

Rotten Tomatoes（ロッテン・トマト）は、映画評論家による映画レビューを一か所にまとめたウェブサイト。映画ごとに肯定的なレビューが多いか否定的なレビューが多いかを集計して点数にするほか、映画に関する情報・報道全般を扱っており、英語圏の映画レビュー集サイトとして最もよく知られたものである。
1がpositive, 0がnegative,0.5がneutral)

## Text Classificationの仕組み
今回用いるCNNは以下の図のようになっております．
<img src="img/figure1.png" width=600>

embedding table:レビューに含まれる単語を集めた辞書で，各単語は128個の特徴データを持っている

１層目：埋め込みレイヤー(Embedding Layer)<br>
→enbedding tableから各単語に対応する特徴データ（重み）を持ってくる

２層目：埋め込み後の単語ベクトルを複数の大きさのフィルターで畳み込み
（ 例えば同時に3,4,5単語づつなど） 

３層目：max-poolしてFlatten 

出力層：全結合層

必要なデータ<br>
・全てのレビューに現れる単語を抽出したEnbedding table<br>
・各レビューを単語に分解し，各単語をembedding tableのindexに対応させた入力データ<br>
・各レビューに対するpositive,negativeのラベルデータ<br>


## 1 データ準備

### 1-1 ライブラリのインポート

In [14]:
import urllib.request
import numpy as np
import re
import itertools
from collections import Counter

### 1-2 不要な文字の削除

In [15]:
def clean_str(string):
    """
    Tokenization/string cleaning for all datasets except for SST.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string=str(string)
    
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    
    #文字列の先頭および末尾にある空白，改行を削除し，全ての文字を小文字で返す
    return string.strip().lower()

### 1-3 レビューデータを文字で区切り，positiveなら1，negativeなら0をラベルしたものを返す


In [16]:
def load_data_and_labels():
    """
    Loads MR polarity data from files, splits the data into words and generates labels.
    Returns split sentences and labels.
    """
    # 高評価なレビューの文章を持ってくる
    pos_file = urllib.request.urlopen('https://raw.githubusercontent.com/yoonkim/CNN_sentence/master/rt-polarity.pos')

    # 低評価なレビューの文章を持ってくる
    neg_file = urllib.request.urlopen('https://raw.githubusercontent.com/yoonkim/CNN_sentence/master/rt-polarity.neg')

    # ファイルからデータを一行ずつロードし，先頭および末尾にある空白，改行を削除する
    positive_examples = list(pos_file.readlines())
    positive_examples = [s.strip() for s in positive_examples]
    negative_examples = list(neg_file.readlines())
    negative_examples = [s.strip() for s in negative_examples]
    
    # 不要な文字を削除し，文字ごとに区切る
    x_text = positive_examples + negative_examples
    x_text = [clean_str(sent) for sent in x_text]
    x_text = [s.split(" ") for s in x_text]
    
    # ラベルの作成
    #positive_labels = [1 for _ in positive_examples]
    #negative_labels = [0 for _ in negative_examples]
    positive_labels = [[0, 1] for _ in positive_examples] #_はfor i in ...のiと同じ
    negative_labels = [[1, 0] for _ in negative_examples]
    
    y = np.concatenate([positive_labels, negative_labels], 0) #２個以上の配列を軸指定して結合する
    return [x_text, y]

In [17]:
load_data_and_labels()

[[["b'the",
   'rock',
   'is',
   'destined',
   'to',
   'be',
   'the',
   '21st',
   'century',
   "'s",
   'new',
   'conan',
   'and',
   'that',
   'he',
   "'s",
   'going',
   'to',
   'make',
   'a',
   'splash',
   'even',
   'greater',
   'than',
   'arnold',
   'schwarzenegger',
   ',',
   'jean',
   'claud',
   'van',
   'damme',
   'or',
   'steven',
   'segal',
   "'"],
  ["b'the",
   'gorgeously',
   'elaborate',
   'continuation',
   'of',
   'the',
   'lord',
   'of',
   'the',
   'rings',
   'trilogy',
   'is',
   'so',
   'huge',
   'that',
   'a',
   'column',
   'of',
   'words',
   'cannot',
   'adequately',
   'describe',
   'co',
   'writer',
   'director',
   'peter',
   'jackson',
   "'s",
   'expanded',
   'vision',
   'of',
   'j',
   'r',
   'r',
   'tolkien',
   "'s",
   'middle',
   'earth',
   "'"],
  ["b'effective", 'but', 'too', 'tepid', "biopic'"],
  ["b'if",
   'you',
   'sometimes',
   'like',
   'to',
   'go',
   'to',
   'the',
   'movies',
   '

### 1-4 全ての文の長さが均等になるように，最大の文字数の文に合わせて，文字を足す(パディング)

In [18]:
def pad_sentences(sentences, padding_word=""):
    """
    Pads all sentences to the same length. The length is defined by the longest sentence.
    Returns padded sentences.
    """
    sequence_length = max(len(x) for x in sentences)
    padded_sentences = []
    for i in range(len(sentences)):
        sentence = sentences[i]
        num_padding = sequence_length - len(sentence)
        new_sentence = sentence + [padding_word] * num_padding
        padded_sentences.append(new_sentence)
    return padded_sentences

### 1-5 embedding tableの作成

In [19]:
def build_vocab(sentences):
    """
    Builds a vocabulary mapping from word to index based on the sentences.
    Returns vocabulary mapping and inverse vocabulary mapping.
    """
    # 単語の抽出(出現単語と出現回数の辞書を作成)
    word_counts = Counter(itertools.chain(*sentences))
    # 出現単語をリストに格納(出現回数の多い順に格納)
    vocabulary_inv = [x[0] for x in word_counts.most_common()]
    # 各単語に番号を振り，辞書として格納
    vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}
    return [vocabulary, vocabulary_inv]


### 1-6 入力データ，ラベルデータの作成

In [20]:
def build_input_data(sentences, labels, vocabulary):
    """
    Maps sentences and labels to vectors based on a vocabulary.
    """
    x = np.array([[vocabulary[word] for word in sentence] for sentence in sentences])#単語を対応する数字に変換
    y = np.array(labels)
    return [x, y]

### 1-7 前処理の実行

In [21]:
"""
Loads and preprocessed data for the MR dataset.
Returns input vectors, labels, vocabulary, and inverse vocabulary.
"""

# データの前処理
sentences, labels = load_data_and_labels()
sentences_padded = pad_sentences(sentences)
vocabulary, vocabulary_inv = build_vocab(sentences_padded)
x, y = build_input_data(sentences_padded, labels, vocabulary)

vocab_size = len(vocabulary)

# データのシャッフル
np.random.seed(10)
shuffle_indices = np.random.permutation(np.arange(len(y)))
x_shuffled = x[shuffle_indices]
y_shuffled = y[shuffle_indices]

# 訓練データとテストデータに分割
# 全部て10662個あるデータを9662個の訓練データと1000個のテストデータに分ける
x_train, x_dev = x_shuffled[:-1000], x_shuffled[-1000:]
y_train, y_dev = y_shuffled[:-1000], y_shuffled[-1000:]

sentence_size = x_train.shape[1]

print ('Train/Dev split: %d/%d' % (len(y_train), len(y_dev)))
print ('train shape:', x_train.shape)
print ('dev shape:', x_dev.shape)
print ('vocab_size', vocab_size)
print ('sentence max words', sentence_size)

Train/Dev split: 9662/1000
train shape: (9662, 57)
dev shape: (1000, 57)
vocab_size 20399
sentence max words 57


In [22]:
x_train

array([[    7, 17145,    50, ...,     0,     0,     0],
       [  692,   861,   177, ...,     0,     0,     0],
       [  105,     9,  1498, ...,     0,     0,     0],
       ..., 
       [    7,  5294,    11, ...,     0,     0,     0],
       [    7,    14,   101, ...,     0,     0,     0],
       [  104,  1516,     1, ...,     0,     0,     0]])

In [23]:
vocabulary

{'': 0,
 ',': 1,
 'the': 2,
 "'": 3,
 'a': 4,
 'and': 5,
 'of': 6,
 'b': 7,
 'to': 8,
 'is': 9,
 "'s": 10,
 'it': 11,
 'that': 12,
 'in': 13,
 'as': 14,
 'but': 15,
 'film': 16,
 'with': 17,
 'for': 18,
 'movie': 19,
 'its': 20,
 'this': 21,
 'you': 22,
 'an': 23,
 'be': 24,
 "b'a": 25,
 'on': 26,
 "n't": 27,
 "b'the": 28,
 'by': 29,
 'not': 30,
 'are': 31,
 'about': 32,
 'more': 33,
 'has': 34,
 'one': 35,
 'like': 36,
 'at': 37,
 'than': 38,
 'from': 39,
 'all': 40,
 'have': 41,
 'his': 42,
 'so': 43,
 'or': 44,
 'story': 45,
 'if': 46,
 'i': 47,
 'out': 48,
 'who': 49,
 'too': 50,
 'up': 51,
 'into': 52,
 'good': 53,
 'there': 54,
 'just': 55,
 'what': 56,
 'does': 57,
 'most': 58,
 'much': 59,
 'comedy': 60,
 'no': 61,
 'time': 62,
 'will': 63,
 'even': 64,
 'can': 65,
 "b'": 66,
 'well': 67,
 'some': 68,
 'he': 69,
 'characters': 70,
 'only': 71,
 'little': 72,
 'way': 73,
 'their': 74,
 'director': 75,
 'funny': 76,
 'do': 77,
 'make': 78,
 'been': 79,
 'they': 80,
 'enough': 81,

## 2 畳み込みニューラルネットワークの定義

### 2-1 ライブラリのインポート

In [24]:
import tensorflow as tf
import numpy as np

from sklearn.utils import shuffle
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

rng = np.random.RandomState(1234)
random_state = 42

### 2-2 CNNクラスの定義

In [25]:
class TextCNN(object):
    """
    A CNN for text classification.
    Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer.
    """
    def __init__(
      self, sequence_length, num_classes, vocab_size,
      embedding_size, filter_sizes, num_filters, l2_reg_lambda=0.0):

        # input, output ,dropoutのPlaceholders
        self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
        self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
        self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")

        # l2正則化lossの定義(学習をスムーズに行うための工夫：後述)
        l2_loss = tf.constant(0.0)

        # Embedding層
        with tf.device('/cpu:0'), tf.name_scope("embedding"):
            self.W = tf.Variable(
                tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
                name="W")
            self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x) #tf.nn.embedding_lookupについては後述
            self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1) #畳み込みするためにレイヤーを１にする

        # それぞれのフィルターサイズに対して畳み込み層，maxpooling層を作成する
        pooled_outputs = []
        for i, filter_size in enumerate(filter_sizes):
            with tf.name_scope("conv-maxpool-%s" % filter_size):
                # 畳み込み層
                filter_shape = [filter_size, embedding_size, 1, num_filters]
                
                W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
                b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
                conv = tf.nn.conv2d(
                    self.embedded_chars_expanded,
                    W,
                    strides=[1, 1, 1, 1],
                    padding="VALID",
                    name="conv")
                
                # 活性化関数による非線形化
                h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
                # Maxpoolingを出力に対して行う
                pooled = tf.nn.max_pool(
                    h,
                    ksize=[1, sequence_length - filter_size + 1, 1, 1],
                    strides=[1, 1, 1, 1],
                    padding='VALID',
                    name="pool")
                pooled_outputs.append(pooled)

        # poolされた特徴を組み合わせる
        num_filters_total = num_filters * len(filter_sizes)
        self.h_pool = tf.concat(pooled_outputs, 3) 
        self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])

        # dropoutを適用：後述
        with tf.name_scope("dropout"):
            self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)

        # 最終層の定義
        with tf.name_scope("output"):
            W = tf.get_variable(
                "W",
                shape=[num_filters_total, num_classes],
                initializer=tf.contrib.layers.xavier_initializer())
            b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
            l2_loss += tf.nn.l2_loss(W)
            l2_loss += tf.nn.l2_loss(b)
            self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
            self.predictions = tf.argmax(self.scores, 1, name="predictions")

        # ソフトマックス交差エントロピー誤差を取る．
        with tf.name_scope("loss"):
            losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y)
            self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss

        # Accuracy
        with tf.name_scope("accuracy"):
            correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
            self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")

#### 補足1 tf.nn.embedding_lookupについて

<img src='https://4.bp.blogspot.com/-G43l7AMmApM/WAXExL3OJNI/AAAAAAAADgY/JHaBo4Jo14AUFdmpU5nkoUzp8ZbFKumzwCLcB/s1600/run2.png'>

#### tf.nn.embedding_lookup(params,ids)はidsに対応するindexのparamsの値を持ってきます．例えば，ids=[2,1]であれば，params[2]とparams[1]の値をとってくるので，返り値は[params[2],params[1]]となります

#### 補足2 l2正則化について
正則化は過学習を防ぐ手法の一つです．過学習が起きる原因は誤差関数が局所解にトラップされてしまうためであると考えられます．ネットワークの自由度（重みの数）が高いほど，そうなる可能性が大きくなります．しかし，自由度は表現力に直結するため，過学習の解決策として自由度を減らすことは望ましくありません．そこで，重みの自由度を制約する手法として正則化が用いられています．<br>
具体的には，誤差に重みの二乗和を加えて，それを最小化させます．
$$E^{'}=E+\lambda\boldsymbol{w}^2$$

#### 補足3 dropoutについて
tf.nn.dropout(入力データ, 残存率, noise_shape=None, seed=None, name=None)<br>
dropoutは入力データを一定の割合で非活性化させることで過学習を防ぐ手法です．<br>
https://qiita.com/mine820/items/db31a11bbe0fc4159199

In [26]:
cnn = TextCNN(
            sequence_length=x_train.shape[1],
            num_classes=y_train.shape[1],
            vocab_size=vocab_size,
            embedding_size=128,
            filter_sizes=[3,4,5],
            num_filters=128,
            l2_reg_lambda=0.0
            )



### 2-3 学習プロセスを定義

In [27]:
global_step = tf.Variable(0, name="global_step", trainable=False)
optimizer = tf.train.AdamOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(cnn.loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)



### 2-4 学習の実行

In [28]:
n_epochs = 30
batch_size = 100
n_batches = x_train.shape[0]//batch_size

sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

print("学習開始")    
for epoch in range(n_epochs):
    x_train, y_train = shuffle(x_train, y_train, random_state=random_state)
    current_step = tf.train.global_step(sess, global_step)
    
    for i in range(n_batches):
       
        start = i * batch_size
        end = start + batch_size
        
        x_batch = x_train[start:end]
        y_batch = y_train[start:end]
        
        feed_dict = {
              cnn.input_x: x_batch,
              cnn.input_y: y_batch,
              cnn.dropout_keep_prob: 0.5,
            }
        
        _, step, loss, accuracy = sess.run(
                [train_op, global_step,  cnn.loss, cnn.accuracy],
                feed_dict)
       
    feed_dict = {
              cnn.input_x: x_dev,
              cnn.input_y: y_dev,
              cnn.dropout_keep_prob: 1.0,
            }
    step, loss, accuracy = sess.run(
                [global_step, cnn.loss, cnn.accuracy],
                feed_dict)
    print(" epoch {}, loss {:g}, acc {:g}".format( step, loss, accuracy))

学習開始


KeyboardInterrupt: 