### Multi-KernelSize CNN & GRU Char-Level Text Classification

##### Core Idea:
1. The data is in a deterministic mapping and there is no tokenization,  
   so we choose the Char-Level text classification to solve the problem.
2. Using multi-size kernel CNN with GRU by comparing several methods from some papers. Reference as below.
3. To make the code clean and readable, just putting all the codes in one file.
4. To make the training process faster, just using smaller kernel size in CNN and also the training epoch is limited.
5. In the experiment stage, tf.saver is used, for the submitted version, the tf.saver is switched off.

##### Reference:
1. Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
2. Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. 
   In Advances in neural information processing systems (pp. 649-657).
3. Lai, S., Xu, L., Liu, K., & Zhao, J. (2015, January). 
   Recurrent Convolutional Neural Networks for Text Classification. In AAAI (Vol. 333, pp. 2267-2273).

##### Model Explanation:

1. One-hot of document as the input, it has 3 dimensions (batchsize, sentence_length, one_hot_length) 
2. The first CNN layers, kernel size 7 and 5 were used to extract different n-grams.
3. The higher CNN layers extract higher level of abstraction. 
3. The max-pooling layers were used to down-sampling without overlap, by using stride=3 and k_size = 3
4. Batch-Normalization were used in each cnn layers.
5. In this work, I chose the GRU instead of CNN flatten, supposed that this helps to add some positional learning of sentence.
6. After GRU layer, the drop_out layer was used to decrease overfitting.
7. Batch size is 128, validate data set is 1600, 
   Learning rate is 0.001, Adam_optimizer was used. And for the drop_out layer, is_training parameter is used.

##### Print the training performance:
Epoch 4 Step 241: Loss: 0.358734, Accuracy: 0.906250  
Epoch 4 Step 242: Loss: 0.580555, Accuracy: 0.871795  
Epoch 4 Validate Loss: 0.612117, Validate Accuracy: 0.805658  
Epoch 5 Step 241: Loss: 0.286239, Accuracy: 0.921875  
Epoch 5 Step 242: Loss: 0.487394, Accuracy: 0.897436  
Epoch 5 Validate Loss: 0.703344, Validate Accuracy: 0.800123  
Epoch 6 Step 241: Loss: 0.108257, Accuracy: 0.953125  
Epoch 6 Step 242: Loss: 0.180196, Accuracy: 0.897436  
Epoch 6 Validate Loss: 0.690398, Validate Accuracy: 0.799508  
Epoch 7 Step 241: Loss: 0.135933, Accuracy: 0.968750  
Epoch 7 Step 242: Loss: 0.154102, Accuracy: 0.923077  
Epoch 7 Validate Loss: 0.717810, Validate Accuracy: 0.81734  

##### Training instructions：
1. Keep adjusting the parameters in model, and after achieving the 0.817 validation accuracy, seems like it works.
2. I also put the best checkpoint in the folder, you can use it to train and validate.
3. If you want to use the checkpoint, please change "restore_train_model = False".

In [None]:
import codecs
import pandas as pd
import numpy as np
import tensorflow as tf

""" global parameters """
global_batch_size = 128
global_learning_rate = 0.001
global_validation_switch = True
global_training_proportion = 0.95
global_epoch_num = 10
base_validation_acc = 0.75

""" file sess """
train_x_file = "./.txt"
train_y_file = "./.txt"
test_x_file = "./.txt"
test_y_file = "./.txt"

In [None]:
""" data file indicators """
fp_train_x = codecs.open(train_x_file)
fp_train_y = codecs.open(train_y_file)
fp_test_x = codecs.open(test_x_file)

""" tf saver """
restore_train_model = True
save_train_model = False
save_interval = 3
checkpoint_dir = "./saver/"

""" generate y label list """
train_y_lines = fp_train_y.readlines()
whole_train_y = [int(i.strip()) for i in train_y_lines]
whole_train_y = pd.get_dummies(whole_train_y)

""" generate train x """
train_x_lines = fp_train_x.readlines()
test_x_lines = fp_test_x.readlines()
global_sample_num = len(train_x_lines)

tmp_list = list()
for i in train_x_lines:
    tmp_list.append(len(i))
for i in test_x_lines:
    tmp_list.append(len(i))

global_max_len = max(tmp_list) + 1
print("Global_max_sentence_len: ", global_max_len)

##### Using one-hot to represent the char-level vector. Using all-zero vectors to pad the length of sentence into max_length.

In [None]:
""" generate one_hot embedding """
alphabet = 'abcdefghijklmnopqrstuvwxyz'
char_one_hot = dict()
for i in range(len(alphabet)):
    letter = [0 for _ in range(len(alphabet))]
    letter[i] = 1
    char_one_hot[alphabet[i]] = letter
char_one_hot["#"] = [0 for _ in range(len(alphabet))]
# print(char_one_hot)

""" generate one_hot train set x """
whole_train_x = list()
for i in train_x_lines:
    tmp_vector = list()
    for j in i.strip():
        tmp_vector.append(char_one_hot[j])
    for j in range(global_max_len - len(i.strip())):
        tmp_vector.append(char_one_hot["#"])
    whole_train_x.append(tmp_vector)

""" generate one_hot text set x """
whole_test_x = list()
for i in test_x_lines:
    tmp_vector = list()
    for j in i.strip():
        tmp_vector.append(char_one_hot[j])
    for j in range(global_max_len - len(i.strip())):
        tmp_vector.append(char_one_hot["#"])
    whole_test_x.append(tmp_vector)

""" convert x and y into np"""
whole_train_x = np.asarray(whole_train_x)
print("Train X shape: ", np.shape(whole_train_x))
whole_train_y = np.array(whole_train_y)
print("Train Y shape: ", np.shape(whole_train_y))
whole_test_x = np.asarray(whole_test_x)
print("Test X shape: ", np.shape(whole_test_x))

fp_train_x.close()
fp_train_y.close()
fp_test_x.close()

##### To help you understand the model layers, adding print of each layer

In [None]:
def generate_batch_index(sample_size, batch_size, num_iter=1, is_shuffle=True):
    index = list(range(sample_size))
    for j in range(num_iter):
        if is_shuffle:
            np.random.shuffle(index)
        for i in range(int(sample_size / batch_size) + (1 if sample_size % batch_size else 0)):
            yield index[i * batch_size:(i + 1) * batch_size]


def char_cnn_classifier(x, istrain):
    with tf.variable_scope('char_cnn_classifier', reuse=tf.AUTO_REUSE):
        print("Input layer:", x)

        layer_1_k7 = tf.layers.conv1d(x, 64, 7, padding='same', strides=1)
        print("Conv layer_1_k7:", layer_1_k7)
        layer_1_k7 = tf.layers.batch_normalization(layer_1_k7)

        layer_1_k5 = tf.layers.conv1d(x, 64, 5, padding='same', strides=1)
        print("Conv layer_1_k5:", layer_1_k5)
        layer_1_k5 = tf.layers.batch_normalization(layer_1_k5)

        layer_1_concat = tf.concat([layer_1_k7, layer_1_k5], axis=2)
        print("Concat layer_1:", layer_1_concat)

        x = tf.layers.max_pooling1d(layer_1_concat, 2, 2, padding='valid')
        print("Pooling layer 1:", x)

        layer_2_k5 = tf.layers.conv1d(x, 128, 5, padding='same', strides=1)
        print("Conv layer_2_k5:", layer_2_k5)
        layer_2_k5 = tf.layers.batch_normalization(layer_2_k5)

        layer_2_k3 = tf.layers.conv1d(x, 128, 3, padding='same', strides=1)
        print("Conv layer_2_k3:", layer_2_k3)
        layer_2_k3 = tf.layers.batch_normalization(layer_2_k3)

        layer_2_concat = tf.concat([layer_2_k5, layer_2_k3], axis=2)
        print("Concat layer_2:", layer_2_concat)

        x = tf.layers.max_pooling1d(layer_2_concat, 2, 2, padding='valid')
        print("Pooling layer 2:", x)

        x = tf.layers.conv1d(x, 256, 3, padding='valid', strides=1, activation=tf.nn.relu)
        print("Conv layer 3:", x)
        x = tf.layers.batch_normalization(x)

        x = tf.layers.max_pooling1d(x, 2, 2, padding='valid')
        print("Pooling layer 3:", x)

        x = tf.layers.conv1d(x, 256, 3, padding='valid', strides=1, activation=tf.nn.relu)
        print("Conv layer 4:", x)
        x = tf.layers.batch_normalization(x)

        x = tf.layers.max_pooling1d(x, 2, 2, padding='valid')
        print("Pooling layer 4:", x)

        """ Here we use GRU instead of Flatten CNN"""
        gru_cell = tf.nn.rnn_cell.GRUCell(256)
        _, x = tf.nn.dynamic_rnn(gru_cell, x, dtype=np.float32)
        x = tf.layers.dropout(x, rate=0.5, training=istrain)

        """ Flatten for CNN, it's the original way of Char-CNN，here is not used """
        # x = tf.layers.flatten(x)
        # print("Flatten layer 5:", x)
        #
        # x = tf.layers.dense(x, 1024)
        # x = tf.layers.dropout(x, rate=0.5, training=istrain)
        # print("Dense layer 6:", x)
        #
        # x = tf.layers.dense(x, 1024)
        # x = tf.layers.dropout(x, rate=0.5, training=istrain)
        # print("Dense layer 7:", x)
        #
        x = tf.layers.dense(x, 12)
        print("FC layer:", x)

    return x


In [None]:
sentence_input_int = tf.placeholder(tf.float32, shape=[None, global_max_len, 26])
sentence_label_input = tf.placeholder(tf.float32, shape=[None, 12])
is_train_flag = tf.placeholder(tf.bool)

dis_out = char_cnn_classifier(sentence_input_int, is_train_flag)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=dis_out, labels=sentence_label_input))

correct_prediction = tf.equal(tf.argmax(dis_out, 1), tf.argmax(sentence_label_input, 1))
accuracy_op = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
disc_train_op = tf.train.AdamOptimizer(learning_rate=global_learning_rate).minimize(loss)

saver = tf.train.Saver()
init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)

    """ restore the train saver"""
    if restore_train_model is True:
        checkpoint = tf.train.get_checkpoint_state(checkpoint_dir)
        if checkpoint and checkpoint.model_checkpoint_path:
            print("###INFO### Restore train model from", checkpoint.model_checkpoint_path)
            saver.restore(sess, checkpoint.model_checkpoint_path)

    if global_validation_switch is False:
        global_training_proportion = 1.0
    print("Global_training_proportion: ", global_training_proportion)

    for global_idx in range(global_epoch_num):
        iter_tmp = 0
        """ batch training """
        for batch_index in generate_batch_index(int(global_sample_num*global_training_proportion),
                                                global_batch_size, num_iter=1, is_shuffle=True):
            batch_x = whole_train_x[batch_index]
            batch_y = whole_train_y[batch_index]
            batch_y = np.reshape(batch_y, [len(batch_index), 12])
            iter_tmp += 1

            feed_dict = {sentence_input_int: batch_x, sentence_label_input: batch_y, is_train_flag: True}
            _, batch_loss, batch_acc = sess.run([disc_train_op, loss, accuracy_op], feed_dict=feed_dict)
            print('Epoch %i Step %i: Loss: %f, Accuracy: %f' % (global_idx, iter_tmp, batch_loss, batch_acc))

        if global_validation_switch is True:
            """ epoch validation """
            batch_x = whole_train_x[int(global_sample_num*global_training_proportion):]
            batch_y = whole_train_y[int(global_sample_num*global_training_proportion):]
            batch_y = np.reshape(batch_y, [len(batch_y), 12])
            print(len(batch_y))
            iter_tmp += 1
            feed_dict = {sentence_input_int: batch_x, sentence_label_input: batch_y, is_train_flag: False}
            validation_loss, validation_acc = sess.run([loss, accuracy_op], feed_dict=feed_dict)
            print('Epoch %i Validate Loss: %f, Validate Accuracy: %f' % (global_idx, validation_loss, validation_acc))

        """ save the train model"""
        if (save_train_model is True) and (global_idx % save_interval == 0):
            saver.save(sess, checkpoint_dir + 'model_saver.ckpt', global_step=global_idx)

        if validation_acc > base_validation_acc:
            """ update the max validation accuracy value """
            base_validation_acc = validation_acc
            """ predict on test set x and write into ytest.txt """
            feed_dict = {sentence_input_int: whole_test_x, is_train_flag: False}
            predict_test_x = sess.run(dis_out, feed_dict=feed_dict)
            predict_test_x = np.argmax(predict_test_x, 1)
            print(predict_test_x)
            fp_test_y = codecs.open(test_y_file, "w")
            for i in predict_test_x:
                fp_test_y.write(str(i)+"\n")
            fp_test_y.close()
