## Train a QA-Robot by Neural Network
In this part, we trained a neural network model for the **QA-robot** in competition 1. This part will be organized in three parts: (a) Word embedding model, (b) generating training data and computing the features, and (c) the graph and the session of our neural network.

### Word enbedding
To process the natural language sentence, we have to encode each chinese word before training the neural network model. We choose FastText as the word embedding model. All the sentences in training data were partitioned into word level by jieba, and converted into the format which can be read by gensim fasttext model. 

In [None]:
# Processing training data as the same in the TA's example code and save in *.npy file format
# Read in the processed and partitioned word file and convert it into the format of gensim
import numpy as np
cut_programs = np.load('cut_Programs.npy')
cut_Question = np.load('cut_Questions.npy')
print(sum([len(p) for p in cut_programs]))

with open('split_word_Programs.txt', 'w', encoding='utf-8') as output:
    for i, program in enumerate(programs):
        episodes = len(program)
        print('Processing the %d programs' % i)
        for episode in range(episodes):
            for line in cut_programs[i][episode]:
                line_space = " ".join(line)
                output.write(line_space)
                output.write('\n')

with open('split_word_Questions.txt', 'w', encoding='utf-8') as output:
    n = len(questions)

    for i in range(n):
        for j in range(6):
            for line in cut_Question[i][j]:
                line_space = " ".join(line)
                output.write(line_space)
                output.write('\n')
                
# Train a fasttext model
from gensim.models import word2vec, doc2vec, FastText
import numpy as np

vec_size = 128
sentences = word2vec.LineSentence('split_word_Programs.txt')
model = FastText(sentences, size=vec_size, window=5,min_count=1)
model.save("fasttext"+str(vec_size)+".model")

### Generating training data
To train a QA-Robot model by the corpus from the normal programs, the data should be transformed into 1 question and 6 answer candidates format to. We define two consecutive sentences are a pair of correct question and answer. With exclusively and randomly selected 5 sentences, a complete QA pair is formed. Due to the tediousness of the lengthy codes, the codes are not shown in this report. Please refer to `gen_data.py` and `gen_data_infer.py` for the details of implementation.

### Computing the features
The most important problem of the feature extraction is how to deal with the variable length sentence. Therefore, we have tried 3 types of feature to unify the word vectors of different sentences:
1. Padding the word vectors to the length which is equals to the maximun legnth of the sentence in the training data. The data are padded with **wrapping values** instead of **zeros** for avoiding the numerical imbalance. The dimension of this feature is [*MAX_LEN*, *VECTOR_SIZE*].
2. Averaging all the word vectors (along the axis of words). The dimension of this feature is [*VECTOR_SIZE*].
3. Calculating the standard deviation from all the word vectors of the sentence. The dimension of this feature is also [*VECTOR_SIZE*].
4. Calculating the Word Mover's Distance (WMD) between the question and the answer candidate. The dimension of this feature is [*1*].
This codes are part of `gen_data.py` and `gen_data_infer.py`. Note that `self.w2vembedd2` is a FastText model and `self.w2vembedd` is a Doc2Vec model which was not used in the final version.

Several combinations of the features and corresponding were tested, more details and results will be discuessed in the next parts.

In [None]:
def sentence2vec(self, sentence_idx):
    sentvec = sent = 0
    word_c2 = word_c = 0
    sent_sq = 0
    wv_seq = np.empty([0,VECTOR_SIZE])
    for i, w in enumerate(self.sentences[sentence_idx]):
        if MAX_LEN <= word_c:

            break
        if w.isalnum():
            try:
                wv_seq = np.vstack([wv_seq, self.w2vembedd2.wv[[w]]])
                word_c += 1
            except KeyError:
                pass
            try:
                sentvec += self.w2vembedd.wv[w]
                #sent_var += self.w2vembedd2.wv[w]**2
                word_c2 += 1
            except KeyError:
                pass
    if word_c > 0:
        sent = np.mean(wv_seq,0)
        sent_var = np.std(wv_seq,0)
    if word_c2 > 0:
        sentvec = sentvec / word_c2
    return np.concatenate([sentvec,sent,sent_var], 0), wv_seq

### The graph and the session of our neural network
We assume that different types of feature preserve its characteristic. The following figure is the overal scheme of our model:

|  ![alt text](fig1.png "Title")  | ![alt text](fig2.png "Title")  |
|--------|--------|

![alt text](fig3.png "Title")  

In [None]:
def question(q_sentence, training_flag):
    l1_regularizer = tf.contrib.layers.l1_regularizer(
        scale=1e-5, scope=None)
    frame_out = tf.layers.conv2d(q_sentence, 256, [1,1], kernel_initializer=tf.contrib.layers.xavier_initializer(), activation=tf.nn.relu)
    frame_att = tf.layers.conv2d(q_sentence, 256, [1,1], kernel_initializer=tf.contrib.layers.xavier_initializer(
        ), activation=tf.nn.sigmoid, kernel_regularizer=l1_regularizer)
    frame_out *= frame_att
    pool_size = frame_out.get_shape().as_list()[2]
    frame_out = tf.layers.average_pooling2d(
        frame_out, pool_size=[1,pool_size], strides=1)
    return frame_out


def answer(a_sentence, training_flag):
    l1_regularizer = tf.contrib.layers.l1_regularizer(
        scale=1e-5, scope=None)
    frame_out = tf.layers.conv2d(a_sentence, 256, [1,1], kernel_initializer=tf.contrib.layers.xavier_initializer(), activation=tf.nn.relu)
    frame_att = tf.layers.conv2d(a_sentence, 256, [1,1], kernel_initializer=tf.contrib.layers.xavier_initializer(
        ), activation=tf.nn.sigmoid, kernel_regularizer=l1_regularizer)
    frame_out *= frame_att
    pool_size = frame_out.get_shape().as_list()[2]
    frame_out = tf.layers.average_pooling2d(
        frame_out, pool_size=[1,pool_size], strides=1)
    return frame_out


def similar(logits_q, logits_a, training_flag, p,is_batchnorm=False,name='similarity'):

    qa_pair = tf.concat([logits_q * logits_a,logits_q**2,logits_a**2], -1)
    qa_pair = denselayer(qa_pair, 256, tf.nn.relu, is_batchnorm=is_batchnorm,
                         is_dropout=True, prob=p, is_training=training_flag, repeat=1)
    fc_out = denselayer(qa_pair, 64, tf.nn.relu, is_batchnorm=is_batchnorm,
                        is_dropout=False, prob=1, is_training=training_flag, repeat=1)
    return fc_out

def selection(simi_all, training_flag, p):
    qa_pair = simi_all
    qa_pair = denselayer(qa_pair, 64, tf.nn.selu, is_batchnorm=False,
                         is_dropout=False, prob=1, is_training=training_flag, repeat=1)
    fc_out = denselayer(qa_pair, 6, None, is_batchnorm=False,
                        is_dropout=False, prob=1, is_training=training_flag, repeat=1)
    return fc_out

dim = 1024
def dim_reduce(x_input, name, window=1,dim = 1024):
    with tf.variable_scope(name):
        l1_regularizer = tf.contrib.layers.l1_regularizer(
            scale=1e-5, scope=None)
        x = tf.layers.conv1d(x_input, dim, [
                             window], kernel_initializer=tf.contrib.layers.xavier_initializer(), activation=tf.nn.selu)
        
        xa = tf.layers.conv1d(x, dim, [
                             1], kernel_initializer=tf.contrib.layers.xavier_initializer(), activation=tf.nn.selu)
        x_att = tf.layers.conv1d(x, dim, [1], kernel_initializer=tf.contrib.layers.xavier_initializer(
        ), activation=tf.nn.sigmoid, kernel_regularizer=l1_regularizer)
        x = xa * x_att
        return x

In [None]:
from gensim import models
    model = models.Word2Vec.load('doc2vec128.model')
    model2 = models.FastText.load('fasttext128.model')
    #model.init_sims(replace=True)
    model2.init_sims(replace=True)
    train_dataset = gen_dataset.dataset(w2v=model, w2v2=model2, val_ratio=0.2)
    test_dataset = gen_dataset_infer.dataset(
        w2v=model, w2v2=model2, val_ratio=0)
    tf.reset_default_graph()
    with tf.Graph().as_default() as g:
        x_input = tf.placeholder(
            tf.float32, [None, 7, gen_dataset.VECTOR_SIZE * 3])
        x_seq = tf.placeholder(
            tf.float32, [None, 7, gen_dataset.MAX_LEN, gen_dataset.VECTOR_SIZE])
        y_label = tf.placeholder(tf.float32, [None, 6])
        sim_tensor = tf.placeholder(tf.float32, [None, 6])
        training_flag = tf.placeholder(tf.bool)
        prob = tf.placeholder(tf.float32)
        
        #module for the all WVs
        simi_mat_red = tf.reshape(question(x_seq[:,0:1,:,:], training_flag) * answer(x_seq[:,1:,:,:], training_flag),[-1,256])
        
        #module for the avg. of WVs
        x1q = dim_reduce(
            x_input[:, 0:1, gen_dataset.VECTOR_SIZE:gen_dataset.VECTOR_SIZE * 2], 'red_1_q')
        x1a = dim_reduce(
            x_input[:, 1:, gen_dataset.VECTOR_SIZE:gen_dataset.VECTOR_SIZE * 2], 'red_1_a')
        
        #module for the std. of WVs
        x3q = dim_reduce(
            x_input[:, 0:1, gen_dataset.VECTOR_SIZE*2:gen_dataset.VECTOR_SIZE * 3], 'red_3_q')
        x3a = dim_reduce(
            x_input[:, 1:, gen_dataset.VECTOR_SIZE*2:gen_dataset.VECTOR_SIZE * 3], 'red_s3_a')

        x = tf.concat([x1q,x1a], 1)
        x3 = tf.concat([x3q,x3a], 1)

        x_question = tf.reshape(tf.tile(x[:, 0:1, :],[1,6,1]),[-1,x.shape[-1]])
        x_ans = tf.reshape(x[:, 1:, :],[-1,x.shape[-1]])
        x3_question = tf.reshape(tf.tile(x3[:, 0:1, :],[1,6,1]),[-1,x.shape[-1]])
        x3_ans = tf.reshape(x3[:, 1:, :],[-1,x.shape[-1]])

        a = similar(x_question, x_ans, training_flag, prob)
        b = similar(x3_question, x3_ans, training_flag, prob,is_batchnorm=True)

        sim_exp = tf.reshape(sim_tensor, [-1,1])
        simi_all = tf.concat([simi_mat_red,sim_exp], -1)
        simi_single = tf.reshape(simi_all, [-1, simi_all.shape[-1]])
        single_out = denselayer(simi_single, 64, tf.nn.selu, is_batchnorm=False,
                         is_dropout=False, prob=1, is_training=training_flag, repeat=1)
        single_out = denselayer(single_out, 1, None, is_batchnorm=False,
                         is_dropout=False, prob=1, is_training=training_flag, repeat=1)
        simi_all = tf.reshape(simi_all, [-1, simi_all.shape[-1] * 6])

        out = selection(simi_all, training_flag, prob)
        sigmoid_out = tf.nn.sigmoid(out)
        pred = tf.argmax(softmax_out,1)
        reg_ws = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
        loss_single = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(
            logits=single_out,  labels=tf.reshape(y_label,[-1,1])))
        loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(
            logits=out,  labels=y_label))
        loss = tf.reduce_sum(reg_ws) + loss_single + loss

        acc = tf.equal(tf.argmax(y_label, 1), pred)
        acc = tf.reduce_mean(tf.cast(acc, tf.float32))
        optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
        writer = tf.summary.FileWriter(graph_dir, tf.get_default_graph())
        init = tf.global_variables_initializer()
        saver = tf.train.Saver()
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    best_acc = 0
    early_count = 0
    ans = np.load('label.npy')
    with tf.Session(config=config, graph=g) as sess:
        writer = tf.summary.FileWriter(graph_dir, graph=sess.graph)
        sess.run(init)
        for i in range(3000):
            s, label, sim, wv_seq, end_flag = train_dataset.getbatch(
                mode='train', batch_size=batch_size)
            if end_flag != gen_dataset.IS_EPOCH_END:
                lo, _, a = sess.run([loss, optimizer, acc], feed_dict={
                                    x_input: s, y_label: label, sim_tensor: sim, x_seq:wv_seq, training_flag: True, prob: 0.8})

### Results

We found that too small batch size leads the training instability (size < 128). Therefore, we tried 512, 1024, 2048 as the batch size. For selecting input feature, we first several model by only one type of feature. Surprisingly, the average and stadard deviation of the word vectors give a better results. We further try the different combination of the feature to train different model.

In the end of the competition, we selected 2 results for the deep learning methods. One is from the best parameters, another is from voting by different training results under different parameters.

| |  Best  | Voting  |
|----|--------|--------|
|Public score|0.612|0.648|
|Private score|0.688|0.692|