## Train a QA-Robot by Neural Network
In this part, we train a neural network model for the QA-robot in competition 1. This part will be organized in three parts: (a) Word embedding model, (b) generating training data and computing the features, and (c) the graph and the session of our neural network.

### Word enbedding
To process the natural language sentence, we have to encode each chinese word before training the neural network model. We choose FastText as the word embedding model. All the sentences in training data were partitioned into word level by jieba, and converted into the format which can be read by gensim fasttext model. 

In [None]:
# Processing training data as the same in the TA's example code and save in *.npy file format
# Read in the processed and partitioned word file and convert it into the format of gensim
import numpy as np
cut_programs = np.load('cut_Programs.npy')
cut_Question = np.load('cut_Questions.npy')
print(sum([len(p) for p in cut_programs]))

with open('split_word_Programs.txt', 'w', encoding='utf-8') as output:
    for i, program in enumerate(programs):
        episodes = len(program)
        print('Processing the %d programs' % i)
        for episode in range(episodes):
            for line in cut_programs[i][episode]:
                line_space = " ".join(line)
                output.write(line_space)
                output.write('\n')

with open('split_word_Questions.txt', 'w', encoding='utf-8') as output:
    n = len(questions)

    for i in range(n):
        for j in range(6):
            for line in cut_Question[i][j]:
                line_space = " ".join(line)
                output.write(line_space)
                output.write('\n')
                
# Train a fasttext model
from gensim.models import word2vec, doc2vec, FastText
import numpy as np

vec_size = 128
sentences = word2vec.LineSentence('split_word_Programs.txt')
model = FastText(sentences, size=vec_size, window=5,min_count=1)
model.save("fasttext"+str(vec_size)+".model")

### Generating training data
To train a QA-Robot model by the corpus from the normal programs, the data should be transformed into 1 question and 6 answer candidates format to. We define two consecutive sentences are a pair of correct question and answer. With exclusively and randomly selected 5 sentences, a complete QA pair is formed. Due to the tediousness of the lengthy codes, the codes are not shown in this report. Please refer to `gen_data.py` and `gen_data_infer.py` for the details of implementation.

### Computing the features
The most important problem of the feature extraction is how to deal with the variable length sentence. Therefore, we have tried 3 types of feature to unify the word vectors of different sentences:
1. Padding the word vectors to the length which is equals to the maximun legnth of the sentence in the training data. The data are padded with **wrapping values** instead of **zeros** for avoiding the numerical imbalance. The dimension of this feature is [MAX_LEN, VECTOR_SIZE].
2. Averaging all the word vectors (along the axis of words). The dimension of this feature is [VECTOR_SIZE].
3. Calculating the standard deviation from all the word vectors of the sentence. The dimension of this feature is also [VECTOR_SIZE].
4. Calculating the Word Mover's Distance (WMD) between the question and the answer candidate. The dimension of this feature is [1].
This codes are part of `gen_data.py` and `gen_data_infer.py`. Note that `self.w2vembedd2` is a FastText model and `self.w2vembedd` is a Doc2Vec model which was not used in the final version.

Several combinations of the features and corresponding were tested, more details and results will be discuessed in the next parts.

In [None]:
def sentence2vec(self, sentence_idx):
    sentvec = sent = 0
    word_c2 = word_c = 0
    sent_sq = 0
    wv_seq = np.empty([0,VECTOR_SIZE])
    for i, w in enumerate(self.sentences[sentence_idx]):
        if MAX_LEN <= word_c:

            break
        if w.isalnum():
            try:
                wv_seq = np.vstack([wv_seq, self.w2vembedd2.wv[[w]]])
                word_c += 1
            except KeyError:
                pass
            try:
                sentvec += self.w2vembedd.wv[w]
                #sent_var += self.w2vembedd2.wv[w]**2
                word_c2 += 1
            except KeyError:
                pass
    #sentvec= self.w2vembedd.infer_vector(self.sentences[sentence_idx])
    if word_c > 0:
        sent = np.mean(wv_seq,0)
        sent_var = np.std(wv_seq,0)
    if word_c2 > 0:
        sentvec = sentvec / word_c2

    #print(sentvec.shape, sent.shape)
    return np.concatenate([sentvec,sent,sent_var], 0), wv_seq

### The graph and the session of our neural network
We assume that different types of feature preserve its characteristic. The following figure is the overal scheme of our model:
![alt text](imagename.png "Title")