    Natural language processing has been a topic of conversation since Alan Turing first designed a test that would involve a judge conversing with both a human and a machine and telling which one is the machine and which is the human. A machine is said to “pass” the test if the judge cannot tell the two apart. For years, the task of building a machine that passes this test has been – and continues to be - a very difficult task for computers and their programmers to accomplish. The main difficulty is due to all of the various rules and vocabularies we apply almost innately in our everyday conversations. For example, some words mean different things in different contexts.
    
	We can evaluate how well our models perform using questions that have already been answered by humans, by comparing these answers to the answers our machine gives. We can then compare the words and sentence structure of the machine’s answer to that of the human’s. For example, we could give the machine a score of 1 every time it uses a word in its answer as the human does, perhaps with the caveat that it only gets the point for words that succeed another matching word (or are the first word in the sentence.)
    
	However, there are many ways to build a program that can try to understand and speak human language. One of the earliest applications of machine learning for this problem involved question classification. Researchers realized as far back as 1992 that in order to answer a question, a machine needed to be able to understand the constraints placed on a question. To do this, Drs. Xin Li and Dan Roth proposed a hierarchical classifier that would classify a question. (https://dl.acm.org/citation.cfm?id=1072378).
    
	The problem, of course, with this approach is that each possible question needs to have a classification. Furthermore, since our machine learning classifier needs to understand how the type of question affects the answer, it needs to be trained to understand the types of words (e.g noun, adjective), the sentence structures, etc. For example, for the question, “How does the human heart work?” it needs to first understand how to answer a question on how anything works. Only then can it answer the question of how the human heart works. 
    
    What I propose is that instead of concerning ourselves with how a question is classified or the types of words in the question, or its sentence structure, we can instead train a machine to answer a question by classifying the question in terms of how it would be answered. This allows us to train our program on questions for which we may not have defined a classification. For example, to answer a question from our SQuAD dataset, the program wouldn’t necessarily learn that a question is related to an article, but that it is related to another question and to the answer we’re looking for.


## Defining our inputs and outputs

The first task in any machine learning setup is to define our inputs and outputs.
For this task, it is readily obvious that our questions should be inputs and our answers should be the outputs.

However, in order for our bot to be able to formulate an answer,
any neural network we create to generate answers needs information from which it can pick out an answer.

Therefore, we will use two neural networks. The first neural network's job will be to identify which article in the dataset best answers a given question. The second neural network's job will be to take as input both the question posed and the relevant article, and then output the portion of the article that answers the question.


## Setup

 Just to verify that everything's working.

In [109]:
import numpy as np
from scipy.sparse import csr_matrix
from keras.models import Sequential
from keras.layers import Dense,LSTM,Activation,InputLayer,Input,BatchNormalization
from keras.callbacks import ModelCheckpoint
from sklearn.feature_extraction.text import CountVectorizer
# Verify we are using GPU.
from tensorflow.python.client import device_lib
import sys
assert sys.version_info[0] >= 3
print("Set up!")

Set up!


## Preprocessing

For the preprocessing step, we will create two dictionaries. One will be used to map questions to articles, while the other will be used to map questions and the contents of the articles to the answers.

### Loading preprocessed data

In [2]:
from scipy.sparse import load_npz
X_1_train = load_npz("X_1_train.npz")
X_1_cross_validation = load_npz("X_1_cross_validation.npz")
dev_questions = load_npz("dev_questions.npz")
dev_articleTitles = load_npz("dev_articleTitles.npz")
X_2_train = load_npz("X_2_train.npz")
X_2_cross_validation = load_npz("X_2_cross_validation.npz")
X_2_dev = load_npz("X_2_dev.npz")
dev_answers = load_npz("dev_answers.npz")

### Batch generator

Because the number of features is so large, we need to use a batch generator.

In [49]:
import scipy.sparse
import numpy as np
def nn_batch_generator(X_data, y_data, number_of_batches):
    indices = np.arange(0,X_data.shape[0])
    assert X_data.shape[0] == y_data.shape[0]
    samples_per_epoch = X_data.shape[0] # Number of samples per Keras epoch.
    batch_size = int(samples_per_epoch / number_of_batches) # Number of rows in a batch.
    counter=0
    while 1:
        # Randomly select indices to fill the batch.
        batch_indices = np.random.choice(indices,size=number_of_batches,replace=False)
        # Get X data.
        batch_X = X_data[batch_indices].toarray()
        # Get Y data.
        batch_Y = y_data[batch_indices].toarray()
        yield batch_X,batch_Y
#for batch_x,batch_y in nn_batch_generator(scipy.sparse.csr_matrix([[1,2,3],[4,5,6]]),scipy.sparse.csr_matrix([[7,8,9],[10,11,12]]),2):
    #print(batch_x)

## First neural network

This is a neural network that takes a question and outputs an article title.
Note for future reference: It may be a good idea to build a neural network that can also parse the content and thereby decide how relevant an article title is.

In [None]:
questions_article_model = Sequential()
from math import log,pow
base = 5
function = lambda x: 5^x
numberOfLayers = int(log(inputShape_first[0],5))
print(numberOfLayers)
questions_article_model.add(Dense(5^(numberOfLayers - 1),input_shape=inputShape_first))
questions_article_model.add(BatchNormalization())
for i in reversed(range(1,numberOfLayers)):
    outputDims = 5**i
    print("outputDims",outputDims)
    questions_article_model.add(Dense(outputDims))
questions_article_model.add(Dense(article_titles_shape[1]))
questions_article_model.add(Activation("sigmoid"))
questions_article_model.summary()

In [None]:
questions_article_model.compile("adam","categorical_crossentropy",metrics=['accuracy'])

In [None]:

checkpoint = ModelCheckpoint('model_1-{epoch:03d}.h5', verbose=1, monitor='val_acc',save_best_only=True, mode='auto') 

In [None]:
batch_size = 200
steps_per_epoch = int(X_1_train.shape[0] / batch_size)
questions_article_model.fit_generator(nn_batch_generator(X_1_train,Y_1_train,steps_per_epoch),epochs=9,validation_data=(X_1_cross_validation,Y_1_cross_validation),callbacks=[checkpoint],steps_per_epoch=steps_per_epoch,verbose=True)
print("Weights: ",questions_article_model.get_weights())

### Load the model with the best weight.

In [None]:
from keras.models import load_model
model_1 = load_model("model_1-001.h5")
print("Model ready")

In [None]:
# Get the model's weights.
modelWeights = model_1.get_weights()
for array in modelWeights:
    isnan = np.isnan(array)
    assert np.all(isnan == False)

### Test it against the dev set.

In [None]:
import numpy as np
print(model_1.evaluate(x=dev_questions_vectorized,y=dev_articleTitles_vectorized))
predictions = model_1.predict(dev_questions_vectorized)
for prediction in predictions:
    isnan = np.isnan(prediction)
    assert np.all(isnan == False)

## Second neural network - using Dense layers

This neural network is used to generate answers from the questions and articles. It works by first reading the relevant article and using the question to find the answer.

In [75]:
#Build the neural network.
from math import log
answers_network = Sequential()
inputShape_second = X_2_train.shape[1]
print(inputShape_second)
answers_shape = Y_2_train.shape
print(answers_shape)
numberOfLayers = int(log(inputShape_second,5))
print(numberOfLayers)
answers_network.add(Dense(5^(numberOfLayers - 1),input_shape=(inputShape_second,)))
answers_network.add(BatchNormalization())
for i in reversed(range(1,numberOfLayers)):
    outputDims = 5**i
    print("outputDims",outputDims)
    answers_network.add(Dense(outputDims))
answers_network.add(Dense(32))
answers_network.add(Dense(answers_shape[1]))
answers_network.add(Activation("sigmoid"))
answers_network.summary()

335948
(65699, 89)
7
outputDims 15625
outputDims 3125
outputDims 625
outputDims 125
outputDims 25
outputDims 5
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_46 (Dense)             (None, 3)                 1007847   
_________________________________________________________________
batch_normalization_6 (Batch (None, 3)                 12        
_________________________________________________________________
dense_47 (Dense)             (None, 15625)             62500     
_________________________________________________________________
dense_48 (Dense)             (None, 3125)              48831250  
_________________________________________________________________
dense_49 (Dense)             (None, 625)               1953750   
_________________________________________________________________
dense_50 (Dense)             (None, 125)               78250     
_______________________________

#### Train the neural network.

In [90]:
answers_network.compile("adam","categorical_crossentropy",metrics=['accuracy'])

In [73]:
answers_network_checkpoint = ModelCheckpoint('answers_network-{epoch:03d}.h5', verbose=1, monitor='val_acc',save_best_only=True, mode='auto') 

In [92]:
batch_size = 200
from scipy.sparse import csr_matrix
Y_2_train = csr_matrix(Y_2_train)
steps_per_epoch = int(X_2_train.shape[0] / batch_size)
answers_network.fit_generator(nn_batch_generator(X_2_train,Y_2_train,steps_per_epoch),epochs=8,validation_data=(X_2_cross_validation,Y_2_cross_validation),callbacks=[answers_network_checkpoint],steps_per_epoch=steps_per_epoch,verbose=True)
print("Weights: ",questions_article_model.get_weights())

Epoch 1/8

Epoch 00001: val_acc improved from -inf to 0.65365, saving model to answers_network-001.h5
Epoch 2/8

Epoch 00002: val_acc improved from 0.65365 to 0.65740, saving model to answers_network-002.h5
Epoch 3/8

Epoch 00003: val_acc did not improve from 0.65740
Epoch 4/8

Epoch 00004: val_acc did not improve from 0.65740
Epoch 5/8

KeyboardInterrupt: 

#### Loading the model with best fit.

In [97]:
answers_network.load_weights('answers_network-002.h5')

In [125]:
batch_size = 200
steps = int(X_2_train.shape[0] / batch_size)
print(X_2_dev.shape)
nn_generator = nn_batch_generator(X_2_dev,dev_answersBinarized,batch_size)
answers_network.evaluate_generator(nn_generator,steps=steps)

(87599, 336029)


TypeError: only integer scalar arrays can be converted to a scalar index