    Natural language processing has been a topic of conversation since Alan Turing first designed a test that would involve a judge conversing with both a human and a machine and telling which one is the machine and which is the human. A machine is said to “pass” the test if the judge cannot tell the two apart. For years, the task of building a machine that passes this test has been – and continues to be - a very difficult task for computers and their programmers to accomplish. The main difficulty is due to all of the various rules and vocabularies we apply almost innately in our everyday conversations. For example, some words mean different things in different contexts.
    
	We can evaluate how well our models perform using questions that have already been answered by humans, by comparing these answers to the answers our machine gives. We can then compare the words and sentence structure of the machine’s answer to that of the human’s. For example, we could give the machine a score of 1 every time it uses a word in its answer as the human does, perhaps with the caveat that it only gets the point for words that succeed another matching word (or are the first word in the sentence.)
    
	However, there are many ways to build a program that can try to understand and speak human language. One of the earliest applications of machine learning for this problem involved question classification. Researchers realized as far back as 1992 that in order to answer a question, a machine needed to be able to understand the constraints placed on a question. To do this, Drs. Xin Li and Dan Roth proposed a hierarchical classifier that would classify a question. (https://dl.acm.org/citation.cfm?id=1072378).
    
	The problem, of course, with this approach is that each possible question needs to have a classification. Furthermore, since our machine learning classifier needs to understand how the type of question affects the answer, it needs to be trained to understand the types of words (e.g noun, adjective), the sentence structures, etc. For example, for the question, “How does the human heart work?” it needs to first understand how to answer a question on how anything works. Only then can it answer the question of how the human heart works. 
    
    What I propose is that instead of concerning ourselves with how a question is classified or the types of words in the question, or its sentence structure, we can instead train a machine to answer a question by classifying the question in terms of how it would be answered. This allows us to train our program on questions for which we may not have defined a classification. For example, to answer a question from our SQuAD dataset, the program wouldn’t necessarily learn that a question is related to an article, but that it is related to another question and to the answer we’re looking for.


## Defining our inputs and outputs

The first task in any machine learning setup is to define our inputs and outputs.
For this task, it is readily obvious that our questions should be inputs and our answers should be the outputs.

However, in order for our bot to be able to formulate an answer,
any neural network we create to generate answers needs information from which it can pick out an answer.

Therefore, we will use two neural networks. The first neural network's job will be to identify which article in the dataset best answers a given question. The second neural network's job will be to take as input both the question posed and the relevant article, and then output the portion of the article that answers the question.


## Preprocessing

For the preprocessing step, we will create two dictionaries. One will be used to map questions to articles, while the other will be used to map questions and the contents of the articles to the answers.

### Loading JSON dataset

In [4]:
import json
questions_articleTitles = {} # Stores questions and article titles.
questions_answers = {} # Stores questions with their associated answers, stored as integers representing where an ansser begins and ends
with open('train-v1.1.json') as file:
    data = json.load(file)
    articles = data["data"]
    # Iterate through articles, looking for question/answer pairs.
    for article in articles:
        article_title = article["title"]
        article_paragraphs = article["paragraphs"]
        for paragraph in article_paragraphs:
            qas_pairs = paragraph["qas"]
            for qas_pair in qas_pairs:
                # Note: There's another attribute called "context", which may come in handy.
                answer = qas_pair["answers"][0]
                answer_text = answer["text"]
                # Get where to find the answers.
                answer_start = answer["answer_start"]
                answer_end = answer_start + len(answer_text) - 1
                question = qas_pair["question"]
                questions_answers[question] = answer_text
                questions_articleTitles[question] = article_title
print("Finished loading data.")

Finished loading data.


In [5]:
print(len(questions_articleTitles), "question-article pairs.")
print(len(questions_answers), "question-answer pairs.")

(87355, 'question-article pairs.')
(87355, 'question-answer pairs.')


### Text preprocessing using scikit-learn and TensorFlow

In [6]:
import numpy as np
# Combine all text values to fit into the vectorizer.
questions = questions_articleTitles.keys()
articleTitles = questions_articleTitles.values()
answers = questions_answers.values()
X = questions
m = map(lambda x: x.lower(),X) # Make sure each element in combined is a string.
# Note: It could be useful to add article titles to the neural network's input 
#    to ensure the neural network doesn't suggest an article that doesn't answer the question, or doesn't exist.
print("X prepared")


X prepared


### Get n-grams

The following section turns inputs into n-grams. Note that order is not necessarily preserved.

In [7]:
# Get skipgrams.
def ngrams(input):
    _ngrams = []
    sentenceWords = input.split(" ")
    for i in range(0,len(sentenceWords) - 1):
        ngram = sentenceWords[i] + " "
        ngram = ngram + sentenceWords[i + 1]
        _ngrams.append(ngram)
    return list(set(_ngrams))
print(ngrams("I like cheese"))
X_ngrams = map(ngrams,X)
print(X_ngrams[0])
articleTitles_ngrams = map(ngrams,articleTitles)
answers_ngrams = map(ngrams,answers)
combined_ngrams = X_ngrams + articleTitles_ngrams + answers_ngrams
print("Done computing n-grams and n-gram vocabulary.")

['I like', 'like cheese']
[u'from Johns', u'Did Eran', u'or against', u'Public Health', u'Johns Hopkins', u'Health argue', u'Elhaik, from', u'Eran Elhaik,', u'Hopkins University', u'of Public', u'University School', u'against Khazar', u'School of', u'for or', u'Khazar descent?', u'argue for']
Done computing n-grams and n-gram vocabulary.


In [13]:
# Compute n-gram vocabulary.
combinedString = [string for element in combined_ngrams for string in element]
print("Computed n-gram vocabulary; sample: " + combinedString[2])

Computed n-gram vocabulary; sample: or against


In [14]:

# Vectorize all strings.
# TODO: Fit vectorizer on ngram vocabulary.
from sklearn.feature_extraction.text import CountVectorizer
x_first_vectorizer = CountVectorizer(vocabulary=combinedString)
x_first_vectorizer.fit(X)
X_first_vectorized = x_first_vectorizer.transform(X)
articleTitles_vectorized = vectorizer.transform(articleTitles)
print("Vectorized.")

ValueError: Duplicate term in vocabulary: u'at the'

### Splitting training data into training + cross-validation sets

In [121]:
# convert data to numpy arrays, get input shape
X_first_array = X_first_vectorized.toarray()
from sklearn.preprocessing import normalize
X_first_array = normalize(X_first_array)
print("X_first_array normalized.")
articleTitles_array = articleTitles_vectorized.toarray()
inputShape_first = X_first_array.shape[1:] # Input shape of the first neural network.
print("First neural network X shape: " + str(X_first_array.shape))
print("First neural network input shape: " + str(inputShape_first))
print("Article titles array shape: " + str(articleTitles_array.shape))

X_first_array normalized.
First neural network X shape: (87355, 36752)
First neural network input shape: (36752,)
Article titles array shape: (87355, 36752)


## First neural network

In [133]:
from keras.models import Sequential
from keras.layers import Dense,LSTM,Activation,InputLayer,Input,BatchNormalization

questions_article_model = Sequential()
questions_article_model.add(Dense(256,input_shape=inputShape_first))
questions_article_model.add(Dense(64))
questions_article_model.add(Dense(32))
questions_article_model.add(Dense(16))
questions_article_model.add(Dense(256))
questions_article_model.add(Dense(articleTitles_array.shape[1]))
questions_article_model.add(Activation("sigmoid"))
questions_article_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_107 (Dense)            (None, 256)               9408768   
_________________________________________________________________
dense_108 (Dense)            (None, 64)                16448     
_________________________________________________________________
dense_109 (Dense)            (None, 32)                2080      
_________________________________________________________________
dense_110 (Dense)            (None, 16)                528       
_________________________________________________________________
dense_111 (Dense)            (None, 256)               4352      
_________________________________________________________________
dense_112 (Dense)            (None, 36752)             9445264   
_________________________________________________________________
activation_26 (Activation)   (None, 36752)             0         
Total para

In [134]:
questions_article_model.compile("SGD","categorical_crossentropy",metrics=['accuracy'])

In [135]:
from keras.callbacks import ModelCheckpoint
checkpoint = ModelCheckpoint('model-{epoch:03d}.h5', verbose=1, monitor='val_loss',save_best_only=True, mode='auto') 

In [None]:
questions_article_model.fit(X_first_array,articleTitles_array,epochs=8,validation_split=0.25,callbacks=[checkpoint],verbose=True)

Train on 65516 samples, validate on 21839 samples
Epoch 1/8