# Project Proposal: Machine Learning System Design and Implementation for Question Answering

**Author**: Nam Phung \
**Advisor**: Professor Susan Fox \
*Macalester College, Department of Mathematics, Statistics, and Computer Science* \
Fall 2019

**Keywords**: Natural Language Processing, Question Answering, Deep Learning, Software Engineering.
<!--TOC-->

## 1. Introduction

For this project, we will be exploring the task of **Question Answering** (QA), one of the primary tasks in Natural Language Processing. The goal of a QA system is to answer a specific query given a context (paragraph, document, web page, etc.). In other words, the system should be able to extract relevant information from the context conditioned on some query issued by the user. QA systems have seen huge application across different application domains, most notably in chatbots designed to streamline information gathering, provide support and recommendations, etc. In particular, we will be focusing on **Deep Learning** techniques to address this problem, which has become extrememly popular thanks to the sheer amount of text data, as well as more and more computational power. The goal of the project is to build a complete pipeline for a QA system that can serve as the *minimal viable product* (MVP) for a client-facing service.

## 2. Project Goals

This project is primarily concerned with the implementation of deep learning models for question answering. QA is an active research area with very obvious commercial applications, including chatbots, digital assistants, search engines, etc. Implementation and deployment of deep learning models to production is thus very important, since it is how we give users access to these innovations. We are interested in applying software engineering practices to implementing machine learning models and serving them with a client-facing service. In particular, the main goals of the project are as follows.

1. Exploring the main deep learning architectures that have proven successful in NLP tasks.
2. Implementing a end-to-end training pipeline using a cloud computing service.
3. Building an API-driven system to serve the trained model.
4. Building and deploying a client-side web application that allows users to provide their own context paragraph and ask questions that can be answered with that paragraph.
5. Learning about designing machine learning systems.

## 3. Related Works
Almost all proposed approaches to Question Answering (and NLP in general) use some variation of **Recurrent Neural Networks** (RNNs). RNNs are a natural generalization of the feed-forward neural network architecture to sequence data by introducing feedback connections. In particular, the output that an RNN computes at timestep $t$ of a sequence depends on both the corresponding input at that timestep and some computed value from the previous timestep. The task of Question Answering can then be modeled using a sequence-to-sequence (seq2seq) architecture, where an Encoder RNN is used to encode the data from the context and the query into a fixed-length vector, and a Decoder is then used to generate a response to the question [1]. Though it showed some initial success in tasks such as translation, one potential issue with this architecture is that it needs to encode the information from the sequence into a single vector. This bottleneck issue often makes it difficult to model very long input sequences. Bahdanau et. al. introduces the concept of *attention*, which allows the model to choose specific parts of the input sequence to pay attention to during the decoding process [2]. This attention mechanism has been a key factor in recent development in QA. Chen et. al. (2016) proposed the Stanford Attentive Reader, which uses deep bidirectional LSTM along with attention mechanism to predict the *start* and *end* indices of the span of the context that answers the question [3]. Seo et al. (2017) proposed a more complex architecture for QA called Bidirectional Attention Flow for Machine Comprehension (BiDAF)[4], where attention is applied to both the context and the question to produce new representations, which is then passed through a fully-connected layer to produce the start and end indices for the answer. 

There are many more works in recent year that are mostly small fine-tuning of these models, augmented with new variants of attention. However, RNN-based architecture still suffered from the lack of contextual modeling. One of the most prominent recent research in NLP is thus concerned with contextual representations, with some promising results. In 2017, Vaswani et al proposed the Transformers, a new encoder-decoder architecture based solely on attention that has since eclipsed all variations of RNNs in NLP tasks, since it was shown to be superior in various tasks, while also speeding up training by almost an order of magnitude [5]. We will be focusing on the Transformers architecture and the related BERT model [6] for this project.

## 4. Model Training

### 4.1. Data

We will use the Stanford Question Answering Dataset (SQuAD) [7] for this project. SQuAD is a reading comprehension dataset, with each example consisting of a context paragraph, a question, as well as the ground truth answer to the question. The dataset is crowdsourced using Amazon Mechanical Turks, and contains 150,000 questions in total, half of which cannot be answered using the information from the paragraph. When a question can be answered, the answer is a span of text from the paragraph. Below is an example from the dataset.

>**Question**: What kind of university is the University of Chicago? \
**Context**: The University of Chicago (UChicago, Chicago, or U of C) is a <span style='background: yellow;'>private research university</span> in Chicago. The university, established in 1890, consists of The College, various graduate programs, interdisciplinary committees organized into four academic research divisions and seven professional schools. Beyond the arts and sciences, Chicago is also well known for its professional schools, which include the Pritzker School of Medicine, the University of Chicago Booth School of Business, the Law School, the School of Social Service Administration, the Harris School of Public Policy Studies, the Graham School of Continuing Liberal and Professional Studies and the Divinity School. The university currently enrolls approximately 5,000 students in the College and around 15,000 students overall. \
**Answer**: private research university.

### 4.2. Evaluation Metric

The official SQuAD competition uses two metrics for evaluating models: exact match (EM) and F1 score.

**Exact Match**: This metric measures the percentage of the system predicting an answer that matches exactly any of the three ground-truth answer. For a single example, the score is 1 if the model output matches the ground-truth exactly, and 0 otherwise. 

**F1 Score**: This is the harmonic means of precision and recall. In particular, we treat each answer as a bag of words, and measure the amount of overlap between the system's prediction and ground-truth answer. F1 score for one example is computed as follows
$$
F1 = \frac{2PR}{P+R}, P = \frac{tp}{tp + fp}, R = \frac{tp}{tp + fn},
$$
where $P$ and $R$ are precision and recall, respectively, while $tp, fp, fn$ are the numbers of true-positivesm false-positives, and false-negatives in the predicted answer, respectively. When a question cannot be answered using the context, both **EM** and **F1** are 1 if the system also predicts no-answer, and 0 otherwise.


### References

1. Sutskever, I., Vinyals, O., and Le, Q. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems (NIPS).
2. Bahdanau, D., Cho K., and Bengio Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
3. Chen, D., Bolton, J., and Manning, C. (2016). A thorough examination of the cnn/daily mail reading comprehension task. ACL.
4. Seo, M., Kembhavi, A., Farhadi, A., and Hajishirzi, H. (2016). Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603.
5. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A, Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems.
6. Devlin, J., Chang, M., Lee, K., and Toutanova, K.(2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
7. Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. CoRR, abs/1606.05250.