# Chata Interview Report
## I. Task Summary & Objectives
- Given disfluent questions, rewrite these questions with a deep learning model. 
- Evaluate the model for overfitting.
- Detail the development process.

## II. Define the Task
1. Sample data:
- Example input: what is denmark a region of uh no france?
- Example output: what is france a region of?
2. Problem inspection:
- From this sample data, it is easy to see that it is a *sequence to sequence modeling* where the input is a disfluent/erroneous English sentence and the output is a grammarly correct sentence.
- The input string is *not long* (i.e., a sentence rather than a document). Ruling out the use of Hierarchical LSTM structures.
- Input preprocessing steps might not be needed because the raw data was well structured.   
- Output preprocessing steps are not necessary, though some sentences come with trailing space. Most NLG evaluation tools should easily handle this.
3. Solution outlines:
- Establishing two vocabularies: Disfluent and Fluent English. **Note that** using only *one vocabulary set* (English) won't give good results. Mathemathically speaking, we are modeling Pr(y|x1,x2,x3,...) where y represent a latent space that understand the structure of the disfluent sentences. If we use the same vocabulary set for both disfluent and fluent sentences, intuitively, it causes difficulty during the text generation process because we now have to model Pr(x[i]|y) such that x[i] is close to x1,x2,x3,... above. It makes more sense to split them up into two separate vocabulary sets. You can try it if you want.
- Apply any sequence to sequence model such as LSTM-LSTM, Transformer-based, GPT-based models. The expectation is Transformer/GPT model should perform better. But it is unknown how god the LSTM-LSTM would be.
- Look for any paper that work on this dataset. As far as I know, there are only 2 citation for this dataset, and none of them propose any cutting edge solution for this type of dataset. Hence, I will just apply simple/canonical models mentioned above.





## III. Build the Vocabulary
1. Spacy Tokenizer or not?

I know most NLP researchers use NLTK/Spacy for tokenization. But in this case, I will pass. Spacy is good for tokenizing or pretraining, but it is not good for handling unknown words/vocabulary. Based on Chata introduction to AutoQL, if you want to rephrase an incorrect input and the input is in different languages, it is better to use byte-pair-encoding or similar techniques to avoid the out-of-vocabulary situations. In this dataset, they use quite a large number of names and places. Secondly, handling a large vocabulary set where each word has a dimension of 512 or above would be very expensive. 

So the answer is NO, I will use SentencePiece & unigram modeling for this task.

**Implementation: Tools/vocab_builder.py**

2. Preprocessing or not?

Since this dataset is targeting the precision of the output sentence, any preprocessing steps such as replacing numbers with "<num>" token would be inappropriate. Hence the answer is NO for this dataset. For other datasets where we focus on understanding the text content, it would be more meaninful for preprocessing steps. 

Here I will just convert all letters to lowercase.

## IV. Build the Models
1. LSTM-LSTM
Here I will try both single and bidirectional LSTM encoder to encode the disfluent sentences. Then decode them with another LSTM decoder. 

**Pros:**
- Fast inference time.

**Cons:**
- Long time to train.
- Model performance might not be good
- I am not a big fan of RNN/LSTM. It is old with very few improvement.

The implementation is in the models.py (LSTM_ED)

2. Transformer/GPT/Perceiver
Since this is to showoff my knowledge of SOTA models such as Transformer/BERT, I will implement most Transformer modules with some modification adopted from the GPT-2 model. 

Particularly, I will use the Perceiver network (my implementatioin, you cannot find it anywhere on internet) as an encoder to encode any arbitrary long stream of data including raw pixel images, raw audio files. Then I will use the GPT-2 based model to decode the latent space into complete sentences. 

I call the hybrid of PERceiver Encoder + TransFORMER Decoder = PERFORMER model.

**Pros:**
- Quick convergence in training.
- Better performance.
- Apply to any problem (Computer Vision, Iamge Captioning, NLP problems).

**Cons:**
- Slow inference time (can be resolved with larger batch size).

The implementation is in the models.py (Performer)

3. Memory Consumption Issue on GPUs

I know using Transformer architecture would require a lot of memory consumption. Therefore, I chose Perceiver architecture for this task. Even more, I added the auto mixed precision to cut down the model size by half.

## V. Train/Test the Models
1. How to validate the model?
- Train: train.json
- Validation: dev.json
- Test/Inference: test.json

Here I will compute the loss value on the dev.json to save the best model on the train set. Then, the inference is done on the test.json and compute the BLEU/CIDEr/ROUGE scores. All of the loss value will be saved in the stats section of the pytorch checkpoint model.

**See the train.py for more details.**

2. Experimental results

| Models    | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE\_L | CIDEr |
| --------- | ------ | ------ | ------ | ------ | ------ | -------- | ----- |
| Input     | 0.654  | 0.616  | 0.578  | 0.540  | 0.524  | 0.780    | 4.655 |
| LSTM w/o Norm      | 0.261  | 0.163  | 0.103  | 0.061  | 0.095  | 0.288    | 0.239 |
| Bi\_LSTM w/o Norm | 0.283  | 0.176  | 0.111  | 0.066  | 0.099  | 0.289    | 0.265 |
| LSTM w Norm       | 0.312  | 0.200  | 0.131  | 0.083  | 0.110  | 0.324    | 0.372 |
| Bi\_LSTM w Norm   | 0.305  | 0.194  | 0.125  | 0.078  | 0.108  | 0.322    | 0.351 |
| Performer | **0.738**  | **0.672**  | **0.614**  | **0.562**  | **0.402**  | **0.753**    | **5.153** |

It is clear from the experiment results that the Transformer-based model (Performer) is much better in this task than the LSTM models. It is unknown why LSTM models were much inferrior although the models themselves are correct (follow exactly the original models). I tried to include normalization (line 70 in models.py) to help the models converge better. But still the results are bad. For this reason, I will stop discussing about the LSTM models. Later on, I will only use the Performer model for fine-tuning the result.

3. Overfitting/Overconfidence

Overfitting occurred in the LSTM model quite after 10 epochs (not shown) while the performer model (transformer-based model) was still converging. 

- Performer Loss Value Graph:

![Performer](Results/Loss_PERFORMER.png)

- LSTM and Bidirectional LSTM Loss Value Graphs:

![LSTM](Results/Loss_LSTM_ED.png)

![BILSTM](Results/Loss_LSTM_BI_ED.png)

## Fine-tune the Best Model (Performer)
1. Use the trained model as pretrained model for fine-tuning
2. Reset the optimizer (AdamW)
3. Set up a learning rate scheduler:
- Initial learning rate: 3e-4
- Reduce learning rate by a factor of 10 at the following epochs: [5,10,20]
3. Train the model one more time, save the best model based on validation loss.
4. Experimental results:

| Models             | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE\_L | CIDEr |
| ------------------ | ------ | ------ | ------ | ------ | ------ | -------- | ----- |
| Input              | 0.654  | 0.616  | 0.578  | 0.540  | 0.524  | 0.780    | 4.655 |
| LSTM w/o Norm      | 0.261  | 0.163  | 0.103  | 0.061  | 0.095  | 0.288    | 0.239 |
| Bi\_LSTM w/o Norm  | 0.283  | 0.176  | 0.111  | 0.066  | 0.099  | 0.289    | 0.265 |
| LSTM w Norm        | 0.312  | 0.200  | 0.131  | 0.083  | 0.110  | 0.324    | 0.372 |
| Bi\_LSTM w Norm    | 0.305  | 0.194  | 0.125  | 0.078  | 0.108  | 0.322    | 0.351 |
| Performer          | 0.738  | 0.672  | 0.614  | 0.562  | 0.402  | 0.753    | 5.153 |
| Performer Finetune | 0.750  | 0.687  | 0.630  | 0.580  | 0.413  | 0.765    | 5.360 |

5. Qualitative performance:

Example #1:
- Input: in what country is norse found no wait normandy not norse?
- Output: in what country is normandy not norman?
- Ground-truth: in what country is normandy located?

Example #5:
- Input: when no what century did the normans first gain their separate identity?
- Output: in what century did the normans first gain their separate separate identity?
- Ground-truth: what century did the normans first gain their separate identity?

Example #10:
- Input: who was the duke in the kingdom of sicily sorry in the battle of hastings?
- Output: who was the duke in the kingdom of primality hastings?
- Ground-truth: who was the duke in the battle of hastings?