# Names and Net IDs

- Kyle Hwang (ksh6947)
- Michael Lin (qlb968)
- Dylan Wu (dwg0364)

# Abstract

Our final project seeks to use `pytorch` to replicate the "Attention is All You Need" paper, which introduced the Transformer as a way to improve upon existing sequence transduction language models. We will attempt to implement the model's architecture and train the model on a subset of the WMT 2014 English-French dataset. We will then perform analyses on the training results, and, where appropriate, compare the model performance against results from the paper.


# Goals and Discussion

## Essential Goals

- Generally, as we train the model for longer, the BLEU score should be increasing (we will include a chart to show this).

> Discuss what happened

## Desired Goals

- Achieve results of translation that is not simply mappings between the two language vocabularies but rather encompasses the context and attention mappings of the whole sentence. This could be achieved by inputting examples where gendered noun and adjective would be correctly translated (i.e., the Transformer can differentiate between genders).

  - "I am a tall man" vs. "I am a tall woman"

> We were not able to implement a way to directly test it, but based off of sample outputs that we print out when training, we find that the model is able to match with the true translation despite having gendered words in them, so we are led to assume that this goal has been achieved. Early on, the model was not able to do this, much less output a coherent sentence. But after testing with the Transformer that `torch` offers, we were able to tweak our model in such a way that successfully translates from a non-English language to English, which includes gendered words.

- Have consistency across languages when training under the same model in terms of BLEU score. Testing with another language dataset (most likely English to French) and have a similar BLEU score performance under the same training settings and time.

> Discuss what happened

## Stretch Goals

- Based on the findings from the paper "Reformer: The Efficient Transformer", we would try to implement and quantify the impact of the suggested changes from that paper compared to the Transformer.

> Unfortunately, we were not able to complete creating the Transformer model with much time to spare, so we were forced to abandon this goal.

- Try pretraining the model on English-to-French, then fine tune the model to translate English-to-Spanish (the motivation is that since both French and Spanish are Romance languages, the pretrained model could have already learned important parts of the mapping from English to any Romance language).

> Unfortunately, we were not able to complete creating the Transformer model with much time to spare, so we were forced to abandon this goal.

## Other Challenges

Initially, we did try using cross entropy loss on the classes on the model, but we found that we were having trouble with having the model being able to converge and learn anything meaningful. By tinkering around, we ended up trying out using cross entropy loss on the embedded output of the model. However, using intuition on how cross entropy loss is supposed to work, doing it this way doesn't make too much sense, but empirically speaking, it ended up performing extremely, as well as suspiciously, well. It is suspicious because we are treating the embedding space as a probability space when we use cross entropy loss, which doesn't make too much sense. So we tried taking a different approach: we already found that cross entropy loss on the class labels wasn't working out for us, so we opted to try using mean squared error on the embedded outputs. This now strays away from the source material, however, given our aforementioned previous experiences, we believed that this was a worthwhile attempt to try.


# Code and Documentation

## `data.py`

This contains classes that processes our data in the format we desire. This is relevant as it makes using our data easier. This includes the preprocessing of our data to a format that could be fit into the transformer model. This also includes the tokenization and padding of the initial text data into a tokenized list of length max_length.

## `decoder.py`

This contains the decoder portion of the Transformer model as well as each individual decoder layer. This is what processes the outputs of the encoder and the previous output tokens to obtain the most likely next translated word.

## `encoder.py`

This contains the encoder portion of the Transformer model as well as each individual encoder layer. This is what processes the tokenized input using positional encoding, multihead self attention and feed forward neural networks. This output is then fed as a type of encoded information to the decoder.

## `multihead_attention.py`

This contains the Multi-Head Attention portion of the Transformer model, present in both the encoder and the decoder. This is what makes it possible to capture information from an input. This includes the self attention mechanism and the weights to compute Q,K,V from the embedded inputs with positional encoding.

## `optimizer.py`

This contains a specific version of the Adam optimizer, Dynamic LR Adam. This is relevant as this is what the paper used, and it is respectively used to optimize our model.

## `position_wise_feed_forward.py`

This is the position wise feed forward that is used in the encoder and decoder and contains two fully connected linear layers with a ReLU activation in between.

## `positional_encoder.py`

This contains our `PositionalEncoding` class. This is required as this is how the Transformer understands the positions of the inputs, something that is naturally included in an LSTM, but excluded in the Transformer since it opts to use the self attention mechanism to encode context.

## `transformer_runner.py`

This contains the runner of the model. This is how the model is run by using the trainer in `transformer_trainer.py`. This includes loading the data, initializing the model and passing the parameters to the trainer to train.

## `transformer_trainer.py`

This contains the trainer of the model. This is how the model gets trained, loss gets calculated and models stored to the local storage.

## `transformer.py`

This contains the Transformer model, which pretty much puts everything together. This is the end goal of the project, which is creating a Transformer model, including everything from the embedding, positonal encoding, self-attention, multihead attention, encoder and decoder. This is the model that we used in the trainer to train for a translation task


# Reflections

## What was interesting?
Despite having a way smaller dataset then the one used in the actual paper, we were able to see the model converge and output translations similar or the same as the target. This is exciting and interesting as one of the largest concerns we had early on in the project was that we would not be able to see much results based on the limitations of computational power as well as the amount of data we are using. This just goes to show how efficient the transformer model are when trained with sequence based datasets such as natural language.

## What was difficult?
The most frustrating thing about this project was making everything align properly and finding out why the components do not have expected behavior. When first starting the project, it was pretty simple: we would work on our respective tasks, implementing an individual part of the Transformer model. But later on in the process, we ran into much difficulty in putting everything together. In fact, creating the individual parts probably took around 2 weeks, give or take, but the rest of the time was spent on alignment, making it work, and tweaking slight misinterpretations we had of the paper. Along this train of thought, another difficulty was one common in large model: time. In order to see if our model was really learning anything, we would have to let it run and only then could we see the results. If giving advice to someone else doing this same project, we would encourage to try to truly understand the nuances of the models so that they would be able to have a more direct and clear path towards the end.

## What's left to do?
We believe that there are still more tweaks to be made in order for this model to truly work as it should, so that would be the initial work that we would get done. Then, as we were unfortunately unable to accomplish our stretch goals, we would then look into those. With a million dollars, a good amount of the money would go towards computational resources. With faster training, we would be able to iterate and improve upon our model faster.