## Turning Equations into LaTeX using an attention based seq2seq model
CS230<br>
Adam Jensen, Henrik Marklund<br>
oojensen@stanford.edu, marklund@stanford.edu<br>

### 1. Background
We were typing up our CS229 homework and realized that we were spending more time on LaTeX than the actual homework. We did a quick informal survey amongst students in the Huang basement who concurred: yes, typesetting is a major inconvenience! Many said they spent over 5 and 10 hours per homework in CS221 and CS229 respectively. Another typical response was: “I chose not to typeset on the last CS229 homework, as I did not have time”.<br>

There is currently no good solution for converting handwritten notes into LateX. As a consequence, STEM students around the world struggle. The long term goal is to train an algorithm that takes a scan of your a4 page and turns it into latex typesetting. <br>

We limit the scope of this project to:
- Create a seq-to-seq model with attention turning images of digital equations into latex. <br>

This is also a request-for-research at OpenAI: https://openai.com/requests-for-research/#im2latex 


### 2. Dataset
Harvard Researchers have crawled wikipedia for mathematical equations and gathered a 100k equations from which one can generate images. Dataset: Harvard im-to-latex-100k (Described in __[Deng et al., 2016](https://arxiv.org/pdf/1609.04938.pdf)__) Guilluame Genthial at Stanford was kind enough to send us his generated images (as this takes quite some time). We do some additional processing (padding, and additional downsampling).

Here is an example image with corresponding latex:


__Latex__:
\widetilde \gamma \_ { \mathrm { h o p f } } \simeq \sum \_ { n > 0 } \widetilde { G } \_ { n } { \frac { ( - a ) ^ { n } } { 2 ^ { 2 n - 1 } } }

##### Histogram: Sequence lengths

When running the code you will see more examples and more details about the dataset.

### 3. Progress through the projeect (sequential):

__1. Dataset loaded and processed.__ <br>
We have approx 80k images with corresponding latex loaded and preprocessed. For now we skip looking at too long sentences and too big images.<br><br>
__2. Encoder-Decoder model up and running in Keras.__ <br>
__3. Overfit to 10 examples__<br>
After introducing Batch Normalization we managed to overfit to 10 examples. At this point, still hard to overfit to many more examples.
__4. Training with decreasing loss on 40k images / sequences__ <br>
Training is really slow which makes it important that we are systematic and smart about our experiments going forward.
Implemented Clip Gradient and a Learning rate schedule.<br><br>
__5. For debugging: Created an analogous but less complex problem.__ <br>
Since it was hard to know why it was so hard overfitting to a larger number of examples we created a simpler but analogous problem: turning pictures of text into text (but treating each character as a separate token to keep the problem analogous). Training was a lot easier, and we could much more easily overfit on a larger number training examples.
__6. Switched to TensorFlow and the Seq2seq library.__ <br>
__7. Got the Keras model to work without attention.__ <br>
__8. Got the Tensorflow model with attention to work improving accuracy by a lot.__ <br>





### 4. The model

#### Overview
Our model is based on a typical seq-to-seq model for translation. We started out with a seq-to-seq model for translation in Keras using LSTM (__[Described by Francois Chollet](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html)__. We replaced the encoder with a convolution neural network as described by in the paper __[Image to Latex by Genthial & Sauvestre (2016)](http://cs231n.stanford.edu/reports/2017/pdfs/815.pdf)__. The conv. network design is one of the versions __[here](https://github.com/guillaumegenthial/im2latex/blob/master/model/encoder.py)__. We have one model for training and one model for inference (using the weights from the first model).<br><br>


#### Encoder


#### Decoder


##### Without attention


##### With attention
Embedding Size: 80


#### Greedy vs. Beam Search

### 5. Training

#### Learning Schedule
<img src="model_visualizations/learning_rate_schedule.jpg" height="40%" width="40%" alt="Learning rate schedule" title="Learning rate" />

#### Other parameters
Mini-batch size:
Gradient clipping: 


### 6. Results

#### Experiments


#### Example predictions


#### Error analysis

#### Visualizing the attention


### 7. Next steps



### 8. References