# Review Report on "[HOC97] Long Short-Term Memory"

## Introduction

Analytical methods to gain insight to sequential data are becoming increasingly important in today's world. Predictive time series data analytics and natural language processing are examples of this major development that we can see in many economic fields in the course of digitalization. As I am a student in a digital health degree, I am particularly interested in applications in the medical domain. For example, evaluating medical observations such as vital signs and ECG, document analysis and categorization for electronic health records, or translations in speech and writing are huge-impact utilizations. In my master's thesis I am going to tackle the challenge of classifiying nurse activities from motion sensors, also a kind of time series data. This kind of data is often analyzed by recurrent neural networks (RNNs), as they can hold a memory of past events via feedback loops. In this literature review, I will examine the recurrent network architecture of Long Short-Term Memory (LSTM) networks developed by Sepp Hochreiter and Jürgen Schmidhuber (1997).

## Content (250/300)

The research paper proposes a novel recurrent neural network architecture to tackle the challenge of vanishing and exploding gradients. The problem occurs when error signals are carried over multiple steps, as characterized by recurrent networks. As the number of back-propagated time steps increases, the current error is mulitiplied with the previous error, influencing it exponentially. Repeated multiplication can either lead to diverging results in case of absolute errors larger than $1$, or results converging to $0$ in case of errors in the interval $(-1, 1)$. Both circumstances make learning significantly harder, since the error cannot be consistently evaluated by the network. The concrete risks for these so-called vanishing and exlpoding values depend on the chosen activation functions.

The central idea presented in the paper to circumvent to problem is the use of memory cells instead of directly feeding the network with a recurrent error. Each memory cell consists of a self-connected unit with the identity activation function and weight $1.0$ (constant error carousel, CEC). This prevents the exponential influence of the back-propagated time steps, as explained above. The unit is "guarded" by two usual multiplicatively activated units to regulate read and write access to the CEC, so the network can learn when to "save" the error and when to use it for calculations. Multiple CECs can be combined to memory blocks with multiple cells to store more information at a time. Neural networks using these memory cells are called Long Short-Term Memory networks (LSTM).

## Innovation (80/300)

At the time the work was published, vanishing and exploding errors lead to significant limitations of recurrent neural networks. Comparisons of recurrent neural network architectures show that the least of them can cope with very long time lags due to these effects. LSTM can not only successfully cope with them, but also finds solutions faster and the results are more reliable. The problem was discovered already six years earlier (Hochreiter 1991) and LSTM was the first solution to the problem.

## Technical quality (220/200)

The technical quality of the work is generally high. The mathematical foundations of the developed concept are sound and the result proven to work. Edge cases have been considered and potential flaws identified, while also proposing mitigating techniques. The model has been tested and compared on both conventional and carefully designed own problems. For comparison, multiple representative network types have been implemented as described in the corresponding papers and run against LSTM in an own setup, ensuring comparability of the results. After its inital version, each experiment was modified to be even harder, in order to push the capability limits.

Despite the great benchmark setup, there are also some critical remarks. Fistly, the problems described and tested on are merely artificial problems. They show the capabilities of each architecture in very pure way, yet testing a method on real data can be completely different. Also, the network architecture seems to have been chosen arbitrarily. This could, however, be due to lacking knowledge on my part about the problem complexities. The network topology has some level of explanation for each experiment, as e.g. to the solution of the abuse problem and comparable number of weights to the other networks. Also, different hyperparameters were tested and compared. However, as inconclusive as it may seem, this is fairly common the machine learning domain.

## Application and X-factor (208/200)

The work described in the paper was very promising at the time it was published and LSTM is still a widely used network architecture. It effectively prevents errors in recurrent models from vanishing or diverging, even for very large time lags. However, as models grow much bigger and deeper in modern times, new problems with recurrent networks emerge, also affecting LSTM. The problem that memory cells solve, re-appears on a macro-scale when building large networks with chained memory cells. Hence even LSTMs do not have unlimited capability as to how much they can remember. Another issue with recurrent neural networks in general is the increased computational power that is needed in comparison to other network architectures. This becomes increasingly important in the era of cloud computing.

Convolutional neural networks have been shown to outperform LSTM while using fewer parameters (Elbayad et al. 2018, Bai et al. 2018). LSTM tries to mitigate the problem of vanishing gradients, whereas CNNs avoid the problem altogether by not having a sequential backprapagation path. Other solutions to the problem are attention-based, like the Transformer network (VAS et al. 2017) or hierarchical attention architectures with a maximum backpropagation path length of $log(n)$ where $n$ is the number of attention layers (Yang et al. 2016).

## Presentation (85/100)

When reading the paper, I found it reasonably easy to follow and well structured. Subsections and paragraphs were logically organized and labelled, easing orientation while reading as well as searching for specific information. Mathematical conclusions were explained intuitively in the text and pointed to further readings for a deeper understanding and backgrounds. Furthermore, illustrations of the developed concept were given and appropriately designed to accompany the explanations. The test setups could have been clarified better in terms of architectural design decisions as already mentioned before.

## References

[HOC97][0]: Hochreiter, S, Schmidhuber, J. 'Long Short-Term Memory'.

[HOC91][1]: Hochreiter, S. 1991, 'Untersuchungen zu dynamischen neuronalen Netzen', Diplomarbeit, Technische Universität München.

[ELB18][2] Elbayad, M, Besacier, L, Verbeek, J. 2018, 'Pervasive Attention: 2D Convolutional Neural Networksfor Sequence-to-Sequence Prediction'. *CoNLL 2018 - Conference on Computational Natural Language Learning*, p.97–107.

[BAI18][3]: An Empirical Evaluation of Generic Convolutional and Recurrent Networksfor Sequence Modeling

[VAS17][4]: Attention Is All You Need

[YAN16][5]: Hierarchical Attention Networks for Document Classification

[0]:https://www.bioinf.jku.at/publications/older/2604.pdf
[1]:http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf
[2]:https://arxiv.org/pdf/1808.03867.pdf
[3]:https://arxiv.org/pdf/1803.01271.pdf
[4]:https://arxiv.org/pdf/1706.03762.pdf
[5]:http://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf