Recurrent Neural Networks

* Objectives:
    * Review of what neural networks are (Multilayer Perceptron)
    * Simple RNN vs MLP
    * Benefits of Intralayer Recurrent Connections
    * Example of RNN in text data
    * Multilayer RNNs
    * Keras Neural Network API For RNN
    * LSTM

1) Recurrent Neural Networks Basics
* What distinguishes neural networks is that connections between neurons can form a **directed cycle**. This gives a network the ability to maintain a state based on previous input. So it can model **temporal, sequential** behavior
* RNN deep-learning models can process:
    * **Text** - understood as sequences of word or sequences of characters
    * **Timeseries**
    * **Sequence data**
* Two fundamental deep-learning algorithms for **sequence processing**:
    * Recurrent neural network
    * **1D Convnets** - 1-dimensional version of 2D convnets
* RNN Use Cases:
    * **Pattern recognition**: Handwriting, Captioning Images
        * **Document/Timeseries classification** - such as identifying the topic of an article or the author of a book
        * **Sentiment Analysis** - such as classifying the sentiment of tweets or movie reviews as positive or negative
    * **Sequential data**: Speech Recognition, Stock price prediction, generating text, and news stories
        * **Timeseries comparisons** - such as estimating how closely related two documents or two stock tickers are
        * **Sequence-to-sequence learning** - such as decoding an English sentence into French
        * **Timeseries forecasting** - such as predicting the future weather at a certain location, given recent weather data

2) Working with Text Data
* We can use text to produce a basic form of natural-language understanding, sufficient for applications including **document classification**, **sentiment analysis**, **author identification**, and even **question-answering (QA)** (in a constrained context)
* No deep-learning model **truly** understands text in a **human sense**, rather these models can **map the statistical structure of written language**, which is sufficient to solve many simple textual tasks
* Deep learning for natural language processing is **pattern recognition applied to <u>words, sentences, and paragraphs</u>**, in much the same way that computer vision is **pattern recognition applied to <u>pixels</u>**
* Like all other neural networks, deep-learning models don't take as input **raw text**: they only take **numeric tensors**
* **Vectorizing text** - the process of transforming text into numeric tensors. This can be done in **multiple ways**:
    * Segment text into **words**, and transform each word into a vector
    * Segment text into **characters**, and transform each character into a vector
    * Extract **n-grams of words or characters**, and transform each n-gram into a vector
        * **N-grams** - overlapping groups of multiple consecutive words or characters
            * Extracting **n-grams** is a form of feature engineering, which deep-learning does away with it, replacing it with **hierarchy feature learning**
            * 1-D convnets and recurrent neural networks are capable of learning representations for groups of words and characters **without being explicitly told about the existence of such groups** by looking at continuous word or character sequences
        * e.g. "bag-of-2-grams" for sentence "The cat sat on the mat." $\rightarrow$ `{"The", "The cat", "cat", "cat sat", "sat", "sat on", "on", "on the", "the", "the mat", "mat"}`
        * e.g. "bag-of-3-grams" for sentence "The cat sat on the mat." $\rightarrow$ `{"The", "The cat", "cat", "cat sat", "The cat sat","sat", "sat on", "on", "cat sat on", "on the", "the", "sat on the", "the mat", "mat", "on the mat"}`
        * **Bag** - refers to the fact that we're dealing with a **set of tokens** rather than a list or sequence (tokens have no specific order)
            * **Bag of words** - the tokenization method of taking words and converting them into n-grams of tokens with no specific order
            * Since bag-of-words isn't an **order-preserving** tokenization method:
                * The general structure of the sentences is lost
                * It tends to be used in **shallow language-processing models** (e.g. logistic regression and random forest) rather than in deep-learning models
* **Tokens** - collectively, the **different units** you can **break down text** (words, characters, or n-grams)
    * **Tokenization** - the process of breaking text into tokens
    ![text_token_vector](text_token_vector.png)
    * All text-vectorization processes consist of **applying some tokenization scheme** and then associating **numeric vectors with the generated tokens**
    * These vectors, packed into **sequence tensors**, are fed into deep neural networks
    * There are multiple ways to **associate a vector with a token**:
        * **One-hot encoding** - consists of associating a unique integer index with every word and then turning this integer index `i` into a binary vector of size `N` (the size of the vocabulary). The vector is all zeros except for the `i`th entry, which is 1
            * The most common, most basic way to turn a token into a vector
            * Keras has built-in utilities for doing one-hot encoding, which take care of important features like stripping special characters from strings, only taking into account the `N` most common words in dataset (a common restriction to avoid dealing with a very large input vector spaces)
            * Vectors obtained through **one-hot encoding** are binary, sparse (mostly made of zeros), and very high-dimensional (same dimensionality as the number of words in the vocabulary) -- **sparse vectors**
            * (-) One-hot encoding words generally leads to vectors that are 20,000-dimensional or greater (e.g. capturing a vocabulary of 20,000 tokens)
            * **One-hot hashing trick** - **hash words into vectors of fixed size** instead of explicitly assigning an index to each word and keeping a reference of these indices in a dictionary
                * Use when the **number of unique tokens** in your vocabulary is **too large** to handle explicitly
                * Saves memory and allows online encoding of data
                * (-) Susceptible to **hash collisions**, where two different words may end up with the same hash, and thus any ML model looking at these hashes won't be able to tell the difference between these words
                * The likelihood of hash collisions decreases when the dimensionality of the hashing space is much larger than the total number of unique tokens being hashed
        * **Token embedding / Word embeddings** - uses **dense word vectors** which are low-dimensional floating-point vectors
            * Typically used exclusively for words
        ![onehot_vs_wordemb](onehot_vs_wordemb.png)
            * Word embeddings are **learned from data**
            * It is common to see word embeddings that are 256-dimensional, 512-dimensional, or 1024-dimensional when **dealing with very large vocabularies**
            * (+) Word embeddings **pack more information** into **far fewer dimensions**
            * There are two ways to obtain word embeddings:
                * Learn word embeddings **jointly** with the main task you care about (e.g. document classification or sentiment prediction). In this setup, you start with **random word vectors** and then **learn vectors in the same way you learn weights of neural network**
                * Load into your model word embeddings that you were **precomputed using a different ML task** than the one you're trying to solve (**Pretrained word embeddings**)
            * Method 1: **Learning Word Embeddings with the Embedding Layer**
                * Simplest way to associate a dense vector with a word is to choose the vector at **random**. 
                    * (-) The problem with this approach is that the resulting embedding space has **no structure** (e.g. the word "accurate" and "exact" may end up with completely different embeddings, even thought they're interchangeable in most sentences)
                    * (-) It's difficult for a NN to make sense of such a **noisy, unstructured embedding space**
                * The **geometric relationships** between word vectors should reflect the **semantic relationships** between the words
                ![word_emb_space](word_emb_space.png)
                    * Word embeddings are meant to map **human language** into a **geometric space**
                    * In a reasonable embedding space, we should expect **synonyms** to be embedded into **similar word vectors**
                    * We should expect the geometric distance (e.g. L2 distance) between any two word vectors to relate to the semantic distance between associated words (words meaning **different things** are embedded at points **far away from each other**, whereas **related words are closer**)
                    * May also want the **specific directions** in embedding space to be meaningful
                        * e.g. dog to wolf or cat to tiger vector can be interpreted as "pet to wild animal" vector
                        * e.g. wolf to tiger or dog to cat vector can be interpreted as "canine to feline" vector
                * There is probably an ideal word-embedding space that perfectly maps human language and can be used for any natural language task, but unfortunately it hasn't been done yet
                    * Also, human language isn't a thing since there are many different languages that are **not isomorphic** since language is a **reflection of a specific culture and a specific context**
                    * More pragmatically, what makes a good word-embedding space depends heavily on your task
                        * e.g. The **perfect word-embedding space** for an English-language **movie review sentiment analysis model** may look **different** from the perfect embedding space for an English-language **legal documentation classification model** because the importance of certain semantic relationships **varies from task to task**
                * It is reasonable to **learn** a new embedding space with every new task using the weights of a layer (**`Embedding` layer**)
            * Method 2: **Using Pretrained Word Embedding**
                * Problem: Very little training data available to  learn appropriate task-specific embedding of the vocabulary
                * **Load embedding vectors** from a **precomputed embedding space** that you know is highly structured and exhibits useful properties (capturing generic aspects of language structure like common visual features or semantic features)
                * Examples of such word embeddings are generally computed using **word-occurrence statistics** (observations about what words co-occur in sentences or documents) that use a variety of techniques
                    * In 2000, the idea of a dense, low-dimensional embedding space for words for unsupervised learning was explored
                    * In 2013, at Google, one of the most famous and successful word-embedding schemes: **Word2Vec** algorithm was released. **Word2Vec** dimensions capture specific semantic properties (e.g. gender)
                * There are variety of **precomputed word embeddings** that you can use in a Keras `Embedding` layer:
                    * **Word2Vec**
                    * **Global Vectors for Word Representation (GloVe)** - developed in 2014 by Stanford researchers that is a embedding technique based on **factorizing a matrix of word co-occurrence statistics**. Its developers have made available precomputed embeddings for millions of English tokens obtained from Wikipedia/Common Crawl data   

3) Understanding **Recurrent Neural Networks (RNN)**
* A major characteristic of all neural networks so far (such as densely connected networks and convnets) is that they have **no memory**
    * With such networks, in order to **process a sequence or temporal series of data points**, you have to **show the entire sequence to the network at once** (turning it into a single data point) -- **Feedforward Networks**
        * e.g. An entire movie review is transformed into a single large vector and processed in one go
* Biological intelligence **processes information incrementally** while **maintaining an internal model of what's it's processing**, built from **past information** and **constantly updated as new information comes in**
    * e.g. Reading a present sentence, processing it word by word while **keeping memories** of what came before giving a **fluid representation of the meaning** conveyed by this sentence
* A **recurrent neural network** adopts the same principle in an extremely simplified version: it processes sequences by **iterating through the sequences** elements and **maintaining a <u>state</u>** containing information relative to what it has seen so far
![rnn_loop](rnn_loop.png)
    * In effect, an RNN is a type of neural network that has an **internal loop**
    * The state of the RNN is **reset between processing two different, independent sequences** (e.g two different IMDB reviews), so you still **consider one sequence a single data point** (a single input to the network)
    * What changes is that this data point is **no longer processed in a single step**; rather, the network internally **loops over sequence elements**
    ![mlp_vs_rnn](mlp_vs_rnn.png)
        * The double arrow in RNN indicates a weight in each direction (2 weights)
        * How many weights are in each architecture?
            * Vanilla MLP weights:
                * $W_{h\rightarrow y} = 4$
                * $W_{x\rightarrow h} = 8$
            * Vanilla RNN weights:
                * $W_{h\rightarrow y} = 4$
                * $W_{h\rightarrow h} = 16$
                * $W_{x\rightarrow h} = 8$
            * Each input shown to them is processed independently, with no state kept in between inputs
* Pseudocode for RNN:
    ```python
    state_t = 0 # state at t
    for input_t in input_sequence: # iterates over sequence elements
        output_t = activation(dot(W, input_t) + dot(U, state_t) + b)
        state_t = output_t # the previous output becomes the state for the next iteration
    ```
    * The code above is an example of a forward pass of toy RNN in `numpy`
    * This RNN takes as **input a sequence of vectors**, which you'll **encode as a 2D tensor of size**: `(timesteps, input_features)`
    * It loops over timesteps, and at each timestep, it considers its current state at `t` and the input at `t` of shape `(input_features,)`, and combines them to obtain the output at `t`
    * Then, **set the state** for the **next step** to be this **previous output**
    * For the **first timestep**, the previous output **isn't defined**; hence, there is **no current state**
    * Thus, initialize first timestep with the state as **an all-zero vector** called **<u>initial state</u>** of the network
    * The `output_t` function is the **transformation of the input and state** that will be parameterized by **two matrices, `W` and `U`, and a `bias` vector**
* `Numpy` implementation of simple RNN:
    ```python
    import numpy as np

    timesteps = 100 # number of timesteps in the input sequence
    input_features = 32 # dimensionality of the input feature space
    output_features = 64 # dimensionality of the output feature space

    inputs = np.random.random((timesteps, input_features)) # input data: random noise for the sake of the example

    state_t = np.zeros((output_features, )) # initial state: an all-zero vector

    # creates random weight matrices
    W = np.random.random((output_features, input_features))
    U = np.random.random((output_features, output_features))
    b = np.random.random((output_features, ))

    successive_outputs = []
    for input_t in inputs: # input_t is a vector of shape (input_features,)
        output_t = np.tanh(np.dot(W, input_t) + np.dot(U, state_t) + b) # combines the input with the current state (the previous output) to obtain the current output

        successive_outputs.append(output_t) # stores this output in a list
        state_t = output_t # updates the state of the network for the next timestep
    final_output_sequence = np.concatenate(successive_outputs, axis=0) # the final output is a 2D tensor of shape (timesteps, output_features)
    ```
    * In summary, an RNN is a **`for` loop** that **reuses quantities computed during the previous iteration** of the loop, nothing more
* There are many different RNNs that are essentially characterized by their **step function**
    * e.g. `output_t = np.tanh(np.dot(W, input_t) + np.dot(U, state_t) + b)`
* A simple RNN, unrolled over time:
![rnn_unroll](rnn_unroll.png)
    * In this example, the final output is a 2D tensor of shape `(timesteps, output_features)`, where each timestep is the output of the loop at time `t`
    * Each timestep `t` in the output tensor contains information about timesteps `0` to `t` in the input sequence (**about the entire past**)
    * For this reason, in many cases, you **<u>don't</u> need this full sequence of outputs**; you **just need the <u>last output</u>** (`output_t` at the end of the loop) because it **already contains information about the <u>entire sequence</u>**

4) A recurrent layer in Keras
* **Keras** - a high-level neural networks API, written in Python and capable of running on top of either Tensorflow, CNTK, or Theano
    * Will become Tensorflow's default API
    * Available Recurrent Layers:
        * **Recurrent**
        * **SimpleRNN**
        * **Long Short-Term Memory (LSTM)** (1997)
        * **Gated Recurrent Unit (GRU)** (2014)
* **`SimpleRNN`** - the proces implemented in `numpy` above is an actual layer in Keras (`from keras.layes import SingleRNN`)
    * One minor difference is that `SimpleRNN` processes **batches of sequences**, not single sequence like in the `numpy` example
    * This means it takes inputs of shape `(batch_size, timesteps, input_features)`
    * Like all recurrent layers in Keras, `SimpleRNN` can be run in two different modes: (controlled by `return_sequences`)
        * It can return the **full sequence** of successive outputs for each timestep: a 3D tensor of shape `(batch_size, timesteps, input_features)`
        * Return only the **last output** for each input sequence: a 2D tesnor shape `(batch_size, output_features)`
    * It is sometimes useful to **stack** recurrent layers one after the other in order to increase the representational power of a network
        * In such a setup, you have to get **all of the intermediate layers** to return **full sequence** of outputs
    * (-) Doesn't perform well compared to baseline since it only consider words, instead of full sequences in inputs
    * (-) `SimpleRNN` isn't good at processing long sequences, such as text
    * (-) Although, it should theoretically be able to retain at time `t` information about inputs seen many timesteps before, in practice, such **long-term dependencies are <u>impossible</u> to learn**
        * (-) Suffers from a major issue called **vanishing gradient problem**, which is a problem when you keep adding layers to a network where the **network becomes untrainable**
        ![seq_rnn](seq_rnn.png)
        * Theoretical reasons for the **vanishing gradient effect** was studied by Hochreiter and Schmidhuber in the early 1990s
        * Since the long-term information has to sequentially travel through all cells before getting to the present processing cell, it can be **easily corrupted** by being **multiplied many times by small numbers < 0**
    * In general, `SimpleRNN` is too simplistic to be of real use
* **Long Short-Term Memory (`LSTM`)** - developed by Hochreiter and Schmidhuber (1997) created `LSTM` layer that **saves information for later**, thus preventing older signals from gradually vanishing during processing
    ![lstm](lstm.png)
    * Both `LSTM` and `GRU` were designed to solve the **vanishing gradient effect**
    * `LSTM` can be seen as **multiple switch gates**, and a bit like **ResNet** where it can **bypass units and thus remember longer time steps**
    ![seq_lstm](seq_lstm.png)
    * The `LSTM` layer is a variant on the `SimpleRNN` layer (which adds a way to carry information across many timesteps)
    * Imagine a conveyor belt running parallel to the sequence you're processing. **Information from the sequence** can **jump onto the conveyor belt at any point**, be **transported to a later timestep**, and **jump off intact when you need it**
    ![simple_to_lstm](simple_to_lstm.png)
    * The diagram above adds an **additional data flow** that **carries** information **across timesteps**
        * Call its values at different timesteps `Ct`, where `C` stands for **carry**
        * This information will have the following impact on the cell:
            * It will be **combined with the input connection and the recurrent connection** (via a dense transformation: a dot product with a weight matrix followed by a bias add and the application of an activation function)
            * It will **affect the state being sent to the next timestep** (via an activation function and a multiplication operation)
        * Conceptually, the **carry dataflow** is a way to **modulate the next output and the next state**
        * The way the next value of the carry dataflow is computed uses **three distinct transformations with their own weight matrices** 
    * Pseudocode details of LSTM architecture:
        ```python
        output_t = activation(dot(state_t, U_o) + dot(input_t, W_o) + dot(C_t, V_o + b_o)
        i_t = activation(dot(state_t, U_i) + dot(input_t, W_i) + b_i)
        f_t = activation(dot(state_t, U_f) + dot(input_t, W_f) + b_f)
        k_t = activation(dot(state_t, U_k) + dot(input_t, W_k) + b_k)
        c_t + 1 = i_t * k_t * c_t * f_t
        ```
        * Obtain the **new carry state** (the next `c_t`) by **combining `i_t, f_t, and k_t`**
    * Interpreting each of the operations of `LSTM`:
    ![anatomy_lstm](anatomy_lstm.png)
        * Multiplying `c_t` and `f_t` is a way to **deliberately forget irrelevant information** in the carry dataflow
        * `i_t` and `f_t` provide **information about the present**, updating the carry track with **new information**
        * These interpretations don't mean much since what the weights **actually** do is determined by the **contents of the weights parameterizing** them 
    * The weights are **learned in an end-to-end fashion**, starting over with each training round, making it **impossible to credit this or that operation with a specific purpose**
        * The specification of an RNN cell determines your **hypothesis space** (the space in which you'll search for a good model configuration during training)
        * It **doesn't** determine what the **cell does**; that is done by the **cell weights**
        * The same cell with different weights can be doing very different things
        * The combination of operations making up an RNN cell is better interpreted as a **set of constraints** on your search, **<u>not</u> as a design** in an engineering sense
        * The **choice of such constraints** (how to implement RNN cells) is better left to **optimization algorithms** (e.g. **genetic algorithms** or **reinforcement learning processes**) tahn to human engineers
    * In summary, `LSTM` cell allows past information to be reinjected at a later time, thus fighting the vanishing-gradient problem
* **Concrete `LSTM` example in Keras**
    * Only specify the output dimensionality of the `LSTM` layer
        * Leave every other argument (there are many) at the Keras defaults
        * Keras has good defaults, and the things will almost always "just work" without you having to spend time tuning parameters by hand
    * `LSTM` will perform better if you tune hyperparameters such as **embeddings dimensionality** or the `LSTM` output dimensionality
    * Could also add some regularization
    * `LSTM` is good at analyzing **global, long-term structure**, which isn't great for **sentiment analysis** problem
    * There are far more difficult NLP problems that the strength of `LSTM` is more apparent: 
        * **Question-answering**
        * **Machine translation**
        * Natural language text compression
        * Handwriting recognition
        * Speech recognition
* Issues with `LSTM`/`GRU` technique:
    * (-) Still has sequential path from older past cells to the current one with an even more complicated path using **additive** and **forget** branches
    * (-) Remembers sequences of 100s, but **not** 1000s or 10,000s or more
    * (-) `LSTM`/`GRU` RNNs are not hardware friendly
        * (-) `RNN`/`LSTM` are difficult to train because they require memory-bandwidth-bound computation, which is the worst nightmare for hardware designer and ultimately limits the applicability of neural networks solutions
        * `LSTM` requires 4 linear layer (MLP layer) per cell to run at and for each sequence timestep
        * Linear layers require large amounts of memory bandwidth to be computed, in fact they cannot use many compute unit often because the system has not enough memory bandwidth to feed the computational units

5) Advanced use of **recurrent neural networks**
* **Three** advanced techniques for **improving the performance and generalization power** of recurrent neural networks (demonstrated using **temperature-forecasting** problem):
    1. **Recurrent dropout** - specific, built-in way to use **dropout** to fight overfitting in recurrent layers
    2. **Stacking recurrent layers** - **increases the representational power** of the network (at the cost of higher computational loads)
    3. **Bidirectional recurrent layers** - presents the same information to a recurrent network in **different** ways, **increasing accuracy** and **mitigating forgetting issues**
* Timeseries problem pre-processing:
    * View data from different periodicity to see what the trend looks like
    * No vectorization needed since data is already numerical
    * However, each timeseries data is on a **different scale** (e.g. temperature between -20 to 30 and atmospheric pressure is around 1,000)
        * Normalize each timeseries independently so that they all take small values on similar scale
        * Preprocess the data by subtracting the mean of each timeseries and dividing by the standard deviation (for the training data)
* Try to establish a **baseline, non-ML score**:
    * Use a simple, common-sense approach to serve as a **sanity check** that will establish a baseline that we'll beat in order to **demonstrate the usefulness of a more-advanced machine-learning models**
    * Such a common-sense baselines can be useful when you're **approaching a new problem for which there is no known solution yet**
    * For example, for a **unbalanced classification tasks**, where some classes are much more common than others
        * If your dataset contains 90% instances of class A and 10% instances of class B, then a common-sense approach to the classification task is to **always predict "A" when presented with a new sample**
        * Such a classifier is 90% accurate overall, and any learning-based approach should therefore **beat this 90% score** in order to demonstrate usefulness
        * Sometimes such elementary baselines can prove surprisingly hard to beat
    * In this case, the temperature timeseries can safely be assumed to be **continuous** (the temperatures tomorrow are likely to be close to the temperatures today) as well as **periodical with a daily period**
        * Thus a common-sense approach is to **always predict that the temperature 24 hours** from now will be **equal to the temperature right now**
    * Evaluate this approach using **mean absolute error (MAE)** metric: `np.mean(np.abs(preds - targets))`
* Also try to establish a **basic ML approach**:
    * Use a simple, cheap ML model (e.g. small, densely connected networks) before looking into a complicated and computationally expensive models such as RNNs
    * This is the best way to make sure **any further complexity thrown** at the problem is **legitimate** and **delivers real benefits**
    * It is possible that the validation losses for this simple ML model are close to the no-learning baseline making the no-learning baseline difficult to outperform
* Now try **Gated Recurrent Unit (`GRU`)**
    * `GRU` layers work using the same principle as `LSTM`, but they're **somewhat streamlined** and thus **cheaper to run**, although they may **<u>not</u> have as much representational power** as `LSTM`
    * The **trade-off** between **computational expensiveness** and **representational power** is seen everywhere in ML
* **Using <u>recurrent dropout</u> to fight overfitting** - using the same dropout mask at every timestep
    * **Dropout** - randomly zeros out input units of a layer in order to break coincidental correlations in the training data that the layer is exposed to
    * It has long been know that **applying dropout before a recurrent layer hinders learning** rather than helping with regularization
    * In 2015, Yarin Gal (PhD on Bayesian deep learning) determined the proper way to use dropout with a recurrent network:
        * The same **dropout mask** (the same pattern of dropped units) should be **applied at every timestep** instead of a dropout mask that varies **randomly** from timestep to timestep
        * In order to regularize the representations formed by the recurrent gates of layers (e.g. `GRU` and `LSTM`), a **temporally constant dropout mask** should be applied to the **<u>inner recurrent activations</u> of the layer** (**a <u>recurrent</u> dropout mask**)
        * Using the **same dropout mask at every timestep** allows the network to properly **propagate its learning error through time**
            * If you used a **temporally random dropout mask**, it would **disrupt** the error signal and be harmful to the learning process
        * Yarin Gal built this mechanism directly into Keras on every recurrent layer where there will be **two dropout-related arguments**: `dropout` and `recurrent_dropout`
            * **`dropout`** - a float specifying the dropout rate for input units of the layer
            * **`recurrent_dropout`** - specifying the dropout rate of the recurrent units
        * Networks being **regularized with dropout** always take **longer to fully converge**, it is best to train the network for **twice as many epochs**
* **Stacking recurrent layers** - increasing the number of units in the layers or adding more layers to recurrent neural network
    * If the neural network model is no longer overfitting, but seem to have hit a performance bottleneck, consider **increasing the capacity of the network**
        * It is generally a good idea to increase the capacity of your network until overfitting becomes the primary obstacle (assuming you've already mitigated overfitting using dropout)
    * Increasing network capacity is typically done by increasing the number of units in the layers or adding more layers
        * e.g. For **Google Translate** algorithm, it uses seven large `LSTM` layers
    * To stack recurrent layers on top of each other in Keras, all intermediate layers should **return their full sequence of outputs (a 3D tensor)** rather than their output at the last timestep (`return_sequences=True`)
* **Using Bidirectional RNNs** - exploits the order sensitivity of RNNs by **processing a sequence both ways** that will **catch patterns that may be overlooked** by a unidirectional RNN
    ![bidirectional_rnn](bidirectional_rnn.png)
    * Can offer greater performance than a regular RNN on certain tasks
    * Frequently used in NLP (often called the "Swiss Army knife" of deep learning of NLP)
    * RNNs are notably order dependent, or time dependent (shuffling or reversing the timesteps can completely change the representations the RNN extracts from the sequence)
    * It consists of using **two regular RNNs** (e.g. `GRU` and `LSTM` layers), each of which **processes the input sequence in the opposite direction** (**chronologically** and **antichronologically**), and then **merging their representations**
    * By processing a sequence both ways, a bidirectional RNN can catch patterns that may be overlooked by a unidirectional RNN
    * The underlying `GRU` layer will typically be better at remembering the recent past than the distant past, and naturally the more recent events are more predictive than older data points for the problem (e.g. for timeseries forecast problems like weather)
        * This isn't true for many other problem like natural language since the **importance of a word in understanding a sentence** <u>isn't</u> usually dependent on **its position in the sentence**
    * On text dataset, reversed-order processing works just as well as chronological process confirming the hypothesis that word order **does** matter in understanding language, but **which** order you use isn't crucial
        * More importantly, an RNN trained on reversed sequences will learn different representations than one trained on the original sequences
        * In ML, representations that are **different yet useful** are always worth exploiting, and the more they differ, the better
        * It offers new angles to look at your data, capturing aspects of the data that were missed by other approaches (boosting performance on a task) -- **ensembling**
* Other ways to **improve performance**:
    * Adjust the number of units in each recurrent layer in the stacked setup
    * Adjust the learning rate used by the `RMSprop` optimizer
    * Try using `LSTM` insteadof `GRU` layers
    * Try using a bigger densely connecte regressor on top of the recurrent layers (e.g. bigger `Dense` layer or even a stack of `Dense` layers)
    * Run best-performing models on the test set to check that you aren't overfitting to the validation set

6) **Sequence processing** with **convnets**
* The same properties that make convnets excel at computer vision also make them highly relevant to **sequence processing**
* **Time** can be treated as a **spatial dimension**, like the height or width of a 2D image
* Such **1D convnets** can be **competitive** with RNNs on certain sequence-processing problems
    * (+) Usually considerably cheaper computational cost
    * Recently, 1D convnets, typically used with **dilated kernels**, have been used with great success for **audio generation** and **machine translation**
    * It has been long known that **small 1D convnets** can offer a **fast alternative to RNNs** for simple tasks (e.g. **text classification** and **timeseries forecasting**)
* Understanding **1D convolution** for sequence data:
    * The convolution layers introduced previously were 2D convolutions, extracting 2D patches from image tensors and applying an identical transformation to every patch (for image convnets)
    ![1d_conv](1d_conv.png)
    * In the same way, you can use **1D convolutions**, extracting **local 1D patches** (subsequences) from sequences
    * Such 1D convolution layers can **recognize local patterns in a sequence**
    * Because the **same input transformation** is performed on every patch, a pattern learned at a **certain position** in a sentence can later be recognized at a **different position** (making 1D convnets **translation invariant for temporal translations**)
        * e.g. A 1D convnet processing **sequences of characters** using **convolution windows of size 5** should be able to **learn words or word fragments of length 5 or less**, and it should be able to **recognize these words in any context** in an input sequence
        * A character-level 1D convnet is thus able to learn about **word morphology**
* **1D pooling** for sequence data:
    * Similar to 2D average pooling and max pooling used in convnets to spatially downsample image tensors
    * Extracting **1D patches (subsequences)** from an input and outputting the **maximum value** (max pooling) or **average value** (average pooling)
    * Just as with 2D convnets, this is used for reducing the length of 1D inputs (**subsampling**)
* Implementing a 1D convnet in Keras:
    * Takes input 3D tensors with shape `(samples, time, features)` and returns similarly shaped 3D tensors
    * 1D convnets are structured in the same way as their 2D counterparts, which consits of a stack of `Conv1D` and `MaxPooling1D` layers, ending in either a global pooling layer or a `Flatten layer`
    * One difference is the fact that you can **afford to use a larger convolution windows** with 1D convnets
        * With a 2D convolution layer, a 3x3 convolution window contains 3x3 = 9 feature vectors; but with 1D convolution layer, a convolution window of size 3 contains **only** 3 feature vectors (thus can **easily afford 1D convolution windows of size 7 or 9**)
    * From example, the validation accuracy is somewhat **less** than that of the `LSTM`, but the **runtime is faster on both CPU and GPU**
        * This is a convincing demonstration that a 1D convnet can offer a fast, cheap alternative to a recurrent network on a word-level sentiment-classification task
* **Combining CNNs and RNNs** to process **long** sequences:
    * Because 1D convnets **process input patches independently**, they **aren't** sensitive to the **order of the timesteps** (beyond a local scale, the size of the convolution windows) unlike RNNs
    * To recognize longer-term patterns, you can try stacking many conv layers and pooling layers, but that's still a **fairly weak** way to **induce order sensitivity**
        * One way to evidence this weakness is to try 1D convnets on the temperature-forecasting problem, where order-sensitivity is key to producing good predictions
        * (-) The convnet looks for patterns anywhere in the input timeseries and has **no knowledge of the temporal position** of a pattern it sees (e.g. toward beginning, toward end, etc.)
        * Because more recent data points should be interpreted differently from older data points (for forecasting problem)
    * One strategy to combine the **speed and lightness of convnets** with the **order-sensitivity of RNNs** is to use a 1D convnet as a preprocessing step before an RNN
    ![cnn_rnn](cnn_rnn.png)
        * This is especially **beneficial** when you're dealing with **sequences that are so long** they **can't** realistically be processed with RNNs (e.g. sequences with thousands of steps)
        * The convnet will turn the long input sequence into **much shorter (downsampled) sequences** of **higher-level** features
        * This sequence of **extracted features** then becomes the **input to the RNN part of the network**
    * This technique isn't seen often in research papers and practical applications (possibly because it isn't well known). It is **effective** and ought to be **more common**
    * Because this strategy allows you to **manipulate much longer sequences**, you can either: 
        * Look at the data from longer ago by **increasing the `lookback` parameter** of the data generator
        * Look at **high-resolution timeseries** by **decreasing the `step` parameter** of the generator

7) **Recurrent Attention** - a better way to look into the past is to use attention modules to summarize all past encoded vectors
* If sequential processing is to be avoided, then we can find units that "look-ahead" or better "look-back", since most of the time we deal with **real-time causal data** where we know the past and want to affect future decisions
* These are techniques especially useful for NLP, but aren't particularly applicable for numerical forecasting problem
* **Hierarchical Neural Attention Encoder (Attention)** (2015-16)
    * **Memories** - basic building block that saves multiple values in a table
        ![memories](memories.png)
        * Basically just a multi-dimensional array that be of any size and any dimensions
        * It can also be composed of multiple banks or heads
        * Generally, memory have a **write/read function** that writes to all locations and reads from all location
        * An **attention-like module** can focus reading and writing to specific locations 
    * **Attention Module** - gating functions for memories. If you want to specify which values in an array should be passed through attention, use a linear layer to gate each input by some weighting function
    ![attention_mod](attention_mod.png)
        * Attention modules can be **soft** when the **weights are real-valued** and the inputs are thus multiplied by values
        * Attention is **hard when weights are binary**, and inputs are either 0 or passing through
        * Outputs are called **attention head outputs**
    * **Hierarchical Neural Attention Encoder** uses **attention modules** to summarize **all past encoded vectors** into a context vector `Ct`
    ![attention_enc](attention_enc.png)
        * There is a hierarchy of attention modules (similar to Temporal convolutional network (TCN))
        * Multiple layers of attention can look at **small portion of recent past** (e.g. 100 vectors) while **layers above can look at 100** of the same attention modules resulting in integrated information of **100x100 vectors**
            * This **extends the ability** of hierarchical neural attention encoder to **10,000 past vectors**
            * This is the way to look back more into the past and be able to influence the future
    * Looking at the **length of the path** needed to **propagate a representation vector** to the output of the network:
        * In hierarchical networks, it is **proportional to `log(N)`** where `N` is the **number of hierarchy layers**
        * In contrast, the `T` steps that a RNN needs to do where `T` is the maximum length of the sequence to be remembered, and `T >> N`
        * It is easier to remember sequences if you hop 3-4 times, as opposed to hopping 100 times
        * This architecture is similary to **neural Turing machine**, but lets the neural network **decide what is read out from memory via attention**. This means the NN will decide which vectors from the past are important for future decisions
    * The recurrent attention architecture **stores all previous representations in memory** (unlike neural Turing machines)
        * (-) This is be rather inefficient since it is like storign the representation of every frame in a video (where most representation vector do not change frame-to-frame, so we are really storing too much of the same information)
        * Can try to add another unit to prevent correlated data to be stored
        * e.g. A unit that **prevents** storage of vectors **too similar** to previously stored ones

8) **Sequence Masking**