## Sentence Classification

* <span style="color:yellow">**Handling variable-length input with Recurrent Neural Network (RNN)**</span>.
* <span style="color:green">**Working with RNNs and their variants (LSTMs and GRUs)**</span>.
* <span style="color:pink">**Using common evaluation metrics for classification problems**</span>.
* <span style="color:red">**Developing and configuring a training pipeline using AllenNLP**</span>.
* <span style="color:blue">**Building a language detector as a sentence classification task**</span>.



### 4.1 Recurrent neural networks (RNNs)
* The first step in sentence classification is to represent variable-length sentences using neural networks (RNNs).
* In this section, I'm going to present the concept of recurrent neural networks, one of the most important concepts in deep NLP.

##### 4.1.1 Handling variable-length input

* Neural networks can handle only numbers and arithmetic operations. => That was why we needed to convert words and documents to numbers through embeddings.
* One idea is to first convert the input to embeddings, then average them.


![alt text for screen readers](figure1.PNG "Text to show on mouseover")


* This method is quite simple and is actually used in many NLP applications.
* **BUT, it has one critical issue, which is that it cannot take word order into account.**<br>
**=> Solution: Use Recurrent neural networks**

##### 4.1.2 RNN abstraction

* **Reading process:** <br>
1. Read a word. <br>
2. Base on what has been read so far, figure out what the word means. <br>
3. Update the mental state. <br>
4. Move on to the next word. <br>


![alt text for screen readers](figure2.PNG "Text to show on mouseover")


##### 4.1.3 Simple RNNs and nonlinearity

### 4.2 Long short-term memory units (LSTMs) and gated recurrent units (GRUs)

* The simple RNNs are rarely used in real-world NLP applications due to one problem called the **vanishing gradients problem**.
* In this section, I'll show the issue associated with simple RNNs and how morepopular RNN architectures, namely LSTMs and GRUs, solve this particular problem.

##### 4.2.1 Vanishing gradients problem

* RNNs trained with **back propagation algorithm**.
* **Vanishing gradients problem**: the message needs to pass through many layers, it becomes so weak an obscure (or so strong and skewed because of some misunderstanding) that the inner functions have a difficult time figuring out what they did wrong. <br>
**=> Because of the vanishing gradients problem, simple RNNs are difficult to train and rarely used in practice nowadays.**

##### 4.2.2 Long short-term memory (LSTM)

* Instead of passing the information through an activation function every time and changing its shape completely, how about adding and subtracting information relevant to the part of sentence being processed at each step?
* Long short-term memory units (LSTMs) are a type of RNN cell that is proposed based on this insight.
* Instead of passing around states, LSTM cells share a "memory" that each cell can remove old information from and/or add new information to, some-thing like an assembly line in manufacturing factory.

In [2]:
def update_lstm(state, word):
    cell_state, hidden_state = state
    cell_state *= forget(hidden_state, word)
    cell_state += add(hidden_state, word)

    hidden_state = update_hidden(hidden_state, cell_state, word)

    return (cell_state, hidden_state)


* The LSTM state comprise two halves - the cell state (the "memory" part) and the hidden state (the "mental representation" part).
* The function forget() returns a value between 0 and 1, so multiplying by this number mean erasing old memory from cell_state. How much to erase is determined from hidden_state and word (input). Controlling the flow of information by multiplying by a value between 0 and 1 is called **gating**. LSTMs are the first RNN architecture that uses this gating mechanism.
* The function add() return a new value added to the memory. The value again is determined from hidden_state and word.
* Finally, hidden_state is updated using a function, whose value is computed from the previous hidden state, the updated memory, and the input word.


![alt text for screen readers](figure8.PNG "Text to show on mouseover")


##### 4.2.3 Gated recurrent units (GRUs)

* Gated Recurrent Units (GRUs), uses the gating mechanism. The philosophy behind GRUs is similar to that of LSTMs.
* BUT, GRUs use only one set of states instead of two halves.

In [3]:
def update_gru(state, word):
    new_state = update_hidden(state, word)

    switch = get_switch(state, word)

    state = switch * new_state + (1 - switch) * state

    return state

* Instead of erasing or updating the memory, GRUs use a switching mechanism.
1. The Cell first computes the new state from the old state and the input. 
2. It then compute *switch*, a value between 0 and 1. The state is chosen between the new state and the old one based on the value of switch.
3. The state is updated by: state = switch * new_state + (1 - switch) * old_state


![alt text for screen readers](figure9.PNG "Text to show on mouseover")


### 4.3 Accuracy, precision, recall, and F-measure

### 4.4 Building AllenNLP training pipelines