### Text Data and Recurring Neural Networks

In [8]:
source("src//lib.R")

In this notebook we'll summarise two related topics in Neural Networks:
* Text processing and how to feed text data into Neural Networks
* The Neural Networks models that have been developed to handle data having a sequence structure (i.e. text or timeseries)

#### Text vectorization
In the previous notebooks we have seen that neural networks model must be fed with numeric tensors, the *text vectorization* is a process that transforms text data into quantitative data.

There are different way to *vectorize* text data, as example:
* Segment text into **words** and associate each word to a numerical vector.
* Segment text into **characters** and transform each character into a numerical vector.
* Extract **N-Grams** of words/characters and transform each N-Gram into a vector. A **N-gram** is a sequence of at most **N** consecutive words/characters in a text data: consider text *"UniCredit is a pan European Winner"*, it contains the following **3 grams**: *Unicredit, is, a, pan, European, Winner, Unicredit is, is a, a pan, pan European, European Winner, Unicredit is a, is a pan, a pan European, pan European Winner*.

In the following we'll focus our attention on the first *vectorization* strategy: **words** vectorization.

The **first step** in this process is to segment text data into words, aka *Tokenization*. Let's see how to tokenize text data in using **KERAS**:

In [13]:
samples <- c("UniCredit is a pan European Winner", "CIB fully plugged into UniCredit")
tokenizer <- text_tokenizer() %>%
 fit_text_tokenizer(samples)

tokenizer$word_docs

The **second step** required to associate a numeric vector to each token produced by the first step. The first simple way to associate a numeric vector to each token is called **One-Hot encoding**:
* Associate to each word $w$ a unique integer index id $H_w$
* Transform the integer index into a binary $N$-dimensional binary vector (where $N$ is the number of unique tokens) in which the only not-zero component is the one corresponding to index $H_w$

Both sub steps are implemented into the **Keras** framework. First associate each word to an integer index:

In [15]:
(sequences <- texts_to_sequences(tokenizer, samples))