# Natural Language Processing (NLP)

This section is based on the following blog post by Andrej Karpathy:

[The Unreasonable Effectiveness of Recurrent Neural Networks, by Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

The code used by Karpathy for the article is on Github:

[https://github.com/karpathy/char-rnn](https://github.com/karpathy/char-rnn)

Basically, it is a **character-level language model**; astonishingly, the network will learn to create text, even being trained on a character level!

The basic idea is show on th efollowing picture from Karpathy's blog post:

![Karpathy: An example RNN with 4-dimensional input and output layers, and a hidden layer of 3 units (neurons). The vocabulary is `[h,e,l,o]`](pics/charseq_karpathy.jpeg)
*Karpathy: An example RNN with 4-dimensional input and output layers, and a hidden layer of 3 units (neurons). The vocabulary is `[h,e,l,o]`.*

The current notebook is about creating a simplified project, skmilar to the one described in the article, with the following goal: Given a sequence of characters, predict the same sequence shifted one character: e.g., `[h,e,l,l] -> [e,l,l,o]`.

Some points to consider:
- We are going to use the complete works by Shakespeare for training.
- We are going to create a one-hot encoding for the alphabet characters and punctuation.

Steps followed:
1. Load text/data; a large dataset with millions of characters is required
2. Text processing and vectorization: integers assigned to letterns and symbols (e.g., punctuation)
3. Create batches: create long enough sequences to learn relationships, but not too long to avoid noise
4. Crate the model: we'll have 3 layers
    - Embedding layer: one-hot encoding vectors are compressed to a smaller space of fixed size (dimensions)
    - GRU layer: a simplified version of LSTM units (i.e., with fewer parameters), which leads to better results (see RNN folder: `../19_07_Keras_RNN`)
    - Dense layer: probabilities per character
5. Train the model
6. Inference

### Embeddings

A nice description of what embeddings are is given in this video on the DotCSV Youtube channel:

[INTRO al Natural Language Processing (NLP) #2 - ¿Qué es un EMBEDDING?](https://www.youtube.com/watch?v=RkYuH_K7Fx4)

Embeddings are not exclusive to language, but are commonly used in it, thanks to approaches like `word2Vec`, published in

"Efficient Estimation of Word Representations in Vector Space", Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. 2013, Google.

The idea is that, first, **we create a one-hot encoding to represent our vocabulary** in order to start working on the text; the size of the one-hot vector is the size of the vocabulary (i.e., the number of words, say 10k). This representation has several problems, such as:
- It is large and sparse.
- Words that are close to each other semantically ar ethe same dinstance apart as words that should be far away.

In order to solve those issues, a shallow neural net can be applied to the one-hot vectors to compressed them to a space with less dimensions (e.g., 300) but continuous values:

`[0,0,0,1,0,0,0] -> [0.54, 0.01]`

The nice thing is that vectors in the embedding space that are close to each other are in the reality semantically close to each other. Thus, we could start applying typical algebra operations on them, in such a way that `V(king) - V(man) + V(woman)` should be close to `V(queen)`.

## 1. Load Text/Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import tensorflow as tf