# Natural Language Processing (NLP)

This section is based on the following blog post by Andrej Karpathy:

[The Unreasonable Effectiveness of Recurrent Neural Networks, by Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

The code used by Karpathy for the article is on Github:

[https://github.com/karpathy/char-rnn](https://github.com/karpathy/char-rnn)

Basically, it is a **character-level language model**; astonishingly, the network will learn to create text, even being trained on a character level!

The basic idea is show on th efollowing picture from Karpathy's blog post:

![Karpathy: An example RNN with 4-dimensional input and output layers, and a hidden layer of 3 units (neurons). The vocabulary is `[h,e,l,o]`](pics/charseq_karpathy.jpeg)
*Karpathy: An example RNN with 4-dimensional input and output layers, and a hidden layer of 3 units (neurons). The vocabulary is `[h,e,l,o]`.*

The current notebook is about creating a simplified project, skmilar to the one described in the article, with the following goal: Given a sequence of characters, predict the same sequence shifted one character: e.g., `[h,e,l,l] -> [e,l,l,o]`.

Some points to consider:
- We are going to use the complete works by Shakespeare for training. The reansons are: (1) we have more than one million characters in the text and (2) the text is very well structured.
- We are going to create a one-hot encoding for the alphabet characters and punctuation; then, we are going to use an embedding to compress those one-hot vectors.

Steps followed:
1. Load text/data; a large dataset with millions of characters is required
2. Text processing and vectorization: integers assigned to letterns and symbols (e.g., punctuation)
3. Create batches: create long enough sequences to learn relationships, but not too long to avoid noise
4. Crate the model: we'll have 3 layers
    - Embedding layer: one-hot encoding vectors are compressed to a smaller space of fixed size (dimensions)
    - GRU layer: a simplified version of LSTM units (i.e., with fewer parameters), which leads to better results (see RNN folder: `../19_07_Keras_RNN`)
    - Dense layer: probabilities per character
5. Train the model
6. Inference

### Embeddings

A nice description of what embeddings are is given in this video on the DotCSV Youtube channel:

[INTRO al Natural Language Processing (NLP) #2 - ¿Qué es un EMBEDDING?](https://www.youtube.com/watch?v=RkYuH_K7Fx4)

Embeddings are not exclusive to language, but are commonly used in it, thanks to approaches like `word2Vec`, published in

"Efficient Estimation of Word Representations in Vector Space", Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. 2013, Google.

The idea is that, first, **we create a one-hot encoding to represent our vocabulary** in order to start working on the text; the size of the one-hot vector is the size of the vocabulary (i.e., the number of words, say 10k). This representation has several problems, such as:
- It is large and sparse.
- Words that are close to each other semantically ar ethe same dinstance apart as words that should be far away.

In order to solve those issues, a shallow neural net can be applied to the one-hot vectors to compressed them to a space with less dimensions (e.g., 300) but continuous values:

`[0,0,0,1,0,0,0] -> [0.54, 0.01]`

The nice thing is that vectors in the embedding space that are close to each other are in the reality semantically close to each other. Thus, we could start applying typical algebra operations on them, in such a way that `V(king) - V(man) + V(woman)` should be close to `V(queen)`.

## 1. Load Text/Data

In [43]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import tensorflow as tf

In [47]:
with open('./shakespeare.txt','r') as f:
    #lines = f.readlines() 
    text = f.read()

In [76]:
# The text has symbols in it, such as \n
text[0:100]

"\n                     1\n  From fairest creatures we desire increase,\n  That thereby beauty's rose mi"

In [75]:
# If we print it, the symbols are interpreted
print(text[100100:100500])

houldst not abhor my state.
    If thy unworthiness raised love in me,
    More worthy I to be beloved of thee.


                     151
  Love is too young to know what conscience is,  
  Yet who knows not conscience is born of love?
  Then gentle cheater urge not my amiss,
  Lest guilty of my faults thy sweet self prove.
  For thou betraying me, I do betray
  My nobler part to my gross body's 


## 2. Text Vectorization

In [58]:
# We create a set with all characters and symbols
vocab = sorted(set(text))

In [74]:
vocab[0:10]

['\n', ' ', '!', '"', '&', "'", '(', ')', ',', '-']

In [64]:
# Number of characters/symbols we have - important for the final dense layer
len(vocab)

84

In [66]:
# Now, we need to bidirectionally associate each character in the vocabulary
# with a number (related to one-hot encoding):
# character <-> number
# We can do that with enumerate and dictionaries
for pair in enumerate(vocab):
    print(pair)

(0, '\n')
(1, ' ')
(2, '!')
(3, '"')
(4, '&')
(5, "'")
(6, '(')
(7, ')')
(8, ',')
(9, '-')
(10, '.')
(11, '0')
(12, '1')
(13, '2')
(14, '3')
(15, '4')
(16, '5')
(17, '6')
(18, '7')
(19, '8')
(20, '9')
(21, ':')
(22, ';')
(23, '<')
(24, '>')
(25, '?')
(26, 'A')
(27, 'B')
(28, 'C')
(29, 'D')
(30, 'E')
(31, 'F')
(32, 'G')
(33, 'H')
(34, 'I')
(35, 'J')
(36, 'K')
(37, 'L')
(38, 'M')
(39, 'N')
(40, 'O')
(41, 'P')
(42, 'Q')
(43, 'R')
(44, 'S')
(45, 'T')
(46, 'U')
(47, 'V')
(48, 'W')
(49, 'X')
(50, 'Y')
(51, 'Z')
(52, '[')
(53, ']')
(54, '_')
(55, '`')
(56, 'a')
(57, 'b')
(58, 'c')
(59, 'd')
(60, 'e')
(61, 'f')
(62, 'g')
(63, 'h')
(64, 'i')
(65, 'j')
(66, 'k')
(67, 'l')
(68, 'm')
(69, 'n')
(70, 'o')
(71, 'p')
(72, 'q')
(73, 'r')
(74, 's')
(75, 't')
(76, 'u')
(77, 'v')
(78, 'w')
(79, 'x')
(80, 'y')
(81, 'z')
(82, '|')
(83, '}')


In [67]:
# Following that, we create a dictionary with comprehension
char_to_ind = {char:ind for ind,char in enumerate(vocab)}

In [68]:
# Bidirectional association
ind_to_char = np.array(vocab)

In [71]:
char_to_ind['A']

26

In [72]:
ind_to_char[26]

'A'

In [80]:
# Now, with those two vectors, we can encodde our text!
encoded_text = np.array([char_to_ind[c] for c in text])

In [83]:
# We check that we have several millions of characters (necessary for good enough results)
encoded_text.shape

(5445609,)