# Main

**Ressources: **
* [Colah's blog](http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)

The fact that neural networks with hidden layers are universal, isn’t what makes them so powerful.

## Embeddings
A word embedding is a parameterized (word) function mapping words in some language to high-dimensional vectors. 
- Typically the function is a lookup table, parameterised by a matrix W (initialised with random vectors for each word)

Corrupting different sentences and classifying ‘valid’ from ‘broken’ or ‘predicting’ the next word in a sentences is a common way
to learn W. Words with similar meanings has similar vectors (W).
* Visualize word embeddings (http://lvdmaaten.github.io/tsne/)

We still need to see examples of every word being used, but the analogies allow us to generalise to new combinations of words.
- Even more, analogies between words seem to be encoded in the difference between words (e.g. constant male-female difference vector)

Learning a good representation on task **A** and then using it on task **B**, works incredibly well.
- represent one kind of data and use it on multiple tasks
- map multiple kinds of data into a single representation (bilingual word-embedding in machine translation)

Shared embeddings are extremely exciting and why representation focused perspective of deep learning is so compelling.


Modular approach to building neural networks, by composing modules is popular in NLP.

## Setup

In [52]:
%matplotlib inline
from keras.layers import TimeDistributed, Activation, Embedding, Dropout, LSTM, Dense
from keras.optimizers import Adam
from keras.utils.data_utils import get_file
from keras.models import Sequential
import numpy as np
from numpy.random import choice

In [4]:
path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read().lower()

Downloading data from https://s3.amazonaws.com/text-datasets/nietzsche.txt


In [6]:
print('corpus length: ', len(text))

corpus length:  600893


In [22]:
chars = sorted(list(set(text)))
vocab_size = len(chars) + 1
# print(chars)

In [23]:
chars.insert(0, "\0")
# print(chars)

In [14]:
''.join(chars[1:-6])

'\n !"\'(),-.0123456789:;=?[]_abcdefghijklmnopqrstuvwx'

In [27]:
char_indices = dict((c, i) for i, c in enumerate(chars))
# print(char_indices)

In [25]:
indices_char = dict((i, c) for i, c in enumerate(chars))
# print(indices_char)

In [28]:
# we use character indices as id
idx = [char_indices[c] for c in text]

In [31]:
''.join(indices_char[i] for i in idx[:100])

'preface\n\n\nsupposing that truth is a woman--what then? is there not ground\nfor suspecting that all ph'

## Preprocess and create model

In [36]:
maxlen = 40
sentences = []
next_chars = []

for i in range(0, len(idx) - maxlen+1):
    sentences.append(idx[i: i + maxlen])
    next_chars.append(idx[i+1: i+maxlen+1])

print('number of sequences: ', len(sentences))

number of sequences:  600854


In [44]:
sentences = np.concatenate([[np.array(o)] for o in sentences[:-2]])
next_chars = np.concatenate([[np.array(o)] for o in next_chars[:-2]])

In [45]:
n_fac = 24

In [49]:
model=Sequential([
    Embedding(vocab_size, n_fac, input_length=maxlen),
    LSTM(512,
         input_dim=n_fac,
         return_sequences=True,
         dropout_U=0.2,
         dropout_W=0.2,
         consume_less='gpu'),
    Dropout(0.2),
    LSTM(512,
         return_sequences=True,
         dropout_U=0.2,
         dropout_W=0.2,
         consume_less='gpu'),
    Dropout(0.2),
    TimeDistributed(Dense(vocab_size)),
    Activation('softmax')  
])



In [53]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

## Train model

In [54]:
def print_example():
    seed_string = 'ethics is a basic foundation of all that'
    for i in range(320):
        x = np.array([char_indices[c] for c in seed_string[-40]])[np.newaxis, :]
        preds = model.predict(x)[0][-1]
        preds = preds / np.sum(preds)
        next_char = choice(chars, p=preds)
        seed_string = seed_string + next_char
    print(seed_string)

In [58]:
# TODO: fit and experiment with different parameters
# Should be trained on a GPU
# model.fit(sentences, np.expand_dims(next_chars, -1), batch_size=64, nb_epoch=1)

In [None]:
print_example()

In [None]:
model.save_weights('date/char_rnn.h5')