# Main

**Notes**
* Setup
* 3 Character Model
* Simple RNN


**Ressources: **
* [Colah's blog](http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)

The fact that neural networks with hidden layers are universal, isn’t what makes them so powerful.

## Embeddings
A word embedding is a parameterized (word) function mapping words in some language to high-dimensional vectors. 
- Typically the function is a lookup table, parameterised by a matrix W (initialised with random vectors for each word)

Corrupting different sentences and classifying ‘valid’ from ‘broken’ or ‘predicting’ the next word in a sentences is a common way
to learn W. Words with similar meanings has similar vectors (W).
* Visualize word embeddings (http://lvdmaaten.github.io/tsne/)

We still need to see examples of every word being used, but the analogies allow us to generalise to new combinations of words.
- Even more, analogies between words seem to be encoded in the difference between words (e.g. constant male-female difference vector)

Learning a good representation on task **A** and then using it on task **B**, works incredibly well.
- represent one kind of data and use it on multiple tasks
- map multiple kinds of data into a single representation (bilingual word-embedding in machine translation)

Shared embeddings are extremely exciting and why representation focused perspective of deep learning is so compelling.


Modular approach to building neural networks, by composing modules is popular in NLP.

## Setup

In [80]:
%matplotlib inline
from keras.layers import TimeDistributed, Activation, Embedding
from keras.layers import Dropout, LSTM, Dense, Input, Flatten, merge
from keras.optimizers import Adam
from keras.utils.data_utils import get_file
from keras.models import Sequential, Model
import numpy as np
from numpy.random import choice

In [4]:
path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read().lower()

Downloading data from https://s3.amazonaws.com/text-datasets/nietzsche.txt


In [6]:
print('corpus length: ', len(text))

corpus length:  600893


In [22]:
chars = sorted(list(set(text)))
vocab_size = len(chars) + 1
# print(chars)

In [23]:
chars.insert(0, "\0")
# print(chars)

In [14]:
''.join(chars[1:-6])

'\n !"\'(),-.0123456789:;=?[]_abcdefghijklmnopqrstuvwx'

In [27]:
char_indices = dict((c, i) for i, c in enumerate(chars))
# print(char_indices)

In [25]:
indices_char = dict((i, c) for i, c in enumerate(chars))
# print(indices_char)

In [28]:
# we use character indices as id
idx = [char_indices[c] for c in text]

In [31]:
''.join(indices_char[i] for i in idx[:100])

'preface\n\n\nsupposing that truth is a woman--what then? is there not ground\nfor suspecting that all ph'

## 3 character model

In [61]:
# Create a list of every 4th character, starting at the 0th, 
# 1st, 2nd, then 3rd characters
cs = 3
c1_dat = [idx[i] for i in range(0, len(idx)-1-cs, cs)]
c2_dat = [idx[i+1] for i in range(0, len(idx)-1-cs, cs)]
c3_dat = [idx[i+2] for i in range(0, len(idx)-1-cs, cs)]
c4_dat = [idx[i+3] for i in range(0, len(idx)-1-cs, cs)]

In [66]:
# Out inputs
x1 = np.stack(c1_dat[:-2])
x2 = np.stack(c2_dat[:-2])
x3 = np.stack(c3_dat[:-2])

In [67]:
# Our output
y = np.stack(c4_dat[:-2])

In [68]:
# Number of latent factors to create
n_fac = 42

In [74]:
# Create inputs and embedding outputs for each of our 3 character inputs
def embedding_input(name, n_in, n_out):
    inp = Input(shape=(1,), dtype='int64', name=name)
    emb = Embedding(n_in, n_out, input_length=1)(inp)
    return inp, Flatten()(emb)

In [75]:
c1_in, c1 = embedding_input('c1', vocab_size, n_fac)
c2_in, c2 = embedding_input('c2', vocab_size, n_fac)
c3_in, c3 = embedding_input('c3', vocab_size, n_fac)

### Create and train model

In [78]:
# size of hidden state
n_hidden = 256

# layer operation from input to hidden (green arrow)
dense_in = Dense(n_hidden, activation='relu')
c1_hidden = dense_in(c1)

# layer operation from hidden to hidden (orrange arrow)
dense_hidden = Dense(n_hidden, activation='tanh')
c2_dense = dense_in(c2)
hidden_2 = dense_hidden(c1_hidden)
c2_hidden = merge([c2_dense, hidden_2])

c3_dense = dense_in(c3)
hidden_3 = dense_hidden(c2_hidden)
c3_hidden = merge([c3_dense, hidden_3])

# layer operation from hidden to output (blue arrow)
dense_out = Dense(vocab_size, activation='softmax')

  name=name)


In [79]:
# the 3rd hidden state is the input to our output layer
c4_out = dense_out(c3_hidden)

In [81]:
model = Model([c1_in, c2_in, c3_in], c4_out)

In [82]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [85]:
# reduce learning rate
model.optimizer.lr=0.000001

In [86]:
# train and experiment with learning rate
# model.fit([x1, x2, x3], y batch_size=64, nb_epoch=4)

## Simple RNN

In [87]:
# size of our unrolled RNN
cs = 8

In [88]:
# For each 0 through 7, create a list of every 8th char with that starting point
# These will be the 8 inputs to output model
c_in_dat = [[idx[i+cs] for i in range(0, len(idx)-1-cs, cs)] for n in range(cs)]

In [90]:
# list of the next char in each of these series
c_out_dat = [idx[i+cs] for i in range(0, len(idx)-1-cs, cs)]

In [92]:
xs = [np.stack(c[:-2]) for c in c_in_dat]

In [93]:
len(xs), xs[0].shape

(8, (75109,))

In [94]:
y = np.stack(c_out_dat[:-2])

In [95]:
# each column is one series of 8 characters from the text
[xs[n][:cs] for n in range(cs)]

[array([ 1, 36,  2, 46, 41, 47,  2, 42]),
 array([ 1, 36,  2, 46, 41, 47,  2, 42]),
 array([ 1, 36,  2, 46, 41, 47,  2, 42]),
 array([ 1, 36,  2, 46, 41, 47,  2, 42]),
 array([ 1, 36,  2, 46, 41, 47,  2, 42]),
 array([ 1, 36,  2, 46, 41, 47,  2, 42]),
 array([ 1, 36,  2, 46, 41, 47,  2, 42]),
 array([ 1, 36,  2, 46, 41, 47,  2, 42])]

In [96]:
# next char after each sequence
y[:cs]

array([ 1, 36,  2, 46, 41, 47,  2, 42])

In [97]:
n_fac = 42

### Create and train model

In [100]:
def embedding_input(name, n_in, n_out):
    inp = Input(shape=(1,), dtype='int64', name=name+'_in')
    emb = Embedding(n_in, n_out, input_length=1, name=name+'_emb')(inp)
    return inp, Flatten()(emb)

In [101]:
c_ins = [embedding_input('c'+str(n), vocab_size, n_fac) for n in range(cs)]

In [102]:
n_hidden = 256

In [106]:
dense_in = Dense(n_hidden, activation='relu')
dense_hidden = Dense(n_hidden, activation='relu', kernel_initializer='identity')
dense_out = Dense(vocab_size, activation='softmax')

In [104]:
# The first char of each sequence goes through dense_in(),
# to create our first hidden activations
hidden = dense_in(c_ins[0][1])

In [105]:
# For each successive layer we combine the output of dense_in() on the
# next character with the output of dense_hidden()
# on the current hidden state
for i in range(1, cs):
    c_dense = dense_in(c_ins[i][1])
    hidden = dense_hidden(hidden)
    hidden = merge([c_dense, hidden])

  name=name)


In [107]:
# Putting the final hidden state through dense_out()
c_out = dense_out(hidden)

In [108]:
model = Model([c[0] for c in c_ins], c_out)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [None]:
#model.fit(xs, y, batch_size=64, nb_epoch)

## Simple example

In [36]:
maxlen = 40
sentences = []
next_chars = []

for i in range(0, len(idx) - maxlen+1):
    sentences.append(idx[i: i + maxlen])
    next_chars.append(idx[i+1: i+maxlen+1])

print('number of sequences: ', len(sentences))

number of sequences:  600854


In [44]:
sentences = np.concatenate([[np.array(o)] for o in sentences[:-2]])
next_chars = np.concatenate([[np.array(o)] for o in next_chars[:-2]])

In [45]:
n_fac = 24

In [49]:
model=Sequential([
    Embedding(vocab_size, n_fac, input_length=maxlen),
    LSTM(512,
         input_dim=n_fac,
         return_sequences=True,
         dropout_U=0.2,
         dropout_W=0.2,
         consume_less='gpu'),
    Dropout(0.2),
    LSTM(512,
         return_sequences=True,
         dropout_U=0.2,
         dropout_W=0.2,
         consume_less='gpu'),
    Dropout(0.2),
    TimeDistributed(Dense(vocab_size)),
    Activation('softmax')  
])



In [53]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

### Train model

In [54]:
def print_example():
    seed_string = 'ethics is a basic foundation of all that'
    for i in range(320):
        x = np.array([char_indices[c] for c in seed_string[-40]])[np.newaxis, :]
        preds = model.predict(x)[0][-1]
        preds = preds / np.sum(preds)
        next_char = choice(chars, p=preds)
        seed_string = seed_string + next_char
    print(seed_string)

In [58]:
# TODO: fit and experiment with different parameters
# Should be trained on a GPU
# model.fit(sentences, np.expand_dims(next_chars, -1), batch_size=64, nb_epoch=1)

In [None]:
print_example()

In [None]:
model.save_weights('date/char_rnn.h5')