# **Recurrent Neural Networks**


---



In the spirit of:

*Yuh can' say it better than Karpathy* 

-John Fogarty

[The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

## Notbook Goals
Demonstrate a simple, old-school RNN language model that gives you the jones to take it further.

## Code/Cocepts Agenda

1) DEPS and DATA: what you need to get started #nltk4life

3) PREPROCESSING: we should do more, but I'll hand wave and say something about corpus length and UNKs

4) WORD2VEC: learn to roll your own - not just for sushi any more

5) REPRESENTATION LAYERS: what, we don't have 94GB of RAM for our onehots!?

6) MODEL: the sound of one layer clapping

7) FUN: dead presidents fail to inspire (yet)

In [68]:
#DEPS
import numpy as np
import pandas as pd

from keras.models import Sequential
from keras.layers import SimpleRNN, LSTM, Dense

import nltk
nltk.download("inaugural")

!pip install gensim
import gensim

#DATA LOADING
from nltk.corpus import inaugural
inaugural.fileids()

text_names = inaugural.fileids()

print(text_names)

documents_all = [ inaugural.words(text_name) for text_name in text_names ]

documents_train = [ inaugural.words(text_name)[:25] for text_name in text_names ]

#PRE-PROCESSING 
documents_all = [ map(lambda word: word.lower(), document) for document in documents_all ]
documents_train = [ map(lambda word: word.lower(), document) for document in documents_train ]

[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!
[u'1789-Washington.txt', u'1793-Washington.txt', u'1797-Adams.txt', u'1801-Jefferson.txt', u'1805-Jefferson.txt', u'1809-Madison.txt', u'1813-Madison.txt', u'1817-Monroe.txt', u'1821-Monroe.txt', u'1825-Adams.txt', u'1829-Jackson.txt', u'1833-Jackson.txt', u'1837-VanBuren.txt', u'1841-Harrison.txt', u'1845-Polk.txt', u'1849-Taylor.txt', u'1853-Pierce.txt', u'1857-Buchanan.txt', u'1861-Lincoln.txt', u'1865-Lincoln.txt', u'1869-Grant.txt', u'1873-Grant.txt', u'1877-Hayes.txt', u'1881-Garfield.txt', u'1885-Cleveland.txt', u'1889-Harrison.txt', u'1893-Cleveland.txt', u'1897-McKinley.txt', u'1901-McKinley.txt', u'1905-Roosevelt.txt', u'1909-Taft.txt', u'1913-Wilson.txt', u'1917-Wilson.txt', u'1921-Harding.txt', u'1925-Coolidge.txt', u'1929-Hoover.txt', u'1933-Roosevelt.txt', u'1937-Roosevelt.txt', u'1941-Roosevelt.txt', u'1945-Roosevelt.txt', u'1949-Truman.txt', u'1953-Eise

# DEPS

At this point numpy, pandas, and keras need no introduction.

### Keras Layers
However, we are going to pull some new layer types, in particular SimpleRNNN, which handles basic recurrent connections behind the scenes.

**Feed Forward (e.g. Dense)**

Input ---> Hidden/Procesing Layers ---> Output

**Recurent**

Input ---> Hidden/Procesing Layers ----> Output
           
           ^         |
         /   \     \ | /
           |        \ /
           |         |
            <---------
            
**SimpleRNN**

Input[ t ] + SimpleRNN [ t - 1 ] ---> SimpleRNN

### NLTK
Standard tool-kit for basic and non-neural NLP (Natural Lanuage Processing). Includeds a large number of gold standard corpora: the NISTs of NLP. Here we take a collection of inaugural speeches from past presidents.

### GENSIM
Good library focused on topic modeling and other vector space transformations used in NLP. Has a convenient Word2Vec model built-in.

# PREPROCESSING
Much like with the (non NIST) datasets we've seen, the realworld and its measurments are messy. Language is no exception.

The **golden rule for preprocessing is: does this feature contribute to my desired outcome**.

For example, if you want to learn meanings of words, in general capitalized words (excepting proper nouns, of course...) mean the same thing as their all lowercase forms. So, ditch the capitalization.

The inagural speech corpus is NOT BIG DATA, so that is the extend of the preprocessing here.

With a larger data set, typical additions are: 
the removal of rare words and hapax legomenona (words that only appear once...), where they are replaced with a token representing the unkown, usually "UNK".

Also, especially in tasks were intent or general meaning is paramount, inflected forms are collapsed into their roots or stems: goes, went, going -> go



In [69]:
#GENERATE WORD EMBEDDINGS
features_n = 50

w2v = gensim.models.Word2Vec(
  documents_all,
  size = features_n,
  window = 3,
  iter = 80,
  workers = 4,
  min_count = 1
)

W0913 19:23:34.880012 140042713429888 base_any2vec.py:723] consider setting layer size to a multiple of 4 for greater performance


#WORD2VEC

For information on this particular algorithm see this [Tensor Flow tutorial](https://www.tensorflow.org/tutorials/representation/word2vec).

Fundamentally, the idea is that **"you shall know a word by he company it keeps"**, which is a one-liner summary of the ditributional hypothesis of Zellig Harris.

Of particular importance here, is the variable features_n, which is used to the dimension of the words embeddings.

In [0]:
#PREPARE REPRESENTATION LAYER HELPERS
tokens = set()

[ tokens.add(token) for document in documents_train for token in document ]

tokens_n = len(tokens)

token2int = { token: i for (i,token) in enumerate(tokens)}
int2token = { i: token for (token, i) in token2int.items() }

def int2onehot(i):
  onehot = [0 for _ in range(tokens_n)]
  onehot[i] = 1
  return onehot

tokens_all = [ token2int[word] for document in documents_train for word in document ]


#TRANSFORM HIGH-LEVEL REPRESENATION INTO NETWORK IN-OUT REPRESENTATION

time_steps = 5 #number of prior words to process before weight update

def data_generator(int_list, time_steps):
  i = 0
  time_steps
  while True:
    current_ints = int_list[i : i + time_steps]
    next_int = int_list[i + time_steps]
    
    in_vec = np.array([ w2v[int2token[current_int]] for current_int in current_ints]).reshape(1, len(current_ints), features_n)
    out_vec = np.array(int2onehot(next_int)).reshape(1, tokens_n)
    
    i += 1
    
    if i == len(int_list) - (time_steps + 1):
      i = 0
    
    yield in_vec, out_vec

# REPRESENTATION LAYERS

The most straightforward way to pass your training and target data to a keras model is to prepare a lists of the vectors themselves. However this does not scale well with the standard way of representing categorical data - one hot encoding.

Especially with NLP tasks, your corpus may have 20-40k unique tokens, even after preprocessing. If you represent each word as a 20-40k long array of integers (of any size) you will quikly run out of RAM. For instance, to hold a one-hot encoding of this tiny inaugural corpus in memory takes about 25GB of space.

One way of dealing with this issue is to use richer, more compact representations of the data, such as the Word2Vec embeddings we generated above.

Another trick, especially useful in Keras, is to train the model with a data generator. A data generator is normal Python generator function that will yield one batch of inputs and targets when requested.

Here, we assign a unique integer to each token in the corpus, which is then transformed into a single array of integer IDS. Then we create a generator that will iterrate endlessly over this array, producing a sequence of Word2Vec input vectors, and a onehot output vector.

In [71]:
#RNN MODEL
model = Sequential()  

model.add( SimpleRNN(30, activation="relu", input_shape=(None, features_n), dropout=0.1 ) )

model.add( Dense(tokens_n, activation='softmax') )

model.compile(
    optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.fit_generator(data_generator(tokens_all, time_steps), steps_per_epoch=len(tokens_all)-(time_steps+1), epochs = 2)



Epoch 1/2




Epoch 2/2


<keras.callbacks.History at 0x7f5d60393d50>

# Model

The model is simpicity itself - a single SimpleRNN layer that feeds a dense output layer. The input layer will take our Word2Vec encoded sequences of words [w1, w2, w3] which will be used to predict the next word in the sequence (w4), encoded as a onehot.

The output is produced using softmax so the total of all activations sum to one, and thus can be interpreted as a probability. Ideally, the network learns to concentrate the probability into just a few possible words.

In statistical NLP, a language model is a probability distribution over sequences of words. Specifically, it answers the question: p(w_n | w_n-1, w_n-2, ... , w_2, w_1), that is, given a sequence of all n-1 prior words, what is the probability of each possible next word w_n.

Here the recurrent neural network is learning that distribution by being fed sequences of input words, and adjusting its weights to select the next word in the sequence.

In [72]:
def token_to_input(input_token):
  return w2v[ input_token ].reshape(1,1,features_n)
  
def output_to_token(output_vector):
  max_prob = output_vector.max()
  i = list(output_vector).index(max_prob)
  return int2token[i]

def step_model(input_token):
  input_vector = token_to_input(input_token)
  output_vector = model.predict(input_vector)[0]
  return output_to_token(output_vector)

sent = []
sent.append("my")

for i in range(100):
  ws = len(sent)
  next_input = np.array([ w2v[ w ] for w in sent[-(ws):]] ).reshape(1,ws,features_n)
  next_output = model.predict(next_input)[0]
  sent.append(output_to_token(next_output) )
  
  
print(" ".join(sent))




my fellow citizens , , the my fellow citizens , , the my fellow citizens , , the my fellow citizens , , the my fellow citizens , , the my fellow citizens , , the my fellow citizens , , the my fellow citizens , , the my fellow citizens , , the my fellow citizens , , the my fellow citizens , , the my fellow citizens , , the my fellow citizens , , the my fellow citizens , , the my fellow citizens , , the my fellow citizens , , the my fellow citizens , ,


# FUN

A language model can be used for many things: word-sense dissambiguation, auto-correction, author and language detection, etc. An albeit silly and fun use of such a model is for text generation.

Given a starting word, have the mode predict and chose the next word. Then given this new sequence, predict another word, then take this new sequence...

