## New Hypothesis

The last model I tried to implement was too complex to handle the computation I tried to run, and it wasn't optimized to work on a GPU. In this iteration, I'm going to practice how to alter the code to optimize it for a TPU/GPU, and I'm going to add in what I've learned about TF so far to see if that produces a model that looks like it matches Bengio's a bit more. 

### Preparing Text Data for Deep Learning w/ Keras
In order to get my model to work with the Brown corpus, I need to:
1. Split words with text_to_word_sequence
2. Encode with one_hot
3. Hash encode with hashing_trick
4. Use the Keras Tokenizer API

In [13]:
from keras.preprocessing.text import text_to_word_sequence
# define the document
# loads the doc into memory
def load_doc(filename):
    # opens the file as read only
    file = open(filename, 'r')
    # reads all the text
    text = file.read()
    # closes the file
    file.close()
    return text

In [16]:
text = load_doc("data/brown-train.txt")
# tokenize the document
result = text_to_word_sequence(text)
print(result)



In [18]:
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)

22266


In [20]:
from keras.preprocessing.text import one_hot

# integer code the document
result = one_hot(text, round(vocab_size*1.3))
print(result)

[13879, 20306, 19491, 7598, 11409, 498, 14420, 1599, 22102, 26776, 25641, 18782, 25635, 6073, 24085, 21561, 18591, 7731, 27615, 13851, 23878, 14822, 145, 13879, 11409, 21651, 498, 11771, 35, 5000, 11368, 27615, 13879, 386, 6915, 15227, 6002, 27674, 7714, 13652, 27221, 26776, 13879, 6073, 8901, 13879, 6760, 1111, 24390, 26776, 13879, 386, 26776, 2146, 7731, 6719, 13879, 26808, 11771, 6002, 13879, 6073, 13369, 25881, 13879, 19059, 18235, 35, 11409, 27674, 11157, 14268, 22971, 20306, 5187, 22915, 22722, 13772, 12215, 21093, 12741, 10754, 26776, 8919, 23878, 7731, 11771, 13879, 9939, 7903, 25635, 6002, 13369, 3692, 22971, 4813, 9470, 27383, 10342, 10276, 3180, 28151, 18027, 7396, 26776, 8480, 10754, 13369, 21457, 7731, 13879, 11409, 498, 19492, 13879, 11854, 26501, 11771, 13879, 6073, 13879, 5725, 26776, 4848, 1111, 13879, 28853, 26776, 27146, 386, 7731, 13879, 11409, 498, 15847, 25626, 25555, 27615, 25054, 26776, 21556, 14822, 1111, 6073, 28350, 20903, 19915, 1323, 24895, 1111, 28000, 261

In [21]:
from keras.preprocessing.text import hashing_trick
from keras.preprocessing.text import text_to_word_sequence

text = load_doc("data/brown-train.txt")
# tokenize the document
result = text_to_word_sequence(text)

words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)

# integer encode the document
result = hashing_trick(text, round(vocab_size*1.3), hash_function='md5')
print(result)

22266
[21750, 4081, 16369, 28150, 5247, 22704, 18608, 2855, 11855, 17951, 7342, 27964, 15669, 9476, 7780, 17795, 4431, 13257, 22428, 3214, 9505, 19114, 8161, 21750, 5247, 28895, 22704, 4135, 25037, 1051, 10666, 22428, 21750, 16967, 10534, 27846, 7296, 26525, 24318, 20585, 25891, 17951, 21750, 9476, 5790, 21750, 8622, 17745, 23214, 17951, 21750, 16967, 17951, 16638, 13257, 18190, 21750, 6262, 4135, 7296, 21750, 9476, 928, 7144, 21750, 11048, 25902, 25037, 5247, 26525, 23792, 6255, 13810, 4081, 23726, 20421, 24258, 226, 3169, 7274, 18485, 23404, 17951, 19009, 9505, 13257, 4135, 21750, 2899, 24792, 15669, 7296, 928, 12051, 13810, 22302, 22867, 14965, 17806, 23736, 21440, 28708, 15651, 16083, 17951, 23650, 23404, 928, 5240, 13257, 21750, 5247, 22704, 26233, 21750, 16875, 6553, 4135, 21750, 9476, 21750, 482, 17951, 14163, 17745, 21750, 13120, 17951, 1393, 16967, 13257, 21750, 5247, 22704, 12192, 10053, 15431, 22428, 15986, 17951, 632, 27353, 17745, 9476, 17154, 558, 18417, 24669, 23976, 177

In [23]:
from keras.preprocessing.text import hashing_trick
from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.text import Tokenizer

text = load_doc("data/brown-train.txt")
# tokenize the document
result = text_to_word_sequence(text)

# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(result)

In [24]:
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)

277526


In [25]:
# integer encode documents
encoded_docs = t.texts_to_matrix(result, mode='count')
print(encoded_docs)

[[0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
