# 1. Split Words with text_to_word_sequence
* Words are called tokens and the process of splitting text into tokens is called tokenization.
* By default **text_to_word_sequence()** does :
    * Split words by space
    * Filters out punctuation
    * Converts text to lowercase(lower=True)

In [1]:
from keras.preprocessing.text import text_to_word_sequence

# define the document
text = 'The quick brown box jumped over the lazy dog.'

# tokenize the document
result = text_to_word_sequence(text)
print(result)

['the', 'quick', 'brown', 'box', 'jumped', 'over', 'the', 'lazy', 'dog']


# 2. Encoding with one_hot
* Keras provides the one hot() function that we
can use to tokenize and integer encode a text document in one step.

In [2]:
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence

# define the document
text = 'The quick brown fox jumped over the lazy dog.'

# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)

# integer encode the document
result = one_hot(text,round(vocab_size*1.3))
print(result)

8
[2, 4, 9, 9, 9, 2, 2, 2, 3]


# 3. Hash Encoding with hashing trick
* It avoids need to keep track of a vocabulary,which is faster and requires less memory.

In [3]:
from keras.preprocessing.text import hashing_trick
from keras.preprocessing.text import text_to_word_sequence

# define the document
text = 'The quick brown fox jumped over the lazy dog.'

# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)

# integer encode the document
result = hashing_trick(text,round(vocab_size*1.3),hash_function='md5')
print(result)

8
[6, 4, 1, 2, 7, 5, 6, 2, 6]


# 4. Tokenizer API
* Once fit,the **Tokenizer** provides 4 attributes that we can use to query what has been learned about our documents : 
    * **word_counts**: A dictionary of words and counts.
    * **word_docs**: An integer count of the total number of documents that were used to fit the **Tokenizer**.
    * **word_index**: A dictionary of words and their uniquely assigned integers.
    * **document_count**: A dictionary of words and how many documents each appeared in.

In [4]:
from keras.preprocessing.text import Tokenizer

# define 5 documents
docs = ['Well done!','Good work','Great effort','nice work','Excellent']
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)

# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)

OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])
5
{'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8}
defaultdict(<class 'int'>, {'done': 1, 'well': 1, 'work': 2, 'good': 1, 'effort': 1, 'great': 1, 'nice': 1, 'excellent': 1})


* The texts_to_matrix() function on the Tokenizer can be used to create one vector per document provided per input
* The modes available are:
    * **binary**: Whether or not each word is present in the document. This is the default.
    * **count**: The count of each word in the document.
    * **tfidf**: The Text Frequency-Inverse DocumentFrequency (TF-IDF) scoring for each word in the document.
    * **freq**: The frequency of each word as a ratio of words within each document.

In [5]:
# integer encode documents
encoded_docs = t.texts_to_matrix(docs,mode='count')
print(encoded_docs)

[[0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]]


# Summary
* Discovered how we can use the keras API to prepare our text data for deep learning.