Source: [How to Prepare Text Data for Deep Learning with Keras](https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/)

## Tutorial Overview

This tutorial is divided into 4 parts:
  - Split words with `text_to_word_sequence`.
  - Encoding with `one_hot`
  - hash Encoding with `hasing_trick`
  - Tokenizer API

### Split Words with `text_to_word_sequence`

By default, this function automatically does 3 things:
  - Splits words by space(split=" ")
  - Filters out punctuation(filters=’!”#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n’)
  - Convert text to lowercase(lower=True)

In [3]:
from keras.preprocessing.text import text_to_word_sequence

text = "The quick brown fox jumped over the lazy dog. tim's"
r = text_to_word_sequence(text)
print(r)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', "tim's"]


### Encoding with `one_hot`

Keras provides the `one_hot()` function that you can use to tokenize and integer encode a text document in one step. The name suggests that it will create a one-hot encoding of the document, which is not the case.   
Instead, the function is a wrapper for the `hashing_trick()` function described in the next section. The function returns an integer encoded version of the document. The use of a hash function means that there may be collisions and not all words will be assigned unique integer values.

In [5]:
from keras.preprocessing.text import text_to_word_sequence, one_hot


text = "The quick brown fox jumped over the lazy dog. tim's"
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print("Vocab size:", vocab_size)
r = one_hot(text, round(vocab_size*1.3))
print(r)

Vocab size: 9
[2, 2, 6, 5, 8, 3, 2, 1, 4, 9]


### Hash Encoding with `hashing_trick`

A limitation of integer and count base encodings is that they must maintain a vocabulary of words and their mapping to integers.

An alternative to this approach is to use a one-way hash function to convert words to integers. This avoids the need to keep track of a vocabulary, which is faster and requires less memory.

Keras provides the `hashing_trick()` function that tokenizes and then integer encodes the document, just like the `one_hot()` function. It provides more flexibility, allowing you to specify the hash function as either ‘hash’ (the default) or other hash functions such as the built in md5 function or your own function.

In [6]:
from keras.preprocessing.text import hashing_trick, text_to_word_sequence

text = "The quick brown fox jumped over the lazy dog. tim's"
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
r = hashing_trick(text, round(vocab_size*1.3), hash_function='md5')
print(r)

9
[7, 2, 8, 11, 7, 7, 7, 8, 11, 10]


### Tokenizer API

Keras provides a more sophisticated API for preparing text that can be fit and reused to prepare multiple text documents. This may be the preferred approach for large projects.

Keras provides the Tokenizer class for preparing text documents for deep learning. The Tokenizer must be constructed and then fit on either raw text documents or integer encoded text documents.

In [7]:
from keras.preprocessing.text import Tokenizer

# there are 5 documents
docs = ['Well done!',
    'Good work',
    'Great effort',
    'nice work',
    'Excellent!']

t = Tokenizer()
t.fit_on_texts(docs)

  - word_counts: A dictionary of words and their counts.
  - word_docs: An integer count of the total number of documents that were used to fit the Tokenizer.
  - word_index: A dictionary of words and their uniquely assigned integers.
  - document_count: A dictionary of words and how many documents each appeared in.

In [8]:
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)

OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])
5
{'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8}
{'well': 1, 'done': 1, 'work': 2, 'good': 1, 'effort': 1, 'great': 1, 'nice': 1, 'excellent': 1}


Once the Tokenizer has been fit on training data, it can be used to encode documents in the train or test datasets.

The `texts_to_matrix()` function on the Tokenizer can be used to create one vector per document provided per input. The length of the vectors is the total size of the vocabulary.

This function provides a suite of standard `bag-of-words` model text encoding schemes that can be provided via a mode argument to the function.

The modes avaiable include:
  - 'binary': Whether or not each word is present in the document.This is the default.
  - 'count': The count of each word in the document
  - 'tfidf': The Text Frequency-Inverse DocumentFrequency(TF-IDF) scoring for each word
  - 'freq': The frequency of each word as a ratio of words within each document

In [11]:
from keras.preprocessing.text import Tokenizer

# there are 5 documents
docs = ['Well done!',
    'Good work',
    'Great effort',
    'nice work',
    'Excellent!']
t = Tokenizer()
t.fit_on_texts(docs)
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)

# integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode="count")
print(encoded_docs)

OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])
5
{'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8}
{'well': 1, 'done': 1, 'work': 2, 'good': 1, 'effort': 1, 'great': 1, 'nice': 1, 'excellent': 1}
[[ 0.  0.  1.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  1.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  1.]]
