------------------------------
#### Text preprocessing with keras
----------------------------
1. Keras text_to_word_sequence.
2. Keras hasing_trick.
3. Encoding with one_hot in Keras.
4. Keras Tokenizer.

#### text_to_word_sequence
Keras provides the text_to_word_sequence() function to convert text into token of words. 

While preprocessing text, this may well be the very first step that can be taken before moving further.

- text_to_word_sequence() splits the text based on white spaces. 
- It also filters out different punctuation marks and coverts all the characters to lower cases. 
- The default list of punctuation marks that it removes is 
- ! ” # $ % & ( ) * + , - . / : ; < = > ? @ [ \ ] ^_ { | } ~ \t \n. 
- One function providing so many functionalities is really great.

In [29]:
from keras.preprocessing.text import text_to_word_sequence

In [32]:
# define the text 
text = "Text to Word @DXC all-good St. Xavier can't Sequence Function works really well ! @ # % ^ &"

In [35]:
# tokenizing the text 
text_to_word_sequence(text, filters='<=>')

['text',
 'to',
 'word',
 '@dxc',
 'all-good',
 'st.',
 'xavier',
 "can't",
 'sequence',
 'function',
 'works',
 'really',
 'well',
 '!',
 '@',
 '#',
 '%',
 '^',
 '&']

#### hashing_trick
Keras hashing_trick() function converts a text to a sequence of indexes in a fixed size hashing space.

This function is useful because as we know, deep learning models do not take text inputs. So, converting the text into a list with text_to_word_sequence() is only the first step. 

With hashing_trick() function, we can get back a list of word indexes.

In [36]:
from keras.preprocessing.text import hashing_trick

In [37]:
# define the text 
text = 'An example for keras hashing trick function'

In [None]:
# tokenizing the text 
tokens = text_to_word_sequence(text)
tokens

['an', 'example', 'for', 'keras', 'hashing', 'trick', 'function']

In [40]:
length     = 10000      #len(tokens)
final_text = hashing_trick(text, length)

print(final_text)

[6890, 3294, 4261, 1351, 7882, 5735, 8623]


We can see that some of the words have been assigned the same index. This may be due to possible collisions by the hashing function. 

#### Encoding with one_hot in Keras
It is very common to encode text data to integer data when working with deep learning models.

The __one_hot()__ function in Keras allows us to do that with ease. 

The function takes takes 2 mandatory arguments as inputs. 

The first one is the text/file and the second one is the size of the vocabulary.

In [41]:
from keras.preprocessing.text import one_hot

In [42]:
#define the text
text = 'One hot encoding in Keras'

In [43]:
#tokenize the text
tokens = text_to_word_sequence(text)
length = len(tokens)

In [44]:
one_hot = one_hot(text, length)
print(one_hot)

[2, 3, 1, 3, 1]


This function hashes the text using the python hash function. Also, by default it filters the text by !”#$%&()*+,-./:;<=>?@[\]^_{|}~\t\n. The default filter includes basic punctuation, tabs and newlines.

#### Keras Tokenizer

In [45]:
from keras.preprocessing.text import Tokenizer

In [51]:
# define the text
text = ['You are learning a lot lot', 
        'That is a good thing',
        'This will help you a lot']

In [52]:
# creating tokenizer
tokenizer = Tokenizer()

In [53]:
# fit the tokenizer on the document
tokenizer.fit_on_texts(text)

In [54]:
# print the attributes for the text and encode the doucment
print(tokenizer.word_counts,'\n')
print(tokenizer.word_docs,'\n')
print(tokenizer.word_index,'\n')
print(tokenizer.document_count)

OrderedDict([('you', 2), ('are', 1), ('learning', 1), ('a', 3), ('lot', 3), ('that', 1), ('is', 1), ('good', 1), ('thing', 1), ('this', 1), ('will', 1), ('help', 1)]) 

defaultdict(<class 'int'>, {'lot': 2, 'are': 1, 'you': 2, 'a': 3, 'learning': 1, 'is': 1, 'that': 1, 'good': 1, 'thing': 1, 'help': 1, 'this': 1, 'will': 1}) 

{'a': 1, 'lot': 2, 'you': 3, 'are': 4, 'learning': 5, 'that': 6, 'is': 7, 'good': 8, 'thing': 9, 'this': 10, 'will': 11, 'help': 12} 

3


In [58]:
encoded_text = tokenizer.texts_to_matrix(text, mode='count')
print(encoded_text)

[[0. 1. 2. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0.]
 [0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1.]]


In [56]:
# define the text
# text = ['You are learning a lot lot', 
#         'That is a good thing',
#         'This will help you a lot']

In [19]:
tokenizer.texts_to_sequences(text)

[[3, 4, 5, 1, 2, 2], [6, 7, 1, 8, 9], [10, 11, 12, 3, 1, 2]]

In [20]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [21]:
sentences = [
'Life is so beautiful',
'Hope keeps us going',
'Let us celebrate life!'
]

In [22]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)

In [23]:
tokenizer.word_index

{'life': 1,
 'us': 2,
 'is': 3,
 'so': 4,
 'beautiful': 5,
 'hope': 6,
 'keeps': 7,
 'going': 8,
 'let': 9,
 'celebrate': 10}

In [24]:
test_data = [
'Our life is to celebrate',
'Hoping for the best!',
'Let peace prevail everywhere'
]

In [25]:
sequences = tokenizer.texts_to_sequences(sentences)
sequences

[[1, 3, 4, 5], [6, 7, 2, 8], [9, 2, 10, 1]]

In [26]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [28]:
padded = pad_sequences(sequences,maxlen=15)
print("\nPadded Sequences:")
print(padded)


Padded Sequences:
[[ 0  0  0  0  0  0  0  0  0  0  0  1  3  4  5]
 [ 0  0  0  0  0  0  0  0  0  0  0  6  7  2  8]
 [ 0  0  0  0  0  0  0  0  0  0  0  9  2 10  1]]
