<a href="https://colab.research.google.com/github/iamlekh/NLP/blob/master/prepare_text_data_with_keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Split Words with text to word sequence**

In [1]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence
# define the document
text = ' The quick brown fox jumped over the lazy dog. '
# tokenize the document
result = text_to_word_sequence(text)
print(result)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']


# Encoding with one hot

In [2]:
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# integer encode the document
result = one_hot(text, round(vocab_size*1.3))
print(result)

8
[7, 2, 3, 3, 5, 5, 7, 4, 5]


## **Hash Encoding with hashing trick**

In [3]:
from tensorflow.keras.preprocessing.text import hashing_trick
from tensorflow.keras.preprocessing.text import text_to_word_sequence
# define the document
text = ' The quick brown fox jumped over the lazy dog. '
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# integer encode the document
result = hashing_trick(text, round(vocab_size*1.3), hash_function= 'md5' )
print(result)

8
[6, 4, 1, 2, 7, 5, 6, 2, 6]


## **Tokenizer API**

the Tokenizer provides 4 attributes that you can use to query what has been
learned about your documents: 

1.*word counts*: A dictionary of words and their counts.

2.*word docs*: An integer count of the total number of documents that were used  to fit the Tokenizer.

3.*word index*: A dictionary of words and their uniquely assigned integers.

4.*document count*: A dictionary of words and how many documents each appeared in.


---

*  binary: Whether or not each word is present in the document. This is the default.
*  count: The count of each word in the document.
*  tfidf: The Text Frequency-Inverse DocumentFrequency (TF-IDF) scoring for each word in the document.
*  freq: The frequency of each word as a ratio of words within each document.






In [4]:
from tensorflow.keras.preprocessing.text import Tokenizer
# define 5 documents
docs = [ ' Well done! ' ,
' Good work ' ,
' Great effort ' ,
' nice work work ' ,
' Excellent! ' ]
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)
# summarize what was learned
print("word_counts ---> ",t.word_counts,'\n')
print("document_count --->",t.document_count,'\n')
print("word_index --->",t.word_index,'\n')
print("word_docs --->",t.word_docs,'\n')

encoded_docs = t.texts_to_matrix(docs, mode= 'count' )
print("count ---> ",encoded_docs,"\n")
encoded_docs = t.texts_to_matrix(docs, mode= 'binary' )
print("binary ---> ",encoded_docs,"\n")
encoded_docs = t.texts_to_matrix(docs, mode= 'tfidf' )
print("tfidf ---> ",encoded_docs,"\n")
encoded_docs = t.texts_to_matrix(docs, mode= 'freq' )
print("frequency ---> ",encoded_docs,"\n")

word_counts --->  OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 3), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)]) 

document_count ---> 5 

word_index ---> {'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8} 

word_docs ---> defaultdict(<class 'int'>, {'done': 1, 'well': 1, 'good': 1, 'work': 2, 'great': 1, 'effort': 1, 'nice': 1, 'excellent': 1}) 

count --->  [[0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 1. 0. 0.]
 [0. 2. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]] 

binary --->  [[0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]] 

tfidf --->  [[0.         0.         1.25276297 1.25276297 0.         0.
  0.         0.         0.        ]
 [0.         0.98082925 0.         0.         1.25276297 0.
  0.         0.         0.        ]
 [0.         0.         0.         0. 