<a href="https://colab.research.google.com/github/lambhua/Deep-Learning--tensorflow-NLP/blob/main/Tokenizer_%2CTextVectorizer_and_Embedding_in_tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Most Important step in deep learning for nlp tasks is Vectorization of text and embedding .This notebook tries to show these steps very simply **

In [1]:
import nltk
import numpy
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [24]:
tokens=["My name anoop","I am 25 years old"] # Example strings to preprocess


In [4]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [25]:
processed=[]
for i in tokens:
  d=nltk.word_tokenize(i)
  processed.append(d)


In [26]:
processed

[['My', 'name', 'anoop'], ['I', 'am', '25', 'years', 'old']]

In [8]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

**Using tensorflow tokenizer to vectorize text**

In [27]:
tokenizer=Tokenizer(num_words=20)
tokenizer.fit_on_texts(processed)

In [28]:
word_index=tokenizer.word_index
word_index

{'my': 1,
 'name': 2,
 'anoop': 3,
 'i': 4,
 'am': 5,
 '25': 6,
 'years': 7,
 'old': 8}

In [42]:
sequences=tokenizer.texts_to_sequences(processed)
sequences

[[1, 2, 3], [4, 5, 6, 7, 8]]

In [43]:
#padding sequences to make them of equal length
padded=pad_sequences(sequences,maxlen=6,padding='post')
padded     #Vectorized text sequences

array([[1, 2, 3, 0, 0, 0],
       [4, 5, 6, 7, 8, 0]], dtype=int32)

**Using TextVectorization and Embedding layers to Vectorize text**

In [31]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import TextVectorization,Embedding,Dropout,Conv1D,GlobalAveragePooling1D

In [37]:
vocab=[]
for i in text:
  d=nltk.word_tokenize(i)
  vocab=vocab+d


In [39]:
text=["My name anoop","I am 25 years old"]
input_dim=len(vocab)
input_dim

8

In [38]:
vocab

['My', 'name', 'anoop', 'I', 'am', '25', 'years', 'old']

In [41]:
text=tf.convert_to_tensor(text)
text

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'My name anoop', b'I am 25 years old'], dtype=object)>

In [49]:
model=Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(TextVectorization(max_tokens=1000, standardize='lower_and_strip_punctuation', split='whitespace', output_mode='int', output_sequence_length=None, pad_to_max_tokens=False, vocabulary=vocab))
model.add(Embedding(input_dim=30, output_dim=3,input_length=4))
model.add(Dropout(.2))
model.add(Conv1D(64,3))
model.add(GlobalAveragePooling1D())


In [50]:
model.compile('adam','mse')

In [51]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (Text  (None, None)              0         
 Vectorization)                                                  
                                                                 
 embedding (Embedding)       (None, None, 3)           90        
                                                                 
 dropout (Dropout)           (None, None, 3)           0         
                                                                 
 conv1d (Conv1D)             (None, None, 64)          640       
                                                                 
 global_average_pooling1d (  (None, 64)                0         
 GlobalAveragePooling1D)                                         
                                                                 
Total params: 730 (2.85 KB)
Trainable params: 730 (2.8

In [52]:
pred=model.predict(text)



In [53]:
pred

array([[ 4.62712394e-03,  7.26274587e-03, -5.56086237e-03,
        -2.90694484e-03,  4.34175134e-03, -2.09856778e-03,
         9.47237201e-03,  2.46382365e-03, -1.04862163e-02,
        -3.59398103e-03,  4.56065638e-03, -1.16134351e-02,
         7.96197346e-05, -2.20770016e-03,  7.99816381e-03,
         7.44300010e-03, -3.16967559e-03, -5.79420291e-03,
         2.41349146e-04, -7.82136992e-03,  2.42950511e-03,
        -4.23364341e-03, -5.74928196e-03, -4.77139838e-03,
        -4.17674240e-03,  1.13888802e-02,  4.16937895e-04,
        -6.23484328e-03, -2.22637341e-03, -8.30651540e-03,
        -2.51025683e-03,  7.48977438e-03,  3.38881812e-03,
         3.91091919e-03,  1.84474885e-03, -2.18895730e-03,
        -8.38974491e-03, -6.15619868e-03, -4.26335866e-03,
         5.42076258e-03,  6.04793802e-03,  1.02310407e-03,
         8.48336052e-03,  1.28133222e-02, -7.58195668e-03,
         4.35629860e-04, -4.69516171e-03, -3.71954549e-04,
         3.08300019e-03, -3.52252997e-03,  5.07271243e-0

In [54]:
pred.shape

(2, 64)

**Saving the model**

In [55]:
model.save('anoop.keras')

**Loading the saved model**

In [57]:
model1=tf.keras.models.load_model('anoop.keras')



In [None]:
model1.predict(text)



array([[-7.17547082e-04, -2.11744453e-03, -2.55411584e-03,
         2.38280930e-03, -5.57656819e-03,  2.96368566e-03,
        -5.80826018e-04,  2.93328636e-03, -2.89438316e-03,
         4.98660607e-03, -8.75625468e-04,  3.40516237e-03,
        -1.47768098e-03,  6.45941542e-03,  9.44783620e-04,
         3.40556115e-04, -1.58469379e-03,  4.08577081e-03,
        -2.44306237e-03,  5.22705400e-03, -5.21561829e-03,
        -6.03056338e-04, -1.47436594e-03, -2.13419087e-04,
        -2.77724955e-03,  4.40655043e-04, -5.82949317e-04,
        -3.03099491e-03,  5.21902752e-04,  3.81007348e-03,
         2.64187600e-03,  5.40415524e-03, -7.78996060e-03,
         1.74167028e-04, -5.03765000e-03, -3.26164183e-03,
         2.65483023e-03, -3.02560441e-03,  3.41647002e-03,
         1.13724626e-03,  3.47029418e-03, -5.16389904e-04,
        -3.81181086e-03,  3.81590944e-04, -2.04070355e-03,
        -1.35912455e-03,  3.00821383e-03,  2.26932322e-03,
        -3.63340043e-03,  1.79494964e-03,  1.33584661e-0

There are many other pre-trained text embeddings from TFHub that can be used in this tutorial:

google/nnlm-en-dim128/2 - trained with the same NNLM architecture on the same data as google/nnlm-en-dim50/2, but with a larger embedding dimension. Larger dimensional embeddings can improve on your task but it may take longer to train your model.
google/nnlm-en-dim128-with-normalization/2 - the same as google/nnlm-en-dim128/2, but with additional text normalization such as removing punctuation. This can help if the text in your task contains additional characters or punctuation.
google/universal-sentence-encoder/4 - a much larger model yielding 512 dimensional embeddings trained with a deep averaging network (DAN) encoder.
And many more! Find more text embedding models on TFHub.

Let's first create a Keras layer that uses a TensorFlow Hub model to embed the sentences, and try it out on a couple of input examples. Note that no matter the length of the input text, the output shape of the embeddings is: (num_examples, embedding_dimension).

With preloaded embedding layers u dont have to preprocess text


In [None]:
import tensorflow_hub as hub
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
                           dtype=tf.string, trainable=True)

In [None]:
model2=Sequential()
model2.add(hub_layer)


In [None]:
model2.predict(text)



array([[ 0.16158575,  0.03487716,  0.04701673,  0.26000825,  0.11360174,
        -0.05174125, -0.02798709, -0.3440884 , -0.0166243 , -0.17252597,
         0.2915599 ,  0.0504854 ,  0.16284578, -0.20347922,  0.03978457,
        -0.02678947,  0.06873696, -0.05014047, -0.14492953, -0.2781547 ,
        -0.29362342, -0.08327091,  0.08857469, -0.09184322, -0.22503239,
        -0.03789838, -0.00526696, -0.3014082 ,  0.17550017, -0.09049083,
        -0.00784622,  0.10334724,  0.03452793,  0.041433  , -0.03380363,
         0.04795556,  0.16658159,  0.01602894, -0.17990495, -0.03481016,
         0.23810524,  0.08191874, -0.14780585, -0.074142  ,  0.04400147,
         0.10603544, -0.24475896, -0.09352271,  0.10226095,  0.13133794],
       [ 0.15819348, -0.10955666, -0.17832954, -0.07462916,  0.21725678,
        -0.11359747,  0.25439048,  0.00842642, -0.13804607,  0.1250199 ,
        -0.00525587,  0.15726002,  0.00819244,  0.11809565, -0.00574656,
        -0.131393  , -0.11280521, -0.08900156, -0.

In [47]:
model5.predict(text)



array([[1, 3, 4, 0, 0],
       [1, 6, 7, 8, 9]])