<a href="https://colab.research.google.com/github/kameshcodes/tensorflow-codes/blob/main/8_tensorflow_textvectorization_layer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

$$\textbf{TextVectorization layer in Keras}$$

---
---

# TextVectorization layer

---

## 1. Import Libraries

In [23]:
import os
import tensorflow as tf
import matplotlib.pyplot as plt

## 2. $Example$

In [24]:
sentences = [
    "I love my cats",
    "I love my cow",
    "You are my best cow"
]

In [25]:
# initialize the TextVectorization layer
vectorizer = tf.keras.layers.TextVectorization(output_mode='int')

In [26]:
# adapt and build vocab
vectorizer.adapt(sentences)

In [27]:
vectorizer.get_vocabulary()

['', '[UNK]', 'my', 'love', 'i', 'cow', 'you', 'cats', 'best', 'are']

In [28]:
for idx, element in enumerate(vectorizer.get_vocabulary()):
  print(idx, element)

0 
1 [UNK]
2 my
3 love
4 i
5 cow
6 you
7 cats
8 best
9 are


In [29]:
vectorizer(sentences)

<tf.Tensor: shape=(3, 5), dtype=int64, numpy=
array([[4, 3, 2, 7, 0],
       [4, 3, 2, 5, 0],
       [6, 9, 2, 8, 5]])>

To get vector, you simply pass in the string to the layer which already learned the vocabulary, and it will output the integer sequence as a `tf.Tensor`

In [30]:
sample_sentence = 'cats best companions'

# convert string into vector
vectorizer(sample_sentence)

<tf.Tensor: shape=(3,), dtype=int64, numpy=array([7, 8, 1])>

In [31]:
vectorizer(sample_sentence).numpy()

array([7, 8, 1])

## 3 How to Vectorize sentences ?

### 3.1 Using vectorizer object

In [32]:
sentences_2 = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
    ]

vectorize_layer = tf.keras.layers.TextVectorization(output_mode='int')
vectorize_layer.adapt(sentences_2)

for idx, word in enumerate(vectorize_layer.get_vocabulary()):
  print(idx, word)

0 
1 [UNK]
2 my
3 love
4 dog
5 you
6 i
7 think
8 is
9 do
10 cat
11 amazing


In [33]:
vectorizer(sentences_2).numpy()

array([[4, 3, 2, 1, 0, 0, 0],
       [4, 3, 2, 1, 0, 0, 0],
       [6, 3, 2, 1, 0, 0, 0],
       [1, 6, 1, 2, 1, 1, 1]])

In [34]:
sample_input = 'I love my dog'

vector = vectorize_layer(sample_input)
print(vector)

tf.Tensor([6 3 2 4], shape=(4,), dtype=int64)


#### 3.2 Using MAP

In [35]:
tf.data.Dataset.from_tensor_slices(sentences_2)

<_TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

In [36]:
# convert the sentences to tf.data.Dataset

sentences_dataset = tf.data.Dataset.from_tensor_slices(sentences_2)
sequences = sentences_dataset.map(vectorize_layer)
sequences

<_MapDataset element_spec=TensorSpec(shape=(None,), dtype=tf.int64, name=None)>

In [37]:
for sentence, sequence in zip(sentences_2, sequences):
  print(f'{sentence} ---> {sequence}')

I love my dog ---> [6 3 2 4]
I love my cat ---> [ 6  3  2 10]
You love my dog! ---> [5 3 2 4]
Do you think my dog is amazing? ---> [ 9  5  7  2  4  8 11]


- Note that, Vectors are of varying length when using $map$
- To get vectors of uniform length, either $pad$ or $truncate$
- Padding is more common to preserve information


### 3.2 Padding
- 0 is special token for padding
- when using `vectorize_layer` object it is already post padded for `0`

In [38]:
# Apply the layer to the string input list
sequences_post = vectorize_layer(sentences_2)

# Print the results
print('INPUT:')
print(sentences_2)
print()

print('OUTPUT:')
print(sequences_post)

INPUT:
['I love my dog', 'I love my cat', 'You love my dog!', 'Do you think my dog is amazing?']

OUTPUT:
tf.Tensor(
[[ 6  3  2  4  0  0  0]
 [ 6  3  2 10  0  0  0]
 [ 5  3  2  4  0  0  0]
 [ 9  5  7  2  4  8 11]], shape=(4, 7), dtype=int64)


- Note that vectors are post padded with `0` by the `vectorize_layer` object.

#### Padding using `keras.utils`

In [39]:
sentences_dataset = tf.data.Dataset.from_tensor_slices(sentences_2)
sequences = sentences_dataset.map(vectorize_layer)

for sentence, sequence in zip(sentences_2, sequences):
  print(f'{sentence} ---> {sequence}')

I love my dog ---> [6 3 2 4]
I love my cat ---> [ 6  3  2 10]
You love my dog! ---> [5 3 2 4]
Do you think my dog is amazing? ---> [ 9  5  7  2  4  8 11]


In [40]:
sequences_pre = tf.keras.utils.pad_sequences(sequences, padding='pre')

sequences_pre

array([[ 0,  0,  0,  6,  3,  2,  4],
       [ 0,  0,  0,  6,  3,  2, 10],
       [ 0,  0,  0,  5,  3,  2,  4],
       [ 9,  5,  7,  2,  4,  8, 11]], dtype=int32)

In [41]:
sequences_pre = tf.keras.utils.pad_sequences(sequences, padding='post')

sequences_pre

array([[ 6,  3,  2,  4,  0,  0,  0],
       [ 6,  3,  2, 10,  0,  0,  0],
       [ 5,  3,  2,  4,  0,  0,  0],
       [ 9,  5,  7,  2,  4,  8, 11]], dtype=int32)

In [42]:
# Print the results
print('INPUT:')
[print(sequence.numpy()) for sequence in sequences]
print()

print('OUTPUT:')
print(sequences_pre)

INPUT:
[6 3 2 4]
[ 6  3  2 10]
[5 3 2 4]
[ 9  5  7  2  4  8 11]

OUTPUT:
[[ 6  3  2  4  0  0  0]
 [ 6  3  2 10  0  0  0]
 [ 5  3  2  4  0  0  0]
 [ 9  5  7  2  4  8 11]]


#### Using `rugged=True`

In [43]:
vectorize_layer = tf.keras.layers.TextVectorization(ragged=True)
vectorize_layer.adapt(sentences_2)
ragged_sequences = vectorize_layer(sentences_2)
print(ragged_sequences)

<tf.RaggedTensor [[6, 3, 2, 4], [6, 3, 2, 10], [5, 3, 2, 4], [9, 5, 7, 2, 4, 8, 11]]>


In [44]:
sequences_pre = tf.keras.utils.pad_sequences(ragged_sequences.numpy())
print(sequences_pre)

[[ 0  0  0  6  3  2  4]
 [ 0  0  0  6  3  2 10]
 [ 0  0  0  5  3  2  4]
 [ 9  5  7  2  4  8 11]]
