Author: Kaveh Mahdavi <kavehmahdavi74@yahoo.com>
License: BSD 3 clause
last update: 28/12/2022

# Representing text as Tensors

I explore different neural network architectures for dealing with natural language text by using:
* bag-of-words
* embeddings
* recurrent neural network

In [8]:
import os

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds

In [9]:
# To use GPU memory cautiously, I set tensorflow option to grow GPU memory allocation when needed.
physical_devices = tf.config.list_physical_devices('GPU')
if len(physical_devices) > 0:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)

## Represent text

To solve Natural Language Processing (NLP) tasks with ANN, I need some way to represent text as tensors.

* **Character-level representation:** I represent text by treating each character as a number. Given that we have C  different characters in our text corpus, the word Hello could be represented by a tensor with shape C×5. Each letter would correspond to a tensor in one-hot encoding.
*
* **Word-level representation:** I create a vocabulary of all words in our text, and then represent words using one-hot encoding. This approach is better than character-level representation because each letter by itself does not have much meaning. By using higher-level semantic concepts - words - we simplify the task for the neural network. However, given a large dictionary size, we need to deal with high-dimensional sparse tensors.

### Load Dataset

In [10]:
dataset = tfds.load('ag_news_subset')

In [11]:
classes = ['World', 'Sports', 'Business', 'Sci/Tech']
ds_train = dataset['train']
ds_test = dataset['test']

print("Size of train dataset: {}".format(len(ds_train)))
print("Size of test dataset:  {}".format(len(ds_test)))

Size of train dataset: 120000
Size of test dataset:  7600


In [12]:
for i, x in zip(range(3), ds_train):
    print(f"{x['label']} ({classes[x['label']]}) -> {x['title']} {x['description']}")

3 (Sci/Tech) -> b'AMD Debuts Dual-Core Opteron Processor' b'AMD #39;s new dual-core Opteron chip is designed mainly for corporate computing applications, including databases, Web services, and financial transactions.'
1 (Sports) -> b"Wood's Suspension Upheld (Reuters)" b'Reuters - Major League Baseball\\Monday announced a decision on the appeal filed by Chicago Cubs\\pitcher Kerry Wood regarding a suspension stemming from an\\incident earlier this season.'
2 (Business) -> b'Bush reform may have blue states seeing red' b'President Bush #39;s  quot;revenue-neutral quot; tax reform needs losers to balance its winners, and people claiming the federal deduction for state and local taxes may be in administration planners #39; sights, news reports say.'


## Approaches to Represent Text as Tensor

### 1. Bag-of-Words as Data Preprocessing

I vectorize text into numbers to represent as tensors. In the word-level, I should do:
* Use a tokenizer to split text into tokens.
* Build a vocabulary of those tokens.

I don't take to account words that are rarely present in the text, since only a few sentences will have them, and the model will not learn from them.
I limit the vocabulary size by passing an argument to the `TextVectorization` constructor.

#### 1.1. Vectorize & Build a Vocabulary

In [13]:
vectorizer = keras.layers.experimental.preprocessing.TextVectorization(max_tokens=50000)
vectorizer.adapt(ds_train.take(500).map(lambda x: x['title'] + ' ' + x['description']))

vocabulary = vectorizer.get_vocabulary()
vocabulary_size = len(vocabulary)
print(vocabulary[:15])
print(f"Number of vocabulary: {vocabulary_size}")
vectorizer('I love artificial intelligence')

['', '[UNK]', 'the', 'to', 'a', 'in', 'of', 'and', 'on', 'for', '39s', 'with', 'that', 'its', 'as']
Number of vocabulary: 5335


<tf.Tensor: shape=(4,), dtype=int64, numpy=array([ 112, 3695, 5071, 3908])>

#### 1.2. Bagging

I convert each word number into a one-hot encoding and adding all those vectors up.

In [93]:
def get_bag_of_words(text, vocab_size):
    return tf.reduce_sum(tf.one_hot(vectorizer(text), vocab_size), axis=0)

batch_size = 128
ds_train_bow = ds_train.map(lambda x: (get_bag_of_words(x['title'] + x['description'], vocabulary_size),
                                       x['label'])).batch(batch_size)
ds_test_bow = ds_test.map(lambda x: (get_bag_of_words(x['title'] + x['description'], vocabulary_size),
                                     x['label'])).batch(batch_size)

#### 1.3. Build Classifier

In [94]:
model = keras.models.Sequential([
    keras.layers.Dense(4, activation='softmax', input_shape=(vocabulary_size,))
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])
model.fit(ds_train_bow, validation_data=ds_test_bow)

model.summary()

Model: "sequential_21"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_18 (Dense)            (None, 4)                 21344     
                                                                 
Total params: 21,344
Trainable params: 21,344
Non-trainable params: 0
_________________________________________________________________


### 2. Bag-of-Words with n-grams

Since some words are part of multi-word expressions, for example, the word 'on-line' has a completely different meaning.

from the words 'on' and 'line' in other contexts. so the representation of 'on' and 'line' by the same vectors, it can confuse our model.

IN n-gram the frequency of each word, bi-word or tri-word is a useful feature for training classifiers, e.g. bigram
adds all word pairs to the vocabulary, in addition to original words.

#### 2.1 Generate a bi-gram Bag-of-Words

To use an n-gram representation in our AG News dataset, we need to pass the ngrams parameter to our TextVectorization constructor.

In [7]:
vectorizer = keras.layers.experimental.preprocessing.TextVectorization(max_tokens=50000,ngrams=2)
vectorizer.adapt(ds_train.take(500).map(lambda x: x['title'] + ' ' + x['description']))

vocabulary = vectorizer.get_vocabulary()
vocabulary_size = len(vocabulary)
print(vocabulary[:15])
print(f"Number of vocabulary: {vocabulary_size}")
vectorizer('I love artificial intelligence')

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


['', '[UNK]', 'the', 'to', 'a', 'in', 'of', 'and', 'on', 'for', '39s', 'with', 'that', 'its', 'as']
Number of vocabulary: 20274


<tf.Tensor: shape=(7,), dtype=int64, numpy=array([  130, 11718, 18382, 12901,     1,     1, 18381])>

#### 2.2. Bagging

In [96]:
ds_train_bow_ngram = ds_train.map(lambda x: (get_bag_of_words(x['title'] + x['description'], vocabulary_size),
                                       x['label'])).batch(batch_size)
ds_test_bow_ngram = ds_test.map(lambda x: (get_bag_of_words(x['title'] + x['description'], vocabulary_size),
                                     x['label'])).batch(batch_size)

#### 2.3. Build Classifier

In [97]:
model = keras.models.Sequential([
    keras.layers.Dense(4, activation='softmax', input_shape=(vocabulary_size,))
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])
model.fit(ds_train_bow_ngram, validation_data=ds_test_bow_ngram)

model.summary()

Model: "sequential_22"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_19 (Dense)            (None, 4)                 81100     
                                                                 
Total params: 81,100
Trainable params: 81,100
Non-trainable params: 0
_________________________________________________________________


### 3. Bag-of-Words as a Layer

Since the vectorizer is also a Keras layer, I can define a network that includes it, and train it end-to-end.

Then I don't need to vectorize the dataset using map, we can just pass the original dataset to the input of the network.


In [21]:
# Vectorize Text
vectorizer_layer = keras.layers.experimental.preprocessing.TextVectorization(max_tokens=50000)
vectorizer_layer.adapt(ds_train.take(500).map(lambda x: x['title'] + ' ' + x['description']))

vocabulary = vectorizer_layer.get_vocabulary()
vocabulary_size = len(vocabulary)

In [22]:
def to_tuple(_x):
    return _x['title'] + ' ' + _x['description'], _x['label']

batch_size = 128
ds_train_embed = ds_train.map(to_tuple).batch(batch_size)
ds_test_embed = ds_test.map(to_tuple).batch(batch_size)

inp = keras.Input(shape=(1,), dtype=tf.string)
x = vectorizer_layer(inp)
x = tf.reduce_sum(tf.one_hot(x, vocabulary_size), axis=1)
out = keras.layers.Dense(4, activation='softmax')(x)
model = keras.models.Model(inp, out)

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])
model.fit(ds_train_embed, validation_data=ds_test_embed)

model.summary()


Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_7 (TextV  (None, None)             0         
 ectorization)                                                   
                                                                 
 tf.one_hot_2 (TFOpLambda)   (None, None, 5335)        0         
                                                                 
 tf.math.reduce_sum_2 (TFOpL  (None, 5335)             0         
 ambda)                                                          
                                                                 
 dense_5 (Dense)             (None, 4)                 21344     
                                                                 
Total params: 21,344
Trainable params: 21,344
Non-trainable

### 4. Auto-compute Bag-of-Words

Until here, I compute the BoW vectors manually by summing the one-hot encodings of individual words to show the calculation approach.

To define and train the model easier, TensorFlow enable us to calculate BoW vectors automatically by passing the  `output_mode='count parameter'` to the vectorizer constructor.


In [20]:
model = keras.models.Sequential([
    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocabulary_size,output_mode='count'),
    keras.layers.Dense(4,input_shape=(vocabulary_size,), activation='softmax')
])
print("Training vectorizer:")
model.layers[0].adapt(ds_train.take(500).map(lambda x: x['title'] + ' ' + x['description']))
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])
model.fit(ds_train_embed,validation_data=ds_test_embed)

model.summary()

Training vectorizer:
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_6 (TextV  (None, 5334)             0         
 ectorization)                                                   
                                                                 
 dense_4 (Dense)             (None, 4)                 21340     
                                                                 
Total params: 21,340
Trainable params: 21,340
Non-trainable params: 0
_________________________________________________________________
