# V2: SEQ2SEQ

This notebook follows [an online tutorial](https://www.tensorflow.org/text/tutorials/nmt_with_attention#create_a_tfdata_dataset).

In [1]:
import tensorflow as tf

2024-09-05 19:32:43.812420: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-05 19:32:43.818567: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-05 19:32:43.831683: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-05 19:32:43.850292: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-05 19:32:43.854793: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-05 19:32:43.866789: I tensorflow/core/platform/cpu_feature_gu

**Notes**:
- This notebook follows [an online tutorial](https://www.tensorflow.org/text/tutorials/nmt_with_attention) (and [at least one other](https://www.tensorflow.org/text/tutorials/text_generation) of the Tensorflow tutorials).
- This [blog post](https://janakiev.com/blog/jupyter-virtual-envs/) was referenced to set up the virtual environment.

In [2]:
import numpy as np
from typing import Any, Tuple
from IPython.display import display, Markdown
from pathlib import Path


In [3]:
data_file_paths = list(Path('data/processed/en/').glob('*.txt'))
dataset_raw = tf.data.TextLineDataset(
	data_file_paths,
)

We've now loaded the `.txt` training data files using [`tf.data.TextLineDataset`](https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset). Each line in the source files is mapped to a new training example. 

Although some preprocessing has been done by `/data/process_data.py`, paragraphs aren't filtered out based on length/content. Let's do that now:

In [4]:
def filter_paragraphs(context, target):
	return tf.strings.length(context) > 5

punctuation_chars = r'\?!.,"\-\':'
def add_context(target):
	context = tf.strings.regex_replace(target, '[{}]+'.format(punctuation_chars), '')
	context = tf.strings.strip(
		tf.strings.regex_replace(context, '[ ]+', ' ')
	)
	context = tf.strings.lower(context)
	return context, target

dataset_raw = dataset_raw.map(add_context).filter(filter_paragraphs)

Now let's inspect the data:

In [5]:
for text, label in dataset_raw.take(4).as_numpy_iterator():
	print(text, label)

b'illustration' b'Illustration '
b'alices adventures in wonderland' b"Alice's Adventures in Wonderland"
b'by lewis carroll' b'by Lewis Carroll'
b'the millennium fulcrum edition 30' b'THE MILLENNIUM FULCRUM EDITION 3.0'


2024-09-05 19:32:45.481673: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


### Batching

In [6]:
BUFFER_SIZE = 100_000
BATCH_SIZE = 16
dataset_train = dataset_raw.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)

# Inspired by https://stackoverflow.com/a/74609848.
validate_size = 64
dataset_validate = dataset_train.take(validate_size)
dataset_train = dataset_train.skip(validate_size)

### Preparing to process data

The [`TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) layer takes a `standardize` option that preprocesses input data. The default removes punctuation, but we don't want that. Let's redefine it:

In [7]:

def standardize_tf_text(text):
	punctuation_regex = '[{p}]'.format(p = punctuation_chars)

	# Surround punctuation with spaces for easier tokenization
	text = tf.strings.regex_replace(text, punctuation_regex, r' \0 ')

	# Remove repeated spaces
	text = tf.strings.regex_replace(text, r'\s+', ' ')

	# Add a special "capitalize the next letter" token
	text = tf.strings.regex_replace(text, r'(\s|^)([A-Z])', r' [CAP] \2')

	# Lowercase everything
	text = tf.strings.lower(text)

	# Remove leading and trailing spaces
	text = tf.strings.strip(text)

	# Add sequence markings
	return tf.strings.join(['[START]', text, '[END]'], separator=' ')

print(standardize_tf_text('This is a test! It\'s working?!'))

tf.Tensor(b"[START] [cap] this is a test ! [cap] it ' s working ? ! [END]", shape=(), dtype=string)


The text standardization function can now be used to preprocess text:

In [8]:
# Keep only the 2000 most commonly used tokens
max_vocab_size = 2000

target_text_processor = tf.keras.layers.TextVectorization(
	standardize=standardize_tf_text,
	max_tokens=max_vocab_size,
	# Allow entries of different lengths
	ragged=True,
)
target_text_processor.adapt(dataset_train.map(lambda context, target: target))

print('First 14 target words:', target_text_processor.get_vocabulary()[:14])

# The target data should be roughly equivalent to the context data, except have additional (punctuation)
# tokens.
context_text_processor = target_text_processor

First 14 target words: ['', '[UNK]', '[cap]', ',', 'the', '.', '[START]', '[END]', 'and', 'to', 'of', 'i', 'a', 'in']


2024-09-05 19:32:55.540937: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


We can use these layers to convert to/from token IDs:


In [9]:
example_text = 'hello world this is a test tensorflow is processing this'
example_tokens = context_text_processor(example_text)
print('Example tokens', example_tokens)

context_vocab = np.array(context_text_processor.get_vocabulary())
tokens = context_vocab[example_tokens.numpy()]
print('Back to text', ' '.join(tokens))

Example tokens tf.Tensor([  6   1 257  42  24  12   1   1  24   1  42   7], shape=(12,), dtype=int64)
Back to text [START] [UNK] world this is a [UNK] [UNK] is [UNK] this [END]


### Processing the data

Now, we'll:
1. Map the data through the text processors we just made.
2. Shift the target data, so that our network is provided with a history of generated tokens.

In [10]:
def process_text(context, target):
	return context_text_processor(context), target_text_processor(target)

def add_target_history(context, target):
	# .to_tensor(): Converts from RaggedTensors to Tensors.
	# We give our network the history as target_in
	target_in = target[:, :-1].to_tensor()
	target_out = target[:, 1:].to_tensor()
	return (context.to_tensor(), target_in), target_out
dataset_train = dataset_train.map(process_text).map(add_target_history).repeat()
dataset_validate = dataset_validate.map(process_text).map(add_target_history)

In [11]:

def inspect_dataset(dataset: tf.data.Dataset):
	target_vocab = np.array(target_text_processor.get_vocabulary())
	for (context, target_in), target_out in dataset.take(1):
		context_words = context_vocab[context[0]]
		print('context', ','.join(context_words))
		print('target_in', ','.join(target_vocab[target_in[0]]))
		print('target_out', ','.join(target_vocab[target_out[0]]))

inspect_dataset(dataset_train)

context [START],yet,the,character,of,his,face,had,been,at,all,times,remarkable,[END],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
target_in [START],[cap],yet,the,character,of,his,face,had,been,at,all,times,remarkable,.,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
target_out [cap],yet,the,character,of,his,face,had,been,at,all,times,remarkable,.,[END],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


## Model

### The encoder

See https://www.tensorflow.org/text/tutorials/nmt_with_attention#the_encoder

In [12]:
class Encoder(tf.keras.Layer):
	def __init__(self, text_processor, units: int):
		"""
		Creates a new Encoder layer. [dimen] is the maxiumum number of elements of the input
		that can be processed by the encoder.
		"""
		super(Encoder, self).__init__()
		self.text_processor = text_processor
		self.vocab_size = text_processor.vocabulary_size()
		self.units = units

		# Converts tokens -> vectors
		self.embedding = tf.keras.layers.Embedding(
			# mask_zero: Treats zero as a padding value that should be ignored
			self.vocab_size, units, mask_zero = True,
		)
		gru = tf.keras.layers.GRU(
			units, return_sequences = True,
			# Use the recurrent_initializer suggested by the tutorial (& the default
			# for kernel_initializer).
			recurrent_initializer='glorot_uniform'
		)
		self.rnn = tf.keras.layers.Bidirectional(
			# merge_mode determines how the forward and backward layers are combined
			#            'concat' is another option here
			merge_mode = 'sum',
			layer=gru,
		)

	def call(self, x):
		x = self.embedding(x)
		x = self.rnn(x)
		return x

	def prepare_for_input(self, texts):
		"""
		Utility method that converts `texts` to a form that can be provided to the `call` method.
		"""
		texts = tf.convert_to_tensor(texts)
		if len(texts.shape) == 0:
			texts = texts[None]
		context = self.text_processor(texts).to_tensor()
		return context


Try it:

In [13]:
ENCODER_UNITS = 48
encoder = Encoder(context_text_processor, ENCODER_UNITS)

for (context, target_history), target_next in dataset_validate.take(1):
	encoder_result = encoder(context)
	print('Context tokens shape (batch, s):', context.shape)
	print('Encoder output shape (batch, s, ENCODER_UNITS):', encoder_result.shape)

Context tokens shape (batch, s): (16, 46)
Encoder output shape (batch, s, ENCODER_UNITS): (16, 46, 48)


2024-09-05 19:33:01.819845: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


### The attention layer

Attention can be thought of as training a lookup table with keys and values. The lookup table has inputs `values` and `query`.

In [14]:
class CrossAttention(tf.keras.Layer):
	def __init__(self, units, **kwargs):
		super().__init__()
		self.attention_layer = tf.keras.layers.MultiHeadAttention(
			key_dim=units,
			num_heads=1,
			**kwargs
		)
		# Keeps "the mean activation within each example close to 0 and the
		# activation standard deviation close to 1" -- https://www.tensorflow.org/api_docs/python/tf/keras/layers/LayerNormalization?hl=en
		self.norm_layer = tf.keras.layers.LayerNormalization()
		self.add_layer = tf.keras.layers.Add()
		self.supports_masking = True

	def call(self, query, value):
		attention_output = self.attention_layer(
			query = query,
			value = value,
			#use_causal_mask=True,
			# Return the attention scores for latter plotting
			# return_attention_scores = True,
		)

		x = self.add_layer([ query, attention_output ])
		x = self.norm_layer(x)
		return x


In [15]:
attention_layer = CrossAttention(ENCODER_UNITS)

# Test with an example
for (context, target_history), target_next in dataset_validate.take(1):
	embed_layer = tf.keras.layers.Embedding(target_text_processor.vocabulary_size(), output_dim=ENCODER_UNITS, mask_zero=True)
	target_embed = embed_layer(target_history)
	encoded_context = encoder(context)
	attention_result = attention_layer(target_embed, encoded_context)

	print('Encoded context sequence shape (batch, s, units):', encoded_context.shape)
	print('Target history sequence shape (batch, t, units):', target_embed.shape)
	print('Attention result shape (batch, t, units):', attention_result.shape)

	# Used later 
	test_encoded_context = encoded_context

Encoded context sequence shape (batch, s, units): (16, 52, 48)
Target history sequence shape (batch, t, units): (16, 63, 48)
Attention result shape (batch, t, units): (16, 63, 48)




### The decoder

The decoder produces queries for the attention layer. The decoder operates on `target_history`. At each step during training, it should have no information about future target output (that's what we're trying to determine). As such, we use a unidirectional RNN.


In [16]:
class CustomDense(tf.keras.layers.Dense):
	def __init__(self, *args, **kwargs):
		super(CustomDense, self).__init__(*args, **kwargs)
	
	def compute_mask(self, _inputs, mask=None):
		return mask

class Decoder(tf.keras.Layer):
	def __init__(self, text_processor, units):
		super(Decoder, self).__init__()
		self.text_processor = text_processor
		self.vocab_size = text_processor.vocabulary_size()
		self.units = units

		self.embedding_layer = tf.keras.layers.Embedding(
			# mask_zero: Treats zero as a padding value that should be ignored
			self.vocab_size, units, mask_zero = True,
		)
		self.rnn_layer = tf.keras.layers.GRU(
			units, return_sequences = True, return_state = True, recurrent_initializer='glorot_uniform',
		)
		self.attention_layer = CrossAttention(units)

		# Creates logits with the estimated probability of each output token
		self.output_layer = CustomDense(self.vocab_size)

		# Conversion:
		self.word_to_id = tf.keras.layers.StringLookup(
			vocabulary = text_processor.get_vocabulary(),
			mask_token = '',
			oov_token = '[UNK]',
		)
		self.id_to_word = tf.keras.layers.StringLookup(
			vocabulary = text_processor.get_vocabulary(),
			mask_token = '',
			oov_token = '[UNK]',
			invert = True,
		)
		# Pre-computing these simplifies exporting
		self.start_id = self.word_to_id('[START]')
		self.end_id = self.word_to_id('[END]')

		self.supports_masking = True
	
	def build(self, input_shape):
		# Nothing tha needs a size allocation based on the input shape
		pass
	
	def call(self, context, target_history, state = None, return_state = False):
		x = self.embedding_layer(target_history)
		x, state = self.rnn_layer(x, initial_state = state)
		x = self.attention_layer(x, context)

		logits = self.output_layer(x)
		if return_state:
			return logits, state
		else:
			return logits
	
	## Conversion/testing ##

	def tokens_to_text(self, tokens):
		text = tf.strings.reduce_join(self.id_to_word(tokens), separator = ' ')
		text = tf.strings.regex_replace(text, r'\s*\[START\]\s*', '')
		text = tf.strings.regex_replace(text, r'\s*\[END\].*$', '')
		return text

	def generate_next_token(self, context, target_history, done_vec, state, temperature = 0.0):
		# Note: is_done is a vector, indicating whether each item in the batch is done

		logits, state = self(context, target_history, state = state, return_state = True)

		# logits has shape (batch, t, target_vocab_size). Only generate the token corresponding
		# to the last logits in the sequence (at t - 1)
		if temperature > 0:
			next_token = tf.where(
				done_vec,
				tf.constant(0, dtype=tf.int64), # Emit 0 after a sequence is done
				tf.random.categorical(logits[:, -1, :] / temperature, num_samples = 1), # Otherwise, pick the token from a categorical distribution
			)
		else:
			next_token = tf.math.argmax(logits, axis=-1)
		done_vec = done_vec|(next_token == self.end_id)
		return next_token, done_vec, state
	
	def get_initial_state(self, context):
		# context has shape (batch_size, s, units)
		batch_size = tf.shape(context)[0]
		start_tokens = tf.fill([batch_size, 1], self.start_id)
		done_vec = tf.zeros([batch_size, 1], dtype = tf.bool)

		# From the Tensorflow source code:
		# > RNN expect the states in a list, even if single state.
		# Note: Without the [0] we get a type mismatch while exporting.
		initial_state = self.rnn_layer.get_initial_state(batch_size)[0]

		return start_tokens, done_vec, initial_state

Let's try it!

In [17]:
def test_generation_loop():
	decoder = Decoder(target_text_processor, ENCODER_UNITS)
	next_token, done_vec, state = decoder.get_initial_state(test_encoded_context[:3, :, :])
	tokens = [next_token]

	for i in range(8):
		next_token, done_vec, state = decoder.generate_next_token(test_encoded_context[:3, :, :], next_token, done_vec, state)
		tokens.append(next_token)
	
	# Merge all batch outputs into a single dimension
	tokens = tf.concat(tokens, -1) # -1 = last axis

	print('Output:', decoder.tokens_to_text(tokens).numpy())

test_generation_loop()

Output: b'villefort dim terror laid fixed pressed pain stapletonvillefort dim terror laid fixed pressed pain stapletonvillefort dim terror laid fixed pressed pain stapleton'


## The model

We can now build a model for training and punctuation:

In [18]:
class Punctuator(tf.keras.Model):
	def __init__(self, units, context_text_processor, target_text_processor):
		super().__init__()
		self.encoder = Encoder(context_text_processor, units)
		self.decoder = Decoder(target_text_processor, units)
	
	def call(self, inputs):
		context, target_history = inputs
		context = self.encoder(context)
		logits = self.decoder(context, target_history)
		return logits
	
	def fix_punctuation_raw(self, input):
		"""
		Adds punctuation to `input`, where `input` is a `Tensor` with shape (batch_size, s) where s is the
		context length.
		"""
		context = self.encoder(input)

		next_token, done_vec, state = self.decoder.get_initial_state(context)

		# Although a TensorArray would allow more efficient exporting, the ONNX exporter seems to
		# have trouble with it. For now, use a Python list.
		tokens = []
		max_iterations = 34

		for i in range(max_iterations):
			# token_history has size: (batch, t, target_vocab_size)
			# token_history = tf.concat(tokens, 1)
			# print('history', model.decoder.id_to_word(token_history))
			next_token, done_vec, state = self.decoder.generate_next_token(context, next_token, done_vec, state, temperature=0)
			#tokens = tokens.write(i + 1, next_token)
			tokens.append(next_token)

			if tf.executing_eagerly() and tf.reduce_all(done_vec):
				break
		
		tokens = tf.concat(tokens, -1)
		# When exporting to ONNX, tokens_to_text can only operate on a single dimension. As such,
		# all inputs are collapsed:
		return tokens

	def fix_punctuation(self, text: list[str]):
		inputs = self.encoder.prepare_for_input(text)
		tokens = self.fix_punctuation_raw(inputs)
		return self.decoder.tokens_to_text(tokens)

In [19]:
model = Punctuator(ENCODER_UNITS, context_text_processor, target_text_processor)

for (example_context_tok, example_target_hist), _ in dataset_validate.take(1):
	test_logits = model((example_context_tok, example_target_hist))
	print('Context tokens shape (batch, s):', example_context_tok.shape)
	print('Target history tokens shape (batch, t):', example_target_hist.shape)
	print('Logits shape (batch, t, vocab_size)', test_logits.shape)

Context tokens shape (batch, s): (16, 52)
Target history tokens shape (batch, t): (16, 69)
Logits shape (batch, t, vocab_size) (16, 69, 2000)


In [20]:
model.summary()

To avoid penalizing masked outputs, we use a custom loss function (see the tutorial):

In [21]:
base_loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

def masked_loss(y_true, y_predict):
	loss = base_loss_fn(y_true, y_predict)
	
	unmasked = y_true != 0
	unmasked = tf.cast(unmasked, loss.dtype)
	# Only consider output with a corresponding label.
	loss *= unmasked

	count_unmasked = tf.math.reduce_sum(unmasked)

	# reduce_sum: Adds all entries of a vector.
	return tf.math.reduce_sum(loss)/count_unmasked

In [22]:
def masked_accuracy(y_true, predict_logits):
	predicted_index = tf.math.argmax(predict_logits, axis=-1)
	predicted_index = tf.cast(predicted_index, y_true.dtype)

	match = tf.cast(y_true == predicted_index, tf.float32)
	unmasked = tf.cast(y_true != 0, tf.float32)
	count_unmasked = tf.math.reduce_sum(unmasked)

	return tf.math.reduce_sum(match * unmasked) / count_unmasked


In [23]:
#model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=[masked_accuracy, masked_loss])
model.compile(optimizer='adam', loss=masked_loss, metrics=[masked_accuracy, masked_loss])

In [24]:
print('From the tutorial:')
vocab_size = float(target_text_processor.vocabulary_size())

print('expected loss', tf.math.log(vocab_size).numpy())
print('expected accuracy', 1/vocab_size)

From the tutorial:
expected loss 7.6009026
expected accuracy 0.0005


In [25]:
model.evaluate(dataset_validate, steps=20, return_dict=True)


[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 33ms/step - loss: 7.6039 - masked_accuracy: 1.0857e-04 - masked_loss: 7.6039


{'loss': 7.603926658630371,
 'masked_accuracy': 0.00015984181663952768,
 'masked_loss': 7.6039276123046875}

In [26]:


def test_punctuation(text):
	return '[test]: ' + model.fix_punctuation(text).numpy().decode('utf-8')

class DemoCallback(tf.keras.callbacks.Callback):
	def on_epoch_end(self, epoch_index: int, logs = None):
		print('\r', test_punctuation([ 'test this is a sample will it work if it does then how well' ]))
		if epoch_index % 10 == 0:
			# From the test data
			print(test_punctuation([
				'not that alice had any idea of doing that she felt as if she would never be able to talk again, she was getting so much out of breath and still the queen cried faster faster and dragged her along'
			]))
			print(test_punctuation([ 'tensorflow is a library that is used for machine learning it is available for more languages than just python' ]))
			print(test_punctuation([ 'the joplin note taking app can be used to take multimedia notes' ]))
			print(test_punctuation([ 'here are a few words javascript typescript python joplin interesting loud and sequence these words are all very useful' ]))

test_punctuation(tf.constant([ 'this is an example they said' ]))

'[test]: sensation dressed feelings drive dressed feelings curious express promised sympathy breast grey sister presence lion wore mystery smoke dim hungry treasure terror bring pleasant pleasant pleasant repeat carrying play misfortune inside wild facts monte'

In [27]:
history = model.fit(
	dataset_train,
	epochs = 30,
	steps_per_epoch = 500,
	validation_data = dataset_validate,
	callbacks=[DemoCallback()]
)

Epoch 1/30
[1m499/500[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 68ms/step - loss: 5.2544 - masked_accuracy: 0.1673 - masked_loss: 5.2544

  self.gen.throw(typ, value, traceback)


 [test]: [cap] i have not [UNK] , and [cap] i have [UNK] , and [cap] i have [UNK] , and [cap] i have [UNK] , and [cap] i have [UNK] , and [cap] i have
[test]: [cap] i had [UNK] to [UNK] .
[test]: [cap] i [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and
[test]: [cap] [UNK] , [cap] i have been [UNK] to [UNK] .
[test]: [cap] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK]
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 82ms/step - loss: 5.2516 - masked_accuracy: 0.1676 - masked_loss: 5.2516 - val_loss: 3.9222 - val_masked_accuracy: 0.2783 - val_masked_loss: 3.8619
Epoch 2/30
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step - loss: 3.7639 - masked_accuracy: 0.3097 - masked_loss: 3.7639

2024-09-05 19:34:35.454265: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]


 [test]: [cap] [UNK] this is a [UNK] , and it , it as it it is it ?
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 66ms/step - loss: 3.7634 - masked_accuracy: 0.3098 - masked_loss: 3.7634 - val_loss: 2.7005 - val_masked_accuracy: 0.4905 - val_masked_loss: 2.6590
Epoch 3/30
 [test]: [cap] [UNK] this is a [UNK] will be work , if it does then how well .loss: 2.4612 - masked_accuracy: 0.5349 - masked_loss: 2.4612
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 64ms/step - loss: 2.4607 - masked_accuracy: 0.5350 - masked_loss: 2.4607 - val_loss: 1.7115 - val_masked_accuracy: 0.6387 - val_masked_loss: 1.6852
Epoch 4/30
 [test]: [cap] [UNK] this is a [UNK] will it work if it does then how well .- loss: 1.5722 - masked_accuracy: 0.6764 - masked_loss: 1.5722
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m34s[0m 67ms/step - loss: 1.5719 - masked_accuracy: 0.6765 - masked_loss: 1.5719 - val_loss: 1.1549 - val_masked_accuracy: 0.7292 - val_masked_

2024-09-05 19:39:03.046974: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]


 [test]: [cap] [UNK] this is a [UNK] will it work if it does then , how well .
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 65ms/step - loss: 0.5594 - masked_accuracy: 0.8557 - masked_loss: 0.5594 - val_loss: 0.6103 - val_masked_accuracy: 0.8311 - val_masked_loss: 0.6009
Epoch 11/30
 [test]: [cap] [UNK] this is a [UNK] will it work , if it does then , how well .ss: 0.5363 - masked_accuracy: 0.8595 - masked_loss: 0.5363
[test]: [cap] not that [cap] alice had any idea of doing that she felt as if she would never be able to talk again . [cap] she was getting , she so much out of
[test]: [cap] [UNK] is a library , that is used for [UNK] [UNK] it is [UNK] for more [UNK] than just [UNK] .
[test]: [cap] the [UNK] note , taking [UNK] can be used to take [UNK] notes .
[test]: [cap] here are a few words [UNK] [UNK] [UNK] [UNK] interesting loud and [UNK] these words are all very [UNK] .
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 70ms/step - loss: 0.53

2024-09-05 19:48:27.759985: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]


 [test]: [cap] [UNK] this is a [UNK] will it work , if it does then how well ?
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 77ms/step - loss: 0.4023 - masked_accuracy: 0.8844 - masked_loss: 0.4023 - val_loss: 0.3833 - val_masked_accuracy: 0.8730 - val_masked_loss: 0.3774
Epoch 27/30
 [test]: [cap] [UNK] , this is a [UNK] will it work , if it does then how well ?ss: 0.4390 - masked_accuracy: 0.8784 - masked_loss: 0.4390
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 78ms/step - loss: 0.4390 - masked_accuracy: 0.8784 - masked_loss: 0.4390 - val_loss: 0.4142 - val_masked_accuracy: 0.8667 - val_masked_loss: 0.4078
Epoch 28/30
 [test]: [cap] [UNK] this is a [UNK] will it work , if it does then how well ?loss: 0.4243 - masked_accuracy: 0.8811 - masked_loss: 0.4243
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 80ms/step - loss: 0.4243 - masked_accuracy: 0.8811 - masked_loss: 0.4243 - val_loss: 0.3983 - val_masked_accuracy: 0.8730 

In [28]:

print(test_punctuation([
	'not that alice had any idea of doing that she felt as if she would never be able to talk again she was getting so much out of breath and still the queen cried faster faster and dragged her along'
]))
print(test_punctuation([ 'this is a test of a the punctuation system for i am curious how well it works' ]))

[test]: [cap] not that [cap] alice had any idea of doing that she felt as if she would never be able to talk again she was getting so much out of breath , and still
[test]: [cap] this is a [UNK] of a the [UNK] [UNK] for [cap] i am curious how well it [UNK] .


## Exporting

Based on the [Export](https://www.tensorflow.org/text/tutorials/nmt_with_attention#export) section of the tutorial:

In [53]:
class Export(tf.Module):
	def __init__(self, model):
		self.model = model
	
	@tf.function(input_signature=[tf.RaggedTensorSpec(dtype=tf.int64, shape=[None])])
	def fix_punctuation(self, input):
		return model.fix_punctuation_raw(
			tf.reshape(input, [1, -1])
		)

Run `fix_punctuation` once to compile it:

In [54]:
export = Export(model)

In [57]:
sample_inputs = context_text_processor('this sentence shall be punctuated for the following reasons first punctatuion makes things easier to read second um')
model.decoder.tokens_to_text(export.fix_punctuation(sample_inputs))

<tf.Tensor: shape=(), dtype=string, numpy=b'[cap] this sentence shall be [UNK] for the following reasons first [UNK] makes things [UNK] to read second [UNK] .'>

Now we save the model:

In [58]:
tf.saved_model.save(export, 'punctuator-seq2seq', signatures={ 'serving_default': export.fix_punctuation })



INFO:tensorflow:Assets written to: punctuator-seq2seq/assets


INFO:tensorflow:Assets written to: punctuator-seq2seq/assets


See [the documentation](https://www.tensorflow.org/guide/saved_model#specifying_signatures_during_export) for information about the `signatures` option.

In [63]:
import json

web_output_dir = Path('web')
vocab_output_file = web_output_dir / 'wordEncodings.ts'

vocab_output_file.write_text('''
// Auto-generated file!
// Created by v2-seq2seq.ipynb
export default {};
'''.format(json.dumps(target_text_processor.get_vocabulary())))

19729

### Testing the saved model

In [61]:
reloaded = tf.saved_model.load('punctuator-seq2seq')
# Warmup
reloaded.fix_punctuation(sample_inputs)
print('Imported and warmed up!')

Imported and warmed up!


In [62]:
%%time
model.decoder.tokens_to_text(reloaded.fix_punctuation(sample_inputs))


CPU times: user 41.1 ms, sys: 15.7 ms, total: 56.8 ms
Wall time: 24.3 ms


<tf.Tensor: shape=(), dtype=string, numpy=b'[cap] this sentence shall be [UNK] for the following reasons first [UNK] makes things [UNK] to read second [UNK] .'>