# V2: SEQ2SEQ

This notebook follows [an online tutorial](https://www.tensorflow.org/text/tutorials/nmt_with_attention#create_a_tfdata_dataset).

In [1]:
import tensorflow as tf

2024-09-05 14:20:20.263898: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-05 14:20:20.266879: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-05 14:20:20.276312: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-05 14:20:20.291136: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-05 14:20:20.295456: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-05 14:20:20.307899: I tensorflow/core/platform/cpu_feature_gu

**Notes**:
- This notebook follows [an online tutorial](https://www.tensorflow.org/text/tutorials/nmt_with_attention) (and [at least one other](https://www.tensorflow.org/text/tutorials/text_generation) of the Tensorflow tutorials).
- This [blog post](https://janakiev.com/blog/jupyter-virtual-envs/) was referenced to set up the virtual environment.

In [2]:
import numpy as np
from typing import Any, Tuple
from IPython.display import display, Markdown
from pathlib import Path


In [3]:
data_file_paths = list(Path('data/processed/en/').glob('*.txt'))
dataset_raw = tf.data.TextLineDataset(
	data_file_paths,
)

We've now loaded the `.txt` training data files using [`tf.data.TextLineDataset`](https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset). Each line in the source files is mapped to a new training example. 

Although some preprocessing has been done by `/data/process_data.py`, paragraphs aren't filtered out based on length/content. Let's do that now:

In [4]:
def filter_paragraphs(context, target):
	return tf.strings.length(context) > 5

punctuation_chars = r'\?!.,"\-\':'
def add_context(target):
	context = tf.strings.regex_replace(target, '[{}]+'.format(punctuation_chars), '')
	context = tf.strings.strip(
		tf.strings.regex_replace(context, '[ ]+', ' ')
	)
	context = tf.strings.lower(context)
	return context, target

dataset_raw = dataset_raw.map(add_context).filter(filter_paragraphs)

Now let's inspect the data:

In [5]:
for text, label in dataset_raw.take(4).as_numpy_iterator():
	print(text, label)

b'illustration' b'Illustration '
b'alices adventures in wonderland' b"Alice's Adventures in Wonderland"
b'by lewis carroll' b'by Lewis Carroll'
b'the millennium fulcrum edition 30' b'THE MILLENNIUM FULCRUM EDITION 3.0'


2024-09-05 14:20:21.826408: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


### Batching

In [6]:
BUFFER_SIZE = 100_000
BATCH_SIZE = 16
dataset_train = dataset_raw.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)

# Inspired by https://stackoverflow.com/a/74609848.
validate_size = 64
dataset_validate = dataset_train.take(validate_size)
dataset_train = dataset_train.skip(validate_size)

### Preparing to process data

The [`TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) layer takes a `standardize` option that preprocesses input data. The default removes punctuation, but we don't want that. Let's redefine it:

In [7]:

def standardize_tf_text(text):
	punctuation_regex = '[{p}]'.format(p = punctuation_chars)

	# Surround punctuation with spaces for easier tokenization
	text = tf.strings.regex_replace(text, punctuation_regex, r' \0 ')

	# Remove repeated spaces
	text = tf.strings.regex_replace(text, r'\s+', ' ')

	# Add a special "capitalize the next letter" token
	text = tf.strings.regex_replace(text, r'(\s|^)([A-Z])', r' [CAP] \2')

	# Lowercase everything
	text = tf.strings.lower(text)

	text = tf.strings.strip(text)

	# Add sequence markings
	return tf.strings.join(['[START]', text, '[END]'], separator=' ')

print(standardize_tf_text('This is a test! It\'s working?!'))

tf.Tensor(b"[START] [cap] this is a test ! [cap] it ' s working ? ! [END]", shape=(), dtype=string)


The text standardization function can now be used to preprocess text:

In [8]:
# Keep only the 3000 most commonly used tokens
max_vocab_size = 3000

# context_text_processor = tf.keras.layers.TextVectorization(
# 	standardize=standardize_tf_text,
# 	max_tokens=max_vocab_size,
# 	# Allow entries of different lengths
# 	ragged=True,
# )
# context_text_processor.adapt(dataset_train.map(lambda context, target: context))

# print('First 14 context words:', context_text_processor.get_vocabulary()[:14])
# print('Context vocab length', len(context_text_processor.get_vocabulary()))

target_text_processor = tf.keras.layers.TextVectorization(
	standardize=standardize_tf_text,
	max_tokens=max_vocab_size,
	# Allow entries of different lengths
	ragged=True,
)
target_text_processor.adapt(dataset_train.map(lambda context, target: target))

print('First 14 target words:', target_text_processor.get_vocabulary()[:14])

# The target data should be roughly equivalent to the context data, except have additional (punctuation)
# tokens.
context_text_processor = target_text_processor

First 14 target words: ['', '[UNK]', '[cap]', ',', 'the', '.', '[START]', '[END]', 'and', 'to', 'of', 'i', 'a', 'you']


2024-09-05 14:20:30.293864: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


We can use these layers to convert to/from token IDs:


In [9]:
example_text = 'hello world this is a test tensorflow is processing this'
example_tokens = context_text_processor(example_text)
print('Example tokens', example_tokens)

context_vocab = np.array(context_text_processor.get_vocabulary())
tokens = context_vocab[example_tokens.numpy()]
print('Back to text', ' '.join(tokens))

Example tokens tf.Tensor([  6   1 266  41  23  12   1   1  23   1  41   7], shape=(12,), dtype=int64)
Back to text [START] [UNK] world this is a [UNK] [UNK] is [UNK] this [END]


### Processing the data

Now, we'll:
1. Map the data through the text processors we just made.
2. Shift the target data, so that our network is provided with a history of generated tokens.

In [10]:
def process_text(context, target):
	return context_text_processor(context), target_text_processor(target)

def add_target_history(context, target):
	# .to_tensor(): Converts from RaggedTensors to Tensors.
	# We give our network the history as target_in
	target_in = target[:, :-1].to_tensor()
	target_out = target[:, 1:].to_tensor()
	return (context.to_tensor(), target_in), target_out
dataset_train = dataset_train.map(process_text).map(add_target_history).repeat()
dataset_validate = dataset_validate.map(process_text).map(add_target_history)

In [11]:

def inspect_dataset(dataset: tf.data.Dataset):
	target_vocab = np.array(target_text_processor.get_vocabulary())
	for (context, target_in), target_out in dataset.take(1):
		context_words = context_vocab[context[0]]
		print('context', ','.join(context_words))
		print('target_in', ','.join(target_vocab[target_in[0]]))
		print('target_out', ','.join(target_vocab[target_out[0]]))

inspect_dataset(dataset_train)

context [START],excited,beyond,his,usual,calm,[UNK],franz,rose,with,the,[UNK],and,was,about,to,join,the,loud,[UNK],[UNK],that,followed,but,suddenly,his,purpose,was,arrested,his,hands,fell,by,his,sides,and,the,[UNK],[UNK],[UNK],on,his,lips,[END],,,,,,,,,,,,,,,,,,,,,,
target_in [START],[cap],excited,beyond,his,usual,calm,[UNK],,,[cap],franz,rose,with,the,[UNK],,,and,was,about,to,join,the,loud,,,[UNK],[UNK],that,followed,but,suddenly,his,purpose,was,arrested,,,his,hands,fell,by,his,sides,,,and,the,half,-,uttered,[UNK],[UNK],on,his,lips,.,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
target_out [cap],excited,beyond,his,usual,calm,[UNK],,,[cap],franz,rose,with,the,[UNK],,,and,was,about,to,join,the,loud,,,[UNK],[UNK],that,followed,but,suddenly,his,purpose,was,arrested,,,his,hands,fell,by,his,sides,,,and,the,half,-,uttered,[UNK],[UNK],on,his,lips,.,[END],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


## Model

### The encoder

See https://www.tensorflow.org/text/tutorials/nmt_with_attention#the_encoder

In [12]:
class Encoder(tf.keras.Layer):
	def __init__(self, text_processor, units: int):
		"""
		Creates a new Encoder layer. [dimen] is the maxiumum number of elements of the input
		that can be processed by the encoder.
		"""
		super(Encoder, self).__init__()
		self.text_processor = text_processor
		self.vocab_size = text_processor.vocabulary_size()
		self.units = units

		# Converts tokens -> vectors
		self.embedding = tf.keras.layers.Embedding(
			# mask_zero: Treats zero as a padding value that should be ignored
			self.vocab_size, units, mask_zero = True,
		)
		gru = tf.keras.layers.GRU(
			units, return_sequences = True,
			# Use the recurrent_initializer suggested by the tutorial (& the default
			# for kernel_initializer).
			recurrent_initializer='glorot_uniform'
		)
		self.rnn = tf.keras.layers.Bidirectional(
			# merge_mode determines how the forward and backward layers are combined
			#            'concat' is another option here
			merge_mode = 'sum',
			layer=gru,
		)

	def call(self, x):
		x = self.embedding(x)
		x = self.rnn(x)
		return x

	def convert_input(self, texts):
		texts = tf.convert_to_tensor(texts)
		if len(texts.shape) == 0:
			texts = texts[None]
		context = self.text_processor(texts).to_tensor()
		context = self(context)
		return context


Try it:

In [13]:
ENCODER_UNITS = 64
encoder = Encoder(context_text_processor, ENCODER_UNITS)

for (context, target_history), target_next in dataset_validate.take(1):
	encoder_result = encoder(context)
	print('Context tokens shape (batch, s):', context.shape)
	print('Encoder output shape (batch, s, ENCODER_UNITS):', encoder_result.shape)

Context tokens shape (batch, s): (16, 66)
Encoder output shape (batch, s, ENCODER_UNITS): (16, 66, 64)


2024-09-05 14:20:37.085797: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


### The attention layer

Attention can be thought of as training a lookup table with keys and values. The lookup table has inputs `values` and `query`.

In [14]:
class CrossAttention(tf.keras.Layer):
	def __init__(self, units, **kwargs):
		super().__init__()
		self.attention_layer = tf.keras.layers.MultiHeadAttention(
			key_dim=units,
			num_heads=1,
			**kwargs
		)
		# Keeps "the mean activation within each example close to 0 and the
		# activation standard deviation close to 1" -- https://www.tensorflow.org/api_docs/python/tf/keras/layers/LayerNormalization?hl=en
		self.norm_layer = tf.keras.layers.LayerNormalization()
		self.add_layer = tf.keras.layers.Add()
		self.supports_masking = True

	def call(self, query, value):
		attention_output = self.attention_layer(
			query = query,
			value = value,
			#use_causal_mask=True,
			# Return the attention scores for latter plotting
			# return_attention_scores = True,
		)

		x = self.add_layer([ query, attention_output ])
		x = self.norm_layer(x)
		return x


In [15]:
attention_layer = CrossAttention(ENCODER_UNITS)

# Test with an example
for (context, target_history), target_next in dataset_validate.take(1):
	embed_layer = tf.keras.layers.Embedding(target_text_processor.vocabulary_size(), output_dim=ENCODER_UNITS, mask_zero=True)
	target_embed = embed_layer(target_history)
	encoded_context = encoder(context)
	attention_result = attention_layer(target_embed, encoded_context)

	print('Encoded context sequence shape (batch, s, units):', encoded_context.shape)
	print('Target history sequence shape (batch, t, units):', target_embed.shape)
	print('Attention result shape (batch, t, units):', attention_result.shape)

	# Used later 
	test_encoded_context = encoded_context

Encoded context sequence shape (batch, s, units): (16, 58, 64)
Target history sequence shape (batch, t, units): (16, 69, 64)
Attention result shape (batch, t, units): (16, 69, 64)




### The decoder

The decoder produces queries for the attention layer. The decoder operates on `target_history`. At each step during training, it should have no information about future target output (that's what we're trying to determine). As such, we use a unidirectional RNN.


In [16]:
class CustomDense(tf.keras.layers.Dense):
	def __init__(self, *args, **kwargs):
		super(CustomDense, self).__init__(*args, **kwargs)
	
	def compute_mask(self, _inputs, mask=None):
		return mask

class Decoder(tf.keras.Layer):
	def __init__(self, text_processor, units):
		super(Decoder, self).__init__()
		self.text_processor = text_processor
		self.vocab_size = text_processor.vocabulary_size()
		self.units = units

		self.embedding_layer = tf.keras.layers.Embedding(
			# mask_zero: Treats zero as a padding value that should be ignored
			self.vocab_size, units, mask_zero = True,
		)
		self.rnn_layer = tf.keras.layers.GRU(
			units, return_sequences = True, return_state = True, recurrent_initializer='glorot_uniform',
		)
		self.attention_layer = CrossAttention(units)

		# Creates logits with the estimated probability of each output token
		self.output_layer = CustomDense(self.vocab_size)

		# Conversion:
		self.word_to_id = tf.keras.layers.StringLookup(
			vocabulary = text_processor.get_vocabulary(),
			mask_token = '',
			oov_token = '[UNK]',
		)
		self.id_to_word = tf.keras.layers.StringLookup(
			vocabulary = text_processor.get_vocabulary(),
			mask_token = '',
			oov_token = '[UNK]',
			invert = True,
		)
		# Pre-computing these simplifies exporting
		self.start_id = self.word_to_id('[START]')
		self.end_id = self.word_to_id('[END]')

		self.supports_masking = True
	
	def build(self, input_shape):
		# Nothing tha needs a size allocation based on the input shape
		pass
	
	def call(self, context, target_history, state = None, return_state = False):
		x = self.embedding_layer(target_history)
		x, state = self.rnn_layer(x, initial_state = state)
		x = self.attention_layer(x, context)

		logits = self.output_layer(x)
		if return_state:
			return logits, state
		else:
			return logits
	
	## Conversion/testing ##

	def tokens_to_text(self, tokens):
		return tf.strings.reduce_join(self.id_to_word(tokens), separator = ' ')

	def generate_next_token(self, context, target_history, done_vec, state, temperature = 0.0):
		# Note: is_done is a vector, indicating whether each item in the batch is done

		logits, state = self(context, target_history, state = state, return_state = True)

		# logits has shape (batch, t, target_vocab_size). Only generate the token corresponding
		# to the last logits in the sequence (at t - 1)
		if temperature > 0:
			next_token = tf.where(
				done_vec,
				tf.constant(0, dtype=tf.int64), # Emit 0 after a sequence is done
				tf.random.categorical(logits[:, -1, :] / temperature, num_samples = 1), # Otherwise, pick the token from a categorical distribution
			)
		else:
			next_token = tf.math.argmax(logits, axis=-1)
		done_vec = done_vec|(next_token == self.end_id)
		return next_token, done_vec, state
	
	def get_initial_state(self, context):
		# context has shape (batch_size, s, units)
		batch_size = tf.shape(context)[0]
		start_tokens = tf.fill([batch_size, 1], self.start_id)
		done_vec = tf.zeros([batch_size, 1], dtype = tf.bool)

		# From the Tensorflow source code:
		# > RNN expect the states in a list, even if single state.
		# Note: Without the [0] we get a type mismatch while exporting.
		initial_state = self.rnn_layer.get_initial_state(batch_size)[0]

		return start_tokens, done_vec, initial_state

Let's try it!

In [17]:
def test_generation_loop():
	decoder = Decoder(target_text_processor, ENCODER_UNITS)
	next_token, done_vec, state = decoder.get_initial_state(test_encoded_context[:3, :, :])
	tokens = [next_token]

	for i in range(8):
		next_token, done_vec, state = decoder.generate_next_token(test_encoded_context[:3, :, :], next_token, done_vec, state)
		tokens.append(next_token)
	
	# Merge all batch outputs into a single dimension
	tokens = tf.concat(tokens, -1) # -1 = last axis

	print('Output:', decoder.tokens_to_text(tokens).numpy())

test_generation_loop()

Output: b'[START] lucien touched couldn god sunk gazed reward relief [START] lucien perfectly divine exercise john stock europe command [START] solitude tenderness finger spare see devil did bosom'


## The model

We can now build a model for training and punctuation:

In [18]:
class Punctuator(tf.keras.Model):
	def __init__(self, units, context_text_processor, target_text_processor):
		super().__init__()
		self.encoder = Encoder(context_text_processor, units)
		self.decoder = Decoder(target_text_processor, units)
	
	def call(self, inputs):
		context, target_history = inputs
		context = self.encoder(context)
		logits = self.decoder(context, target_history)
		return logits
	
	def fix_punctuation(self, text):
		context = self.encoder.convert_input(text)

		next_token, done_vec, state = self.decoder.get_initial_state(context)

		# A TensorArray allows more efficient exporting
		tokens = tf.TensorArray(tf.int64, size=0, dynamic_size=True)
		tokens = tokens.write(0, next_token)

		# Use a tf.range to allow dynamic loop size while exporting
		for i in tf.range(60):
			# token_history has size: (batch, t, target_vocab_size)
			# token_history = tf.concat(tokens, 1)
			# print('history', model.decoder.id_to_word(token_history))
			next_token, done_vec, state = self.decoder.generate_next_token(context, next_token, done_vec, state, temperature=0)
			tokens = tokens.write(i + 1, next_token)

			# executing_eagerly() is false if tracing execution (e.g. in a tf.function being prepared for
			# export)
			if tf.reduce_all(done_vec):
				break
		
		tokens = tokens.stack()
		batch_size = tf.shape(context)[0]
		tokens = tf.reshape(tokens, [batch_size, -1])
		return self.decoder.tokens_to_text(tokens)

In [19]:
model = Punctuator(ENCODER_UNITS, context_text_processor, target_text_processor)

for (example_context_tok, example_target_hist), _ in dataset_validate.take(1):
	test_logits = model((example_context_tok, example_target_hist))
	print('Context tokens shape (batch, s):', example_context_tok.shape)
	print('Target history tokens shape (batch, t):', example_target_hist.shape)
	print('Logits shape (batch, t, vocab_size)', test_logits.shape)

Context tokens shape (batch, s): (16, 63)
Target history tokens shape (batch, t): (16, 94)
Logits shape (batch, t, vocab_size) (16, 94, 3000)


In [20]:
model.summary()

To avoid penalizing masked outputs, we use a custom loss function (see the tutorial):

In [21]:
base_loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

def masked_loss(y_true, y_predict):
	loss = base_loss_fn(y_true, y_predict)
	
	unmasked = y_true != 0
	unmasked = tf.cast(unmasked, loss.dtype)
	# Only consider output with a corresponding label.
	loss *= unmasked

	count_unmasked = tf.math.reduce_sum(unmasked)

	# reduce_sum: Adds all entries of a vector.
	return tf.math.reduce_sum(loss)/count_unmasked

In [22]:
def masked_accuracy(y_true, predict_logits):
	predicted_index = tf.math.argmax(predict_logits, axis=-1)
	predicted_index = tf.cast(predicted_index, y_true.dtype)

	match = tf.cast(y_true == predicted_index, tf.float32)
	unmasked = tf.cast(y_true != 0, tf.float32)
	count_unmasked = tf.math.reduce_sum(unmasked)

	return tf.math.reduce_sum(match * unmasked) / count_unmasked


In [23]:
#model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=[masked_accuracy, masked_loss])
model.compile(optimizer='adam', loss=masked_loss, metrics=[masked_accuracy, masked_loss])

In [24]:
print('From the tutorial:')
vocab_size = float(target_text_processor.vocabulary_size())

print('expected loss', tf.math.log(vocab_size).numpy())
print('expected accuracy', 1/vocab_size)

From the tutorial:
expected loss 8.006368
expected accuracy 0.0003333333333333333


In [25]:
model.evaluate(dataset_validate, steps=20, return_dict=True)


[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 47ms/step - loss: 8.0115 - masked_accuracy: 4.4516e-04 - masked_loss: 8.0115


{'loss': 8.012113571166992,
 'masked_accuracy': 0.0003151628188788891,
 'masked_loss': 8.012113571166992}

In [26]:


def test_punctuation(text):
	return '[test]: ' + model.fix_punctuation(text).numpy().decode('utf-8')

class DemoCallback(tf.keras.callbacks.Callback):
	def on_epoch_end(self, epoch_index: int, logs = None):
		print('\r', test_punctuation([ 'test this is a sample will it work if it does then how well' ]))
		if epoch_index % 10 == 0:
			# From the test data
			print(test_punctuation([
				'not that alice had any idea of doing that she felt as if she would never be able to talk again, she was getting so much out of breath and still the queen cried faster faster and dragged her along'
			]))
			print(test_punctuation([ 'tensorflow is a library that is used for machine learning it is available for more languages than just python' ]))
			print(test_punctuation([ 'the joplin note taking app can be used to take multimedia notes' ]))
			print(test_punctuation([ 'here are a few words javascript typescript python joplin interesting loud and sequence these words are all very useful' ]))

test_punctuation(tf.constant([ 'this is an example they said' ]))

'[test]: [START] usual william tail long date take mice date take political barrymore common supposed understood past why obey ease after task awakened china perfect giant dearest faria real informed dreadful lower feast accomplished advantage real sound was date them lock ceremony insisted am revenge merchant taught pick left villefort laugh resumed surprise ask peculiar madness off stayed unfortunate tears rays uncle'

In [27]:
history = model.fit(
	dataset_train,
	epochs = 40,
	steps_per_epoch = 300,
	validation_data = dataset_validate,
	callbacks=[DemoCallback()]
)

Epoch 1/40
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 90ms/step - loss: 5.8100 - masked_accuracy: 0.1244 - masked_loss: 5.8100

  self.gen.throw(typ, value, traceback)


 [test]: [START] [cap] i have not not be [UNK] , and [cap] i have not not be [UNK] , and [cap] i have not not be [UNK] , and [cap] i have not not be [UNK] , and [cap] i have not not be [UNK] , and [cap] i have not not be [UNK] , and [cap] i have not not be
[test]: [START] [cap] i was a [UNK] of the [UNK] of the [UNK] of the [UNK] of the [UNK] of the [UNK] of the [UNK] of the [UNK] of the [UNK] of the [UNK] of the [UNK] of the [UNK] of the [UNK] of the [UNK] of the [UNK] of the [UNK] of the [UNK] of the [UNK] of the [UNK] of
[test]: [START] [cap] i have not [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and [UNK] [UNK] , and
[test]: [START] [cap] i have not [UNK] , and [UNK] , and [cap] i have not [UNK] , and [UNK] , and [cap] i have not [UNK] , and [UNK] , and [cap] i have not [UNK] , and [UNK] , and [cap]

2024-09-05 14:22:06.185728: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]


 [test]: [START] [cap] [UNK] , [cap] [UNK] , [cap] [UNK] , [cap] i have not a [UNK] ? [END]
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 108ms/step - loss: 4.2345 - masked_accuracy: 0.2590 - masked_loss: 4.2345 - val_loss: 3.9079 - val_masked_accuracy: 0.2953 - val_masked_loss: 3.8478
Epoch 3/40
 [test]: [START] [cap] [UNK] this is a [UNK] will be it , if it is a good [UNK] . [END]203 - masked_accuracy: 0.3356 - masked_loss: 3.7203
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 109ms/step - loss: 3.7194 - masked_accuracy: 0.3357 - masked_loss: 3.7194 - val_loss: 2.8522 - val_masked_accuracy: 0.4811 - val_masked_loss: 2.8083
Epoch 4/40
 [test]: [START] [cap] [UNK] this is a [UNK] will it , if if it does , then , said [cap] well . [END]ccuracy: 0.5242 - masked_loss: 2.6333
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 111ms/step - loss: 2.6325 - masked_accuracy: 0.5243 - masked_loss: 2.6325 - val_loss: 1.8455 - val_masked_acc

2024-09-05 14:26:29.963612: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]


 [test]: [START] [cap] [UNK] this is a [UNK] will it work , if it does then how well ? [END]
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 110ms/step - loss: 0.6174 - masked_accuracy: 0.8517 - masked_loss: 0.6174 - val_loss: 0.5377 - val_masked_accuracy: 0.8536 - val_masked_loss: 0.5294
Epoch 11/40
 [test]: [START] [cap] [UNK] this is a [UNK] will it work if it does then , how well . [END] masked_accuracy: 0.8612 - masked_loss: 0.5448
[test]: [START] [cap] not that [cap] alice had any idea of doing that she felt as if she would never be able to talk again , she was getting so much out of breath , and still the queen , cried [cap] [UNK] , [UNK] , and dragged her along . [END]
[test]: [START] [cap] [UNK] is a library , that is used for [UNK] learning it is [UNK] for more [UNK] than just [UNK] . [END]
[test]: [START] [cap] the [cap] [UNK] [cap] douglass , taking [cap] [UNK] , can be used to take [UNK] notes . [END]
[test]: [START] [cap] here are a few words [cap] [UNK

2024-09-05 14:35:06.667711: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
	 [[{{node IteratorGetNext}}]]


 [test]: [START] [cap] [UNK] . [cap] this is a [UNK] will it work if it does then how well ? [END]
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 106ms/step - loss: 0.4256 - masked_accuracy: 0.8821 - masked_loss: 0.4256 - val_loss: 0.4554 - val_masked_accuracy: 0.8626 - val_masked_loss: 0.4484
Epoch 27/40
 [test]: [START] [cap] [UNK] . [cap] this is a [UNK] will it work if it does then how well ? [END]d_accuracy: 0.8870 - masked_loss: 0.4050
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 107ms/step - loss: 0.4050 - masked_accuracy: 0.8870 - masked_loss: 0.4050 - val_loss: 0.3608 - val_masked_accuracy: 0.8834 - val_masked_loss: 0.3552
Epoch 28/40
 [test]: [START] [cap] [UNK] . [cap] this is a [UNK] will it work if it does then how well ? [END]d_accuracy: 0.8863 - masked_loss: 0.4047
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 110ms/step - loss: 0.4048 - masked_accuracy: 0.8863 - masked_loss: 0.4048 - val_loss: 0.3790 - val_m

In [28]:

print(test_punctuation([
	'not that alice had any idea of doing that she felt as if she would never be able to talk again she was getting so much out of breath and still the queen cried faster faster and dragged her along'
]))
print(test_punctuation([ 'this is a test of a the punctuation system for i am curious how well it works' ]))

[test]: [START] [cap] not that [cap] alice had any idea of doing that she felt as if she would never be able to talk again . [cap] she was getting so much out of breath , and still the queen cried , [cap] [UNK] [UNK] and dragged her along . [END]
[test]: [START] [cap] this is a [UNK] of a [UNK] system , for [cap] i am curious how well it works . [END]


## Exporting

Based on the [Export](https://www.tensorflow.org/text/tutorials/nmt_with_attention#export) section of the tutorial:

In [29]:
class Export(tf.Module):
	def __init__(self, model):
		self.model = model
	
	@tf.function(input_signature=[tf.TensorSpec(dtype=tf.string, shape=[None])])
	def fix_punctuation(self, inputs):
		return model.fix_punctuation(inputs)

Run `fix_punctuation` once to compile it:

In [30]:
export = Export(model)

In [31]:
sample_inputs = tf.constant([ 'this sentence shall be punctuated for the following reasons first punctatuion makes things easier to read second um' ])
export.fix_punctuation(sample_inputs)

W0000 00:00:1725572559.330611   25103 op_level_cost_estimator.cc:699] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "CPU" vendor: "GenuineIntel" model: "106" frequency: 2611 num_cores: 12 environment { key: "cpu_instruction_set" value: "AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 49152 l2_cache_size: 1310720 l3_cache_size: 18874368 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }


<tf.Tensor: shape=(), dtype=string, numpy=b'[START] [cap] this sentence shall be [UNK] , for the following reasons first [UNK] makes things [UNK] to read second [UNK] . [END]'>

Now we save the model:

In [32]:
tf.saved_model.save(export, 'punctuator-seq2seq', signatures={ 'serving_default': export.fix_punctuation })



INFO:tensorflow:Assets written to: punctuator-seq2seq/assets


INFO:tensorflow:Assets written to: punctuator-seq2seq/assets


See [the documentation](https://www.tensorflow.org/guide/saved_model#specifying_signatures_during_export) for information about the `signatures` option.

In [34]:
reloaded = tf.saved_model.load('punctuator-seq2seq')
# Warmup
reloaded.fix_punctuation(tf.constant(['this is a test is it not']))
print('Imported and warmed up!')

Imported and warmed up!


W0000 00:00:1725572963.812083   25103 op_level_cost_estimator.cc:699] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "CPU" vendor: "GenuineIntel" model: "106" frequency: 2611 num_cores: 12 environment { key: "cpu_instruction_set" value: "AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 49152 l2_cache_size: 1310720 l3_cache_size: 18874368 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }


In [41]:
%%time
reloaded.fix_punctuation(tf.constant(['this sentence should end with a full stop']))


CPU times: user 17.1 ms, sys: 8.51 ms, total: 25.6 ms
Wall time: 10.2 ms


<tf.Tensor: shape=(), dtype=string, numpy=b'[START] [cap] this sentence should end with a full stop ? [END]'>