Random Sampling Util for Text Generation #228

jessechancy · 2022-06-16T20:34:28Z

Randomly samples using the probability distribution provided by the input function.

…ling util

mattdangerw

Looks good! Left some initial comments

mattdangerw · 2022-06-16T21:17:33Z

keras_nlp/utils/text_generation.py

+        )
+    if not isinstance(prompt, tf.Tensor):
+        prompt = tf.convert_to_tensor(prompt)
+    input_is_1d = prompt.shape.rank == 1


I think this would be simpler if you left the upranking in the main function. Currently the upranking and downranking and now split apart from each other, which is bad for readability.

I've moved the section out of the helper function, into the main function

mattdangerw · 2022-06-16T21:18:31Z

keras_nlp/utils/text_generation.py

+    return prompt, input_is_1d
+
+
+def _mask_tokens_after_end_token(


We don't need to lead with underscore here. We choose what function to "export" in our API by listing the init.py. Underscore just needed for private class methods.

Removed underscores for both helper functions

mattdangerw · 2022-06-16T21:18:48Z

keras_nlp/utils/text_generation.py

+    # Build a mask including end_token and replace tokens after end_token
+    # with `pad_token_id`.
+    valid_indices = tf.sequence_mask(end_indices + 1, maxlen=max_length)
+    prompt = tf.where(valid_indices, prompt, pad_token_id)


return directly

Does this mean I should do "return tf.where(valid_indices, prompt, pad_token_id)"? Just edited

mattdangerw · 2022-06-16T21:23:14Z

keras_nlp/utils/text_generation.py

+    pad_token_id=0,
+):
+    """
+    Text generation utility based on random sampling.


... randomly sampling the entire probability distribution.

mattdangerw · 2022-06-16T21:26:30Z

keras_nlp/utils/text_generation_test.py

+from keras_nlp.utils.text_generation import random_sampling


 class TextGenerationTest(tf.test.TestCase):


We should split a separate class for greedy, random, etc. They should mirror each other, but not be mixed in the same unit tests.

mattdangerw · 2022-06-16T21:28:37Z

keras_nlp/utils/text_generation.py

+    return prompt
+
+
+def random_sampling(


We need to make sure we have naming uniformity. Do we want

greedy_search, random_search, top_k_search? or greedy_sampling, random_sampling, top_k_sampling`?

Or something else?

The way I was going to name it was greedy_search, beam_search, random_sampling, top_k_sampling and top_p_sampling, because the latter three uses probabilistic sampling techniques. But if uniformly calling them search is better, I'll change it to that.

I would prefer a name random_sampling_search. @mattdangerw Matt, what do you think?

chenmoneygithub · 2022-06-16T20:48:12Z

keras_nlp/utils/text_generation.py

+            append generated tokens.
+
+    Returns:
+        a 2D Tensor, the prompt with shape [batch_size, max_length].


This is not correct? I don't see padding in this function, so the width is not necessarily max_length.

edited, this now says "a 1D or 2D Tensor, with the same shape as prompt."

keras_nlp/utils/text_generation.py

chenmoneygithub · 2022-06-16T21:30:22Z

keras_nlp/utils/text_generation_test.py

        inputs = tf.constant([1])
        outputs = greedy_search(self.token_probability_fn, inputs, max_length=5)
        self.assertEquals(outputs.shape, [5])
+        outputs = random_sampling(


Now each test case is testing different generation algos, which could become unreadable when we have more utilities. Let's use parameterized test to pass in the utility to test. For example: https://github.com/keras-team/keras/blob/v2.9.0/keras/optimizers/optimizer_experimental/optimizer_test.py#L283

There is one thing I am not clear - for different algos, the expected generation results can differ, and how should we test them in a clear way?

The decoding methods I'll be adding are randomized, hence making it harder to test. Most of the test don't test for value, but rather shape, so those can be reused for different decoding methods.

My testing would be based on seeding the decoding methods so that they generate a specific value, which is seen here: https://github.com/jessechancy/keras-nlp/blob/b08d6c6b49392e4791e2b9ece7f16c941837fa55/keras_nlp/utils/text_generation_test.py#L82.

I also have specific test for top-k and top-p, which runs the algorithm and makes sure that the tokens that are supposed to be cut off, won't ever be selected.

I had a comment above, I think we should just split a separate test class for each utility. Parameterized will get messy.

Re random testing, we can always set a random seed for the test and check the entire output. Though that won't check we are actually sampling the distribution correctly, that's definitely harder.

chenmoneygithub

Thanks! Mainly looks good, some minor comments!

chenmoneygithub · 2022-06-19T00:18:09Z

keras_nlp/utils/text_generation.py

+    return prompt
+
+
+def random_sampling(


I would prefer a name random_sampling_search. @mattdangerw Matt, what do you think?

chenmoneygithub · 2022-06-19T00:21:29Z

keras_nlp/utils/text_generation_test.py

+            rtol=0.2,
+        )
+
+    def test_seeded_end_token_id(self):


This test case's name is a little confusing to me, what are we testing here? Are we testing generation with a given seed?

This one is testing whether the end token is detected and anything after filled with pad token. It needs to be seeded because its still randomized when the end token is detected.

I see, let's just call it test_end_token_id or test_handle_end_token_id, seed here is something we don't need to expose to readers. The current name has a suggestion that we are "seeding" the end_token_id.

mattdangerw

lgtm once we add the line to utils/__init__.py

mattdangerw · 2022-06-21T22:02:42Z

keras_nlp/utils/text_generation.py



+def validate_prompt(prompt):
+    """


Hmm, with the whole docstring for a helper it's hard to tell it's just a helper function. In general I don't think you would need the whole args/return structure for something small like this.

Just Helper function to validate input to text_generation utils.

mattdangerw · 2022-06-21T22:03:03Z

keras_nlp/utils/text_generation.py

+
+def mask_tokens_after_end_token(prompt, max_length, end_token_id, pad_token_id):
+    """
+    Mask the tokens after the end token.


Same here. Mention it's a helper, maybe kill the whole args/returns section.

mattdangerw · 2022-06-21T22:03:35Z

keras_nlp/utils/text_generation.py

+    return prompt
+
+
+def random_search(


We need to add this to the __init__.py for the utils dir, so this gets exported.

mattdangerw

Thanks1

jessechancy added 2 commits June 16, 2022 13:32

reformatted greedy search with helper functions and added random samp…

8be0f83

…ling util

reformat files

b08d6c6

mattdangerw reviewed Jun 16, 2022

View reviewed changes

chenmoneygithub suggested changes Jun 16, 2022

View reviewed changes

jessechancy added 2 commits June 16, 2022 16:40

split testing into two classes + minor changes

9bcf5a8

formatted code

d31020f

chenmoneygithub reviewed Jun 19, 2022

View reviewed changes

jessechancy added 4 commits June 21, 2022 11:28

naming changes

b05a01c

naming changes

7678560

format changes

bfb032d

naming changes to random_search

0f6757d

mattdangerw approved these changes Jun 21, 2022

View reviewed changes

removed docstring for helper and added random_search to init

1f8cec1

mattdangerw approved these changes Jun 21, 2022

View reviewed changes

mattdangerw merged commit 65349af into keras-team:master Jun 21, 2022

		from keras_nlp.utils.text_generation import random_sampling


		class TextGenerationTest(tf.test.TestCase):

Random Sampling Util for Text Generation #228

Random Sampling Util for Text Generation #228

Uh oh!

Conversation

jessechancy commented Jun 16, 2022

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenmoneygithub left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!