Add ROUGE Metric #122

abheesht17 · 2022-04-16T05:47:29Z

Resolves #67

@mattdangerw, @chenmoneygithub, this PR is now ready for review :)

mattdangerw · 2022-04-18T20:52:29Z

Thank you! This looks awesome!

Before diving into review could you share a colab showing the expected end to end use case for this with some string translations and references?

You should be able to add these lines to the top of the colab.

!git clone --banch your-branch-name your-remote-url
!cd keras-nlp && pip install . -q

It looks like we would need people to tokenize themselves before calling into this metric, which I think is ok (and maybe even preferable). But we should be clear what our expected end to end flow is. See as reference...

https://colab.sandbox.google.com/github/huggingface/notebooks/blob/master/course/videos/rouge_metric.ipynb

Maybe use TextVectorization for now to show how this could be done at a word level?

It would also be helpful to compare this argument list and docstring to the ones we expect for the Rouge-N variant. Could you show what the full arg list and usage for Rouge-N could look like? I know we are still figure out how to implement, so that part could be a markdown codeblock, not actually runnable.

abheesht17 · 2022-04-18T23:35:51Z

Sure thing! Will share a notebook ASAP :)

abheesht17 · 2022-04-19T16:07:48Z

@mattdangerw, here is the notebook: https://colab.research.google.com/drive/1X8ppbzcsBE1wKTB-XnfKhLRCPT6EHQTH?usp=sharing :)

chenmoneygithub

Generally looks good! left some comments

chenmoneygithub · 2022-04-19T20:28:05Z

keras_nlp/metrics/rouge_l.py

+    1.1. `mask_token_ids` not provided.
+    >>> tf.random.set_seed(42)
+    >>> rouge_l = keras_nlp.metrics.RougeL(name="rouge_l")
+    >>> references = tf.random.uniform(


Prefer use a well-defined example rather than random data so that users can manually calculate the F1 score.

chenmoneygithub · 2022-04-19T20:28:56Z

keras_nlp/metrics/rouge_l.py

+        alpha=0.5,
+        metric_type="f1_score",
+        mask_token_ids=None,
+        dtype=None,


In docstring it defaults to float32, which mismatches the default value None. Please fix it.

chenmoneygithub · 2022-04-19T20:38:23Z

keras_nlp/metrics/rouge_l.py

+
+    def update_state(self, y_true, y_pred, sample_weight=None):
+        # Both y_true and y_pred have shape: [batch_size, seq_len]. Note that
+        # they can also be ragged tensors with shape [num_samples, (seq_len)].


rename num_samples to batch_size, we should be consistent with the naming.

My bad. Changed!

chenmoneygithub · 2022-04-19T20:46:34Z

keras_nlp/metrics/rouge_l.py

+        return config
+
+
+def rouge_l(y_true, y_pred, alpha=0.5):


We do not actually need to make this a standalone util. Just write

f1_scores, precisions, recalls = tf_text.metrics.rouge_l( y_pred, y_true, alpha=alpha )

in the RougeL class.

I was following the format of metrics given in Keras. For example, check this out: https://github.com/keras-team/keras/blob/master/keras/metrics/metrics.py#L220, https://github.com/keras-team/keras/blob/master/keras/metrics/metrics.py#L3331.

The class helps in aggregating the ROUGE score (the user can iterate over the dataset, and the class will return the avg. ROUGE score).

The function allows string inputs, I think.

I've changed it to this for the time being:

f1_scores, precisions, recalls = tf_text.metrics.rouge_l( y_pred, y_true, alpha=alpha )

but let me know which one is more appropriate and I'll change it to that. Thanks!

Thanks! My guess is there was some need to directly call the metrics computing function, which I am not sure if still applies. Let's keep it simple for now, and we can add the util if we find it is necessary.

Great! Let me know if further changes are required

chenmoneygithub · 2022-04-19T20:47:51Z

keras_nlp/metrics/rouge_l_test.py

+
+
+class RougeLTest(tf.test.TestCase):
+    def test_vars_after_initializing_class(self):


This is a bit verbose, we can rename to test_initialization().

abheesht17

@chenmoneygithub, thank you for the review comments! I've addressed them

abheesht17 · 2022-04-20T12:49:29Z

keras_nlp/metrics/rouge_l.py

+        return config
+
+
+def rouge_l(y_true, y_pred, alpha=0.5):


I was following the format of metrics given in Keras. For example, check this out: https://github.com/keras-team/keras/blob/master/keras/metrics/metrics.py#L220, https://github.com/keras-team/keras/blob/master/keras/metrics/metrics.py#L3331.

The class helps in aggregating the ROUGE score (the user can iterate over the dataset, and the class will return the avg. ROUGE score).

The function allows string inputs, I think.

I've changed it to this for the time being:

f1_scores, precisions, recalls = tf_text.metrics.rouge_l( y_pred, y_true, alpha=alpha )

but let me know which one is more appropriate and I'll change it to that. Thanks!

abheesht17 · 2022-04-20T12:49:45Z

keras_nlp/metrics/rouge_l.py

+
+    def update_state(self, y_true, y_pred, sample_weight=None):
+        # Both y_true and y_pred have shape: [batch_size, seq_len]. Note that
+        # they can also be ragged tensors with shape [num_samples, (seq_len)].


My bad. Changed!

abheesht17 · 2022-04-20T12:53:30Z

keras_nlp/metrics/rouge_l.py

+    1.1. `mask_token_ids` not provided.
+    >>> tf.random.set_seed(42)
+    >>> rouge_l = keras_nlp.metrics.RougeL(name="rouge_l")
+    >>> references = tf.random.uniform(


chenmoneygithub

Thanks!

mattdangerw · 2022-04-21T00:03:28Z

@abheesht17 thanks for the colab! That is super helpful. Thoughts...

I am still somewhat torn on whether we might want to somehow accept strings. The hugging face implementation is certainly a little more usable on the face of it, as is the rouge-score package it is based on. Though it seems like they will need to bake in a lot of assumptions (including language stemming!).

This brings up a lot of questions. Curious your thoughts here.

If you wanted to report rouge score, say for a paper, would you need to recreate the tokenization and stemming exactly like the package I linked, for comparability with other models?
How could you do that from our package?
Do people regularly report rouge with other tokenizers?
How does people handle other languages with rogue generally (especially ones without whitespace splitting for tokens)?

Second, more minor point, why do you need to expand_dims on the first metric, could we remove that requirement or is that baked into keras.metrics somehow?

changes requested

chenmoneygithub · 2022-04-25T22:08:53Z

@abheesht17 Hi, we had some discussions around this, and also reached out to Google research team for their insights. Briefly, we will have ROUGE metric working at string inputs, and by default provide a standard tokenizer so that different works would report ROUGE score based on the same calculation mechanism.

To make our work compatible with existing ROUGE scores reporting, we can depend this metric on this rouge_score package. We still want to deliver Keras metrics so that ROUGE can be easily integrated into training flow. To do that, one possible way is to use tf.py_function() to wrap the rouge function call from existing rouge_score package.

We can leave tokenizer customization a TODO until we know how to make our tokenizers compatible with rouge_score package.

Sorry about the inconvenience!

abheesht17 · 2022-04-26T03:59:08Z

Hello, @chenmoneygithub! Thank you! Sorry for not replying earlier to @mattdangerw's questions.

So, essentially, the gist is:

We will still keep ROUGE-L as a subclass of keras.metrics.Metric.
We will use Google Research's ROUGE package.
We will take string inputs.
For future work, we can figure out how to make our tokenisers compatible with package's tokenisers and give an option to the user to pass an arg for that.

I had a brief look at the tokenisers provided by the package. Their default tokeniser merely returns a list of string tokens:

>>> df = DefaultTokenizer()
>>> df.tokenize("hello, this is fun")
['hello', 'this', 'is', 'fun']

If you see this file, we need to do two things to implement a custom tokeniser:

The tokeniser should be a subclass of the Tokenizer class present in the ROUGE package.
It should have a tokenize() method which returns a list of string tokens.

I don't think (1) is very important; after all, in the internal ROUGE package implementation, they will just call the tokenize method. See here: https://github.com/google-research/google-research/blob/master/rouge/rouge_scorer.py#L73 and https://github.com/google-research/google-research/blob/master/rouge/rouge_scorer.py#L125.

Currently, our tokenisers return a RaggedTensor on passing a string:

String output.
>>> vocab = ["[UNK]", "the", "qu", "##ick", "br", "##own", "fox", "."]
>>> inputs = "The quick brown fox."
>>> tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
...     vocabulary=vocab, dtype="string")
>>> tokenizer(inputs)
<tf.RaggedTensor [[b'the', b'qu', b'##ick', b'br', b'##own', b'fox', b'.']]>

So, I guess in the tokenizer class in KerasNLP, we can simply add an option, return_list, and if this is True, we can convert the RaggedTensor output to a list. I'll try this out.

abheesht17 · 2022-04-26T15:43:34Z

Small update:

The functionality for split_summaries and providing a custom tokeniser was only recently added to the rouge-score package. The authors have not released a version on PyPi with the above two functionalities. So, I'm just going for a basic implementation right now.

See these two commits:

google-research/google-research@61ce9f0 (custom tokeniser)
google-research/google-research@ed3e2bc (split summaries)

abheesht17 · 2022-04-26T17:39:07Z

@mattdangerw, @chenmoneygithub, how do I convert a string Tensor to a Python string (graph ops)? The rouge-score package does not accept string tensors (I tried passing a tensor with a single string).

>>> scorer.score(tf.constant("hello"), tf.constant("bye"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\abheesht\anaconda3\envs\keras_nlp\lib\site-packages\rouge_score\rouge_scorer.py", line 88, in score
    target_tokens = tokenize.tokenize(target, self._stemmer)
  File "C:\Users\abheesht\anaconda3\envs\keras_nlp\lib\site-packages\rouge_score\tokenize.py", line 42, in tokenize
    text = text.lower()
  File "C:\Users\abheesht\anaconda3\envs\keras_nlp\lib\site-packages\tensorflow\python\framework\ops.py", line 513, in __getattr__
    self.__getattribute__(name)
AttributeError: 'tensorflow.python.framework.ops.EagerTensor' object has no attribute 'lower'

I tried a bunch of things, none of them seem to work. str(tensor_var) works in Eager mode, but won't work in graph mode. tf.strings.as_string did not work either.

>>> tf.strings.as_string(tf.constant("weifwf"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\abheesht\anaconda3\envs\keras_nlp\lib\site-packages\tensorflow\python\ops\gen_string_ops.py", line 74, in as_string
    _ops.raise_from_not_ok_status(e, name)
  File "C:\Users\abheesht\anaconda3\envs\keras_nlp\lib\site-packages\tensorflow\python\framework\ops.py", line 7186, in raise_from_not_ok_status     
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: Value for attr 'T' of string is not in the list of allowed values: float, double, int32, uint8, int16, int8, int64, bfloat16, uint16, half, uint32, uint64, complex64, complex128, bool, variant
        ; NodeDef: {{node AsString}}; Op<name=AsString; signature=input:T -> output:string; attr=T:type,allowed=[DT_FLOAT, DT_DOUBLE, DT_INT32, DT_UINT8, DT_INT16, DT_INT8, DT_INT64, DT_BFLOAT16, DT_UINT16, DT_HALF, DT_UINT32, DT_UINT64, DT_COMPLEX64, DT_COMPLEX128, DT_BOOL, DT_VARIANT]; attr=precision:int,default=-1; attr=scientific:bool,default=false; attr=shortest:bool,default=false; attr=width:int,default=-1; attr=fill:string,default=""> [Op:AsString]

abheesht17 · 2022-05-23T15:42:13Z

@mattdangerw, @chenmoneygithub, I've made the required changes. Apologies once again for the delay!

Still confused about the graph ops stuff though. Since we are using tf.py_function, we can't really pass an object of this class to model.compile. I tried it out with @tf.function and it fails, which means it won't work with model.compile.

P.S. Haven't added examples in the doc-string yet.
P.P.S It's nice to be back! :)

mattdangerw

Thanks! This looks good to me.

Made a colab to play around, looks like there is at least one shape issue. This metric seems to only supper shape (batch_size) inputs, but we should also support (batch_size, 1).

https://colab.sandbox.google.com/gist/mattdangerw/104626168c0bce36f12679b2dd38ce23/rouge-test.ipynb

We should also discuss whether we want to make this two metrics (maybe with a common base class, unsure).

mattdangerw · 2022-05-26T05:50:50Z

keras_nlp/metrics/rouge.py

+
+        if rouge_score is None:
+            raise ImportError(
+                "ROUGE metric requires the `rouge_score` package."


space after period

mattdangerw · 2022-05-26T05:51:13Z

keras_nlp/metrics/rouge.py

+        if rouge_score is None:
+            raise ImportError(
+                "ROUGE metric requires the `rouge_score` package."
+                "Please install it with `pip install rouge_score`."


rouge_score -> rouge-score

mattdangerw · 2022-05-26T05:58:58Z

keras_nlp/metrics/rouge.py

+                score = score.recall
+            else:
+                score = score.fmeasure
+            return score


a keras metric can just return a dict of scalars I believe, should we just return a dict here and from the metric overall?

Ah, I was not aware of this. Will return a dictionary

@mattdangerw, looks like this doesn't work; an error is thrown when I return a dictionary from result(). Have a look at this snippet of code:

>>> import keras_nlp >>> y_true = "hey, this is great fun" >>> y_pred = "great fun indeed" >>> rouge = keras_nlp.metrics.RougeN(order=2) >>> rouge(y_true, y_pred) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/abheesht/python_envs/keras_nlp/lib/python3.8/site-packages/keras/metrics/base_metric.py", line 200, in __call__ return distributed_training_utils.call_replica_local_fn( File "/home/abheesht/python_envs/keras_nlp/lib/python3.8/site-packages/keras/distribute/distributed_training_utils.py", line 60, in call_replica_local_fn return fn(*args, **kwargs) File "/home/abheesht/python_envs/keras_nlp/lib/python3.8/site-packages/keras/metrics/base_metric.py", line 196, in replica_local_fn result_t._metric_obj = self # pylint: disable=protected-access AttributeError: 'dict' object has no attribute '_metric_obj'

However, this works:

>>> import keras_nlp >>> y_true = "hey, this is great fun" >>> y_pred = "great fun indeed" >>> rouge = keras_nlp.metrics.RougeN(order=2) >>> rouge.update_state(y_true, y_pred) >>> rouge.result() {'rouge_n_precision': <tf.Tensor: shape=(), dtype=float32, numpy=0.5>, 'rouge_n_recall': <tf.Tensor: shape=(), dtype=float32, numpy=0.25>, 'rouge_n_f1_score': <tf.Tensor: shape=(), dtype=float32, numpy=0.33333334>}

Reverting back to the metric_type implementation

mattdangerw · 2022-05-26T05:59:42Z

keras_nlp/metrics/rouge.py

+    def result(self):
+        if self._number_of_samples == 0:
+            return 0.0
+        rouge_l_score = self._rouge_score / self._number_of_samples


Why is this variable called rouge_l when it could be rouge_l or rouge_n? Seems like a confusing name.

Yeah, this is a typo. Corrected in the latest commit!

mattdangerw · 2022-05-26T06:28:08Z

keras_nlp/metrics/rouge.py

+    ROUGE-L and ROUGE-LSum.
+
+    Args:
+        variant: string. One of "rougeN", "rougeL", "rougeLsum". Defaults to


This feels too tricky, particularly with the order of RougeN being a hidden parameter of the string passed to this argument.

What about making a separate RougeL and RougeN metric class?

Discussed offline with Matt. We are going ahead with separate classes for ROUGE-N and ROUGE-L!

mattdangerw

Looking good! Left some comment. Most of the comments on RougeL will also apply to RougeN.

Making a private base class for common logic will make refactoring simpler.

Still trying to get answers on the best return type for this metric.

mattdangerw · 2022-06-01T19:37:08Z

keras_nlp/metrics/rouge_l.py

+        # strings in the tensor/list.
+
+        # Check if input is a raw string/list.
+        if isinstance(y_true, str):


Why not just

if not isinstance(y_true, tf.Tensor): inputs = tf.convert_to_tensor(y_true) if not isinstance(y_pred, tf.Tensor): inputs = tf.convert_to_tensor(y_pred)

I don't think we should do the rank coercion in this test case. That would seem to then support scalar inputs only if you had not tensor inputs, but not support scalar inputs if tensors, which is weird behavior.

Convert to tensor first, then fix rank.

mattdangerw · 2022-06-01T19:37:59Z

keras_nlp/metrics/rouge_l.py

+
+    def update_state(self, y_true, y_pred, sample_weight=None):
+        # Three possible shapes for y_true and y_pred: Python string,
+        # [batch_size] and [batch_size, 1]. In the latter two cases, we have


It seems like we also support scalar inputs, is that true?

We should probably move some of this discussion on supported shape into the docstring.

Yep, I've written "Python string" on line number 96.

Sure, will move it to the doc-string!

mattdangerw · 2022-06-01T20:03:12Z

keras_nlp/metrics/rouge_l.py

+
+        # If the shape of y_true and y_pred is [batch_size, 1], squeeze it to
+        # [batch_size].
+        if y_true.shape.rank == 2:


This would fail if shape is [batch_size, 2] right now in a not very helpful way. The most friendly thing here might be to do a check if we have a supported shape (rank 0, rank 1 or rank 2 with shape[-1] == 1), and if not give a friendly error message.

Right! Changes made 👍🏼

mattdangerw · 2022-06-01T20:08:53Z

keras_nlp/metrics/rouge_l_test.py

+            name="rouge_l_test",
+        )
+
+        config = rouge.get_config()


This feels fragile, we are essentially testing what the base metric class has in it's config. Currently it is only dtype and name, but that could change.

Maybe just assert the contents of the config you expect, metric_type and use_stemmer.

mattdangerw · 2022-06-01T20:10:23Z

keras_nlp/metrics/rouge_l_test.py

+from keras_nlp.metrics import RougeL
+
+
+class RougeLTest(tf.test.TestCase):


We should test this with a model (which maybe just passes through inputs), and a batched tf.data.Dataset.

We should test passing this to model.compile()'s metrics arg.

mattdangerw · 2022-06-01T20:17:10Z

keras_nlp/metrics/rouge_l.py

+            tf.cast(batch_size, dtype=self.dtype)
+        )
+
+    def result(self):


Open question, how are these rouge scores usually aggregated when reported in a paper? We may want to do a little bit of a dive into this, to understand what the critical user journeys are when reporting an aggregate score.

Looking at the aggregation code in the package, they have a lot more there...
https://github.com/google-research/google-research/blob/master/rouge/scoring.py#L61

OK if we don't need all that, but we should make sure we understand what people will want when using this metric.

Hmmm, good point. I went through some examples online. In particular, I went through PyTorch Ignite's ROUGE metric. Have a look: https://pytorch.org/ignite/_modules/ignite/metrics/nlp/rouge.html#Rouge (_BaseRouge class).

They take the average:

def compute(self) -> Mapping: if self._num_examples == 0: raise NotComputableError("Rouge metric must have at least one example before be computed") return { f"{self._metric_name()}-P": float(self._precision / self._num_examples), f"{self._metric_name()}-R": float(self._recall / self._num_examples), f"{self._metric_name()}-F": float(self._fmeasure / self._num_examples), }

Sounds good to me! We can always see if people open up issues. Thanks for checking!

mattdangerw · 2022-06-01T20:25:45Z

keras_nlp/metrics/rouge_l.py

+    rouge_score = None
+
+
+class RougeL(keras.metrics.Metric):


I think it would still be a good idea to do some code sharing between these two metrics.

What if we do this...

Move everything back into rouge.py and rouge_test.py.

Add a base class RougeBase that contains most of the logic.

In init.py, only export RougeL and RougeN

That would similar to how core Keras handle Conv2D and Conv3D, for example.

Done! However, I have kept two separate files for unit tests - rouge_n_test.py and rouge_l_test.py since rougeN and rougeL are what will eventually be exposed to the user. Let me know if you want only one test script (for RougeBase).

mattdangerw · 2022-06-01T20:26:36Z

keras_nlp/metrics/rouge_l.py

+               not specified, it defaults to tf.float32.
+        name: string. Name of the metric instance.
+        **kwargs: Other keyword arguments.
+    """


Add some docstring examples! I think the >>> style with actual output, would be useful in this case.

Yeah, I forgot to do this. I've added examples in the new commit!

chenmoneygithub

Thanks for the PR!

chenmoneygithub · 2022-06-02T01:01:15Z

keras_nlp/metrics/rouge_l.py

+    rouge_score = None
+
+
+class RougeL(keras.metrics.Metric):


chenmoneygithub · 2022-06-02T01:10:08Z

keras_nlp/metrics/rouge_l_test.py

+from keras_nlp.metrics import RougeL
+
+
+class RougeLTest(tf.test.TestCase):


We should test passing this to model.compile()'s metrics arg.

chenmoneygithub · 2022-06-02T01:13:46Z

keras_nlp/metrics/rouge_n.py

+    between the reference text and the hypothesis text.
+
+    Args:
+        order: The order of n-grams which are to be matched. It should lie in


just curious - is this [1, 9] a requirement from rouge-score package?

Yep!

>>> from rouge_score import rouge_scorer >>> rg = rouge_scorer.RougeScorer(rouge_types=["rouge10"]) >>> rg.score("hey", "hey, hello") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/abheesht/python_envs/keras_nlp/lib/python3.8/site-packages/rouge_score/rouge_scorer.py", line 119, in score raise ValueError("Invalid rouge type: %s" % rouge_type) ValueError: Invalid rouge type: rouge10

chenmoneygithub · 2022-06-02T01:15:35Z

keras_nlp/metrics/rouge_n.py

+    rouge_score = None
+
+
+class RougeN(keras.metrics.Metric):


Reading through both class implementation, most code can be shared between, so it looks doable to me to have a RougeBase class.

abheesht17

@mattdangerw, @chenmoneygithub, thanks for the comments! I've addressed them.

abheesht17 · 2022-06-03T11:25:03Z

keras_nlp/metrics/rouge_l.py

+    rouge_score = None
+
+
+class RougeL(keras.metrics.Metric):


Done! However, I have kept two separate files for unit tests - rouge_n_test.py and rouge_l_test.py since rougeN and rougeL are what will eventually be exposed to the user. Let me know if you want only one test script (for RougeBase).

abheesht17 · 2022-06-03T11:25:16Z

keras_nlp/metrics/rouge_l.py

+
+    def update_state(self, y_true, y_pred, sample_weight=None):
+        # Three possible shapes for y_true and y_pred: Python string,
+        # [batch_size] and [batch_size, 1]. In the latter two cases, we have


Yep, I've written "Python string" on line number 96.

Sure, will move it to the doc-string!

abheesht17 · 2022-06-03T11:25:34Z

keras_nlp/metrics/rouge_l.py

+
+        # If the shape of y_true and y_pred is [batch_size, 1], squeeze it to
+        # [batch_size].
+        if y_true.shape.rank == 2:


Right! Changes made 👍🏼

abheesht17 · 2022-06-03T11:25:54Z

keras_nlp/metrics/rouge_n.py

+    between the reference text and the hypothesis text.
+
+    Args:
+        order: The order of n-grams which are to be matched. It should lie in


Yep!

>>> from rouge_score import rouge_scorer >>> rg = rouge_scorer.RougeScorer(rouge_types=["rouge10"]) >>> rg.score("hey", "hey, hello") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/abheesht/python_envs/keras_nlp/lib/python3.8/site-packages/rouge_score/rouge_scorer.py", line 119, in score raise ValueError("Invalid rouge type: %s" % rouge_type) ValueError: Invalid rouge type: rouge10

abheesht17 · 2022-06-03T12:40:50Z

keras_nlp/metrics/rouge_l.py

+               not specified, it defaults to tf.float32.
+        name: string. Name of the metric instance.
+        **kwargs: Other keyword arguments.
+    """


Yeah, I forgot to do this. I've added examples in the new commit!

abheesht17 · 2022-06-03T13:07:43Z

keras_nlp/metrics/rouge_l.py

+            tf.cast(batch_size, dtype=self.dtype)
+        )
+
+    def result(self):


Hmmm, good point. I went through some examples online. In particular, I went through PyTorch Ignite's ROUGE metric. Have a look: https://pytorch.org/ignite/_modules/ignite/metrics/nlp/rouge.html#Rouge (_BaseRouge class).

They take the average:

def compute(self) -> Mapping: if self._num_examples == 0: raise NotComputableError("Rouge metric must have at least one example before be computed") return { f"{self._metric_name()}-P": float(self._precision / self._num_examples), f"{self._metric_name()}-R": float(self._recall / self._num_examples), f"{self._metric_name()}-F": float(self._fmeasure / self._num_examples), }

mattdangerw

Looks good! Left some minor comments. The main thing we need to dig into now is what is going wrong when returning a dict. This will probably require some looking into the base metric class in core Keras.

Thanks!

mattdangerw · 2022-06-07T06:12:00Z

keras_nlp/metrics/rouge_base.py

+
+class RougeBase(keras.metrics.Metric):
+    """ROUGE metric.
+    This class implements all the variants of the ROUGE metric - ROUGE-N,


white space before and after this paragraph

mattdangerw · 2022-06-07T06:13:12Z

keras_nlp/metrics/rouge_base.py

+        use_stemmer: bool. Whether Porter Stemmer should be used to strip word
+            suffixes to improve matching. Defaults to False.
+        dtype: string or tf.dtypes.Dtype. Precision of metric computation. If
+               not specified, it defaults to tf.float32.


fix alignment of this line

mattdangerw · 2022-06-07T06:16:10Z

keras_nlp/metrics/rouge_base.py

+    def __init__(
+        self,
+        variant="rouge2",
+        metric_type="f1_score",


We still need to figure out why returning a dict is not working, and see if there is a bug that needs to be fixed in core keras or elsewhere.

We should not ship a API signature we don't want because of a bug we need to fix!

mattdangerw · 2022-06-07T06:16:43Z

keras_nlp/metrics/rouge_base.py

+            ("rouge" + str(order) for order in range(1, 10))
+        ) + (
+            "rougeL",
+            "rougeLsum",


rougeLsum we aren't supporting right now correct?

mattdangerw · 2022-06-07T06:18:00Z

keras_nlp/metrics/rouge_base.py

+        # [batch_size] and [batch_size, 1]. In the latter two cases, we have
+        # strings in the tensor/list.
+
+        def validate_and_fix_rank(input_, tensor_name):


generally trailing underscore is not a naming pattern we follow

Just call this inputs?

mattdangerw · 2022-06-07T06:20:25Z

keras_nlp/metrics/rouge_l.py

+    Succinctly put, ROUGE-L is a score based on the length of the longest
+    common subsequence present in the reference text and the hypothesis text.
+
+    Note on input shapes:


I would just comment on the shapes here (not the types). So just say supports scalar and batch inputs of shape (), (batch_size,) and (batch_size, 1).

mattdangerw

Thanks! This looks good to me pending the dict discussion.

chenmoneygithub · 2022-06-17T02:56:52Z

keras_nlp/metrics/rouge_base.py

+        # class wraps the `results()` method.
+        obj = super().__new__(cls)
+
+        class MetricDict(dict):


Is this class necessary? It seems this is juts an alias to dict.

Also let's create a TODO here for future cleanup, this code is hard to maintain.

Actually, the reason for defining a class is so that we can do object.var_name type assignments.
If we use a dictionary, this error crops up:

def replica_local_fn(*args, **kwargs): """Updates the state of the metric in a replica-local context.""" if any( isinstance(arg, keras_tensor.KerasTensor) for arg in tf.nest.flatten((args, kwargs))): update_op = None else: update_op = self.update_state(*args, **kwargs) # pylint: disable=not-callable update_ops = [] if update_op is not None: update_ops.append(update_op) with tf.control_dependencies(update_ops): result_t = self.result() # pylint: disable=not-callable # We are adding the metric object as metadata on the result tensor. # This is required when we want to use a metric with `add_metric` API on # a Model/Layer in graph mode. This metric instance will later be used # to reset variable state after each epoch of training. # Example: # model = Model() # mean = Mean() # model.add_metric(mean(values), name='mean') > result_t._metric_obj = self # pylint: disable=protected-access E AttributeError: 'dict' object has no attribute '_metric_obj'

Yeah, definitely the plan would be to remove this code after 2.10 is out!

mattdangerw

Approving! Thanks!

Just a few more minor comments.

mattdangerw · 2022-06-17T06:20:20Z

keras_nlp/metrics/rouge_l_test.py

+        for metric_type, expected_val in zip(
+            self.metric_types, [1, 0.689, 0.807]
+        ):
+            self.assertAlmostEqual(


Can we assert a whole dict structure in here? If so I would fine that a lot more readable actually than the way it's done here. Here and elsewhere

assertAlmostEqual(rouge_output, { "rouge-l_precision": x, "rouge-l_recall": y, "rouge-l_f1_score": z, })

Right - I can make a custom function for this!

mattdangerw · 2022-06-17T06:22:20Z

keras_nlp/metrics/rouge_base.py

+        rouge_recall = self._rouge_recall / self._number_of_samples
+        rouge_f1_score = self._rouge_f1_score / self._number_of_samples
+        return {
+            f"{self.name}_precision": rouge_precision,


Quick note. I think we should remove f"{self.name}_" part. We would like to make a change to core keras to actually join the metric name when reporting metrics in a dict.

So if they metric is called "rouge-2", we would join metric when returning the metric dict to "rouge-2/recall" or something like that.

Should I remove it now, or later when the fix for the bug has been released?

Removed it for now. Let me know if you want to revert it back to f"{self.name}_"

Add rough class for RougeL

62324be

abheesht17 mentioned this pull request Apr 16, 2022

[WIP] Add ROUGE Metric #69

Closed

2 tasks

abheesht17 added 2 commits April 16, 2022 11:20

Fix typos

3bc476a

Correct logic

b09302d

abheesht17 changed the title ~~[WIP] Add Rouge-L Metric~~ Add Rouge-L Metric Apr 16, 2022

abheesht17 added 8 commits April 16, 2022 15:17

Add examples

cadbd01

Small doc-string changes

3e767ff

Add alpha example

b622cfe

Small doc-string change

38a809f

Fix doc-string

e3bf503

Fix f-string

d25403b

Minor doc-string edit

2f9a35c

Minor doc-string edit - 2

9b4c1f1

chenmoneygithub requested review from mattdangerw and chenmoneygithub April 19, 2022 04:53

chenmoneygithub suggested changes Apr 19, 2022

View reviewed changes

abheesht17 commented Apr 20, 2022

View reviewed changes

abheesht17 added 2 commits April 20, 2022 19:48

Address review comments - I

c59aa74

Minor change in examples

d166ab7

chenmoneygithub previously approved these changes Apr 20, 2022

View reviewed changes

chenmoneygithub self-requested a review April 25, 2022 22:01

abheesht17 changed the title ~~Add Rouge-L Metric~~ Add Rouge Metric May 23, 2022

abheesht17 changed the title ~~Add Rouge Metric~~ Add ROUGE Metric May 23, 2022

Use the rouge_score package

632df5d

abheesht17 added 2 commits May 23, 2022 21:21

Fix rouge_score import

7586e00

Add rouge-score to test deps list

893aab9

mattdangerw requested changes May 27, 2022

View reviewed changes

Address review comments - II

ccf33d4

mattdangerw requested changes Jun 1, 2022

View reviewed changes

chenmoneygithub reviewed Jun 2, 2022

View reviewed changes

Address review comments - III

748df81

abheesht17 commented Jun 3, 2022

View reviewed changes

abheesht17 added 2 commits June 3, 2022 18:50

Fix model.compile error in doc-string

a793d3d

Rename rouge.py to rouge_base.py

b8dae75

mattdangerw requested changes Jun 7, 2022

View reviewed changes

abheesht17 added 2 commits June 7, 2022 14:18

Address review comments - IV

8050086

Address review comments - IV

da44d22

mattdangerw reviewed Jun 9, 2022

View reviewed changes

abheesht17 added 5 commits June 10, 2022 11:24

Return dict from ROUGE

f8c05aa

Fix doc-strings

f4df42b

Truncate doc-string example output

723d8e7

Remove ROUGE-LSum from doc-string

b0fe8bc

Small doc-string changes

7250617

chenmoneygithub reviewed Jun 17, 2022

View reviewed changes

Add TODO comment for dict return bug

3c5b3dc

mattdangerw approved these changes Jun 17, 2022

View reviewed changes

abheesht17 added 2 commits June 17, 2022 13:11

Address review comments - V

4fa518a

Fix doc-string

14e851f

mattdangerw merged commit 0e3d12e into keras-team:master Jun 17, 2022

jeffcarp mentioned this pull request Apr 29, 2025

add macro average rouge-n to metrax google/metrax#70

Merged



		class RougeLTest(tf.test.TestCase):
		def test_vars_after_initializing_class(self):

		from keras_nlp.metrics import RougeL


		class RougeLTest(tf.test.TestCase):

Add ROUGE Metric #122

Add ROUGE Metric #122

Uh oh!

Conversation

abheesht17 commented Apr 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdangerw commented Apr 18, 2022

Uh oh!

abheesht17 commented Apr 18, 2022

Uh oh!

abheesht17 commented Apr 19, 2022

Uh oh!

chenmoneygithub left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abheesht17 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenmoneygithub left a comment

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Apr 21, 2022

Uh oh!

chenmoneygithub commented Apr 25, 2022

Uh oh!

abheesht17 commented Apr 26, 2022

Uh oh!

abheesht17 commented Apr 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abheesht17 commented Apr 26, 2022

Uh oh!

abheesht17 commented May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

abheesht17 commented Apr 16, 2022 •

edited

Loading

abheesht17 commented Apr 26, 2022 •

edited

Loading

abheesht17 commented May 23, 2022 •

edited

Loading