-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BLEU Score #222
Add BLEU Score #222
Changes from 13 commits
f151982
7708cd9
0f757d5
e90bef3
eface1e
dc2110e
e18cc50
d59058b
5ddcfa7
b2d0822
45136fb
23f9a2f
0217b71
0b6ebfa
be897dd
363da3a
fa2c658
0acaae5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -0,0 +1,385 @@ | ||||
# Copyright 2022 The KerasNLP Authors | ||||
# | ||||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||||
# you may not use this file except in compliance with the License. | ||||
# You may obtain a copy of the License at | ||||
# | ||||
# https://www.apache.org/licenses/LICENSE-2.0 | ||||
# | ||||
# Unless required by applicable law or agreed to in writing, software | ||||
# distributed under the License is distributed on an "AS IS" BASIS, | ||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||
# See the License for the specific language governing permissions and | ||||
# limitations under the License. | ||||
|
||||
"""BLEU metric implementation.""" | ||||
|
||||
import collections | ||||
import math | ||||
|
||||
import tensorflow as tf | ||||
from tensorflow import keras | ||||
|
||||
from keras_nlp.utils.tensor_utils import tensor_to_list | ||||
from keras_nlp.utils.tensor_utils import tensor_to_string_list | ||||
|
||||
REPLACE_SUBSTRINGS = [ | ||||
("<skipped>", ""), | ||||
("-\n", ""), | ||||
("\n", " "), | ||||
(""", '"'), | ||||
("&", "&"), | ||||
("<", "<"), | ||||
(">", ">"), | ||||
] | ||||
|
||||
|
||||
REGEX_PATTERNS = [ | ||||
# language-dependent part (assuming Western languages) | ||||
(r"([\{-\~\[-\` -\&\(-\+\:-\@\/])", r" \1 "), | ||||
# tokenize period and comma unless preceded by a digit | ||||
(r"([^0-9])([\.,])", r"\1 \2 "), | ||||
# tokenize period and comma unless followed by a digit | ||||
(r"([\.,])([^0-9])", r" \1 \2"), | ||||
# tokenize dash when preceded by a digit | ||||
(r"([0-9])(-)", r"\1 \2 "), | ||||
# If last character is "." or ",", add space. | ||||
(r"[\.,]$", r" \0 \1"), | ||||
# one space only between words | ||||
(r"\s+", r" "), | ||||
] | ||||
|
||||
|
||||
class Bleu(keras.metrics.Metric): | ||||
"""BLEU metric. | ||||
|
||||
This class implements the BLEU metric. BLEU is generally used to evaluate | ||||
machine translation systems. by default, this implementation replicates | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: capital case By default |
||||
SacreBLEU, but user-defined tokenizers can be passed to deal with other | ||||
languages. | ||||
|
||||
For BLEU score, we count the number of matching n-grams in the candidate | ||||
translation and the reference text. We find the "clipped count" of matching | ||||
mattdangerw marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
n-grams so as to not give a high score to a (reference, prediction) pair | ||||
with redundant, repeated tokens. Secondly, BLEU score tends to reward | ||||
shorter predictions more, which is why a brevity penalty is applied to | ||||
penalise short predictions. | ||||
|
||||
Note on input shapes: | ||||
For `y_true` and `y_pred`, this class supports the following shapes: | ||||
If `y_pred` is a scalar value, `y_true` has to be a 1D dense tensor. | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This requirement on shape match is a little weird. Suppose both There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmmm, the idea right now is that one sample can have multiple references. So, basically, a translation can have multiple reference sentences. That's why the rank of |
||||
For batched inputs, if `y_pred` is a 1D dense tensor, `y_true` has to be | ||||
a dense/ragged tensor with shape `(batch_size, None)`. | ||||
|
||||
Args: | ||||
tokenizer: callable. A function that takes a string `tf.RaggedTensor` | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens it you pass a tokenizer layer here, will that work? Say byte tokenizer for simplicity. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmmm, it won't work with byte tokeniser because we use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should either support our tokenizers or not name this argument to something else. Tokenizer means something specific in our library now, if we use that name but don't support our tokenizer class that is a bad look. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We do support our tokenisers. I've added a unit test here: keras-nlp/keras_nlp/metrics/bleu_test.py Line 105 in 0b6ebfa
|
||||
(of any shape), and tokenizes the strings in the tensor. This | ||||
function should use TensorFlow graph ops. If the tokenizer is not | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it necessary? If people are not interested in using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True, but we call the tokeniser after converting the inputs to tensors. So, we have to use TF ops here such as |
||||
specified, the default tokenizer is used. The default tokenizer | ||||
replicates the behaviour of SacreBLEU's `"tokenizer_13a"` tokenizer | ||||
(https://github.com/mjpost/sacrebleu/blob/v2.1.0/sacrebleu/tokenizers/tokenizer_13a.py). | ||||
max_order: int. The maximum n-gram order to use. For example, if | ||||
`max_order` is set to 3, unigrams, bigrams, and trigrams will be | ||||
considered. Defaults to 4. | ||||
smooth: bool. Whether to apply Lin et al. 2004 smoothing to the BLEU | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we describe this better? Lin et al. 2004 with a period in the middle of the docstring does not read very well. Also please add to reference section. |
||||
score. Adds 1 to the matched n-gram count (i.e., numerator) and 1 | ||||
to the total n-gram count (i.e., denominator) for every order while | ||||
calculating precision. Defaults to False. | ||||
dtype: string or tf.dtypes.Dtype. Precision of metric computation. If | ||||
not specified, it defaults to tf.float32. | ||||
name: string. Name of the metric instance. | ||||
**kwargs: Other keyword arguments. | ||||
|
||||
References: | ||||
- [Papineni et al., 2002](https://aclanthology.org/P02-1040/) | ||||
- [SacreBLEU](https://github.com/mjpost/sacrebleu) | ||||
- [Lin et al., 2004](https://aclanthology.org/P04-1077/) | ||||
""" | ||||
|
||||
def __init__( | ||||
self, | ||||
tokenizer=None, | ||||
max_order=4, | ||||
mattdangerw marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
smooth=False, | ||||
dtype=None, | ||||
name="bleu", | ||||
**kwargs, | ||||
): | ||||
super().__init__(name=name, dtype=dtype, **kwargs) | ||||
|
||||
if not tf.as_dtype(self.dtype).is_floating: | ||||
raise ValueError( | ||||
"`dtype` must be a floating point type. " | ||||
f"Received: dtype={dtype}" | ||||
) | ||||
|
||||
def default_tokenizer(inputs): | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are a lot of nested function that are quite long all over. Not very readable. Can we either pull them out of the class entirely, or make them class methods? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Made a few of them class methods. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe make this one a private method called
That way saving self.tokenizer in the config would work as expected. |
||||
""" | ||||
Default tokenizer. Replicates the behaviour of SacreBLEU's | ||||
default tokenizer, namely, `tokenizer_13a`. | ||||
""" | ||||
for pattern, replacement in REPLACE_SUBSTRINGS + REGEX_PATTERNS: | ||||
inputs = tf.strings.regex_replace( | ||||
input=inputs, | ||||
pattern=pattern, | ||||
rewrite=replacement, | ||||
replace_global=True, | ||||
name=None, | ||||
) | ||||
inputs = tf.strings.split(inputs) | ||||
return inputs | ||||
|
||||
if tokenizer is None: | ||||
self.tokenizer = default_tokenizer | ||||
else: | ||||
self.tokenizer = tokenizer | ||||
self.max_order = max_order | ||||
self.smooth = smooth | ||||
|
||||
self._matches = self.add_weight( | ||||
shape=(self.max_order,), | ||||
name="bleu_matches", | ||||
initializer="zeros", | ||||
dtype=self.dtype, | ||||
) | ||||
self._possible_matches = self.add_weight( | ||||
shape=(self.max_order,), | ||||
name="bleu_possible_matches", | ||||
initializer="zeros", | ||||
dtype=self.dtype, | ||||
) | ||||
self._translation_length = self.add_weight( | ||||
name="bleu_translation_length", | ||||
initializer="zeros", | ||||
dtype=self.dtype, | ||||
) | ||||
self._reference_length = self.add_weight( | ||||
name="bleu_reference_length", | ||||
initializer="zeros", | ||||
dtype=self.dtype, | ||||
) | ||||
self._bleu = self.add_weight( | ||||
name="bleu", | ||||
initializer="zeros", | ||||
dtype=self.dtype, | ||||
) | ||||
|
||||
def _get_ngrams(self, segment, max_order): | ||||
"""Extracts all n-grams upto a given maximum order from an input segment. | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: up to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remember to fix this! |
||||
|
||||
Uses Python ops. Inspired from | ||||
https://github.com/tensorflow/nmt/blob/master/nmt/scripts/bleu.py. | ||||
|
||||
Args: | ||||
segment: string. Text segment from which n-grams will be | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this a string or tensor of split tokens? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, this should be |
||||
extracted. | ||||
max_order: int. Maximum length in tokens of the n-grams returned | ||||
by this methods. | ||||
""" | ||||
ngram_counts = collections.Counter() | ||||
for order in range(1, max_order + 1): | ||||
for i in range(0, len(segment) - order + 1): | ||||
ngram = tuple(segment[i : i + order]) | ||||
ngram_counts[ngram] += 1 | ||||
return ngram_counts | ||||
|
||||
def _corpus_bleu( | ||||
self, | ||||
reference_corpus, | ||||
translation_corpus, | ||||
matches_by_order, | ||||
possible_matches_by_order, | ||||
translation_length, | ||||
reference_length, | ||||
max_order=4, | ||||
smooth=False, | ||||
): | ||||
"""Corpus BLEU implementation using Python ops. | ||||
mattdangerw marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
|
||||
Computes BLEU score of translated segments against one or more | ||||
references. Inspired from | ||||
https://github.com/tensorflow/nmt/blob/master/nmt/scripts/bleu.py. | ||||
|
||||
Args: | ||||
reference_corpus: list of lists of references for each | ||||
translation. Each reference should be tokenized into a list | ||||
of tokens. | ||||
translation_corpus: list of translations to score. Each | ||||
translation should be tokenized into a list of tokens. | ||||
matches_by_order: list of floats containing the initial number | ||||
of matches for each order. | ||||
possible_matches_by_order: list of floats containing the initial | ||||
number of possible matches for each order. | ||||
translation_length: float. Initial number of tokens in all the | ||||
translations. | ||||
reference_length: float. Initial number of tokens in all the | ||||
references. | ||||
max_order: int. Maximum n-gram order to use when computing | ||||
BLEU score. | ||||
smooth: boolean. Whether or not to apply Lin et al. 2004 | ||||
smoothing. | ||||
""" | ||||
for (references, translation) in zip( | ||||
reference_corpus, translation_corpus | ||||
): | ||||
reference_length += min(len(r) for r in references) | ||||
translation_length += len(translation) | ||||
|
||||
merged_ref_ngram_counts = collections.Counter() | ||||
for reference in references: | ||||
merged_ref_ngram_counts |= self._get_ngrams( | ||||
reference, max_order | ||||
) | ||||
translation_ngram_counts = self._get_ngrams(translation, max_order) | ||||
overlap = translation_ngram_counts & merged_ref_ngram_counts | ||||
for ngram in overlap: | ||||
matches_by_order[len(ngram) - 1] += overlap[ngram] | ||||
for order in range(1, max_order + 1): | ||||
possible_matches = len(translation) - order + 1 | ||||
if possible_matches > 0: | ||||
possible_matches_by_order[order - 1] += possible_matches | ||||
|
||||
precisions = [0] * max_order | ||||
for i in range(0, max_order): | ||||
if smooth: | ||||
precisions[i] = (matches_by_order[i] + 1.0) / ( | ||||
possible_matches_by_order[i] + 1.0 | ||||
) | ||||
else: | ||||
if possible_matches_by_order[i] > 0: | ||||
precisions[i] = ( | ||||
float(matches_by_order[i]) | ||||
/ possible_matches_by_order[i] | ||||
) | ||||
else: | ||||
precisions[i] = 0.0 | ||||
|
||||
if min(precisions) > 0: | ||||
p_log_sum = sum((1.0 / max_order) * math.log(p) for p in precisions) | ||||
geo_mean = math.exp(p_log_sum) | ||||
else: | ||||
geo_mean = 0 | ||||
|
||||
ratio = float(translation_length) / reference_length | ||||
|
||||
if ratio > 1.0: | ||||
bp = 1.0 | ||||
else: | ||||
bp = math.exp(1 - 1.0 / ratio) | ||||
|
||||
bleu = geo_mean * bp | ||||
|
||||
return ( | ||||
bleu, | ||||
matches_by_order, | ||||
possible_matches_by_order, | ||||
translation_length, | ||||
reference_length, | ||||
) | ||||
|
||||
def update_state(self, y_true, y_pred, sample_weight=None): | ||||
def validate_and_fix_rank(inputs, tensor_name, base_rank=0): | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's just split this into a separate private method on the layer, code can stay unchanged There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll remove There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sg! |
||||
if not isinstance(inputs, (tf.Tensor, tf.RaggedTensor)): | ||||
inputs = tf.convert_to_tensor(inputs) | ||||
|
||||
if inputs.shape.rank == base_rank: | ||||
return inputs[tf.newaxis] | ||||
elif inputs.shape.rank == base_rank + 1: | ||||
return inputs | ||||
else: | ||||
raise ValueError( | ||||
f"{tensor_name} must be of rank {base_rank} or {base_rank+1}. " | ||||
f"Found rank: {inputs.shape.rank}" | ||||
) | ||||
|
||||
def calculate_bleu_score(references, translation): | ||||
if references.dtype == tf.string: | ||||
references = tensor_to_string_list(references) | ||||
translation = tensor_to_string_list(translation) | ||||
else: | ||||
references = tensor_to_list(references) | ||||
translation = tensor_to_list(translation) | ||||
|
||||
matches = self._matches.numpy().tolist() | ||||
possible_matches = self._possible_matches.numpy().tolist() | ||||
translation_length = self._translation_length.numpy() | ||||
reference_length = self._reference_length.numpy() | ||||
|
||||
( | ||||
bleu_score, | ||||
matches, | ||||
possible_matches, | ||||
translation_length, | ||||
reference_length, | ||||
) = self._corpus_bleu( | ||||
reference_corpus=references, | ||||
translation_corpus=translation, | ||||
matches_by_order=matches, | ||||
possible_matches_by_order=possible_matches, | ||||
translation_length=translation_length, | ||||
reference_length=reference_length, | ||||
max_order=self.max_order, | ||||
smooth=self.smooth, | ||||
) | ||||
return ( | ||||
tf.constant(bleu_score, dtype=self.dtype), | ||||
tf.constant(matches, dtype=self.dtype), | ||||
tf.constant(possible_matches, dtype=self.dtype), | ||||
tf.constant(translation_length, dtype=self.dtype), | ||||
tf.constant(reference_length, dtype=self.dtype), | ||||
) | ||||
|
||||
y_true = validate_and_fix_rank(y_true, "y_true", 1) | ||||
y_pred = validate_and_fix_rank(y_pred, "y_pred", 0) | ||||
|
||||
# Tokenize the inputs. | ||||
y_true = self.tokenizer(y_true) | ||||
y_pred = self.tokenizer(y_pred) | ||||
|
||||
( | ||||
bleu_score, | ||||
matches, | ||||
possible_matches, | ||||
translation_length, | ||||
reference_length, | ||||
) = tf.py_function( | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. given that you are converting everything to numpy, would this do better as a numpy_function call? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmmm, I think it won't work because we have ragged tensors? Numpy has no support for ragged matrices, I think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it fails to convert ragged input tensors to numpy arrays:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sg, let's stick with py_function for now |
||||
func=calculate_bleu_score, | ||||
inp=[y_true, y_pred], | ||||
Tout=[ | ||||
self.dtype, | ||||
self.dtype, | ||||
self.dtype, | ||||
self.dtype, | ||||
self.dtype, | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you remove the trailing comma will this format to one line? |
||||
], | ||||
) | ||||
|
||||
self._matches.assign(matches) | ||||
self._possible_matches.assign(possible_matches) | ||||
self._translation_length.assign(translation_length) | ||||
self._reference_length.assign(reference_length) | ||||
self._bleu.assign(bleu_score) | ||||
|
||||
def result(self): | ||||
return self._bleu | ||||
|
||||
def reset_state(self): | ||||
self._matches.assign( | ||||
tf.zeros(shape=(self.max_order,), dtype=self.dtype) | ||||
) | ||||
self._possible_matches.assign( | ||||
tf.zeros(shape=(self.max_order,), dtype=self.dtype) | ||||
) | ||||
self._translation_length.assign(0.0) | ||||
self._reference_length.assign(0.0) | ||||
self._bleu.assign(0.0) | ||||
|
||||
def get_config(self): | ||||
config = super().get_config() | ||||
config.update( | ||||
{ | ||||
"max_order": self.max_order, | ||||
mattdangerw marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
"smooth": self.smooth, | ||||
} | ||||
) | ||||
return config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably mention more prominently that this will replicate sacrebleu by default, but can be used with other tokenizers e.g. for other languages.