Replacing 1 by # #7

astariul · 2019-10-15T02:06:43Z

In this line (code for evaluation of CNNDM)

unilm/src/gigaword/eval.py

Line 239 in d22a233

sentence = fix_tokenization(l.strip()).replace('1', '#')

1 is replaced by #.

I don't understand it. Can someone explain me the reason of such post-processing ?

The text was updated successfully, but these errors were encountered:

donglixp · 2019-10-15T04:21:20Z

The file unilm/src/gigaword/eval.py follows the gigaword data preprocessing as in https://github.com/harvardnlp/sent-summary, which substitutes all the digits to #. However, our bpe tokenizer uses # to indicate subwords. So we preprocess the special token # to 1, and then replace them back after prediction. Notice that the unilm/src/cnndm/eval.py script doesn't use the preprocess.

donglixp closed this as completed Oct 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replacing 1 by # #7

Replacing 1 by # #7

astariul commented Oct 15, 2019

donglixp commented Oct 15, 2019

Replacing 1 by # #7

Replacing 1 by # #7

Comments

astariul commented Oct 15, 2019

donglixp commented Oct 15, 2019