Extract human response from 6k multi-ref dataset #48

andy920262 · 2020-09-02T07:57:38Z

Hi,

I'm trying to reproduce the human response result in the paper and encounter some problem.
I copied test.scored_refs.txt to dstc/data folder and use the first column as the keys.
The eval result after running python extract_human.py and python batch_eval.py is

n_lines = 5994
NIST = [2.871, 3.246, 3.3125, 3.3229]
BLEU = [0.378, 0.1678, 0.0966, 0.0655]
METEOR = 0.10657856237003654
entropy = [6.61382916462754, 10.109370475853682, 11.032526832134234, 11.125019724262556]
diversity = [0.12143963906484984, 0.5817823864609064]
avg_len = 14.64330997664331

which is different from the paper, even the avg_len is wrong.
I'm wondering which step is wrong and how to reproduce the result.

Thanks!

The text was updated successfully, but these errors were encountered:

dreasysnail · 2020-09-03T18:21:18Z

Hi, the human reference file has been uploaded. Please find it here: data/human.ref.6k.txt. You might want to use this human reference file to compute against the other references. Also, your total line number is not 6000. I am not sure what's the reason but maybe it is worth for an examination.

andy920262 · 2020-09-04T07:33:47Z

Thanks for the update.

The file and the avg_len seems correct, but the eval result is still wrong.

I observed 2 problems and fixed them by the following script:

cat ../data/test.refs.txt | cut -f 2- | rev | cut -f 2- | rev > ./data/test.refs.tmp.txt
seq 6000 > ./data/keys.6k.txt
paste ./data/keys.6k.txt ./data/test.refs.tmp.txt > ./data/test.refs.txt

The human response is obtained from the last column of test.refs.txt, so I exclude the last column in the reference.
The first column has some duplicated keys, so I replace them by distinct numbers.

and run

$ python3 dstc.py human.6k.resp.txt --ref ./data/test.refs.txt --keys ./data/keys.6k.txt --vshuman -1

The result looks almost correct, except NIST4 which is 4.25 in the paper

n_lines = 6000
NIST = [2.9939, 3.412, 3.491, 3.5033]
BLEU = [0.3961, 0.179, 0.1071, 0.0748]
METEOR = 0.10636074642754038
entropy = [6.864962939185212, 10.213254208172751, 10.970525196688564, 10.99510001622831]
diversity = [0.1454816096487322, 0.6296332006446193]
avg_len = 13.100166666666667

NIST2 and NIST4 are very close in all other experiments, 3.5 seems more reasonable.
Maybe the NIST4 score is typed wrong in the paper?

theyorubayesian · 2021-07-06T08:20:22Z

@andy

Thanks for the update.

The file and the avg_len seems correct, but the eval result is still wrong.

I observed 2 problems and fixed them by the following script:
cat ../data/test.refs.txt | cut -f 2- | rev | cut -f 2- | rev > ./data/test.refs.tmp.txt
seq 6000 > ./data/keys.6k.txt
paste ./data/keys.6k.txt ./data/test.refs.tmp.txt > ./data/test.refs.txt
The human response is obtained from the last column of test.refs.txt, so I exclude the last column in the reference.

The first column has some duplicated keys, so I replace them by distinct numbers.

and run
$ python3 dstc.py human.6k.resp.txt --ref ./data/test.refs.txt --keys ./data/keys.6k.txt --vshuman -1
The result looks almost correct, except NIST4 which is 4.25 in the paper
n_lines = 6000
NIST = [2.9939, 3.412, 3.491, 3.5033]
BLEU = [0.3961, 0.179, 0.1071, 0.0748]
METEOR = 0.10636074642754038
entropy = [6.864962939185212, 10.213254208172751, 10.970525196688564, 10.99510001622831]
diversity = [0.1454816096487322, 0.6296332006446193]
avg_len = 13.100166666666667
NIST2 and NIST4 are very close in all other experiments, 3.5 seems more reasonable.
Maybe the NIST4 score is typed wrong in the paper?

I obtained the same results as you so it's possible an error was made in the paper.

theyorubayesian mentioned this issue Jul 6, 2021

mteval-v14c.pl link in README is broken #72

Open

lizekang mentioned this issue Oct 11, 2021

Model evaluation ictnlp/DialoFlow#9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract human response from 6k multi-ref dataset #48

Extract human response from 6k multi-ref dataset #48

andy920262 commented Sep 2, 2020

dreasysnail commented Sep 3, 2020 •

edited

andy920262 commented Sep 4, 2020

theyorubayesian commented Jul 6, 2021

Extract human response from 6k multi-ref dataset #48

Extract human response from 6k multi-ref dataset #48

Comments

andy920262 commented Sep 2, 2020

dreasysnail commented Sep 3, 2020 • edited

andy920262 commented Sep 4, 2020

theyorubayesian commented Jul 6, 2021

dreasysnail commented Sep 3, 2020 •

edited