Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract human response from 6k multi-ref dataset #48

Open
andy920262 opened this issue Sep 2, 2020 · 3 comments
Open

Extract human response from 6k multi-ref dataset #48

andy920262 opened this issue Sep 2, 2020 · 3 comments

Comments

@andy920262
Copy link

Hi,

I'm trying to reproduce the human response result in the paper and encounter some problem.
I copied test.scored_refs.txt to dstc/data folder and use the first column as the keys.
The eval result after running python extract_human.py and python batch_eval.py is

n_lines = 5994
NIST = [2.871, 3.246, 3.3125, 3.3229]
BLEU = [0.378, 0.1678, 0.0966, 0.0655]
METEOR = 0.10657856237003654
entropy = [6.61382916462754, 10.109370475853682, 11.032526832134234, 11.125019724262556]
diversity = [0.12143963906484984, 0.5817823864609064]
avg_len = 14.64330997664331

which is different from the paper, even the avg_len is wrong.
I'm wondering which step is wrong and how to reproduce the result.

Thanks!

@dreasysnail
Copy link
Contributor

dreasysnail commented Sep 3, 2020

Hi, the human reference file has been uploaded. Please find it here: data/human.ref.6k.txt. You might want to use this human reference file to compute against the other references. Also, your total line number is not 6000. I am not sure what's the reason but maybe it is worth for an examination.

@andy920262
Copy link
Author

Thanks for the update.

The file and the avg_len seems correct, but the eval result is still wrong.

I observed 2 problems and fixed them by the following script:

cat ../data/test.refs.txt | cut -f 2- | rev | cut -f 2- | rev > ./data/test.refs.tmp.txt
seq 6000 > ./data/keys.6k.txt
paste ./data/keys.6k.txt ./data/test.refs.tmp.txt > ./data/test.refs.txt
  1. The human response is obtained from the last column of test.refs.txt, so I exclude the last column in the reference.
  2. The first column has some duplicated keys, so I replace them by distinct numbers.

and run

$ python3 dstc.py human.6k.resp.txt --ref ./data/test.refs.txt --keys ./data/keys.6k.txt --vshuman -1

The result looks almost correct, except NIST4 which is 4.25 in the paper

n_lines = 6000
NIST = [2.9939, 3.412, 3.491, 3.5033]
BLEU = [0.3961, 0.179, 0.1071, 0.0748]
METEOR = 0.10636074642754038
entropy = [6.864962939185212, 10.213254208172751, 10.970525196688564, 10.99510001622831]
diversity = [0.1454816096487322, 0.6296332006446193]
avg_len = 13.100166666666667

NIST2 and NIST4 are very close in all other experiments, 3.5 seems more reasonable.
Maybe the NIST4 score is typed wrong in the paper?

@theyorubayesian
Copy link

@andy

Thanks for the update.

The file and the avg_len seems correct, but the eval result is still wrong.

I observed 2 problems and fixed them by the following script:

cat ../data/test.refs.txt | cut -f 2- | rev | cut -f 2- | rev > ./data/test.refs.tmp.txt
seq 6000 > ./data/keys.6k.txt
paste ./data/keys.6k.txt ./data/test.refs.tmp.txt > ./data/test.refs.txt
  1. The human response is obtained from the last column of test.refs.txt, so I exclude the last column in the reference.
  2. The first column has some duplicated keys, so I replace them by distinct numbers.

and run

$ python3 dstc.py human.6k.resp.txt --ref ./data/test.refs.txt --keys ./data/keys.6k.txt --vshuman -1

The result looks almost correct, except NIST4 which is 4.25 in the paper

n_lines = 6000
NIST = [2.9939, 3.412, 3.491, 3.5033]
BLEU = [0.3961, 0.179, 0.1071, 0.0748]
METEOR = 0.10636074642754038
entropy = [6.864962939185212, 10.213254208172751, 10.970525196688564, 10.99510001622831]
diversity = [0.1454816096487322, 0.6296332006446193]
avg_len = 13.100166666666667

NIST2 and NIST4 are very close in all other experiments, 3.5 seems more reasonable.
Maybe the NIST4 score is typed wrong in the paper?

I obtained the same results as you so it's possible an error was made in the paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants