-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Predictions written to disk don't match metrics #343
Comments
I'll try reproducing this, esp w/o ELMo. |
Failed to reproduce without save/restore, even with ELMo. Will try with save/restore. |
Checked that Alex's predictions and mine are different (both after restore), even though the reported accuracy scores are the some. |
Couldn't reproduce even with save/restore. I suspect this is an index-related bug. |
What's the path to your predictions on NYU? And these are dev predictions? |
I just overwrote them on the server. One sec. |
/nfs/jsalt/exp/sam-worker2/final/mtl-glue-elmo/wnli_val.tsv I rebuilt all pickles from scratch this time to be sure. Do you have stale pickles? That's my only guess at this point for this particular run mismatch...
|
Aaagh nevermind. There was a filename mismatch, and that was from the other run. |
(The predictions and metrics do, in fact, match.) |
Cool, I was just evaluating my CoLA predictions offline and they seemed to match. |
Tx. Sorry for the scare. Now I have very little idea what's wrong, but this
nondeterminism thing is worrying.
…On Fri, Aug 10, 2018 at 3:07 PM Alex Wang ***@***.***> wrote:
Cool, I was just evaluating my CoLA predictions offline and they seemed to
match.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#343 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABOZWU1l7aBYkMVYb6Tu_Jfw3RBXVgUbks5uPdnvgaJpZM4V4ju4>
.
|
I just tried evaluating Alex's 72.9 run. The numbers reported at the end of the eval run match what's on the sheet, so I was able to reproduce those, but the actual outputs written to disk for the dev set don't make sense.
The ground truth prediction column is the same in the output files and in the original input TSVs, so it's not a loading issue. It may be relevant that WNLI is so short—if something is wrong in the first or last batch, that would be magnified for WNLI.
The text was updated successfully, but these errors were encountered: