Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inability to reproduce paper results #92

Closed
CameronDiao opened this issue Aug 25, 2022 · 5 comments
Closed

Inability to reproduce paper results #92

CameronDiao opened this issue Aug 25, 2022 · 5 comments

Comments

@CameronDiao
Copy link

Thanks to the authors for constructing this benchmark.

I'm having trouble reproducing some of the test scores reported in the paper, in Table 2. Comparing my runs against the paper results (averaging across 3 seeds: 42, 43, and 44):

Graham Scan task:
MPNN: 0.6355 vs. 0.9104 published
PGN: 0.3622 vs. 0.5687 published

Binary Search task:
MPNN: 0.2026 vs. 0.3683 published
PGN: 0.4390 vs. 0.7695 published

Here are the values I used for my reproduction experiments:
image

Values for batch size, train items, learning rate, and hint teaching forcing noise were obtained from sections 4.1 and 4.2 of the paper. Values for eval_every, dropout, use_ln, and use_lstm (which were not found in the paper) were default values in the provided run file. Additionally, I used processor type "pgn_mask" for the PGN experiments.

What setting should I use to more accurately reproduce the paper results? Were there hyperparameter settings unspecified in the paper (or specified) that I am getting wrong?

Finally, I noticed the most recent commit, fixing the axis for mean reduction in PGN. Would that cause PGN to perform differently than reported in the paper? And perhaps explain the discrepancy in results I obtained.

@PetarV-
Copy link
Collaborator

PetarV- commented Aug 25, 2022

Hi Cameron,

Thank you for your interest in our work!
As you rightfully noted, some of our final chosen hyperparameters did not propagate to the public GitHub's run file, and this caused a bit of a discrepancy. Sorry for the inconvenience! We are already preparing a new commit to fix that.

In the meantime, I think the key hyperparameter to change from your setting is hint_mode, which should be encoded_decoded_nodiff. You already figured out the other important hyperparameter to change (hint_teacher_forcing_noise).

Further, you can think of both pgn and pgn_mask as PGNs ("mask" is a hyperparameter for the PGN, masking out possible predictions for the edge targets to follow the graph's edges. Sometimes this is a perfect inductive bias, sometimes it is very wrong.). What we did in the paper is, per-task, report the better result out of those two in the "PGN" column.

The mean reduction patch only affects processors that use the mean aggregation, which we never use in our official experiments, as the max aggregator was always superior.

I hope this is helpful. If you have any other issues, please don't hesitate to contact us.

Thanks,
Petar

@PetarV-
Copy link
Collaborator

PetarV- commented Aug 26, 2022

To follow up on this, PR #94 integrates these hyperparameters into the main codebase.

@CameronDiao
Copy link
Author

Thank you for the quick response! I was able to replicate the paper results much more closely with the new specifications.

@CameronDiao
Copy link
Author

Hello, I just wanted to confirm that the paper settings for GAT was number of heads = 1, head size = 128?

@CameronDiao CameronDiao reopened this Sep 20, 2022
@PetarV-
Copy link
Collaborator

PetarV- commented Sep 20, 2022

Hi Cameron, I am not completely sure at this time, but what we report as "GAT" is actually the maximum performance out of gat, gat_full, gatv2, and gatv2_full, and I think also we swept number of heads between [1, 4, 8].

Basically, the best performance we were able to get out of all of these GAT variants, we reported as "GAT", due to a reduced amount of horizontal space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants