Skip to content

Reproduction Results Show Significant EM Difference in Iter1 Compared to Paper #51

@JayPritchet

Description

@JayPritchet

Hi, thanks for your awesome work!

I'm trying to reproduce the results from the paper using Codegen-350-mono and noticed some discrepancies in the metrics, particularly in the EM score for Iter1.

Paper Results:

Metric Infile Iter1 Iter2 Iter3 Iter4
EM 22.19 31.75 33.88 33.75 33.81
ES 52.24 59.82 61.03 60.96 61.06

My Reproduction Results:

Metric Infile Iter1 Iter2 Iter3 Iter4
EM 22.25 33.12 34.06 34.12 33.81
ES 52.23 60.33 60.94 61.23 60.92

My Hyperparameters:

  • max_retrieval_length = 900
  • window_size = 20
  • slice_size = 2

Observations:

  1. The Infile metrics are very close to the paper results (EM: 22.25 vs 22.19, ES: 52.23 vs 52.24)
  2. However, there's a significant difference in EM for Iter1 (33.12 vs 31.75 - difference of 1.37 points)
  3. Other iterations show smaller but noticeable variations in both EM and ES metrics

I'm using the code from the /zfj/RepoCoder branch for my tests. Since the initial commit, I've noticed there have been a number of updates to the repository. This leads me to wonder:

  • Could these recent commits be the primary reason for the elevated performance metrics I'm seeing?
  • If not, what other factors should I investigate? (For example, were there specific hyperparameters or environmental settings used for the paper's experiments that differ from the current defaults?)

Thank you for your time and assistance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions