-
Notifications
You must be signed in to change notification settings - Fork 86
Open
Description
Hi, thanks for your awesome work!
I'm trying to reproduce the results from the paper using Codegen-350-mono and noticed some discrepancies in the metrics, particularly in the EM score for Iter1.
Paper Results:
| Metric | Infile | Iter1 | Iter2 | Iter3 | Iter4 |
|---|---|---|---|---|---|
| EM | 22.19 | 31.75 | 33.88 | 33.75 | 33.81 |
| ES | 52.24 | 59.82 | 61.03 | 60.96 | 61.06 |
My Reproduction Results:
| Metric | Infile | Iter1 | Iter2 | Iter3 | Iter4 |
|---|---|---|---|---|---|
| EM | 22.25 | 33.12 | 34.06 | 34.12 | 33.81 |
| ES | 52.23 | 60.33 | 60.94 | 61.23 | 60.92 |
My Hyperparameters:
- max_retrieval_length = 900
- window_size = 20
- slice_size = 2
Observations:
- The Infile metrics are very close to the paper results (EM: 22.25 vs 22.19, ES: 52.23 vs 52.24)
- However, there's a significant difference in EM for Iter1 (33.12 vs 31.75 - difference of 1.37 points)
- Other iterations show smaller but noticeable variations in both EM and ES metrics
I'm using the code from the /zfj/RepoCoder branch for my tests. Since the initial commit, I've noticed there have been a number of updates to the repository. This leads me to wonder:
- Could these recent commits be the primary reason for the elevated performance metrics I'm seeing?
- If not, what other factors should I investigate? (For example, were there specific hyperparameters or environmental settings used for the paper's experiments that differ from the current defaults?)
Thank you for your time and assistance!
Metadata
Metadata
Assignees
Labels
No labels