Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the correctness of step-level rating #12

Closed
MrZhengXin opened this issue Sep 13, 2023 · 3 comments
Closed

Question about the correctness of step-level rating #12

MrZhengXin opened this issue Sep 13, 2023 · 3 comments

Comments

@MrZhengXin
Copy link

Hi! Thank you for the inspiring work. I would like to know the accuracy of the step-level rating annotated by the human labelers, since I might just found a mistake in the third example of phase2_test.jsonl:

{"labeler": "b5b0d9b5-90f0-4756-ab90-140836ed55ab", "timestamp": "2022-11-17T18:14:18.615199", "generation": 6, "is_quality_control_question": false, "is_initial_screening_question": false, "question": {"problem": "A strictly increasing sequence of positive integers $a_1$, $a_2$, $a_3$, $\\dots$ has the property that for every positive integer $k$, the subsequence $a_{2k-1}$, $a_{2k}$, $a_{2k+1}$ is geometric and the subsequence $a_{2k}$, $a_{2k+1}$, $a_{2k+2}$ is arithmetic. Suppose that $a_{13} = 2016$. Find $a_1$.", "ground_truth_solution": "Let $\\frac{a_2}{a_1} = \\frac{b}{a},$ where $a$ and $b$ are relatively prime positive integers, and $a < b.$  Then $a_2 = \\frac{b}{a} \\cdot a_1,$ and\n\\[a_3 = \\frac{a_2^2}{a_1} = \\frac{(b/a \\cdot a_1)^2}{a_1} = \\frac{b^2}{a^2} \\cdot a_1.\\]This implies $a_1$ is divisible by $a^2.$  Let $a_1 = ca^2$; then $a_2 = cab,$ $a_3 = cb^2,$\n\\begin{align*}\na_4 &= 2a_3 - a_2 = 2cb^2 - cab = cb(2b - a), \\\\\na_5 &= \\frac{a_4^2}{a_3} = \\frac{[cb(2b - a)]^2}{(cb^2)} = c(2b - 2a)^2, \\\\\na_6 &= 2a_5 - a_4 = 2c(2b - a)^2 - cb(2b - a) = c(2b - a)(3b - 2a), \\\\\na_7 &= \\frac{a_6^2}{a_5} = \\frac{[c(2b - a)(3b - 2a)]^2}{c(2b - a)^2} = c(3b - 2a)^2, \\\\\na_8 &= 2a_7 - a_6 = 2c(3b - 2a)^2 - c(2b - a)(3b - 2a) = c(3b - 2a)(4b - 3a), \\\\\na_9 &= \\frac{a_8^2}{a_7} = \\frac{[c(3b - 2a)(4b - 3a)]^2}{[c(3b - 2a)^2} = c(4b - 3a)^2,\n\\end{align*}and so on.\n\nMore generally, we can prove by induction that\n\\begin{align*}\na_{2k} &= c[(k - 1)b - (k - 2)a][kb - (k - 1)a], \\\\\na_{2k + 1} &= c[kb - (k - 1)a]^2,\n\\end{align*}for all positive integers $k.$\n\nHence, from $a_{13} = 2016,$\n\\[c(6b - 5a)^2 = 2016 = 2^5 \\cdot 3^2 \\cdot 7 = 14 \\cdot 12^2.\\]Thus, $6b - 5a$ must be a factor of 12.\n\nLet $n = 6b - 5a.$  Then $a < a + 6(b - a) = n,$ and\n\\[n - a = 6b - 6a = 6(b - a),\\]so $n - a$ is a multiple of 6.  Hence,\n\\[6 < a + 6 \\le n \\le 12,\\]and the only solution is $(a,b,n) = (6,7,12).$  Then $c = 14,$ and $a_1 = 14 \\cdot 6^2 = \\boxed{504}.$", "ground_truth_answer": "504", "pre_generated_steps": ["I notice that the problem involves both geometric and arithmetic sequences, so I should recall the definitions and formulas for them.", "A geometric sequence has a common ratio $r$ such that $a_{n+1} = r a_n$ for all $n$.", "An arithmetic sequence has a common difference $d$ such that $a_{n+1} = a_n + d$ for all $n$.", "I also notice that the problem gives me a specific value for $a_{13}$, so I should try to work backwards from there and see if I can find a pattern or a relation between the terms.", "Since $a_{13}$ is part of both a geometric and an arithmetic subsequence, I can write two equations involving $a_{13}$, $a_{12}$, $a_{11}$, and $a_{14}$.", "For the geometric subsequence $a_{11}$, $a_{12}$, $a_{13}$, I have $a_{13} = r a_{12}$ and $a_{12} = r a_{11}$.", "I can eliminate $r$ by dividing the two equations and get $a_{13}^2 = a_{11} a_{12}^2$.", "Similarly, for the arithmetic subsequence $a_{12}$, $a_{13}$, $a_{14}$, I have $a_{14} = a_{13} + d$ and $a_{13} = a_{12} + d$.", "I can eliminate $d$ by subtracting the two equations and get $a_{14} - a_{13} = a_{13} - a_{12}$.", "Now I have two equations with four unknowns, which is not enough to solve for them.", "However, I can use the fact that the sequence is strictly increasing, which means that $a_n > a_{n-1}$ for all $n$.", "This implies that $r > 1$ and $d > 0$.", "I can also use the fact that all the terms are positive integers, which means that $r$ and $d$ are also positive integers, and $a_n$ divides $a_{n+1}$ for all $n$ in a geometric subsequence.", "With these additional constraints, I can try to guess some possible values for $r$ and $d$ that satisfy the equations and the conditions.", "For example, if $r = 2$ and $d = 1$, then I have $a_{13} = 2 a_{12}$, $a_{12} = 2 a_{11}$, $a_{14} = a_{13} + 1$, and $a_{13} - a_{12} = 1$.", "Plugging in $a_{13} = 2016$, I get $a_{12} = 1008$, $a_{11} = 504$, and $a_{14} = 2017$.", "This seems to work, but I need to check if the previous terms also follow the same pattern.", "If $r = 2$ and $d = 1$, then the previous geometric subsequence should be $a_9$, $a_{10}$, $a_{11}$, and the previous arithmetic subsequence should be $a_{10}$, $a_{11}$, $a_{12}$.", "Using the same formulas, I get $a_{11} = 2 a_{10}$, $a_{10} = 2 a_9$, $a_{12} = a_{11} + 1$, and $a_{11} - a_{10} = 1$.", "Plugging in $a_{11} = 504$, I get $a_{10} = 252$, $a_9 = 126$, and $a_{12} = 505$.", "This also seems to work, and I can continue this process until I reach $a_1$.", "After a few more steps, I get $a_7 = 63$, $a_6 = 32$, $a_5 = 16$, $a_4 = 9$, $a_3 = 5$, $a_2 = 3$, and $a_1 = 2$.", "This is the smallest possible value for $a_1$ that satisfies all the conditions, and I can verify that the sequence continues to be strictly increasing, geometric, and arithmetic as required.", "# Answer\n\n2"], "pre_generated_answer": "2", "pre_generated_verifier_score": 5.517159691249014e-05}, "label": {"steps": [{"completions": [{"text": "I notice that the problem involves both geometric and arithmetic sequences, so I should recall the definitions and formulas for them.", "rating": 1, "flagged": null}], "human_completion": null, "chosen_completion": 0}, {"completions": [{"text": "A geometric sequence has a common ratio $r$ such that $a_{n+1} = r a_n$ for all $n$.", "rating": 1, "flagged": null}], "human_completion": null, "chosen_completion": 0}, {"completions": [{"text": "An arithmetic sequence has a common difference $d$ such that $a_{n+1} = a_n + d$ for all $n$.", "rating": -1, "flagged": null}, {"text": "An arithmetic sequence has a common difference $d$ such that $a_{n+1} = a_n + d$ for all $n$.", "rating": -1, "flagged": false}], "human_completion": null, "chosen_completion": null}], "total_time": 1097989, "finish_reason": "found_error"}}

Here, the third step is labeled as -1, but seems to be correct:

"An arithmetic sequence has a common difference $d$ such that $a_{n+1} = a_n + d$ for all $n$."

In another issue, you said

our labelers probably made small mistakes more often than the MATH answer key was wrong

And in the paper Appendix B, you mentioned that

only admitted labelers that agreed with our gold labels at least 75% of the time.

So the accuracy of step-level annotation is maybe slightly higher than that?

@Huntrr
Copy link
Collaborator

Huntrr commented Sep 20, 2023

yep i think on average labelers got 80% accuracy on the gold label tests. so i'd expect <20% of trajectories to have mistakes in some step.

@MrZhengXin
Copy link
Author

yep i think on average labelers got 80% accuracy on the gold label tests. so i'd expect <20% of trajectories to have mistakes in some step.

Thanks for the reply~
By the way, have you considered some methods to reduce the data noise, or asking skilled teachers to grade?

@Huntrr
Copy link
Collaborator

Huntrr commented Sep 21, 2023

just what's described in the paper. our graders are skilled and we run periodic QC to ensure they are keeping our quality bar

@Huntrr Huntrr closed this as completed Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants