New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add third party code for text-to-sql evaluation #532
Conversation
…to evaluation.py, evaluation_test.py to be consistent with the original third-party repo
Modified the Readme and renamed files sql_evaluation.py => evaluation.py, sql_evaluation_test.py => evaluation_test.py to match the code in the third-party repo. We used the different file names for some historical reasons now we changed them back to be the same as the files in the original repo. |
…text_to_sql_evaluation merge main into my branch
…test. change assertAlmostEqual(ci[0],...), assertAlmostEqual(ci[0],...) to assertGreater(val, ci[0]) and assertGreater(ci[1], val)
Change the test criteria in meta_eval_wmt_da_test.py to make it pass the test. Use criteria ci[0]<val<ci[1] instead of hardcode the values of ci[0], ci[1].
to
|
Thanks @shuaichenchang ! Could we handle 3d5dfec as a separate PR, as it's not related to this addition of third-party code? |
Maybe approve this case and keep the habit in mind in future PR. |
I'm OK with discussing as part of this thread this time. But I'm actually not sure if the changes to the tests are appropriate, and was hoping to have discussion in the separate thread. These changes to the unit tests seem to be loosening the check on confidence intervals significantly. I'm wondering if it's appropriate to do so, or if the tests are failing because the code is broken and we need to fix the underlying code. @pfliu-nlp : I think this is code that you implemented, could you confirm that? If the code is not broken and it's just the case that the confidence intervals vary run-to-run due to some variance in bootstrapping, then I think it would be better to modify the tests so that they cover a "reasonable" range (e.g. so that they'll fail only a tiny fraction of the time) but not loosen them so much so that they're basically not checking the underlying code. |
Yes, @shuaichenchang has consulted me offline; the current modified implementation is something I suggested.
I think the current implementation is fine. In a lot of other places, we even don't check the ci at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification Pengfei!
I think the current implementation is fine. In a lot of other places, we even don't check the ci at all.
That's true, but this lack of appropriate checks also has been a source of bugs in the past.
Are you certain that the implementation is correct, and the failed tests here are just a result of variance in bootstrapping or something similar? If so, then I'd be OK with approving this PR, but suggest that we at least leave a TODO there so we know take a note of this discussion for the future, as I suggested in the revisions here.
Thanks for the suggestion, which looks good.
Yeah, I cannot think of other potential reasons (but just from my side) Additionally, another context I think is: this task is relatively challenging to add. I kinda tend to try to reduce the blockers so that @shuaichenchang will feel relatively easy to add them. If there are some places that could be further improved, we can leave them as issues, then I, for example, could refine them later. |
Yeah, I understand the importance of being pragmatic. I created an issue to double-check this though: #537 A difference of 0.08 in the confidence values seems a bit bigger than I would expect from our bootstrapping so it'd be nice to know for sure that this is not a bug. |
One observation is: it could pass the test in python3.9 while failed in python3.10. |
@neubig @pfliu-nlp Since the issue was created, we need to focus on the original change in this PR. @shuaichenchang Could you re-request reviews to @neubig and @odashi from the top-right side of this page? it encourages me to continue a review as fast as possible. |
…e needs to be addressed later
It seems the ignoring third-party in |
@shuaichenchang It seems the configuration file needs to be updated. Let me take a look. I'll fix in a separate PR. |
@shuaichenchang The fix was merged into |
@shuaichenchang @neubig @pfliu-nlp I created a separate PR #550 to fix |
I tested this branch locally, it could pass all tests; it's strange that there is a markdown link error here, maybe need to re-commit? (cc @tetsuok )
|
@pfliu-nlp It's because this branch is behind |
@tetsuok I see (I assumed that @shuaichenchang has merged the latest main into this PR, but it seemed he didn't), let me do it. |
Add third-party code for text-to-sql evaluation. The code is adapted from https://github.com/taoyds/test-suite-sql-eval