Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add third party code for text-to-sql evaluation #532

Merged
merged 7 commits into from Oct 9, 2022

Conversation

shuaichenchang
Copy link
Collaborator

Add third-party code for text-to-sql evaluation. The code is adapted from https://github.com/taoyds/test-suite-sql-eval

@shuaichenchang shuaichenchang marked this pull request as draft October 6, 2022 00:27
@shuaichenchang shuaichenchang marked this pull request as ready for review October 6, 2022 01:12
…to evaluation.py, evaluation_test.py to be consistent with the original third-party repo
@shuaichenchang
Copy link
Collaborator Author

Modified the Readme and renamed files sql_evaluation.py => evaluation.py, sql_evaluation_test.py => evaluation_test.py to match the code in the third-party repo. We used the different file names for some historical reasons now we changed them back to be the same as the files in the original repo.

…text_to_sql_evaluation

merge main into my branch
…test. change assertAlmostEqual(ci[0],...), assertAlmostEqual(ci[0],...) to assertGreater(val, ci[0]) and assertGreater(ci[1], val)
@shuaichenchang
Copy link
Collaborator Author

Change the test criteria in meta_eval_wmt_da_test.py to make it pass the test. Use criteria ci[0]<val<ci[1] instead of hardcode the values of ci[0], ci[1].
e.g.

self.assertAlmostEqual(ci[0], 0.6488, 2) 
 self.assertAlmostEqual(ci[1], 0.8999, 2)  

to

self.assertGreater(val, ci[0]) 
self.assertGreater(ci[1], val) 

@neubig
Copy link
Contributor

neubig commented Oct 7, 2022

Thanks @shuaichenchang ! Could we handle 3d5dfec as a separate PR, as it's not related to this addition of third-party code?

@pfliu-nlp
Copy link
Collaborator

Thanks @shuaichenchang ! Could we handle 3d5dfec as a separate PR, as it's not related to this addition of third-party code?

Maybe approve this case and keep the habit in mind in future PR.

@neubig
Copy link
Contributor

neubig commented Oct 7, 2022

I'm OK with discussing as part of this thread this time. But I'm actually not sure if the changes to the tests are appropriate, and was hoping to have discussion in the separate thread.

These changes to the unit tests seem to be loosening the check on confidence intervals significantly. I'm wondering if it's appropriate to do so, or if the tests are failing because the code is broken and we need to fix the underlying code. @pfliu-nlp : I think this is code that you implemented, could you confirm that?

If the code is not broken and it's just the case that the confidence intervals vary run-to-run due to some variance in bootstrapping, then I think it would be better to modify the tests so that they cover a "reasonable" range (e.g. so that they'll fail only a tiny fraction of the time) but not loosen them so much so that they're basically not checking the underlying code.

@pfliu-nlp
Copy link
Collaborator

pfliu-nlp commented Oct 7, 2022

"These changes to the unit tests seem to be loosening the check on confidence intervals significantly. I'm wondering if it's appropriate to do so, or if the tests are failing because the code is broken and we need to fix the underlying code. @pfliu-nlp : I think this is code that you implemented, could you confirm that?"

Yes, @shuaichenchang has consulted me offline; the current modified implementation is something I suggested.

"not loosen them so much so that they're basically not checking the underlying code."

I think the current implementation is fine. In a lot of other places, we even don't check the ci at all.

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification Pengfei!

I think the current implementation is fine. In a lot of other places, we even don't check the ci at all.

That's true, but this lack of appropriate checks also has been a source of bugs in the past.

Are you certain that the implementation is correct, and the failed tests here are just a result of variance in bootstrapping or something similar? If so, then I'd be OK with approving this PR, but suggest that we at least leave a TODO there so we know take a note of this discussion for the future, as I suggested in the revisions here.

integration_tests/meta_eval_wmt_da_test.py Outdated Show resolved Hide resolved
integration_tests/meta_eval_wmt_da_test.py Outdated Show resolved Hide resolved
@pfliu-nlp
Copy link
Collaborator

Thanks for the suggestion, which looks good.

"Are you certain that the implementation is correct, and the failed tests here are just a result of variance in bootstrapping or something similar? If so, then I'd be OK with approving this PR, but suggest that we at least leave a TODO there so we know take a note of this discussion for the future"

Yeah, I cannot think of other potential reasons (but just from my side)
I think having a simple solution first + plus creating an issue or TODO, is a good idea sometimes! Let's go with it.

Additionally, another context I think is: this task is relatively challenging to add. I kinda tend to try to reduce the blockers so that @shuaichenchang will feel relatively easy to add them. If there are some places that could be further improved, we can leave them as issues, then I, for example, could refine them later.

@neubig
Copy link
Contributor

neubig commented Oct 7, 2022

Yeah, I understand the importance of being pragmatic. I created an issue to double-check this though: #537

A difference of 0.08 in the confidence values seems a bit bigger than I would expect from our bootstrapping so it'd be nice to know for sure that this is not a bug.

@pfliu-nlp
Copy link
Collaborator

Yeah, I understand the importance of being pragmatic. I created an issue to double-check this though: #537

A difference of 0.08 in the confidence values seems a bit bigger than I would expect from our bootstrapping so it'd be nice to know for sure that this is not a bug.

One observation is: it could pass the test in python3.9 while failed in python3.10.

@odashi
Copy link
Contributor

odashi commented Oct 7, 2022

@neubig @pfliu-nlp Since the issue was created, we need to focus on the original change in this PR.

@shuaichenchang Could you re-request reviews to @neubig and @odashi from the top-right side of this page? it encourages me to continue a review as fast as possible.

@shuaichenchang
Copy link
Collaborator Author

shuaichenchang commented Oct 8, 2022

It seems the ignoring third-party in .markdownlint.yaml doesn't skip the tests in the third-party folder. @tetsuok
https://github.com/neulab/ExplainaBoard/blob/main/.markdownlint.yaml#L27

@tetsuok
Copy link
Contributor

tetsuok commented Oct 9, 2022

@shuaichenchang It seems the configuration file needs to be updated. Let me take a look. I'll fix in a separate PR.

@tetsuok
Copy link
Contributor

tetsuok commented Oct 9, 2022

@shuaichenchang The fix was merged into main. You can merge main into this branch to fix markdownlint error.

@tetsuok
Copy link
Contributor

tetsuok commented Oct 9, 2022

@shuaichenchang @neubig @pfliu-nlp I created a separate PR #550 to fix integration_tests/meta_eval_wmt_da_test.py as soon as possible because I'm observing more frequent test failures in other pull requests, which becomes a drain on our productivity.

@pfliu-nlp
Copy link
Collaborator

I tested this branch locally, it could pass all tests; it's strange that there is a markdown link error here, maybe need to re-commit? (cc @tetsuok )

black....................................................................Passed
flake8...................................................................Passed
docstring................................................................Passed
isort....................................................................Passed
Upgrade type hints.......................................................Passed
mypy.....................................................................Passed
markdownlint-cli2........................................................Passed
markdownlint-cli2-fix....................................................Passed

@tetsuok
Copy link
Contributor

tetsuok commented Oct 9, 2022

@pfliu-nlp It's because this branch is behind main. We need to merge main into this branch and push the commit to the remote.

@pfliu-nlp
Copy link
Collaborator

@tetsuok I see (I assumed that @shuaichenchang has merged the latest main into this PR, but it seemed he didn't), let me do it.

@odashi odashi merged commit 57359c4 into main Oct 9, 2022
@odashi odashi deleted the third_party_code_for_text_to_sql_evaluation branch October 9, 2022 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants