add third party code for text-to-sql evaluation #532

shuaichenchang · 2022-10-06T00:27:12Z

Add third-party code for text-to-sql evaluation. The code is adapted from https://github.com/taoyds/test-suite-sql-eval

explainaboard/third_party/text_to_sql_test_suit_eval/README.md

…to evaluation.py, evaluation_test.py to be consistent with the original third-party repo

shuaichenchang · 2022-10-07T01:55:17Z

Modified the Readme and renamed files sql_evaluation.py => evaluation.py, sql_evaluation_test.py => evaluation_test.py to match the code in the third-party repo. We used the different file names for some historical reasons now we changed them back to be the same as the files in the original repo.

…text_to_sql_evaluation merge main into my branch

…test. change assertAlmostEqual(ci[0],...), assertAlmostEqual(ci[0],...) to assertGreater(val, ci[0]) and assertGreater(ci[1], val)

shuaichenchang · 2022-10-07T02:47:39Z

Change the test criteria in meta_eval_wmt_da_test.py to make it pass the test. Use criteria ci[0]<val<ci[1] instead of hardcode the values of ci[0], ci[1].
e.g.

self.assertAlmostEqual(ci[0], 0.6488, 2) 
 self.assertAlmostEqual(ci[1], 0.8999, 2)

to

self.assertGreater(val, ci[0]) 
self.assertGreater(ci[1], val)

neubig · 2022-10-07T12:58:07Z

Thanks @shuaichenchang ! Could we handle 3d5dfec as a separate PR, as it's not related to this addition of third-party code?

pfliu-nlp · 2022-10-07T14:17:12Z

Thanks @shuaichenchang ! Could we handle 3d5dfec as a separate PR, as it's not related to this addition of third-party code?

Maybe approve this case and keep the habit in mind in future PR.

neubig · 2022-10-07T14:38:28Z

I'm OK with discussing as part of this thread this time. But I'm actually not sure if the changes to the tests are appropriate, and was hoping to have discussion in the separate thread.

These changes to the unit tests seem to be loosening the check on confidence intervals significantly. I'm wondering if it's appropriate to do so, or if the tests are failing because the code is broken and we need to fix the underlying code. @pfliu-nlp : I think this is code that you implemented, could you confirm that?

If the code is not broken and it's just the case that the confidence intervals vary run-to-run due to some variance in bootstrapping, then I think it would be better to modify the tests so that they cover a "reasonable" range (e.g. so that they'll fail only a tiny fraction of the time) but not loosen them so much so that they're basically not checking the underlying code.

pfliu-nlp · 2022-10-07T16:33:34Z

"These changes to the unit tests seem to be loosening the check on confidence intervals significantly. I'm wondering if it's appropriate to do so, or if the tests are failing because the code is broken and we need to fix the underlying code. @pfliu-nlp : I think this is code that you implemented, could you confirm that?"

Yes, @shuaichenchang has consulted me offline; the current modified implementation is something I suggested.

"not loosen them so much so that they're basically not checking the underlying code."

I think the current implementation is fine. In a lot of other places, we even don't check the ci at all.

neubig

Thanks for the clarification Pengfei!

I think the current implementation is fine. In a lot of other places, we even don't check the ci at all.

That's true, but this lack of appropriate checks also has been a source of bugs in the past.

Are you certain that the implementation is correct, and the failed tests here are just a result of variance in bootstrapping or something similar? If so, then I'd be OK with approving this PR, but suggest that we at least leave a TODO there so we know take a note of this discussion for the future, as I suggested in the revisions here.

integration_tests/meta_eval_wmt_da_test.py

pfliu-nlp · 2022-10-07T17:05:49Z

Thanks for the suggestion, which looks good.

"Are you certain that the implementation is correct, and the failed tests here are just a result of variance in bootstrapping or something similar? If so, then I'd be OK with approving this PR, but suggest that we at least leave a TODO there so we know take a note of this discussion for the future"

Yeah, I cannot think of other potential reasons (but just from my side)
I think having a simple solution first + plus creating an issue or TODO, is a good idea sometimes! Let's go with it.

Additionally, another context I think is: this task is relatively challenging to add. I kinda tend to try to reduce the blockers so that @shuaichenchang will feel relatively easy to add them. If there are some places that could be further improved, we can leave them as issues, then I, for example, could refine them later.

neubig · 2022-10-07T17:14:55Z

Yeah, I understand the importance of being pragmatic. I created an issue to double-check this though: #537

A difference of 0.08 in the confidence values seems a bit bigger than I would expect from our bootstrapping so it'd be nice to know for sure that this is not a bug.

pfliu-nlp · 2022-10-07T17:25:47Z

Yeah, I understand the importance of being pragmatic. I created an issue to double-check this though: #537

A difference of 0.08 in the confidence values seems a bit bigger than I would expect from our bootstrapping so it'd be nice to know for sure that this is not a bug.

One observation is: it could pass the test in python3.9 while failed in python3.10.

odashi · 2022-10-07T21:09:28Z

@neubig @pfliu-nlp Since the issue was created, we need to focus on the original change in this PR.

@shuaichenchang Could you re-request reviews to @neubig and @odashi from the top-right side of this page? it encourages me to continue a review as fast as possible.

…e needs to be addressed later

shuaichenchang · 2022-10-08T17:06:23Z

It seems the ignoring third-party in .markdownlint.yaml doesn't skip the tests in the third-party folder. @tetsuok
https://github.com/neulab/ExplainaBoard/blob/main/.markdownlint.yaml#L27

tetsuok · 2022-10-09T02:44:17Z

@shuaichenchang It seems the configuration file needs to be updated. Let me take a look. I'll fix in a separate PR.

tetsuok · 2022-10-09T08:54:21Z

@shuaichenchang The fix was merged into main. You can merge main into this branch to fix markdownlint error.

tetsuok · 2022-10-09T13:52:19Z

@shuaichenchang @neubig @pfliu-nlp I created a separate PR #550 to fix integration_tests/meta_eval_wmt_da_test.py as soon as possible because I'm observing more frequent test failures in other pull requests, which becomes a drain on our productivity.

pfliu-nlp · 2022-10-09T13:53:09Z

I tested this branch locally, it could pass all tests; it's strange that there is a markdown link error here, maybe need to re-commit? (cc @tetsuok )

black....................................................................Passed
flake8...................................................................Passed
docstring................................................................Passed
isort....................................................................Passed
Upgrade type hints.......................................................Passed
mypy.....................................................................Passed
markdownlint-cli2........................................................Passed
markdownlint-cli2-fix....................................................Passed

tetsuok · 2022-10-09T13:58:51Z

@pfliu-nlp It's because this branch is behind main. We need to merge main into this branch and push the commit to the remote.

pfliu-nlp · 2022-10-09T14:01:06Z

@tetsuok I see (I assumed that @shuaichenchang has merged the latest main into this PR, but it seemed he didn't), let me do it.

add third party code for text-to-sql evaluation

81886fa

shuaichenchang requested review from odashi, neubig and pfliu-nlp as code owners October 6, 2022 00:27

shuaichenchang marked this pull request as draft October 6, 2022 00:27

shuaichenchang marked this pull request as ready for review October 6, 2022 01:12

odashi reviewed Oct 6, 2022

View reviewed changes

explainaboard/third_party/text_to_sql_test_suit_eval/README.md Outdated Show resolved Hide resolved

Add readme file and rename sql_evaluation.py, sql_evaluation_test.py …

d140ced

…to evaluation.py, evaluation_test.py to be consistent with the original third-party repo

shuaichenchang added 2 commits October 6, 2022 22:36

Merge remote-tracking branch 'origin/main' into third_party_code_for_…

7d96f92

…text_to_sql_evaluation merge main into my branch

change the test criteria in meta_eval_wmt_da_test.py to make it pass …

3d5dfec

…test. change assertAlmostEqual(ci[0],...), assertAlmostEqual(ci[0],...) to assertGreater(val, ci[0]) and assertGreater(ci[1], val)

neubig reviewed Oct 7, 2022

View reviewed changes

integration_tests/meta_eval_wmt_da_test.py Outdated Show resolved Hide resolved

integration_tests/meta_eval_wmt_da_test.py Outdated Show resolved Hide resolved

neubig mentioned this pull request Oct 7, 2022

Potential issue with spearman R bootstrapping #537

Closed

add comments about the modifications on confidence interval, the issu…

226f118

…e needs to be addressed later

tetsuok mentioned this pull request Oct 9, 2022

Teach markdownlint-cli2 to ignore explainaboard/third_party #545

Merged

tetsuok mentioned this pull request Oct 9, 2022

Add temporary fix to MetaEvalNLGCITest to fix flakiness #550

Merged

Merge branch 'main' into third_party_code_for_text_to_sql_evaluation

be0f6f4

Merge branch 'main' into third_party_code_for_text_to_sql_evaluation

e08d060

shuaichenchang requested review from odashi and neubig October 9, 2022 17:27

odashi approved these changes Oct 9, 2022

View reviewed changes

odashi merged commit 57359c4 into main Oct 9, 2022

odashi deleted the third_party_code_for_text_to_sql_evaluation branch October 9, 2022 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add third party code for text-to-sql evaluation #532

add third party code for text-to-sql evaluation #532

shuaichenchang commented Oct 6, 2022

shuaichenchang commented Oct 7, 2022

shuaichenchang commented Oct 7, 2022

neubig commented Oct 7, 2022

pfliu-nlp commented Oct 7, 2022

neubig commented Oct 7, 2022 •

edited

pfliu-nlp commented Oct 7, 2022 •

edited

neubig left a comment

pfliu-nlp commented Oct 7, 2022

neubig commented Oct 7, 2022 •

edited

pfliu-nlp commented Oct 7, 2022

odashi commented Oct 7, 2022

shuaichenchang commented Oct 8, 2022 •

edited

tetsuok commented Oct 9, 2022

tetsuok commented Oct 9, 2022

tetsuok commented Oct 9, 2022

pfliu-nlp commented Oct 9, 2022

tetsuok commented Oct 9, 2022 •

edited

pfliu-nlp commented Oct 9, 2022

add third party code for text-to-sql evaluation #532

add third party code for text-to-sql evaluation #532

Conversation

shuaichenchang commented Oct 6, 2022

shuaichenchang commented Oct 7, 2022

shuaichenchang commented Oct 7, 2022

neubig commented Oct 7, 2022

pfliu-nlp commented Oct 7, 2022

neubig commented Oct 7, 2022 • edited

pfliu-nlp commented Oct 7, 2022 • edited

neubig left a comment

Choose a reason for hiding this comment

pfliu-nlp commented Oct 7, 2022

neubig commented Oct 7, 2022 • edited

pfliu-nlp commented Oct 7, 2022

odashi commented Oct 7, 2022

shuaichenchang commented Oct 8, 2022 • edited

tetsuok commented Oct 9, 2022

tetsuok commented Oct 9, 2022

tetsuok commented Oct 9, 2022

pfliu-nlp commented Oct 9, 2022

tetsuok commented Oct 9, 2022 • edited

pfliu-nlp commented Oct 9, 2022

neubig commented Oct 7, 2022 •

edited

pfliu-nlp commented Oct 7, 2022 •

edited

neubig commented Oct 7, 2022 •

edited

shuaichenchang commented Oct 8, 2022 •

edited

tetsuok commented Oct 9, 2022 •

edited