Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace existing port tests with automated regression tests #42

Open
aazuspan opened this issue Jul 4, 2023 · 6 comments
Open

Replace existing port tests with automated regression tests #42

aazuspan opened this issue Jul 4, 2023 · 6 comments
Labels
enhancement New feature or request testing Related to test code or data

Comments

@aazuspan
Copy link
Contributor

aazuspan commented Jul 4, 2023

Currently, we test our estimator accuracy against manually generated results from yaImpute and pynnmap. Once we have all major functionality implemented and are confident in our results, we can switch to regression testing to ensure no errors are introduced. This was briefly discussed in #40:

Eventually, I suppose we will reach a point where we're confident everything is working and could theoretically test against previous versions of sknnr ... pytest-regressions and syrupy both look like interesting tools that might solve this problem for us in the future by testing against automatically generated results.

We'll need to do some experimenting to find the right tool for our use case, but the advantage should be a massive simplification of our testing system (e.g. the removal of KNNTestDataset) and the ability to easily test against a wider range of parameters (e.g. values of n_neighbors and n_components).

@aazuspan aazuspan added the testing Related to test code or data label Jul 4, 2023
@aazuspan aazuspan added this to the 0.1.0 milestone Jul 4, 2023
@aazuspan aazuspan added the enhancement New feature or request label Jul 4, 2023
@grovduck
Copy link
Member

I started experimenting a bit with syrupy and it seems like a pretty straightforward API. It looks like we might need to put special care into how we serialize big arrays as the default in syrupy looks to store each element on a separate line in the resulting .ambr file. It looks like pytest-regressions might have better support for the types of data that we will want to test (i.e. arrays and dataframes) or we could use JSON serialization together with syrupy to get the output a bit more compact. I was playing around with orjson and this test:

import numpy as np
import orjson

def test_arr_1(snapshot):
    arr = np.arange(24).reshape(4, 6)
    assert orjson.dumps(arr, option=orjson.OPT_SERIALIZE_NUMPY) == snapshot

results in this serialization:

# name: test_arr_1
  b'[[0,1,2,3,4,5],[6,7,8,9,10,11],[12,13,14,15,16,17],[18,19,20,21,22,23]]'
# ---

whereas this test:

def test_arr_2(snapshot):
    arr = np.arange(24).reshape(4, 6)
    assert_array_equal(arr, snapshot)

results in this snipped serialization:

# name: test_arr_2
  0
# ---
# name: test_arr_2.1
  1
# ---
# name: test_arr_2.10
  10
# ---
<snip>
# name: test_arr_2.7
  7
# ---
# name: test_arr_2.8
  8
# ---
# name: test_arr_2.9
  9
# ---

@aazuspan
Copy link
Contributor Author

Wow, that default serialization for arrays looks horribly inefficient! I like your solution, although if we can avoid having to manually serialize altogether by going with a different tool, I'm definitely open to that. I can't remember why now, but I also got the impression that pytest-regressions looked like a better fit than syrupy when I was briefly poking around.

@grovduck
Copy link
Member

@aazuspan, I've been playing around with pytest-regressions on the synthetic-knn project and I like it so far. Numpy arrays are stored as binary .npz files in a separate test directory. Tests seem pretty fast from what I can tell so far. So I'm happy moving forward with this issue using pytest-regressions. I'm guessing we'd probably want to use the dataframe_regression fixture for the yaImpute generated files, though.

ps. At the risk of overburdening you, please feel free to weigh in on any issues or PRs that I'm putting up over in synthetic-knn. I'll try to be mostly independent on this one, unless it piques your interest and then I'd be overjoyed if you want to jump in 😄.

@aazuspan
Copy link
Contributor Author

That's awesome @grovduck! I took a quick look and the regression test looks remarkably simple to implement!

I'm guessing we'd probably want to use the dataframe_regression fixture for the yaImpute generated files, though.

Interesting, so we would still be using the yaImpute outputs as a reference with pytest-regressions? My hope was that we could eventually switch to just testing against previous sknnr outputs once we're confident in all the porting, but I think you have a much clearer idea of the capabilities of pytest-regression and how to use it than I do now!

ps. At the risk of overburdening you, please feel free to weigh in on any issues or PRs that I'm putting up over in synthetic-knn. I'll try to be mostly independent on this one, unless it piques your interest and then I'd be overjoyed if you want to jump in 😄.

I'm very curious, so I'll keep an eye on development over there!

@grovduck
Copy link
Member

Interesting, so we would still be using the yaImpute outputs as a reference with pytest-regressions? My hope was that we could eventually switch to just testing against previous sknnr outputs once we're confident in all the porting, but I think you have a much clearer idea of the capabilities of pytest-regression and how to use it than I do now!

Sorry, that wasn't very clear. I think we'd want to do the following:

  1. Remove the yaImpute generated test files from the repo (but don't yet delete them)

  2. Change the tests to generate the regression files automatically

  3. Do a one-time comparison between the yaImpute files and the pytest-regression files to ensure that they are indeed "identical" (identical will probably be subject to rounding and precision). The nice thing is that the dataframe_regression fixture persists the data as CSV, so this dataframe:

    df = pd.DataFrame(
        {
            "x": [1, 2, 3, 4],
            "y": [5, 6, 7, 8],
            "z": [9, 10, 11, 12],
        }
    ).rename_axis("ID")
    

    becomes the CSV file:

    ID,x,y,z
    0,1,5,9
    1,2,6,10
    2,3,7,11
    3,4,8,12
    

    so it should be pretty easy to do the comparison between the yaImpute files and the saved regression files.

  4. Commit the new pytest-regression files and delete the yaImpute files

There's a chance that we'll need to go back to creating other files through yaImpute for other estimators, but we won't plan to have those enter the sklearn repo, but instead just compare against the generated content (as in step 3). Does that make sense?

@aazuspan
Copy link
Contributor Author

Ah, that makes perfect sense! Thanks for laying out the game plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request testing Related to test code or data
Projects
None yet
Development

No branches or pull requests

2 participants