Replace existing port tests with automated regression tests #42

aazuspan · 2023-07-04T00:26:38Z

Currently, we test our estimator accuracy against manually generated results from yaImpute and pynnmap. Once we have all major functionality implemented and are confident in our results, we can switch to regression testing to ensure no errors are introduced. This was briefly discussed in #40:

Eventually, I suppose we will reach a point where we're confident everything is working and could theoretically test against previous versions of sknnr ... pytest-regressions and syrupy both look like interesting tools that might solve this problem for us in the future by testing against automatically generated results.

We'll need to do some experimenting to find the right tool for our use case, but the advantage should be a massive simplification of our testing system (e.g. the removal of KNNTestDataset) and the ability to easily test against a wider range of parameters (e.g. values of n_neighbors and n_components).

The text was updated successfully, but these errors were encountered:

grovduck · 2023-08-29T22:50:45Z

I started experimenting a bit with syrupy and it seems like a pretty straightforward API. It looks like we might need to put special care into how we serialize big arrays as the default in syrupy looks to store each element on a separate line in the resulting .ambr file. It looks like pytest-regressions might have better support for the types of data that we will want to test (i.e. arrays and dataframes) or we could use JSON serialization together with syrupy to get the output a bit more compact. I was playing around with orjson and this test:

import numpy as np
import orjson

def test_arr_1(snapshot):
    arr = np.arange(24).reshape(4, 6)
    assert orjson.dumps(arr, option=orjson.OPT_SERIALIZE_NUMPY) == snapshot

results in this serialization:

# name: test_arr_1
  b'[[0,1,2,3,4,5],[6,7,8,9,10,11],[12,13,14,15,16,17],[18,19,20,21,22,23]]'
# ---

whereas this test:

def test_arr_2(snapshot):
    arr = np.arange(24).reshape(4, 6)
    assert_array_equal(arr, snapshot)

results in this snipped serialization:

# name: test_arr_2
  0
# ---
# name: test_arr_2.1
  1
# ---
# name: test_arr_2.10
  10
# ---
<snip>
# name: test_arr_2.7
  7
# ---
# name: test_arr_2.8
  8
# ---
# name: test_arr_2.9
  9
# ---

aazuspan · 2023-08-29T23:18:47Z

Wow, that default serialization for arrays looks horribly inefficient! I like your solution, although if we can avoid having to manually serialize altogether by going with a different tool, I'm definitely open to that. I can't remember why now, but I also got the impression that pytest-regressions looked like a better fit than syrupy when I was briefly poking around.

grovduck · 2023-09-13T19:13:51Z

@aazuspan, I've been playing around with pytest-regressions on the synthetic-knn project and I like it so far. Numpy arrays are stored as binary .npz files in a separate test directory. Tests seem pretty fast from what I can tell so far. So I'm happy moving forward with this issue using pytest-regressions. I'm guessing we'd probably want to use the dataframe_regression fixture for the yaImpute generated files, though.

ps. At the risk of overburdening you, please feel free to weigh in on any issues or PRs that I'm putting up over in synthetic-knn. I'll try to be mostly independent on this one, unless it piques your interest and then I'd be overjoyed if you want to jump in 😄.

aazuspan · 2023-09-13T22:03:58Z

That's awesome @grovduck! I took a quick look and the regression test looks remarkably simple to implement!

I'm guessing we'd probably want to use the dataframe_regression fixture for the yaImpute generated files, though.

Interesting, so we would still be using the yaImpute outputs as a reference with pytest-regressions? My hope was that we could eventually switch to just testing against previous sknnr outputs once we're confident in all the porting, but I think you have a much clearer idea of the capabilities of pytest-regression and how to use it than I do now!

ps. At the risk of overburdening you, please feel free to weigh in on any issues or PRs that I'm putting up over in synthetic-knn. I'll try to be mostly independent on this one, unless it piques your interest and then I'd be overjoyed if you want to jump in 😄.

I'm very curious, so I'll keep an eye on development over there!

grovduck · 2023-09-13T22:31:30Z

Interesting, so we would still be using the yaImpute outputs as a reference with pytest-regressions? My hope was that we could eventually switch to just testing against previous sknnr outputs once we're confident in all the porting, but I think you have a much clearer idea of the capabilities of pytest-regression and how to use it than I do now!

Sorry, that wasn't very clear. I think we'd want to do the following:

Remove the yaImpute generated test files from the repo (but don't yet delete them)
Change the tests to generate the regression files automatically
Do a one-time comparison between the yaImpute files and the pytest-regression files to ensure that they are indeed "identical" (identical will probably be subject to rounding and precision). The nice thing is that the dataframe_regression fixture persists the data as CSV, so this dataframe:
```
df = pd.DataFrame(
    {
        "x": [1, 2, 3, 4],
        "y": [5, 6, 7, 8],
        "z": [9, 10, 11, 12],
    }
).rename_axis("ID")
```
becomes the CSV file:
```
ID,x,y,z
0,1,5,9
1,2,6,10
2,3,7,11
3,4,8,12
```
so it should be pretty easy to do the comparison between the yaImpute files and the saved regression files.
Commit the new pytest-regression files and delete the yaImpute files

There's a chance that we'll need to go back to creating other files through yaImpute for other estimators, but we won't plan to have those enter the sklearn repo, but instead just compare against the generated content (as in step 3). Does that make sense?

aazuspan · 2023-09-13T23:40:17Z

Ah, that makes perfect sense! Thanks for laying out the game plan.

aazuspan added the testing Related to test code or data label Jul 4, 2023

aazuspan added this to the 0.1.0 milestone Jul 4, 2023

aazuspan added the enhancement New feature or request label Jul 4, 2023

grovduck mentioned this issue Aug 29, 2023

Add R6 southwest Oregon (SWO) Ecology dataset #46

Merged

grovduck modified the milestones: 0.1.0, Complete estimators Sep 27, 2023

aazuspan mentioned this issue Oct 5, 2023

Refactor CCA ordination classes and add new constrained ordination methods #60

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace existing port tests with automated regression tests #42

Replace existing port tests with automated regression tests #42

aazuspan commented Jul 4, 2023

grovduck commented Aug 29, 2023

aazuspan commented Aug 29, 2023

grovduck commented Sep 13, 2023

aazuspan commented Sep 13, 2023

grovduck commented Sep 13, 2023

aazuspan commented Sep 13, 2023

Replace existing port tests with automated regression tests #42

Replace existing port tests with automated regression tests #42

Comments

aazuspan commented Jul 4, 2023

grovduck commented Aug 29, 2023

aazuspan commented Aug 29, 2023

grovduck commented Sep 13, 2023

aazuspan commented Sep 13, 2023

grovduck commented Sep 13, 2023

aazuspan commented Sep 13, 2023