Formalize a test plan #2895

dbluhm · 2024-04-16T15:38:35Z

On the ACA-PUG call on 2024-04-16, one topic of discussion was that there is a gap in our testing strategy. We go through a process of merging changes and bug fixes, publish a number of release candidates, and yet we still often find bugs shortly after the official release.

To some extent, finding bugs in a release is unavoidable. However, by tuning our testing strategy, we should be able to catch more bugs sooner in the release process.

This issue is intended as a starting point for gathering ideas on how we can formalize our testing strategy to produce more resilient releases.

Tagging a few key contributors:
@amanji @ianco @jamshale @swcurran @nodlesh @ff137

ianco · 2024-04-16T15:50:19Z

There are a few test processes that we use:

aca-py unit tests, which tend to be rather brittle for certain types of code changes, that test "units" of code (mocks are used, so there is no integration or functional test coverage with the unit tests)
aca-py integration tests - based on the alice/faber demo code base, tests aca-py functionality by running aca-py instances - the integration tests run for a long time and there are still gaps in the coverage
AATH interoperability tests - focus on coverage of functionality as defined in RFC's - can test interoperability between different agents, but also aca-py to aca-py (if there is a feature implemented in aca-py but no other agents)

AATH tests tend to not get updated as new functionality is added to aca-py, unless there is a specific ticket to do so. People (other than Sheldon and Stephen) don't regularly look at the AATH results.

My personal feeling is that:

We should have less focus on unit tests - they should be required where there is a specific chunk of code to be tested, but I don't agree that a focus on "100% coverage" is necessarily a productive use of time. Because of the extensive use of mocks, the tests are very brittle for certain types of code changes, and require a lot of time to maintain
For integration tests, we should have a smaller set of the tests that we run for each PR. (There is a lot of duplication between tests. There is already a "@gha" tag so we can just review the set of tests that this includes.) For a release (and probably on a regular basis) someone should run the full set of integration tests to make sure there are no issues sneaking in. I think there should be more of a focus on integration tests as new features are added (I'm not sure how many of the aca-py developers are familiar with the integration test framework). Also there is a test suite that @dbluhm has developed that may be a good addition to the integration test suite. See https://github.com/Indicio-tech/acapy-minimal-example
Someone should regularly monitor AATH, and we should ensure that new RFC features have coverage in AATH

dbluhm · 2024-04-16T16:08:04Z

I was writing some thoughts but @ianco and I seem to be of a very similar mindset and a lot of what I had to say is redundant after his comments 😄

I fully agree with Ian's points on unit testing. We used to track test coverage and I think that encouraged us to descend into mock hell. It's quite common for a unit test case to be 95% mock setup. And, most of the time, the construction of the mocks encode the same faulty assumptions originally present in the code under test anyways. A bug can be present in a chunk of code, get fixed, and then the fix must be immediately followed up by correcting the unit tests that ratified the buggy behavior.

I find the integration tests unwieldy. I understand them better now than I did in the past and have been able to make some small additions. But I admit that the strategy is quite foreign to me and I often find myself creating my own testing scenarios from scratch using tools like https://github.com/Indicio-tech/acapy-minimal-example because I can reason about what's happening more easily. I strongly suspect that this is more a manifestation of the fact that I wrote the AME tool and therefore understand it than it is a reflection of the merits of the approach.

Fully agree that AATH is underutilized right now; I am not in the habit of checking the results. AATH is how we've discovered issues with the 0.12.0 release so far so I really ought to be paying more attention to the results.

dbluhm · 2024-04-16T16:11:39Z

Testing backwards compatibility is sometimes desirable. Additionally, testing the behavior of an ACA-Py instance after an upgrade is sometimes useful. Accomplishing this could be complicated and so maybe these scenarios are better served by one-off testing?

jamshale · 2024-04-16T16:30:32Z

I agree that it's hard to write unit tests that aren't brittle and over mocked. I still think they are valuable for smaller blocks of code to ensure the intended logic is correct. But I agree having less focus on unit testing coverage would be beneficial and potentially removing brittle tests which don't add much value.

The integration tests are important but there is some gaps in coverage and some new features are being added which aren't covered. We should have more focus on integration test coverage for new features and bug fixes.

I think we should change the PR tests to have more like one, or a few, full tests for the main use cases. Such as a test that does the entire flow for a issuer, verifier and holder, through to revocation. This could be done for a standalone agent and multi tenant mode and with the main communication protocols. It would avoid the tests standing up a bunch of agents just testing connections and creating credentials with different configs for every PR. Then a daily test could run through the whole GHA workflow on main. And a release could potentially trigger a workflow with even more tests.

The AATH should have more importance as well. I don't know the best way to go about this. Maybe getting a consistent maintainer more involved there and dedicating a portion of their time to that project.

nodlesh · 2024-04-16T17:48:24Z

I think that unless there are more than one active maintainer in AATH then it will always be behind the implementations. I'm split on AATH and the Mobile Wallet Test Harness including BC wallet tests, so it's difficult to stay ahead of the curve and also maintain things to keep them running.

It is my experience that a larger devoted test team can pick up some slack, but there will still be lag behind implementation (usually). What does seem to work, and I'm going to sound like a broken record here for those that know me, is implementors taking a Test Driven Development approach. Tests are written first based on requirements/RFCs and then the implementation is done to make the tests pass, then refactor implementation and tests as required. These tests can be at the Unit, Integration, or at the interop level. This way, as part of the process, we have tests for new or updated features. Picking the best level to test at by evaluating based on where the test will maximize TRIMS (Targeted, Reliable, Informative, Maintainable, & Speedy). It is also my opinion that the Behavioural Specification of tests like in Behave (which is part of both integration tests and AATH) is the best approach when doing TDD. I like the executable specification of the approach. Your milage may vary.

I can see how unit tests, if overly mocky, can be a maintenance problem. So we should be very selective of what goes in unit tests. We should also evaluate existing unit tests based on TRIMS and cut away problematic tests. I agree that code coverage is a really bad metric here.

IMO, the integration tests and the interop tests are siblings. If deciding to write an integration test, I think a great question to ask is if the things being tested would apply to comms with other frameworks and would make a good interop test. Or is it the goal here to test things that are only ACA-Py related. I really don't think, from a test goal perspective, there should be much overlap between the integration tests and the interop tests. Though error and negative tests would probably be strictly an integration or unit test.

There was one thing I was thinking about a while ago for AATH and that was making test validators that are plugins to AATH. This would allow for different end goals to be applied to the same test. So at the command line, if running interop tests, I could load the interop validator that only really cares about end state of the protocols, like are the agents connected, etc. If running more of an integration test one could use a different validator(s) that would do asserts on lets say the data format or other particulars one would want to check. Just an idea at this point, and I don't know if it helps us any.

In the CI/CD pipelines at the PR, nightly runs, and official build levels, we can pull from a set of tests from the unit, integration, and interop suites. Where the PRs get some smoke tests executed, up to all tests executed for an official build.

If we want more weight on the interop tests in AATH, we can increase the frequency of the runs. I think we moved it to once every three days as opposed to nightly. We could even just do the nightly for the ACA-Py to ACA-Py tests instead of for all other frameworks.

WadeBarnes · 2024-04-17T13:05:43Z

In the interest of separation of concern, functional and integration tests can be developed in one repository, such as AATH, and integrated with and executed as part of the ACA-Py CI/CD pipeline(s) (in any number of PR, Push, and/or Release workflows). We do this in with Indy. All of the integration and functional testing is defined in hyperledger/indy-test-automation, and those tests are integrated and executed as part of the Indy-Node release process to qualify the code for release.

This allows efforts to be more focused in the task (integration and functional testing) which helps to improve readability and maintenance by focusing on a particular testing style, methodology, and framework.

ff137 · 2024-04-18T12:18:55Z

From our side, we work rather closely with the openapi spec, because it's how we generate the cloudcontroller library -- an openapi generated client to interface with ACA-Py.

That's the only reason I could catch this bug from earlier: #2894
and how I'm picking up on model naming issues: #2901

As a side-note: Perhaps there's some benefit for the cloudcontroller to become a hyperledger project, partially because it could benefit from broader maintenance, but mainly because using the autogenerated client could simplify some integrations in other projects, and it could give the ACA-Py team an easier way to test and get internal feedback for methods / models that can be improved. Just a thought.

Besides that, one proposal I have, which I can contribute, is to add a github workflow that asserts changes to the openapi spec are tracked. The workflow can detect if there is a diff in the openapi spec, and only pass if the PR includes a commit for the changes to the openapi.json/yaml files.
The only incentive for that is so that all changes to openapi models are known and tracked when they are made.
I've thought of adding this before, but I don't want to make the test workflow more cumbersome for developers. The script to generate the spec is fairly straightforward and just requires docker. But it's a design decision for the Hyperledger team if you would like to include such enforcements with the test workflows. If there's sufficient motivation for that, and I can put a workflow together.

ff137 · 2024-04-18T12:39:39Z

Apart from that, some thoughts I have wrt broader testing strategy:

Maintaining such a large codebase is naturally very difficult. Especially in Python.

Unit tests are important because they reinforce that methods behave as they should. It's not fun, but it forces developers to consider their code with more depth: are all possible inputs handled, could the method benefit from refactoring / modularization, and how does the method handle bad requests and exceptions, etc.
Writing code without going through this exercise, leads to technical debt.

If writing unit tests is overly cumbersome, then that probably suggests the methods being tested are overly complicated, and that should motivate for some better code practices: deduplication, modularity, readability, etc, all those good things that make writing tests easier.
If unit tests require a lot of boiler plate, there's usually better ways to utilise fixtures / mocking / patching.

End-to-end is the gold standard, because that proves everything integrates correctly, but you can't see a coverage report for your e2e tests. So, it requires careful design to know what success and failures scenarios to cover in your e2e test cases. If the unit tests cover everything properly, it's a lot easier to pick up from that and design what scenarios need to be covered in e2e test results. When both unit and e2e tests cover everything, then the test results actually mean something!

I'm of the mind that in order to do the big things right, you have to do the small things right ... so having a proper unit test strategy is a must. I can't say how pedantic or tedious it ought to be, but certainly, at a minimum, the definition of "done" for a new feature should be that it's "sufficiently" covered with both unit tests and e2e tests. My 2c worth!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formalize a test plan #2895

Formalize a test plan #2895

dbluhm commented Apr 16, 2024

ianco commented Apr 16, 2024

dbluhm commented Apr 16, 2024

dbluhm commented Apr 16, 2024

jamshale commented Apr 16, 2024 •

edited

Loading

nodlesh commented Apr 16, 2024 •

edited

Loading

WadeBarnes commented Apr 17, 2024

ff137 commented Apr 18, 2024

ff137 commented Apr 18, 2024

Formalize a test plan #2895

Formalize a test plan #2895

Comments

dbluhm commented Apr 16, 2024

ianco commented Apr 16, 2024

dbluhm commented Apr 16, 2024

dbluhm commented Apr 16, 2024

jamshale commented Apr 16, 2024 • edited Loading

nodlesh commented Apr 16, 2024 • edited Loading

WadeBarnes commented Apr 17, 2024

ff137 commented Apr 18, 2024

ff137 commented Apr 18, 2024

jamshale commented Apr 16, 2024 •

edited

Loading

nodlesh commented Apr 16, 2024 •

edited

Loading