Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference testing using random data #199

Merged
merged 2 commits into from Nov 12, 2022
Merged

Inference testing using random data #199

merged 2 commits into from Nov 12, 2022

Conversation

lopho
Copy link
Contributor

@lopho lopho commented Nov 4, 2022

I have started to work on integration tests for with random data.
This test runs on all pretrained models at fp32 with JIT True/False where applicable.

Related issue: #198

  • Inference testing on pre-made input and gt
    • all models as listed by list_models()
    • Image
    • Text
  • Random test data generator
    • Random image data in PIL format
    • Random text data
  • Determine best way to store and recall test data: ~4.6MB for all models with 1 sample per config
  • Parallelize tests, unlikely due to RAM constraints

To generate test data:

python tests/util_test.py --all

populates the tests/data folder with one torch pt per model config, to be used by the test.

@lopho lopho marked this pull request as draft November 4, 2022 20:26
@lopho lopho mentioned this pull request Nov 4, 2022
7 tasks
@lopho
Copy link
Contributor Author

lopho commented Nov 6, 2022

@rom1504 could you take a look?
Generating random test data and tests using that data is ready.

The next step would be to determine a good approach to storing and recalling test data across tags or commits.

@rom1504
Copy link
Collaborator

rom1504 commented Nov 6, 2022

Can you explain why you want to generate test data multiple times ?
Also why so much test data (50MB) and not just a few samples?

@lopho
Copy link
Contributor Author

lopho commented Nov 6, 2022

50mb is a single sample for each model config. It is total across all configs.
It can definetly be less, because, as you noticed im generating input data for each model config.
This can be reduced by creating one input for each input size. For text it can be one for all models. Regardless of input data, the output has to be stored for every model config.


@pytest.mark.skip('storing and recalling of test data not ready')
@pytest.mark.parametrize('model_name, pretrained, precision, jit', util_test.model_config_permutations)
def test_inference_with_data(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How slow is this ?
Let's try to keep test running in less than 5min
Either remove redundant tests or use the (matrix) parallel feature of GH actions
(And also possibly the parallel feature of pytest)

Copy link
Contributor Author

@lopho lopho Nov 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

======================================================= test session starts =======================================================
platform linux -- Python 3.9.2, pytest-7.0.1, pluggy-1.0.0 -- /REDACTED/open_clip/.venv/bin/python
cachedir: .pytest_cache
rootdir: /REDACTED/open_clip
plugins: xdist-2.5.0, forked-1.4.0
collected 62 items                                                                                                                

tests/test_inference.py::test_inference_with_data[RN50-openai-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[RN50-openai-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[RN50-yfcc15m-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[RN50-cc12m-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[RN50-quickgelu-openai-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[RN50-quickgelu-openai-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[RN50-quickgelu-yfcc15m-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[RN50-quickgelu-cc12m-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[RN101-openai-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[RN101-openai-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[RN101-yfcc15m-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[RN101-quickgelu-openai-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[RN101-quickgelu-openai-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[RN101-quickgelu-yfcc15m-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[RN50x4-openai-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[RN50x4-openai-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[RN50x16-openai-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[RN50x16-openai-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[RN50x64-openai-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[RN50x64-openai-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-32-openai-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-32-openai-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-32-laion400m_e31-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-32-laion400m_e31-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-32-laion400m_e32-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-32-laion400m_e32-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-32-laion2b_e16-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-32-laion2b_e16-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-32-laion2b_s34b_b79k-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-32-laion2b_s34b_b79k-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-32-quickgelu-openai-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-32-quickgelu-openai-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-32-quickgelu-laion400m_e31-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-32-quickgelu-laion400m_e31-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-32-quickgelu-laion400m_e32-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-32-quickgelu-laion400m_e32-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-16-openai-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-16-openai-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-16-laion400m_e31-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-16-laion400m_e31-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-16-laion400m_e32-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-16-laion400m_e32-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-16-plus-240-laion400m_e31-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-16-plus-240-laion400m_e31-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-16-plus-240-laion400m_e32-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-B-16-plus-240-laion400m_e32-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-L-14-openai-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-L-14-openai-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-L-14-laion400m_e31-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-L-14-laion400m_e31-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-L-14-laion400m_e32-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-L-14-laion400m_e32-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-L-14-laion2b_s32b_b82k-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-L-14-laion2b_s32b_b82k-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-L-14-336-openai-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-L-14-336-openai-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-H-14-laion2b_s32b_b79k-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-H-14-laion2b_s32b_b79k-fp32-True] PASSED
tests/test_inference.py::test_inference_with_data[ViT-g-14-laion2b_s12b_b42k-fp32-False] PASSED
tests/test_inference.py::test_inference_with_data[ViT-g-14-laion2b_s12b_b42k-fp32-True] PASSED
tests/test_simple.py::test_inference[False] PASSED
tests/test_simple.py::test_inference[True] PASSED

======================================================== warnings summary =========================================================
tests/test_inference.py::test_inference_with_data[RN50-openai-fp32-True]
  /REDACTED/open_clip/tests/test_inference.py:38: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
  To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  ../aten/src/ATen/native/BinaryOps.cpp:600.)
    image_features = model.encode_image(prepped)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================ 62 passed, 1 warning in 374.16s (0:06:14) ============================================

0:06:14 on i7-4790K

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch size 1, single sample

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing configs in list_models(), now all unit tests (training, infer, hf, ...) take about 8:30-12:00 minutes. With setup over head 10-14 minutes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems now 3min, what changed ?

tests/test_inference.py Outdated Show resolved Hide resolved
@rom1504
Copy link
Collaborator

rom1504 commented Nov 6, 2022

The output should also be stored only once.

Is it really useful to test every single model config ?
I would rather we test all code paths but not necessarily all model configs.
For example one resnet, one vit, one timm,.. but not 10 vit
What do you think?

@lopho
Copy link
Contributor Author

lopho commented Nov 6, 2022

The outputs are different for each model, so we need to store it.
It's not possible to compare the outputs of say, ViT-B-32 and RN50.

Regarding the configurations, correct me if I'm wrong, but some pretrained models are treated differently within the same model architecture, ex. openai.
If not, we could just pick one for each architecture.

@lopho
Copy link
Contributor Author

lopho commented Nov 6, 2022

Regarding input data: Since the data is random, having a larger pool of samples increases robustness by reducing the probability of encountering only false positives.

Overall I tried to design this to be an integration test.
So it can run regardless of internal knowlege about which configs create which architecture and avoid having to create special cases for each.
This is also why the test data are PIL images and strings, instead of input tensors. So the entire usage is covered.

I'll go ahead and split input and output data, so we only need one set for inputs per input size accross all models and inflection points.
Also taking a look at GH action matrix to parameterize this for configs.

@rom1504
Copy link
Collaborator

rom1504 commented Nov 7, 2022

Could you remove the skip of the test so we could see if it works and how long it takes to run ?

@lopho
Copy link
Contributor Author

lopho commented Nov 7, 2022

I can, but the GH action will fail due to missing test data.

If you want to run the new test, be sure to remove @pytest.mark.skip before test_inference_with_data.
This is in place to prevent GH actions to fail while this PR is open and test data isn't ready.

@lopho
Copy link
Contributor Author

lopho commented Nov 7, 2022

Quote from OP.

  • GH action to trigger creating test data for every commit or release/tag
  • Modify test action to insert a predefined commit or tag to test against
  • Determine best way to store and recall test data: ~50MB for all models with 1 sample per config (60 configs)

For these steps I still need feedback from you.
Specifically where to store test data in the long term.
Alternatively I could try to get GH actions to check out older commits and create test data on the fly. But that would double the run time.

@rom1504
Copy link
Collaborator

rom1504 commented Nov 7, 2022

I can, but the GH action will fail due to missing test data.

It's ok, it just needs to pass before merging

@rom1504
Copy link
Collaborator

rom1504 commented Nov 7, 2022

GH action to trigger creating test data for every commit or release/tag

I don't think this is necessary.

Test data should be introduced manually for new models.

I think it would make sense to make it so test data is < 5MB (so that means much less for each model) and then simply store in the the repo

@rom1504
Copy link
Collaborator

rom1504 commented Nov 7, 2022

We could also consider storing test data in releases data if we truly need to make it big but if not needed it's best to keep it small and in the repo

@lopho
Copy link
Contributor Author

lopho commented Nov 7, 2022

Test data is already minimal per model.
1 input image + 1 output tensor, 1 input string + 1 output tensor.
220kB - 1.7MB per model depending on output size.
But as we said, we can reduce it by deduplicating input images to 1 per input size. That could shave off a few mb.

@lopho
Copy link
Contributor Author

lopho commented Nov 7, 2022

I'll refactor tomorrow to generate one input image per size, one input string for all models.
Maybe that'll shrink it enough to include it in the repo.

@rom1504
Copy link
Collaborator

rom1504 commented Nov 7, 2022

sounds good

@lopho
Copy link
Contributor Author

lopho commented Nov 8, 2022

Changed test data creation to only create one image per size, and one string for all models.
Test data is down to 3.2MB.
I created a set of test data against the latest 2.3.1 release, and pushed that as well.

CI failed due to hash mismatch in the model download for RN50 thought.
I'm not sure if that is due to the new download tests or something else.

It works on my machine (including the new tests).

EDIT: nevermind, misread the warning as the error.
It seems the output differs. Not sure if that is dependent on the machine, or on the torch /numpy/pillow version, or something else entirely
I will change the assertion to isclose() instead of equality, and see if precision is the problem.

Maybe data should be created by the CI as well. I'm not certain this will alleviate the problem though.

@lopho
Copy link
Contributor Author

lopho commented Nov 8, 2022

Do you have a suggestion for an acceptable maximum precision for isclose?
Otherwise, I'll stick to the defaults.

@lopho
Copy link
Contributor Author

lopho commented Nov 8, 2022

@rom1504 test pass now (it failed because of either a timeout or memory constraint, but all tests up to vit-g passed, which was interrupted during download)
I had to set the tolerance to 1e-5 (which I'm not very happy about).
The bulk of the runtime is spent on downloading the weights.
I'll try parallelizing it next using either GH action matrix of pytest parallel.

@lopho
Copy link
Contributor Author

lopho commented Nov 11, 2022

I'm reconsidering running tests with pretrained weights.
No matter what, it will either take too long or, if done in parallel, hit the memory restraints.
Both due to weight downloads.

Instead, running tests against randomly initialized non-openai models (as listed by list_models())
Then a separate test for openai models.
I'm not sure yet how to do this while retaining blackbox testing mentality.
E.g. not having internal knowledge of how openai models differ from non-openai in terms of config.

@rom1504
Copy link
Collaborator

rom1504 commented Nov 11, 2022 via email

@lopho
Copy link
Contributor Author

lopho commented Nov 11, 2022

Testing all models as listed by list_models() is implemented.
Added a small script in util_test.py to help with data generation.

# generate test data for all models in tests/data
# will only overwrite old data with --overwrite
python tests/util_test.py --all
# for a specific model
python tests/util_test.py --model RN50

Total runtime for all tests (inference, training, hf, ...) is ~10-14min including setup.
Testing dependencies had to be updated to include timm==0.6.11.
Added exceptions for timm-convnext_xlarge & timm-vit_medium_patch16_gap_256 due to #219 and ViT-e-14 as it exceeds the runners system RAM.

Ready to merge?

@lopho lopho requested a review from rom1504 November 11, 2022 05:39
@lopho lopho marked this pull request as ready for review November 11, 2022 05:39
tests/util_test.py Outdated Show resolved Hide resolved
@rom1504
Copy link
Collaborator

rom1504 commented Nov 11, 2022

I think it would be cool to use gh action infra for parallel testing to run N (probably =8) tests in parallel (it uses N actual containers so no ram issue)
That would decrease run time of tests and would allow us to add more things in the future
https://github.com/PrismarineJS/mineflayer/blob/master/.github/workflows/ci.yml here's an example

@rom1504
Copy link
Collaborator

rom1504 commented Nov 11, 2022

I'm thinking something like that:

  • in gh action have a "parallel test" action, that will have a matrix with for example 8 parallel things (just number from 0 to 7). it will run only that inference test
  • the normal test action will not run this inference test

then in the inference test, you use that number (0 to 7) to pick only 1/8 of the models if the environment variable exists, otherwise run all

How does that sound?

@rom1504
Copy link
Collaborator

rom1504 commented Nov 11, 2022

another (simpler) option is to simply put the names of the models to test in the gh action matrix and use that

@rom1504
Copy link
Collaborator

rom1504 commented Nov 11, 2022

overall nice work! it's almost ready now

@lopho
Copy link
Contributor Author

lopho commented Nov 12, 2022

Using torch.autocast('cpu') increases runtime by two orders of magnitude on my machine.

  • without autocast: 147.27s
  • with autocast: 10244.21s

It's an i7-4790k and does not support AVX-512-BF16 or AVX-512-FP16.
Also the machine running the GH action outputs are vastly different than on my machine, and not comparable anymore (ex. -0.3516 on my machine vs -0.2471 on GH runner), this is probably also due to the lack of AVX512 support.

I would like to keep it without autocast, to make the test usable regardless of instruction set supported by the cpu and to keep implicit changes to behaviour to a minimum (ex. output is bfloat16 instead of float32 with autocast)

@rom1504
Copy link
Collaborator

rom1504 commented Nov 12, 2022

Did you disable jit ? It tends to make things slow

@lopho
Copy link
Contributor Author

lopho commented Nov 12, 2022

@rom1504
Copy link
Collaborator

rom1504 commented Nov 12, 2022

ok I guess let's not use autocast then

any thoughts regarding gh action parallism ?

@rom1504
Copy link
Collaborator

rom1504 commented Nov 12, 2022

looks like tests took only 3m ?!

did anything change ?

@lopho
Copy link
Contributor Author

lopho commented Nov 12, 2022

im on parallelizing it right now
Runtime is dependent on luck of the draw regarding which machine and priority we get from github.
vit-G failing is new, seems like it was very close to the RAM limit and the new tokenizer pushed it just above it.

I've got it working with vit-g when running in parallel.
Its almost ready.

EDIT: yes, i mean ViT-G

@rom1504
Copy link
Collaborator

rom1504 commented Nov 12, 2022

it's vit-G which was failing

@rom1504
Copy link
Collaborator

rom1504 commented Nov 12, 2022

I think we could merge now really, let's maybe do that and do another PR for parallel ?

@lopho
Copy link
Contributor Author

lopho commented Nov 12, 2022

This is a parallel run on my staging branch, if you want to take a look.
https://github.com/lopho/open_clip/actions/runs/3451850195

@lopho
Copy link
Contributor Author

lopho commented Nov 12, 2022

Ok, parallel can wait.

Let me squash the commits first before merging please.

@rom1504
Copy link
Collaborator

rom1504 commented Nov 12, 2022

It'll all be squashed during GitHub merge

@lopho
Copy link
Contributor Author

lopho commented Nov 12, 2022

ah, sorry, too late, it's already done.

@rom1504
Copy link
Collaborator

rom1504 commented Nov 12, 2022 via email

@rom1504 rom1504 merged commit 8d3429f into mlfoundations:main Nov 12, 2022
@rom1504
Copy link
Collaborator

rom1504 commented Nov 12, 2022

Merged

Parallel with GH action would still be appreciated

And next big step would be same but with one step of training (see test training simple for a place to start), so we make sure training keeps doing the same thing

If you want to continue working on this, it would be pretty helpful

There is active development in this repo, and keeping things working will be important

@lopho lopho deleted the integration_test branch November 13, 2022 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants