-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ABOUT file handling in d2d pipeline #1004
Comments
Reference: #1004 Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Reference: #1004 Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
#982 addresses this issue partially, i.e. the About file handling is much more improved and takes less time.
After About paths were patterns this increased to 4 hours, sometimes even 6-7 hours.
After the optimizations used here, this is now ~13 minutes:
Remaining: |
@AyanSinhaMahapatra Why the wait? The attributes are already supported ... see https://github.com/nexB/aboutcode-toolkit/blob/89c16a5b762d38c5e7f4ba25097659bd60a0a08c/src/attributecode/model.py#L903 |
Reference: #1004 Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
* Support regex in ABOUT resource paths Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> * Refactor ABOUT file mapping in d2d for efficiency Reference: #1004 Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> * Restructure map_about_files Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> * Address feedback and review comments Reference: #982 Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> * Update docstrings and use dataclass Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> * Use license/notice files from About data Reference: #1004 Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> * Add tests for AboutFileIndex methods Reference: #982 Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> * Address feedback and update CHANGELOG Reference: #982 Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com> --------- Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Using a query regex and possibly tow conditions with ignores/excludes is likely to be slow. I think we may end up having multiple full table scans in a loop. This main explain why this pipeline step now runs about 5 to 12 times slower than before. I suggest to invert the processing. Right now, we process here this way: https://github.com/nexB/scancode.io/blob/c29c62884a383ae00cb67f5c1e03166e548e4056/scanpipe/pipes/d2d.py#L856C1-L856C1
Instead we could do this, ensuring we ever do a single pass on the resources:
scancode.io/scanpipe/pipes/d2d.py
Line 859 in c29c628
The text was updated successfully, but these errors were encountered: