Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ABOUT file handling in d2d pipeline #1004

Closed
pombredanne opened this issue Nov 5, 2023 · 2 comments · Fixed by #982
Closed

Improve ABOUT file handling in d2d pipeline #1004

pombredanne opened this issue Nov 5, 2023 · 2 comments · Fixed by #982

Comments

@pombredanne
Copy link
Member

  1. It may take a significant time to map now that we are using patterns.
    Using a query regex and possibly tow conditions with ignores/excludes is likely to be slow. I think we may end up having multiple full table scans in a loop. This main explain why this pipeline step now runs about 5 to 12 times slower than before. I suggest to invert the processing. Right now, we process here this way: https://github.com/nexB/scancode.io/blob/c29c62884a383ae00cb67f5c1e03166e548e4056/scanpipe/pipes/d2d.py#L856C1-L856C1
  • for each ABOUT file
    • for each resource matching ABOUT file paths and excludes
      • .....
    • create/update Package

Instead we could do this, ensuring we ever do a single pass on the resources:

  • for each ABOUT file
    • collect path patterns and translate and compile to regex
  • for each resource:
    • for each ABOUT file pattern
      • if resource matching ABOUT file paths and excludes
        • accumulate Resource for a later bulk operation
      • .....
  • for each mapped ABOUT file:
    • create/update Package
    • bulk create the Package -> Resources relationships
  • for each non-mapped:
    • save warning
  1. Companion files should be taken from the ABOUT file, not implied based on naming conventions this should be using the ABOUT file content
    about_file_companions = (
AyanSinhaMahapatra added a commit that referenced this issue Nov 8, 2023
Reference: #1004

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
AyanSinhaMahapatra added a commit that referenced this issue Jan 12, 2024
Reference: #1004

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra
Copy link
Member

#982 addresses this issue partially, i.e. the About file handling is much more improved and takes less time.
See the comparison for improvement in time taken (for a project with ~800k files, ~550k in deployed, ~220 about files) to process About files:
Before About paths were patterns:

2023-12-09 10:26:55.30 Step [map_about_files] starting
2023-12-09 10:26:55.33 Mapping 221 .ABOUT files found in the from/ codebase.
2023-12-09 11:11:58.06 Step [map_about_files] completed in 2703 seconds (45.0 minutes)

After About paths were patterns this increased to 4 hours, sometimes even 6-7 hours.

2023-11-15 16:20:07.50 Step [map_about_files] starting
2023-11-15 16:20:07.50 Mapping 220 .ABOUT files found in the from/ codebase.
2023-11-15 22:57:41.66 Step [map_about_files] completed in 23854 seconds (6.6 hours)

After the optimizations used here, this is now ~13 minutes:

2024-01-12 18:58:56.50 Step [map_about_files] starting
2024-01-12 18:58:56.55 Mapping 221 .ABOUT files found in the from/ codebase.
2024-01-12 19:11:39.79 Step [map_about_files] completed in 763 seconds (12.7 minutes)

Remaining: Companion files should be taken from the ABOUT file, not implied based on naming conventions this should be using the ABOUT file content
This has to be implemented in aboutcode-toolkit and then added here, could be done when #834 is implemented.

@pombredanne
Copy link
Member Author

Remaining: Companion files should be taken from the ABOUT file, not implied based on naming conventions this should be using the ABOUT file content
This has to be implemented in aboutcode-toolkit and then added here, could be done when #834 is implemented.

@AyanSinhaMahapatra Why the wait? The attributes are already supported ... see https://github.com/nexB/aboutcode-toolkit/blob/89c16a5b762d38c5e7f4ba25097659bd60a0a08c/src/attributecode/model.py#L903

AyanSinhaMahapatra added a commit that referenced this issue Jan 16, 2024
Reference: #1004

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra AyanSinhaMahapatra linked a pull request Jan 16, 2024 that will close this issue
tdruez pushed a commit that referenced this issue Jan 30, 2024
* Support regex in ABOUT resource paths

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Refactor ABOUT file mapping in d2d for efficiency

Reference: #1004

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Restructure map_about_files

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Address feedback and review comments

Reference: #982

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Update docstrings and use dataclass

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Use license/notice files from About data

Reference: #1004

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Add tests for AboutFileIndex methods

Reference: #982

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Address feedback and update CHANGELOG

Reference: #982

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

---------

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants