Improve indexing queue with multiple instances of ScanCode.io #236

pombredanne · 2023-12-13T19:44:11Z

We should improve the PurlDB architecture to use not one but multiple ScanCode.io workers. The current PurlDB setup is to have only one dedicated ScanCode.io instance as a worker to perform fingerprinting for the PurlDB as needed at indexing time. The PurlDB calls this ScanCode.io when needed. This does not scale well and is a seriously limiting factor for indexing. We need instead to implement an alternative design.

The original design is in:

Overhaul needed for Package scan requests and indexing #49

The high level solution could cover:

Expose new scan queue to many workers #49 #290
- PurlDB API endpoint: A scan todo queue as a service as an API endpoint that serves URLs that need to be fetched and scanned by one or more ScanCode.ios
- PurlDB API endpoint: A POST endpoint where the ScanCode.io can send the results back passing back its download URL and the scan results (may be distributing a UUID or a token to send back by the ScanCode.io API client when posting the results)
Create purldb scan worker scancode.io#1078
- Move MatchCode-related code scancode.io#1077
- ScanCode.io API client: A wrapper script (as a management command) around a ScanCode.io instance to behave as a worker that will be polling the PurlDB queue for jobs using the API above and will then run the actual scanning jobs and post back its results for storage in the PurlDB. ( Create purldb scan worker scancode.io#1078 proper)
  - ScanCode.io a wrapper that daemonize the API client (systemd or just management command loop)
  - PurlDB-side "Janitor": a task on the PurlDB that reinitialize and makes available the URLs for which we did not get a scan in a given timeline #363

pombredanne · 2023-12-14T17:13:47Z

See also #14

pombredanne · 2023-12-14T17:18:49Z

There are two possible designs:

Multiple SCIO instances on multiple machines have a small new command line wrapper that runs forever as daemon and polls a new PurlDB endpoint to get a Scan queue entry to do. It runs the scans and then sens the results back with some other TBD PurlDB endpoint. Here the PurlDB does not know about the workers ahead of time. We can have as many as needed.
Multiple SCIO instances are being dispatched to from a main SCIO instance and are basically just workers for the main instance. This would require some extra work as each SCIO instance would need to store data and files locally before sending these back to the central SCIO DB, and that instance would then be used to populate the PurlDB the same ways as today. Here, the queue in the main ScanCode needs to know all the workers since it does dispatch to the worker.

JonoYang · 2023-12-18T23:03:49Z

An idea on how to delegate work to workers:

There would be a new API endpoint that we make it only accessible to workers. When called, this API will return the download URL and uuid for the package scan request. The worker will then download the Package at the given download URL and scan it. When the scan is complete, the worker will then send a POST request containing the completed scan to the server.

pombredanne · 2024-03-12T18:27:57Z

For historical reference we considered these other ideas, but this may not be the actual design:

Discussion and alternative design:

Use multiple workers (rqworker) possibly distributing them on multiple machines

Each worker would be using its local database and post results back?

we would need a mode to support this

or share and access the main database over the network?

We need one single centralized queue and is also the place for centralized projects

the central SCIO server does not know about its workers

Is redis using SSL?

We should start by setting up a test to try if the settings worksheet

In most we likely want the download to happen in the pipeline to support distribution (there is a pending (stalled) PR for this as WIP)

Some scans may be:

CPU heavy: a regular package scan pipeline

Database heavy: like a d2d pipeline

If we need no db storage in SCIO, then we should consider using plain SCTK instead

pombredanne · 2024-04-02T15:35:23Z

This is done. We now have

An updated PurlDB that exposes its indexing and scan queue as a service (using an API endpoint). The code for the new scan queue was merged with Expose new scan queue to many workers #49 #290
A wrapper a ScanCode.io instance to behave as a worker that will be polling the PurlDB queue for jobs and will then run the actual indexing jobs and post back its results for storage in the PurlDB. The code for the scancode.io worker is at Create purldb scan worker scancode.io#1078

To test this feature:

Install ScanCode.io and PurlDB from the respective branches of the PR or the main branch when merged
Add one package to the PurlDB using the API. This will create a scan queue entry. This entry will be picked up by the scancode.io worker and be scanned. The scan results will be sent back to the purldb

pombredanne assigned JonoYang Dec 14, 2023

This was referenced Feb 15, 2024

Create purldb scan worker nexB/scancode.io#1078

Merged

Expose new scan queue to many workers #49 #290

Merged

pombredanne closed this as completed Apr 2, 2024

pombredanne mentioned this issue Apr 3, 2024

ScanCode.io: Support multiple scan workers systems nexB/scancode.io#1107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve indexing queue with multiple instances of ScanCode.io #236

Improve indexing queue with multiple instances of ScanCode.io #236

pombredanne commented Dec 13, 2023 •

edited

Loading

pombredanne commented Dec 14, 2023

pombredanne commented Dec 14, 2023 •

edited

Loading

JonoYang commented Dec 18, 2023

pombredanne commented Mar 12, 2024

pombredanne commented Apr 2, 2024

Improve indexing queue with multiple instances of ScanCode.io #236

Improve indexing queue with multiple instances of ScanCode.io #236

Comments

pombredanne commented Dec 13, 2023 • edited Loading

pombredanne commented Dec 14, 2023

pombredanne commented Dec 14, 2023 • edited Loading

JonoYang commented Dec 18, 2023

pombredanne commented Mar 12, 2024

pombredanne commented Apr 2, 2024

pombredanne commented Dec 13, 2023 •

edited

Loading

pombredanne commented Dec 14, 2023 •

edited

Loading