Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve indexing queue with multiple instances of ScanCode.io #236

Closed
9 tasks done
pombredanne opened this issue Dec 13, 2023 · 5 comments
Closed
9 tasks done

Improve indexing queue with multiple instances of ScanCode.io #236

pombredanne opened this issue Dec 13, 2023 · 5 comments
Assignees

Comments

@pombredanne
Copy link
Member

pombredanne commented Dec 13, 2023

We should improve the PurlDB architecture to use not one but multiple ScanCode.io workers. The current PurlDB setup is to have only one dedicated ScanCode.io instance as a worker to perform fingerprinting for the PurlDB as needed at indexing time. The PurlDB calls this ScanCode.io when needed. This does not scale well and is a seriously limiting factor for indexing. We need instead to implement an alternative design.

The original design is in:

The high level solution could cover:

@pombredanne
Copy link
Member Author

See also #14

@pombredanne
Copy link
Member Author

pombredanne commented Dec 14, 2023

There are two possible designs:

  1. Multiple SCIO instances on multiple machines have a small new command line wrapper that runs forever as daemon and polls a new PurlDB endpoint to get a Scan queue entry to do. It runs the scans and then sens the results back with some other TBD PurlDB endpoint. Here the PurlDB does not know about the workers ahead of time. We can have as many as needed.

  2. Multiple SCIO instances are being dispatched to from a main SCIO instance and are basically just workers for the main instance. This would require some extra work as each SCIO instance would need to store data and files locally before sending these back to the central SCIO DB, and that instance would then be used to populate the PurlDB the same ways as today. Here, the queue in the main ScanCode needs to know all the workers since it does dispatch to the worker.

@JonoYang
Copy link
Contributor

An idea on how to delegate work to workers:

There would be a new API endpoint that we make it only accessible to workers. When called, this API will return the download URL and uuid for the package scan request. The worker will then download the Package at the given download URL and scan it. When the scan is complete, the worker will then send a POST request containing the completed scan to the server.

@pombredanne
Copy link
Member Author

For historical reference we considered these other ideas, but this may not be the actual design:

Discussion and alternative design:

  • Use multiple workers (rqworker) possibly distributing them on multiple machines

    • Each worker would be using its local database and post results back?
      • we would need a mode to support this
    • or share and access the main database over the network?
    • We need one single centralized queue and is also the place for centralized projects
      • the central SCIO server does not know about its workers
    • Is redis using SSL?
    • We should start by setting up a test to try if the settings worksheet
  • In most we likely want the download to happen in the pipeline to support distribution (there is a pending (stalled) PR for this as WIP)

  • Some scans may be:

    • CPU heavy: a regular package scan pipeline
    • Database heavy: like a d2d pipeline
  • If we need no db storage in SCIO, then we should consider using plain SCTK instead

@pombredanne
Copy link
Member Author

This is done. We now have

  • An updated PurlDB that exposes its indexing and scan queue as a service (using an API endpoint). The code for the new scan queue was merged with Expose new scan queue to many workers #49 #290

  • A wrapper a ScanCode.io instance to behave as a worker that will be polling the PurlDB queue for jobs and will then run the actual indexing jobs and post back its results for storage in the PurlDB. The code for the scancode.io worker is at Create purldb scan worker  scancode.io#1078

To test this feature:

  1. Install ScanCode.io and PurlDB from the respective branches of the PR or the main branch when merged

  2. Add one package to the PurlDB using the API. This will create a scan queue entry. This entry will be picked up by the scancode.io worker and be scanned. The scan results will be sent back to the purldb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants