Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modernize multiprocessing (convert to asyncio) #643

Closed
terriko opened this issue May 1, 2020 · 13 comments
Closed

Modernize multiprocessing (convert to asyncio) #643

terriko opened this issue May 1, 2020 · 13 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@terriko
Copy link
Contributor

terriko commented May 1, 2020

I can't remember the details, but I know @wzao1515 had to make some compromises with the multiprocessing in order to get it to behave on python 2.7. If someone's feeling ambitious, now that we support only 3.6+, it's possible that we can do the multiprocessing more elegantly.

@terriko terriko added the enhancement New feature or request label May 1, 2020
@wzao1515
Copy link
Contributor

wzao1515 commented May 1, 2020

Yes that's definitely possible since our effort is mainly bounded by I/O, and previously my pr #186 could be optimized from both algorithm and python3-specific aspect. I'll help if anyone is interested.

@pdxjohnny
Copy link
Member

The following resources might be helpful for this:

@Niraj-Kamdar
Copy link
Contributor

Possible optimizations:

  • Extractor: asyncio subprocess and thread pool for blocking call.
  • String and InputEngine: aiofiles module
  • version_scanner: concurrent.futures process pool
  • cve_scanner, cvedb, VersionSignature: aiosqlite module
  • downloading nvd dataset: aiohttp or parfive module.

Do we still want to maintain 2 versions? 1) for multithreading and 2) for single threaded.

@pdxjohnny
Copy link
Member

Nice. I think you'll find that running the checkers is something that might be good to have threaded. But the rest will probably be fine within the main event loop. I'd recommend sticking with ahiohttp over parfive, having a larger userbase means we can count on it to stay maintained.

@Niraj-Kamdar
Copy link
Contributor

I feel like running checkers are CPU bound task because it involves finding regex pattern in every line of the lines array(parsed by strings module). There won't be any benefit in running it with multiple threads because of python GIL. It would be nice if we can run checkers in truly parallel manner. So, I think process pool is right choice.

@Niraj-Kamdar
Copy link
Contributor

In case of aiohttp vs parfive, I am going to benchmark performance of both of the module. If there won't be that much difference, I will stick with aiohttp. In case of parfive, we probably won't have to worry about maintenance because it's a submodule of sunpy.

@pdxjohnny
Copy link
Member

I feel like running checkers are CPU bound task because it involves finding regex pattern in every line of the lines array(parsed by strings module). There won't be any benefit in running it with multiple threads because of python GIL. It would be nice if we can run checkers in truly parallel manner. So, I think process pool is right choice.

Yes, process pool is right choice

@Niraj-Kamdar
Copy link
Contributor

Do we still want to provide an option for multithreaded? Since now we are going to use asyncio everywhere. It won't be practical to maintain two versions and I think it won't serve any purpose. Why one want to use slower synchronous version?

I propose just maintain async functions for everything.

@pdxjohnny
Copy link
Member

I agree, make everything async, run whatever is CPU bound in executors. This essentially means mutlithreaded/processed is the default, provided there is performance gain to be had from running some parts (like the checkers) in executors, which I highly suspect there will be.

@terriko
Copy link
Contributor Author

terriko commented May 26, 2020

+1 to not maintaining non-async variants of the same code. We should try to be self-aware about making sure the debugging log messages are easy to follow, as people often have more trouble debugging async code, but that's probably a given anyhow.

@Niraj-Kamdar
Copy link
Contributor

Update: extractor, strings, file and cvedb is asynchronous now.
Remaining modules: VersionScanner, CVEScanner, checkers and cli
These remaining modules will complete the pipeline and I believe they will improve performance while scanning large directory significantly.

I will continue working on it in my free time and I am hoping to complete it before version 2.0 release but it can take long since my academics has resumed now.

@terriko terriko added this to the 2.0 milestone Sep 8, 2020
@terriko terriko modified the milestones: 2.0, future Sep 30, 2020
@terriko
Copy link
Contributor Author

terriko commented Sep 30, 2020

I'm listing this as "future" in case parts of it don't get done before 2.0, but anything that can be done for this release would be great.

@terriko terriko changed the title Modernize multiprocessing Modernize multiprocessing (convert to asyncio) Sep 30, 2020
@terriko
Copy link
Contributor Author

terriko commented Jun 16, 2022

I think most of this has been done and this issue can be closed. If anyone notices specific areas that can be improved, we can open separate issues for them.

@terriko terriko closed this as completed Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants