Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File protocol regex search improvement #1594

Open
ehsandeep opened this issue Feb 10, 2022 · 3 comments
Open

File protocol regex search improvement #1594

ehsandeep opened this issue Feb 10, 2022 · 3 comments
Assignees
Labels
Investigation Something to Investigate Priority: Low This issue can probably be picked up by anyone looking to contribute to the project, as an entry fix Status: On Hold Similar to blocked, but is assigned to someone Type: Enhancement Most issues will probably ask for additions or changes.

Comments

@ehsandeep
Copy link
Member

Please describe your feature request:

Currently, we read everything in memory with assumption of processing samller data, which might not be the case all the time and slows down as we increase the input items to process

buffer, err := ioutil.ReadAll(file)

Reference:

shared by @yabeow

@ehsandeep ehsandeep added Type: Enhancement Most issues will probably ask for additions or changes. Investigation Something to Investigate Priority: Low This issue can probably be picked up by anyone looking to contribute to the project, as an entry fix labels Feb 10, 2022
@forgedhallpass
Copy link
Contributor

Potential options to consider:

  • split large file into chunks and process them on separate threads
  • look into the feasibility of an interchangeable solution, controlled by a flag (default would remain the same, the flag would control the use of a shared library for more advanced users/use-cases)
  • look into Google's RE2?

@Mzack9999 Mzack9999 self-assigned this Feb 10, 2022
@Mzack9999 Mzack9999 added the Status: In Progress This issue is being worked on, and has someone assigned. label Feb 15, 2022
@Mzack9999
Copy link
Member

After investigation, the following implementations would be needed:

  • Actually, matcher works on string/byte slice only, it's necessary to implement a regex-based engine accepting io.Reader, capable of handling potential overlapping matches between chunks
  • rurego provides between x2 to x4 performance increase on large chunks of data => for better portability, the library should be optionally available statically linked within the GH generated binary.
  • Hyperscan is another very good option => the bindings are not up to date. We need to fork and refactor
  • Create bindings for https://github.com/google/re2

@Mzack9999
Copy link
Member

Blocked by #1634

@ehsandeep ehsandeep removed the Status: Blocked There is some issue that needs to be resolved first. label Mar 10, 2022
@Ice3man543 Ice3man543 changed the title File protocol regex search improvements f May 9, 2022
@Ice3man543 Ice3man543 changed the title f File protocol regex search improvement May 9, 2022
@forgedhallpass forgedhallpass added the Status: On Hold Similar to blocked, but is assigned to someone label Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Investigation Something to Investigate Priority: Low This issue can probably be picked up by anyone looking to contribute to the project, as an entry fix Status: On Hold Similar to blocked, but is assigned to someone Type: Enhancement Most issues will probably ask for additions or changes.
Projects
None yet
Development

No branches or pull requests

3 participants