Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Binary Search in Replay #631

Merged
merged 64 commits into from
Sep 29, 2022
Merged

Implement Binary Search in Replay #631

merged 64 commits into from
Sep 29, 2022

Conversation

machawk1
Copy link
Member

@machawk1 machawk1 commented Mar 1, 2020

This implementation is heavily inspired by @ibnesayeed's MementoMap implementation.

In the future we might align this with @ibnesayeed upcoming API but for now, this drastically speeds up replay and thus closes #604. I recommend we merge then refine the implementation.

@ibnesayeed Can you have a look and let me know of some refinements with regard to semantics and implementation?

ipwb/replay.py Outdated Show resolved Hide resolved
ipwb/replay.py Outdated Show resolved Hide resolved
ipwb/replay.py Outdated Show resolved Hide resolved
ipwb/replay.py Outdated Show resolved Hide resolved
machawk1 and others added 2 commits March 3, 2020 14:44
Co-Authored-By: Sawood Alam <ibnesayeed@gmail.com>
Co-authored-by: Sawood Alam <ibnesayeed@gmail.com>
@machawk1
Copy link
Member Author

@ibnesayeed It would be interesting to use your code-in-github-comments GitHub action to evaluate the binsearch speedup in this PR.

@ibnesayeed
Copy link
Member

@ibnesayeed It would be interesting to use your code-in-github-comments GitHub action to evaluate the binsearch speedup in this PR.

Sure, please go ahead and craft a sequence of commands or python code that you would write to compare the two variations if you were to test it locally on your machine. From there it should be easy to convert it to something that we can make it run here and report the results back.

@machawk1
Copy link
Member Author

machawk1 commented Jul 14, 2020

@ibnesayeed It would be interesting to use your code-in-github-comments GitHub action to evaluate the binsearch speedup in this PR.

Sure, please go ahead and craft a sequence of commands or python code that you would write to compare the two variations if you were to test it locally on your machine. From there it should be easy to convert it to something that we can make it run here and report the results back.

This could probably be performed using GitHub Actions, right?

An easy way to do this might be to fabricate large data with https://github.com/machawk1/cdxjGenerator then run the master branch vs. the issue-631 branch.

High level test procedure outside of the code (brainstorming):

git clone https://github.com/oduwsdl/ipwb
git clone https://github.com/machawk1/cdxjGenerator
cd ipwb
pip install -r requirements.txt
pip install .
cd ../cdxjGenerator
pip install -r requirements.txt

python cdxjGenerator/cdxjGenerator.py 500000 > ./test.cdxj
(observe some values in various places in the file)
time (find values, record respectively)
git checkout issue-631
pip install -r requirements.txt
pip install .
time (find values, record respectively)
(compare values)

@ibnesayeed
Copy link
Member

@machawk1 any plans to publish cdxjGenerator as a package and with a CLI command like cdxjen? Currently, it would require cloning the repo, changing the working directory, installing requirements, and then executing the script, but when released as a package and CLI, it will become a 2-step process, pip install cdxjen && cdxjen.

@machawk1
Copy link
Member Author

machawk1 commented Jul 14, 2020

@ibnesayeed I thought I had already done so but in writing the above procedure, realized I had not, so created machawk1/cdxjGenerator#3. I realize I can create a package without submitting it to pypi but it will help move the process along if there is an end-goal.

@machawk1
Copy link
Member Author

Merging this in after multiple discussions and a long delay. It is not the optimal implementation but it is a step in the right direction.

@machawk1 machawk1 merged commit 2094cd2 into master Sep 29, 2022
@machawk1 machawk1 deleted the issue-604 branch September 29, 2022 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement memory-efficient in-file binary search for CDXJ indexes
2 participants