New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW]: Zoomerjoin: Superlatively Fast Fuzzy-Joins #5693
Comments
Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks. For a list of things I can do to help you, just type:
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
|
|
Wordcount for |
|
👋🏼 @beniaminogreen, @cjbarrie, @wincowgerDEV this is the review thread for the paper. All of our communications will happen here from now on. As a reviewer, the first step is to create a checklist for your review by entering
as the top of a new comment in this thread. These checklists contain the JOSS requirements. As you go over the submission, please check any items that you feel have been satisfied. The first comment in this thread also contains links to the JOSS reviewer guidelines. The JOSS review is different from most other journals. Our goal is to work with the authors to help them meet our criteria instead of merely passing judgment on the submission. As such, the reviewers are encouraged to submit issues and pull requests on the software repository. Please also feel free to comment and ask questions on this thread. In my experience, it is better to post comments/questions/suggestions as you come across them instead of waiting until you've reviewed the entire package. We aim for reviews to be completed within about 2-4 weeks. Please let me know if any of you require some more time. We can also use EditorialBot (our bot) to set automatic reminders if you know you'll be away for a known period of time. Please feel free to ping me if you have any questions/concerns. |
Review checklist for @cjbarrieConflict of interest
Code of Conduct
General checks
Functionality
Documentation
Software paper
|
This is a very good package. Easy to use and well written. There are just a few small changes I would recommend.
Additionally, some small typos in the article:
For validating the package functionality, I did the following for now. It would be helpful if some version of this were incorporated in the documentation so users can test functionality easily: library(dplyr)
url <- "https://raw.githubusercontent.com/beniaminogreen/zoomerjoin/main/vignettes/bonica.csv"
data <- read.csv(url(url))
set.seed(123L)
corpus_1 <- data %>%
sample_n(500000) %>%
select(x) %>%
rename(field = x)
corpus_2 <- data %>%
sample_n(500000) %>%
select(x) %>%
rename(field = x)
start_time <- Sys.time()
join_out <- jaccard_inner_join(corpus_1, corpus_2,
by = "field", n_gram_width=6,
n_bands=20, band_width=6, threshold = .8)
print(Sys.time() - start_time)
|
Hi, Thanks for your detailed feedback and your kind words about the package. Writing in response to the four points you raise:
I'll also make sure to fix the typos you flag in the paper. Best and many thanks, |
@editorialbot generate pdf |
Writing to share some of the updates to the package over the past few days. In addition to fixing typos and fixing a bug in the euclidean joining code, I have:
Because of this, I have also uploaded this gist which runs the same benchmarking code over a smaller set of inputs to show the difference in scaling between the two methods for smaller datasets. It should run in ~20 mins on a modern laptop. |
Review checklist for @wincowgerDEVConflict of interest
Code of Conduct
General checks
Functionality
Documentation
Software paper
|
@beniaminogreen, I ran into an issue with the installation and posted here: beniaminogreen/zoomerjoin#70 |
@beniaminogreen recommendation on line 77 of the manuscript, I don't think you want to claim that your package never creates false positives. True positive matches in character datasets are often subjective and may be complex (beyond the capabilities of jacard similarity), it may confuse readers when they do not get the expected output. |
@beniaminogreen, I recommend adding some code examples in text to demonstrate your points, it will help readers understand which function to use to achieve which outcomes you are talking about. |
@beniaminogreen, it isn't clear where the code for producing the euclidean synthetic dataset is described in the manuscript. I think there should be a link in the manuscript to where the code is that produced the figure. |
I agree with the comment that "no false positives" might mislead users as to what zoomerjoin can accomplish. I'd still like to flag the fact that zoomerjoin will never identify units that are less similar than the threshold the user selects as matches, but I'll see if there's a way to say this that sets expectations better. The code for the synthetic dataset and benchmarking can be found in this vignette, and I'll include a link in the paper. I also like the suggestion of adding code examples, so I'll add a small example of joining using the Euclidean and Jaccard distances to the article. I thought that including a contrubuting.md file was enough to satisfy the JOSS requirements, but since both of you have raised this as a concern, I will write up a more detailed set of instructions on contributing or reporting bugs. Best, |
@editorialbot generate pdf |
Updating the article to make it responsive to comments above. Specifically: have added code for the synthetic dataset, linked to benchmarking vignette, and added an example use of Added contributing instructions in a separate update to the repo. Best, |
Hi All, Thanks for your detailed feedback about the package. In response to some of the comments you've given, I have made the rust code in the package capture R's random seed, so you can set the seed and make sure that all the results are repicable. This helped me catch a locking issue with some of the rust code. I've bundled a version of the dime dataset with the package so that users can start experimenting with the joining function immediately after installing the package. It looks like both of you are having difficulty replicating some of the performance claims in the paper. Ideally, I would like to include benchmarks on large datasets to show off the scaling of the package, but I recognize that this makes it tough for replicators. Is there any way I can make it easier for you to run the benchmarks? Similarly, please don't hesitate to let me know if I can make it easier for you to sign off on any of the other steps. Best, and many thanks for your help, |
Done! archive is now 10.5281/zenodo.8370652 |
@editorialbot recommend-accept |
|
|
The paper's PDF and metadata files generation produced some warnings that could prevent the final paper from being published. Please fix them before the end of the review process.
|
IDREFS attribute rid references an unknown ID "ref-rextednr" |
on it. |
Just fixed - I had introduced a minor spelling error in a citation somewhere in the revision process. |
@editorialbot generate pdf |
@editorialbot recommend-accept |
|
|
👋 @openjournals/sbcs-eics, this paper is ready to be accepted and published. Check final proof 👉📄 Download article If the paper PDF and the deposit XML files look good in openjournals/joss-papers#4604, then you can now move forward with accepting the submission by compiling again with the command |
@editorialbot accept |
|
|
Hmm. I don't understand the error message in the github workflow log, but please let me know if there's anything I can do to fix it. |
@editorialbot accept |
|
Ensure proper citation by uploading a plain text CITATION.cff file to the default branch of your repository. If using GitHub, a Cite this repository menu will appear in the About section, containing both APA and BibTeX formats. When exported to Zotero using a browser plugin, Zotero will automatically create an entry using the information contained in the .cff file. You can copy the contents for your CITATION.cff file here: CITATION.cff
If the repository is not hosted on GitHub, a .cff file can still be uploaded to set your preferred citation. Users will be able to manually copy and paste the citation. |
🐘🐘🐘 👉 Toot for this paper 👈 🐘🐘🐘 |
🚨🚨🚨 THIS IS NOT A DRILL, YOU HAVE JUST ACCEPTED A PAPER INTO JOSS! 🚨🚨🚨 Here's what you must now do:
Any issues? Notify your editorial technical team... |
@cjbarrie, @wincowgerDEV – many thanks for your reviews here and to @samhforbes for editing this submission! JOSS relies upon the volunteer effort of people like you and we simply wouldn't be able to do this without you ✨ @beniaminogreen – your paper is now accepted and published in JOSS ⚡🚀💥 |
🎉🎉🎉 Congratulations on your paper acceptance! 🎉🎉🎉 If you would like to include a link to your paper from your README use the following code snippets:
This is how it will look in your documentation: We need your help! The Journal of Open Source Software is a community-run journal and relies upon volunteer effort. If you'd like to support us please consider doing either one (or both) of the the following:
|
Submitting author: @beniaminogreen (Beniamino Green)
Repository: https://github.com/beniaminogreen/zoomerjoin
Branch with paper.md (empty if default branch): joss
Version: v0.1.0
Editor: @samhforbes
Reviewers: @cjbarrie, @wincowgerDEV
Archive: 10.5281/zenodo.8370652
Status
Status badge code:
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
Reviewer instructions & questions
@cjbarrie & @wincowgerDEV, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review.
First of all you need to run this command in a separate comment to create the checklist:
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @samhforbes know.
✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨
Checklists
📝 Checklist for @cjbarrie
📝 Checklist for @wincowgerDEV
The text was updated successfully, but these errors were encountered: