Skip to content

Add records_are_mates function#79

Merged
marcelm merged 13 commits into
marcelm:mainfrom
rhpvorderman:are_mates
May 17, 2022
Merged

Add records_are_mates function#79
marcelm merged 13 commits into
marcelm:mainfrom
rhpvorderman:are_mates

Conversation

@rhpvorderman
Copy link
Copy Markdown
Collaborator

Fixes #78

I simply added records_are_mates(*args) because this is more suitable for the use case. I noted the most typical use case (in my opinion) in the example in the docstring. Having a SequenceRecord.are_mates(self, *args) function seemed a bit more clunky in the typical use case. records_are_mates(r1, r2, r3) vs r1.are_mates(r2,r3).

This function accepts any number of SequenceRecords as long as it is 2 or larger. Immediately future proof for some other type of FASTQ files that might pop up later.

Cython does generate less optimal code here.

static PyMethodDef __pyx_mdef_5dnaio_5_core_7records_are_mates = {"records_are_mates", (PyCFunction)(void*)(PyCFunctionWithKeywords)__pyx_pw_5dnaio_5_core_7records_are_mates, METH_VARARGS|METH_KEYWORDS, 0}

There are no keywords! METH_VARARGS is sufficient. And then I can also not simply access the tuple with an index and expect this to become PyTuple_GET_ITEM. No.... The index has to be converted to a Python Integer first. So I try to use PyTuple_GET_ITEM myself, and I get these complaints about object vs PyObject *. There is no difference Cython, internally it is all PyObject *, you made that up yourself, and now you are complaining to me?! So I do some casts, but then I can not see clearly if Cython handles the references properly. Then I gave up. This should have been pretty straightforward as Python guarantees that any object passed as an argument has an increased reference count. So we can simply borrow the reference with PyTuple_GET_ITEM and don't worry about it.

So the code is not as optimized as it could be (and as I wanted it to be). But it should be more optimal than two separate is_mate calls. (Although I do not have time to benchmark that now).

Copy link
Copy Markdown
Owner

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks mostly ok although I don’t like the code duplication between records_are_mates and record_ids_match that much. In theory, SequenceRecord.is_mate could just call records_are_mates, but I guess that adds some overhead, so I’m willing to accept the duplication if you think it’s worth it.

Comment thread src/dnaio/_core.pyx Outdated
Comment thread src/dnaio/_core.pyx Outdated
Comment thread src/dnaio/_core.pyx
Comment thread src/dnaio/_core.pyx Outdated
Comment thread src/dnaio/_core.pyi Outdated
@rhpvorderman
Copy link
Copy Markdown
Collaborator Author

rhpvorderman commented May 16, 2022

This looks mostly ok although I don’t like the code duplication between records_are_mates and record_ids_match that much.

I have refactored to remove the duplication. Adding strcspn calls it is quite expensive as shown here: #22. I will perform some benchmarks about records_are_mates now and add the result later to this post.

EDIT:

DEPRECATED record_names_match
average: 71.206, range: 69.919-78.228 stdev: 0.893
record.is_mate
average: 40.31, range: 39.407-42.978 stdev: 0.484
record.is_mate twice for three-way comparison
average: 87.885, range: 85.153-105.841 stdev: 1.391
records_are_mates two-way
average: 68.451, range: 67.723-71.707 stdev: 0.441
records_are_mates three-way
average: 80.743, range: 78.783-91.882 stdev: 1.111

Hmm. Records_are_mates barely justifies itself. There seems to be quite some overhead involved in Cython's parsing. It does seem that at the 4-way comparison records_are_mates will be the better choice. But that is not a use case now. I will check if I can scrape off some of the overhead.

@rhpvorderman
Copy link
Copy Markdown
Collaborator Author

Trimmed some of the fat and the results are satisfactory:

DEPRECATED record_names_match
average: 74.373, range: 69.232-197.702 stdev: 11.2
record.is_mate
average: 41.884, range: 40.16-48.01 stdev: 1.026
record.is_mate twice for three-way comparison
average: 88.579, range: 85.7-122.42 stdev: 2.497
records_are_mates two-way
average: 56.527, range: 53.86-108.057 stdev: 3.034
records_are_mates three-way
average: 64.969, range: 62.759-76.146 stdev: 1.865

I benchmarked with this script.
benchmark_matches.py.gz

I did a reference leak check with sys.getrefcount, and everything works as expected.

@marcelm
Copy link
Copy Markdown
Owner

marcelm commented May 17, 2022

Super, thanks for the improvements, I’m happy with this now :-).

Let me know if we should make a new release.

@marcelm marcelm merged commit 9e73924 into marcelm:main May 17, 2022
@rhpvorderman rhpvorderman deleted the are_mates branch May 17, 2022 10:06
@rhpvorderman
Copy link
Copy Markdown
Collaborator Author

Thanks!
In fact, this would be quite useful right now. I am currently in a project working with UMIs and these are provided in a separate FASTQ file. So if a release is not too much effort, that would be great.

@marcelm
Copy link
Copy Markdown
Owner

marcelm commented May 17, 2022

Version 0.9.0 is now available.

@rhpvorderman
Copy link
Copy Markdown
Collaborator Author

Thank you! Always a pleasure working with you!

@rhpvorderman
Copy link
Copy Markdown
Collaborator Author

I think it works well in practice. Thanks again for merging.
Q.E.D: rhpvorderman/fastqdedup@943e71e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Three-way is mate comparison

2 participants