Improve byte conversion speeds by rhpvorderman · Pull Request #12 · marcelm/dnaio

rhpvorderman · 2021-09-14T13:44:36Z

Hi. Using .encode('ascii') is a nice way to retrieve bytes, but a dedicated cython method that does the same is two times faster. This leads to noticable results on my fastq filter project:

before:

Benchmark #1: fastq-filter -o /dev/null "median_quality:20" ~/test/500000reads.fastq
  Time (mean ± σ):     608.5 ms ±   9.6 ms    [User: 585.9 ms, System: 22.4 ms]
  Range (min … max):   596.4 ms … 621.6 ms    10 runs

After:

Benchmark #1: fastq-filter -o /dev/null "median_quality:20" ~/test/500000reads.fastq
  Time (mean ± σ):     568.6 ms ±  11.7 ms    [User: 547.4 ms, System: 21.2 ms]
  Range (min … max):   550.0 ms … 583.2 ms    10 runs

That is quite a substantial upgrade in runtime.

Also. I found that writing back to a file was quite slow. So I updated the fastq_bytes method. I tested several methods benchmarked them and then choose the fastest. The process can be seen in the commits.

It turns out that encoding is not necessarily slow, but Python's methods of joining strings together are. using b"'.join([...]) already decreased the runtime by 30% compared to the + methods. But I decided it was still slow and opted to do raw C-API calls and memcpy. That made it more than 3 times faster.
The result is quite noticable when writing fastq files now. See below, where I benchmarked both read and read_and_write. The write speed is the difference between the two.

Before:

Benchmark #1: python dnaio_read.py ~/test/big2.fastq
  Time (mean ± σ):      2.158 s ±  0.021 s    [User: 2.030 s, System: 0.127 s]
  Range (min … max):    2.135 s …  2.190 s    10 runs

Benchmark #1: python dnaio_read_and_write.py ~/test/big2.fastq /dev/null
  Time (mean ± σ):      4.353 s ±  0.036 s    [User: 4.199 s, System: 0.153 s]
  Range (min … max):    4.306 s …  4.406 s    10 runs

All records are written in about 2.2 seconds.

After

Benchmark #1: python dnaio_read.py ~/test/big2.fastq
  Time (mean ± σ):      2.215 s ±  0.020 s    [User: 2.069 s, System: 0.145 s]
  Range (min … max):    2.175 s …  2.240 s    10 runs

Benchmark #1: python dnaio_read_and_write.py ~/test/big2.fastq /dev/null
  Time (mean ± σ):      3.535 s ±  0.035 s    [User: 3.369 s, System: 0.166 s]
  Range (min … max):    3.486 s …  3.578 s    10 runs

All records are written in about 1.3 seconds.

That is quite a good improvement.

…thods

marcelm

Cool! I used your benchmarking script to test another variant that uses f-strings because I recently heard they’re quite fast. So for reference, this version

def fastq_bytes(self):
   return f"@{self.name}\n{self.sequence}\n+\n{self.qualities}\n".encode("ascii")

comes quite close to your improvements (your version: 0.13, f-strings: 0.15). But in this case, giving up readability is fine.

I haven’t quite finished my review, but see some comments already now.

rhpvorderman · 2021-09-15T07:56:28Z

comes quite close to your improvements (your version: 0.13, f-strings: 0.15). But in this case, giving up readability is fine.

Oh, that is interesting. I hadn't considered f-strings because these were actually slower than the original fastq_bytes implementation. But that was using pure python. I can imagine that Cython converts it automatically to some code that resembles the Python C-API calling code that I made. I think the hand-crafted code is faster because the '\n', '@' and '+' characters are not represented in a python string but are directly added in memory as a raw integer, thus saving some overhead.

marcelm · 2021-09-15T08:04:33Z

Cython converts it automatically to some code that resembles the Python C-API calling code

Exactly, it creates a temporary tuple, fills it one by one with the properly formatted parts of the f-string and then runs PyUnicode_Join to create the final result.

rhpvorderman · 2021-09-15T13:16:27Z

@marcelm. I adressed the second review comments.

marcelm · 2021-09-15T16:43:41Z

Perfect, thanks, this is a nice improvement!

rhpvorderman · 2021-09-17T07:40:39Z

Thanks for merging! Always a pleasure to work with you!

rhpvorderman · 2021-09-27T11:58:28Z

Sorry for bothering you, but when is a new release planned? I would like fastq-filter to depend on the next version of dnaio so I can use the qualities_as_bytes method. Thanks!

marcelm · 2021-09-28T06:52:09Z

Doing it now

marcelm · 2021-09-28T07:27:30Z

Version 0.6.0 is now on PyPI

rhpvorderman · 2021-09-28T07:55:20Z

Thank you for your quick response!

rhpvorderman added 7 commits September 13, 2021 14:41

Create differing methods and benchmarks

8902c46

Simplify fastqbytes4, add more benchmark tools

0394539

Add method with more API calls and less cython checking

29bbf8c

Replace fastq bytes method, add sequence_bytes and qualities_bytes me…

aa2fdcb

…thods

Add test for fastq_bytes

b35662e

Remove redundant sequence bytes

06220ec

Add comments

63382fe

rhpvorderman force-pushed the fasterbytes branch from bfb3524 to 63382fe Compare September 14, 2021 13:58

rhpvorderman mentioned this pull request Sep 14, 2021

Add Sequence.id attribute #10

Closed

marcelm reviewed Sep 14, 2021

View reviewed changes

Comment thread src/dnaio/_core.pyx

Comment thread src/dnaio/_core.pyx Outdated

Comment thread src/dnaio/_core.pyx Outdated

Comment thread src/dnaio/_core.pyx Outdated

Comment thread src/dnaio/_core.pyx Outdated

Address review comments

ae2290c

Extend docstring

0c0b23c

marcelm reviewed Sep 15, 2021

View reviewed changes

Comment thread src/dnaio/_core.pyx Outdated

marcelm reviewed Sep 15, 2021

View reviewed changes

Comment thread src/dnaio/_core.pyx Outdated

Upper-case ASCII and PEP-257 compliant docstrings

c164dff

marcelm merged commit ef8f341 into marcelm:main Sep 15, 2021

rhpvorderman deleted the fasterbytes branch September 17, 2021 07:40

Conversation

rhpvorderman commented Sep 14, 2021

Uh oh!

marcelm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rhpvorderman commented Sep 15, 2021

Uh oh!

marcelm commented Sep 15, 2021

Uh oh!

Uh oh!

Uh oh!

rhpvorderman commented Sep 15, 2021

Uh oh!

marcelm commented Sep 15, 2021

Uh oh!

rhpvorderman commented Sep 17, 2021

Uh oh!

rhpvorderman commented Sep 27, 2021

Uh oh!

marcelm commented Sep 28, 2021

Uh oh!

marcelm commented Sep 28, 2021

Uh oh!

rhpvorderman commented Sep 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants