Skip to content

Conversation

@ShaneHarvey
Copy link
Member

@ShaneHarvey ShaneHarvey commented Nov 4, 2022

Benchmark results for iterating a randomly generated file with readline() before this change:

client.server_info()['version']='5.0.13' pymongo.__version__='4.4.0.dev0'
object_size_mb= 1, chunk_size_kb= 128, 0.030s
object_size_mb= 1, chunk_size_kb= 256, 0.036s
object_size_mb= 1, chunk_size_kb= 512, 0.056s
object_size_mb= 1, chunk_size_kb=1024, 0.122s
object_size_mb= 1, chunk_size_kb=8192, 0.116s
object_size_mb= 5, chunk_size_kb= 128, 0.179s
object_size_mb= 5, chunk_size_kb= 256, 0.185s
object_size_mb= 5, chunk_size_kb= 512, 0.338s
object_size_mb= 5, chunk_size_kb=1024, 0.652s
object_size_mb= 5, chunk_size_kb=8192, 8.401s
object_size_mb=20, chunk_size_kb= 128, 0.869s
object_size_mb=20, chunk_size_kb= 256, 1.131s
object_size_mb=20, chunk_size_kb= 512, 1.694s
object_size_mb=20, chunk_size_kb=1024, 3.029s
object_size_mb=20, chunk_size_kb=8192, 51.000s
(cut off after 20MB because the 80MB benchmark took too long)

And after:

client.server_info()['version']='5.0.13' pymongo.__version__='4.4.0.dev0'
object_size_mb= 1, chunk_size_kb= 128, 0.012s
object_size_mb= 1, chunk_size_kb= 256, 0.020s
object_size_mb= 1, chunk_size_kb= 512, 0.012s
object_size_mb= 1, chunk_size_kb=1024, 0.012s
object_size_mb= 1, chunk_size_kb=8192, 0.014s
object_size_mb= 5, chunk_size_kb= 128, 0.070s
object_size_mb= 5, chunk_size_kb= 256, 0.080s
object_size_mb= 5, chunk_size_kb= 512, 0.075s
object_size_mb= 5, chunk_size_kb=1024, 0.050s
object_size_mb= 5, chunk_size_kb=8192, 0.043s
object_size_mb=20, chunk_size_kb= 128, 0.191s
object_size_mb=20, chunk_size_kb= 256, 0.191s
object_size_mb=20, chunk_size_kb= 512, 0.189s
object_size_mb=20, chunk_size_kb=1024, 0.189s
object_size_mb=20, chunk_size_kb=8192, 0.150s
object_size_mb=80, chunk_size_kb= 128, 0.625s
object_size_mb=80, chunk_size_kb= 256, 0.572s
object_size_mb=80, chunk_size_kb= 512, 0.638s
object_size_mb=80, chunk_size_kb=1024, 0.584s
object_size_mb=80, chunk_size_kb=8192, 0.610s

This is a huge perf gain anywhere from 2x to 100x (and more) depending on the file size and chunk_size.

@ShaneHarvey ShaneHarvey requested a review from blink1073 November 7, 2022 17:16
@ShaneHarvey
Copy link
Member Author

Note that my first approach was to use a memoryview() for the buffered chunk (self.__buffer) but this was cumbersome for a few reasons:

  1. The readchunk() method needs to return a bytes, not a memoryview, so we would need to pay the cost of copying to bytes anyway.
  2. The readline() method uses bytes.index() to find the next newline which memoryview does not support so it's handy to keep the bytes around.

@ShaneHarvey ShaneHarvey merged commit da4df79 into mongodb:master Nov 7, 2022
@ShaneHarvey ShaneHarvey deleted the PYTHON-3508 branch November 7, 2022 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants