PackfileMaintenanceStep: Use second-largest pack-size for batch size #375

derrickstolee · 2020-05-04T18:25:58Z

When the total size of all pack-files is larger than 2 gigabytes, then
we currently use a strange mechanism for trying to repack everything
else into a single pack.

The "estimated output size" that we use for predicting the batch size
will under-count pack sizes when there are duplicates. This means that
the current model will actually no-op most of the time! By picking the
second-largest file size, the repack command will have a high
likelihood of doing something, and that size will grow geometrically
as we go. This strategy was recommended by @sluongng and it seems
to work better.

This only affects "small" repos. We will use 2 gigabytes as the
default batch-size for large repos.

derrickstolee · 2020-05-04T20:09:15Z

/azp run microsoft.scalar

azure-pipelines · 2020-05-04T20:09:26Z

Azure Pipelines successfully started running 1 pipeline(s).

derrickstolee · 2020-05-04T20:42:01Z

/azp run microsoft.scalar

azure-pipelines · 2020-05-04T20:42:13Z

Azure Pipelines successfully started running 1 pipeline(s).

dscho · 2020-06-01T16:02:36Z

I guess I lack the context to understand... why is the second-largest pack's size a sensible choice? From what little I understand, I would have expected something like the total size of all packs except the largest one...

derrickstolee · 2020-06-01T16:38:00Z

I guess I lack the context to understand... why is the second-largest pack's size a sensible choice? From what little I understand, I would have expected something like the total size of all packs except the largest one...

I felt like the comment @sluongng contributed on list was enough to justify this work, but now that I revisit the comment I feel like it isn't fully justified. Perhaps @sluongng could contribute some more about their experience here.

sluongng · 2020-06-01T17:17:39Z

Hi folks,

This batch_size is more of a prediction over what the total repack size should at least be. And if the prediction is miss, then no repack happen.

So the current batch size calculation is:

batch_size = total_pack_size - max_pack_size

batch_size is first compared with expected_size in 1 and eventually total_size in 2.
The problem here is that expected_size and total_size could be varies depends on the nature of the packs in odb.

With the current calculation, I find it could 'miss' quite a lot of time, resulting in total_size < batch_size and fill_included_packs_batch() returned true, thus no repack happened. This lead to the number of packs keep growing (when prefetch step run) until the total_size get over batch_size that would repack all the small packs into 1.

I dont like that, mainly because its a bit hard to predict when the repack gona happen. And the spike of storage size of ODB which could happens time to time (although its not very significant).

Instead, if you choose the batch_size to be the second_biggest_pack_size + 1, a new second biggest packfile will always be created. This does not repack all the small packfiles like the original approach, but only the second biggest packfile and the latest pack (note how QSORT over mtime is used in fill_included_packs_batch() 3).

All and all, the second biggest pack size approach give me a more consistent result while the tradeoff is less packfiles are repacked each run. A way to correct this is just to ramp up the run interval of Repack steps, so that this job run more often.

I prefer this approach as the CPU / Ram / Disk usage for repack step is a lot more consistent. Instead of having a chance of all small packs being repacked, you consistently spend roughly the same CPU and disk I/O every X minutes to repack 2-3 packfiles.

Idk how to explain it better. I didnt have too much time to dig into midx format and why expected_size was being calculated in git/git@ce1e4a1 the way it is. But in retrospect, perhaps it could be tweaked somehow to make batch_size less of a prediction 🤔

derrickstolee · 2020-06-01T17:42:53Z

Thanks for the context, @sluongng!

To summarize: the "estimated output size" that we use for predicting the batch size will under-count pack sizes when there are duplicates. This means that the current model will actually no-op most of the time! By picking the second-largest file size, the repack command will have a high likelihood of doing something, and that size will grow geometrically as we go.

And to reiterate on the "estimated output size": this is an important tool for allowing repacks in the GVFS protocol case with the 2gb size limit. When a user does not prefetch commits and trees for over 30 days, they will get a second "huge" pack of every commit and tree that was reachable 30 days ago. That will contain most of the objects from the older prefetch packs, but due to force-pushes some of those objects may still be interesting. Without this estimated output size, the oldest prefetch pack would never be repacked.

Does that help, @dscho?

dscho · 2020-06-01T21:59:26Z

Yes, thank you!

When the total size of all pack-files is larger than 2 gigabytes, then we currently use a strange mechanism for trying to repack everything else into a single pack. It was recommended that we should use the second-largest pack-file size instead. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

derrickstolee · 2020-06-02T13:48:49Z

I modified the PR description to include the reasoning we worked out in the comments. Thanks!

dscho

Thanks!

derrickstolee requested review from jeffhostetler and dscho June 1, 2020 14:42

derrickstolee force-pushed the batch-size branch from 72dff37 to 2339a87 Compare June 2, 2020 13:46

dscho approved these changes Jun 2, 2020

View reviewed changes

derrickstolee merged commit e56f42e into microsoft:master Jun 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PackfileMaintenanceStep: Use second-largest pack-size for batch size #375

PackfileMaintenanceStep: Use second-largest pack-size for batch size #375

derrickstolee commented May 4, 2020 •

edited

Loading

derrickstolee commented May 4, 2020

azure-pipelines bot commented May 4, 2020

derrickstolee commented May 4, 2020

azure-pipelines bot commented May 4, 2020

dscho commented Jun 1, 2020

derrickstolee commented Jun 1, 2020 •

edited

Loading

sluongng commented Jun 1, 2020

derrickstolee commented Jun 1, 2020

dscho commented Jun 1, 2020

derrickstolee commented Jun 2, 2020

dscho left a comment

PackfileMaintenanceStep: Use second-largest pack-size for batch size #375

PackfileMaintenanceStep: Use second-largest pack-size for batch size #375

Conversation

derrickstolee commented May 4, 2020 • edited Loading

derrickstolee commented May 4, 2020

azure-pipelines bot commented May 4, 2020

derrickstolee commented May 4, 2020

azure-pipelines bot commented May 4, 2020

dscho commented Jun 1, 2020

derrickstolee commented Jun 1, 2020 • edited Loading

sluongng commented Jun 1, 2020

derrickstolee commented Jun 1, 2020

dscho commented Jun 1, 2020

derrickstolee commented Jun 2, 2020

dscho left a comment

Choose a reason for hiding this comment

derrickstolee commented May 4, 2020 •

edited

Loading

derrickstolee commented Jun 1, 2020 •

edited

Loading