Pack file verification #4374

pks-t · 2017-10-13T16:50:08Z

Warning: this is ugly and, at least to me, in many places it feels like I'm doing hacks over hacks to work around our object system. The main problem here is that we currently cannot parse objects which are not owned by the ODB system, because our objects point into the reference-counted memory of the ODB. So when I try to parse an object which is not owned by the ODB, then this will crash later on after git_odb_object_deref, as it tries to free objects which aren't owned by itself or not even part of the heap. So yeah, this doesn't look as nice as it could in an ideal world.

On the other hand, this seems to work just fine right now. After nearly a whole day of cursing and debugging I think I've finally got it right. So what do I do? This is mostly the implementation of git index-pack --strict, doing two things:

check whether all objects of a pack can be parsed correctly
check all references to other objects for commits, tags and trees

What this gets us is that we can verify that a pack is complete (we've got all objects such that we can resolve the complete graph) when we receive the pack.

I've discussed the design a bit in Slack with @carlosmn. My first thought was to just perform an object walk after fetching the complete pack file. But seeing that in most cases we're limited by network bandwidth while fetching the pack file, we agreed that it would be much nicer to just do as much of the heavy lifting as possible during retrieval of the pack. So the algorithm works as follows:

for every non-deltified object passed to the indexer
- strike its object ID from expected_oids, which keeps track of OIDs we still expect to find as part of the pack file
- parse the object
- add object IDs referenced by the object to the expected_oids
when resolving delta objects, do the same with regards to the expected_oids map
afterwards, check that the expected_oids map has no entries anymore, as all object references should now be resolved correctly

I'm just putting this up early to get early feedback from the CI and reviewers. I'm not happy with some of the things I had to do here, even though they work.

pks-t · 2017-10-13T16:53:56Z

Oh, and to remind myself: I leak memory in the expected_oids right now when it has entries when freeing the indexer.

pks-t · 2017-11-30T15:48:23Z

Yay, this is green now. I've fixed a bug last week where I accessed the object's type through the indexer instead of through the raw object. This has lead to a subtle bug and invalid memory access that could occur seemingly only in very special situations.

pks-t · 2018-04-06T12:45:19Z

Rebased on latest master

The `git_walk_objects` structure is currently only being used inside of the pack-objects.c file, but being declared in its header. This has actually been the case since its inception in 04a36fe (pack-objects: fill a packbuilder from a walk, 2014-10-11) and has never really changed. Move the struct declaration into pack-objects.c to improve code encapsulation.

Going forward, we will have to change how blob sizes are calculated based on whether the blob is a cahed object part of the ODB or not. In order to not have to distinguish between those two object types repeatedly when accessing the blob's data or size, encapsulate all existing direct uses of those fields by instead using `git_blob_rawcontent` and `git_blob_rawsize`.

Currently, parsing objects is strictly tied to having an ODB object available. This makes it hard to parse an object when all that is available is its raw object and size. Furthermore, hacking around that limitation by directly creating an ODB structure either on stack or on heap does not really work that well due to ODB objects being reference counted and then automatically free'd when reaching a reference count of zero. In some occasions parsing raw objects without touching the ODB is actually recuired, though. One use case is for example object verification, where we want to assure that an object is valid before inserting it into the ODB or writing it into the git repository. Asa first step towards that, introduce a distinction between raw and ODB objects for blobs. Creation of ODB objects stays the same by simply using `git_blob__parse`, but a new function `git_blob__parse_raw` has been added that creates a blob from a pair of data and size. By setting a new flag inside of the blob, we can now distinguish whether it is a raw or ODB object now and treat it accordingly in several places. Note that the blob data passed in is not being copied. Because of that, callers need to make sure to keep it alive during the blob's life time. This is being used to avoid unnecessarily increasing the memory footprint when parsing largish blobs.

Currently, parsing objects is strictly tied to having an ODB object available. This makes it hard to parse an object when all that is available is its raw object and size. Furthermore, hacking around that limitation by directly creating an ODB structure either on stack or on heap does not really work that well due to ODB objects being reference counted and then automatically free'd when reaching a reference count of zero. Implement a function `git_commit__parse_raw` to parse a commit object from a pair of `data` and `size`.

Currently, parsing objects is strictly tied to having an ODB object available. This makes it hard to parse an object when all that is available is its raw object and size. Furthermore, hacking around that limitation by directly creating an ODB structure either on stack or on heap does not really work that well due to ODB objects being reference counted and then automatically free'd when reaching a reference count of zero. Implement a function `git_tag__parse_raw` to parse a tag object from a pair of `data` and `size`.

Currently, parsing objects is strictly tied to having an ODB object available. This makes it hard to parse an object when all that is available is its raw object and size. Furthermore, hacking around that limitation by directly creating an ODB structure either on stack or on heap does not really work that well due to ODB objects being reference counted and then automatically free'd when reaching a reference count of zero. Implement a function `git_tree__parse_raw` to parse a tree object from a pair of `data` and `size`.

Now that we have implement functions to parse all git objects from raw data, we can implement a generic function `git_object__from_raw` to create a structure of type `git_object`. This allows us to parse and interpret objects from raw data without having to touch the ODB at all, which is especially useful for object verification prior to accepting them into the repository.

The `processed` variable local to `git_indexer_append` counts how many objects have already been processed. But actually, whenever it gets assigned to, we are also assigning the same value to the `stats->indexed_objects` struct member. So in fact, it is being quite useless due to always having the same value as the `indexer_objects` member and makes it a bit harder to understand the code. We can just remove the variable to fix that.

The loop inside of `git_indexer_append` iterates over every object that is to be stored as part of the index. While the logic to retrieve every object from the packfile stream is rather involved, it currently just part of the loop, making it unnecessarily hard to follow. Move the logic into its own function `read_stream_object`, which unpacks a single object from the stream. Note that there is some subtletly here involving the special error `GIT_EBUFS`, which indicates to the indexer that no more data is currently available. So instead of returning an error and aborting the whole loop in that case, we do have to catch that value and return successfully to wait for more data to be read.

When passing `--strict` to `git-unpack-objects`, core git will verify the pack file that is currently being read. In addition to the typical checksum verification, this will especially cause it to verify object connectivity of the received pack file. So it checks, for every received object, if all the objects it references are either part of the local object database or part of the pack file. In libgit2, we currently have no such mechanism, which leaves us unable to verify received pack files prior to writing them into our local object database. This commit introduce the concept of `expected_oids` to the indexer. When pack file verification is turned on by a new flag, the indexer will try to parse each received object first. If the object has any links to other objects, it will check if those links are already satisfied by known objects either part of the object database or objects it has already seen as part of that pack file. If not, it will add them to the list of `expected_oids`. Furthermore, the indexer will remove the current object from the `expected_oids` if it is currently being expected. Like this, we are able to verify whether all object links are being satisfied. As soon as we hit the end of the object stream and have resolved all objects as well as deltified objects, we assert that `expected_oids` is in fact empty. This should always be the case for a valid pack file with full connectivity.

We strive to keep an options structure to many functions to be able to extend options in the future without breaking the API. `git_indexer_new` doesn't have one right now, but we want to be able to add an option for enabling strict packfile verification. Add a new `git_indexer_options` structure and adjust callers to use that.

Right now, we simply turn on connectivity checks in the indexer as soon as we have access to an object database. But seeing that the connectivity checks may incur additional overhead, we do want the user to decide for himself whether he wants to allow those checks. Furthermore, it might also be desirable to check connectivity in case where no object database is given at all, e.g. in case where a fully connected pack file is expected. Add a flag `verify` to `git_indexer_options` to enable additional verification checks. Also avoid to query the ODB in case none is given to allow users to enable checks when they do not have an ODB.

The new connectivity tests are not currently being verified at all due to being turned off by default. Create two test cases for a pack file which fails our checks and one which suceeds.

pks-t · 2018-06-22T07:52:50Z

Rebased to fix conflicts

ethomson · 2018-07-03T11:01:16Z

I've read through this a few times now and I think it's 👍 . There's a lot going on but it all looks correct to me and it's definitely valuable.

@carlosmn you're more familiar with this original code than I am. Any thoughts before we ?

ethomson · 2018-08-26T10:26:42Z

Okay, this has been open for a while with no additional comments. I'm going to

pks-t · 2018-08-27T06:23:17Z

On Sun, Aug 26, 2018 at 03:26:50AM -0700, Edward Thomson wrote: Merged #4374 into master.

Exciting times. Thanks for your review!

pks-t mentioned this pull request Oct 13, 2017

Lack of validation of objects in packfile when fetching #4366

Closed

pks-t force-pushed the pks/pack-file-verify branch 3 times, most recently from 7c6b204 to 2851dfe Compare November 12, 2017 12:59

pks-t force-pushed the pks/pack-file-verify branch from 2851dfe to 1cee6d1 Compare November 24, 2017 13:14

pks-t mentioned this pull request Nov 30, 2017

Shallow support #4331

Open

pks-t force-pushed the pks/pack-file-verify branch from 1cee6d1 to 82b6fd2 Compare March 2, 2018 11:37

pks-t force-pushed the pks/pack-file-verify branch from 82b6fd2 to d334ef6 Compare April 6, 2018 12:45

pks-t added 13 commits June 22, 2018 09:50

tests: indexer: add test to exercise our connectivity checking

65a4b06

The new connectivity tests are not currently being verified at all due to being turned off by default. Create two test cases for a pack file which fails our checks and one which suceeds.

pks-t force-pushed the pks/pack-file-verify branch from d334ef6 to 65a4b06 Compare June 22, 2018 07:52

pks-t added 2 commits June 22, 2018 13:19

indexer: correctly initialize struct with {0}

6b51f38

odb_pack: fix passing partially initialized indexer options

261267e

ethomson merged commit 50186ce into libgit2:master Aug 26, 2018

ethomson mentioned this pull request Jan 25, 2019

Release v0.27.8 #4942

Merged

pks-t deleted the pks/pack-file-verify branch July 11, 2019 19:06

snyk-bot mentioned this pull request Feb 23, 2020

[Snyk] Upgrade nodegit from 0.4.1 to 0.26.4 saurabharch/Breezeblocks#1

Open

snyk-bot mentioned this pull request Apr 22, 2020

[Snyk] Upgrade nodegit from 0.24.3 to 0.26.5 aminatakonate000/Graviton-App#4

Open

snyk-bot mentioned this pull request May 5, 2020

[Snyk] Upgrade nodegit from 0.24.3 to 0.26.5 Barnstorm-Online/ngp-openapi-generator#1

Open

ethomson mentioned this pull request May 12, 2021

Notes from contributors conference call: 2018-06-11 #5869

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pack file verification #4374

Pack file verification #4374

pks-t commented Oct 13, 2017

pks-t commented Oct 13, 2017

pks-t commented Nov 30, 2017

pks-t commented Apr 6, 2018

pks-t commented Jun 22, 2018

ethomson commented Jul 3, 2018

ethomson commented Aug 26, 2018

pks-t commented Aug 27, 2018 via email

Pack file verification #4374

Pack file verification #4374

Conversation

pks-t commented Oct 13, 2017

pks-t commented Oct 13, 2017

pks-t commented Nov 30, 2017

pks-t commented Apr 6, 2018

pks-t commented Jun 22, 2018

ethomson commented Jul 3, 2018

ethomson commented Aug 26, 2018

pks-t commented Aug 27, 2018 via email