Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple hash algorithms #4438

Merged
merged 6 commits into from
Jun 27, 2019
Merged

Multiple hash algorithms #4438

merged 6 commits into from
Jun 27, 2019

Conversation

pks-t
Copy link
Member

@pks-t pks-t commented Dec 8, 2017

A first step towards an abstraction for the use of multiple of hash algorithms. It lays the groundwork for having git_hash_ctx dispatch to different hash algorithms based on which one is being requested. Next steps:

  • extend git_repository by a git_hash_algorithm field
  • implement required new algorithms
  • extend git_hash_init and git_hash_ctx_init to accept an algorithm which is then used for hashing by that particular context only
  • extend OIDs to be a union of all possible hashes
  • scavenge through the tree and adjust all callers. In most cases we'd want to retain SHA1 (e.g. pack trailers, patch IDs, filebuffers), for others we'd simply use whatever the repository format states

@pks-t
Copy link
Member Author

pks-t commented Jun 22, 2018

Rebased to fix a whole lot of conflicts

@pks-t pks-t force-pushed the pks/hash-algorithm branch 3 times, most recently from d590a31 to 62690ae Compare June 22, 2018 12:10
@pks-t
Copy link
Member Author

pks-t commented Jun 22, 2018

Todo: remove this weird separation between git_hash_init and git_hash_ctx_init. It is only used by Win32 to initialize crypto providers, but this API simply sucks. We should just make sure to always call git_hash_global_init and git_hash_sha1_global_init and initialize the crypto provider there.

@pks-t pks-t force-pushed the pks/hash-algorithm branch 2 times, most recently from 7eafcf9 to 3b4409e Compare June 22, 2018 13:18
src/hash.h Outdated
union {
git_hash_sha1_ctx sha1;
} ctx;
git_hash_algorithm alg;
Copy link
Contributor

@tiennou tiennou Jun 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put the algo first ? I'm thinking we might want a GIT_HASH_ALGORITHM_CUSTOM, and having that appear first would allow a user-provided struct to be overlayed transparently ?

@ethomson
Copy link
Member

Todo: remove this weird separation between git_hash_init and git_hash_ctx_init.

It may be weird, but the goal is to avoid allocations in git_hash_init. Building a git_hash_ctx may perform an allocation of an object that contains the hash state / accumulator registers / etc. So this cannot be shared across multiple hashes at once (ie, across threads). On Windows, the BCryptCreateHash can't be moved into the global initialization. (The Windows hashing stuff does load libraries and such in the global startup.)

The current implementation was chosen to be as similar to the existing init / update / finish pattern as possible, just adding a reusable object on top that you need to allocate and free. (I called that object a context, because our constructors are usually called "init" and needed to disambiguate between hash initialization.)

Indeed we could make git_hash_init do any necessary allocation work, and introduce a git_reset to clean it (ie, call the underlying hash't init method) and introduce a git_hash_dispose. (Now that we've settle on that name, we should fix the cleanup name that we're using.)

Certainly moving to having a "hash context" that uses the init / update / finish paradigm was less code churn and the changes were more obvious when we introduced them. I don't love the current API, but I wanted to at least give you some background on why it is the way it is.

@pks-t
Copy link
Member Author

pks-t commented Jun 29, 2018

Thanks for the clarafication, @ethomson. I didn't realize that hash_ctx_cng_init actually allocates memory, so you've convinced me that the API makes sense. That being said, I think that the names suck and don't quite help to grasp the situation. My proposal would be to have:

  • git_hash_ctx_init to initialize the context and potentially allocate memory (same)
  • git_hash_ctx_dispose to free the context's memory (previously git_hash_ctx_free)
  • git_hash_update to add bytes to compute the hash
  • git_hash_final to compute the final hash
  • git_hash_clear to reset the hash to its initial state (previously git_hash_init)

I think that renaming git_hash_init to git_hash_clear would make it a lot more obvious what the distinction between git_hash_init and git_hash_ctx_init are.

@pks-t
Copy link
Member Author

pks-t commented Feb 22, 2019

Rebased to fix conflicts. I've tried to reduce the scope a bit by only doing the split between generic and SHA1-specific hashing layers. I didn't do any API changes at all

@pks-t
Copy link
Member Author

pks-t commented Feb 22, 2019

By the way: I noticed that our "Generic" SHA1 implementation doesn't compile right now. This is being fixed in here, too, but it shows that we're not as thorough with testing the different implementations as we maybe should be. Or, alternatively, that we have too many backends and that some aren't used at all.

@pks-t
Copy link
Member Author

pks-t commented Apr 5, 2019

Rebased to fix conflicts

@pks-t
Copy link
Member Author

pks-t commented Jun 14, 2019

Rebased to fix conflicts. @ethomson, @tiennou: could we maybe get this merged soonish? This doesn't introduce any behaviour changes yet, but it's required to get started with the actual transition to SHA256. The goal was to get going in small chunks to make later parts easier to review.

Copy link
Contributor

@tiennou tiennou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor, completely ignorable comments. It's pretty unintrusive as well (for now), so 👍.

src/hash.h Outdated
typedef struct {
void *data;
size_t len;
} git_buf_vec;

typedef enum {
GIT_HASH_ALGO_UNDEF = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: UNKNOWN ? It's never really undefined, as it's either SHA1 or something we don't know. Semantics 😉.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

src/hash.h Outdated
typedef enum {
GIT_HASH_ALGO_UNDEF = 0,
GIT_HASH_ALGO_SHA1,
} git_hash_algo;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Remark: we've started adding _t suffixes to some of these, but I've yet to grasp the rules.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea. From a quick glance it looks like _t prefixes are more common, so I'll just add it

* a Linking Exception. For full terms see the included COPYING file.
*/

#ifndef INCLUDE_hash_sha1_collisiondetect_h__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance this could be subsumed as "private implementation detail" inside their respective C files (or sha1.h) ? I don't think we really care about accessing that hashing context, so apart from a struct-sizing/compilation POV, that would reduce the amount of "small files". Do ignore that if you have a compelling reason though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, no. We allocate these structs on the stack, so their size needs to be known at compile time :/ Would've loved that, too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about moving the required include + struct inside the #ifdef cascade in sha1.h ? At least the ones that are 10-20 lines (ie. keep win32.h separate) ? I think I prefer more "complete", straightforward #define blocks vs 3-4 smaller ones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definitely agree if all of those struct declarations were small. But as you correctly point out, the "win32.h" one is a rather complex beast. I do have the goal of refactoring it in the future to internalize more of its implementation, but I'd like to keep that separate from this PR.

So for now, I think we should keep those headers separate to not have a hybrid of everything in "sha1.h" but not "win32.h".

@@ -5,20 +5,15 @@
* a Linking Exception. For full terms see the included COPYING file.
*/

#ifndef INCLUDE_hash_mbedtld_h__
#define INCLUDE_hash_mbedtld_h__
#ifndef INCLUDE_hash_sha1_mbedtld_h__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mbedtld 🙄.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, in fact I can point at you here :P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the eyesballs were my own blame-acceptance 😉 )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot see any smileys, thus I didn't get that :P

@pks-t
Copy link
Member Author

pks-t commented Jun 14, 2019

Fixed comments by @tiennou. Thanks for your review!

@pks-t
Copy link
Member Author

pks-t commented Jun 14, 2019

Rebased to fix conflicts with #5055

The structure `git_hash_prov` is only ever used by the Win32 SHA1
backend. As such, it doesn't make much sense to expose it via the
generic "hash.h" header, as it is an implementation detail of the Win32
backend only. Move the typedef of `git_hash_prov` into
"hash/sha1/win32.h" to fix this.
The hash source files have circular include dependencies right
now, which shows by our broken generic hash implementation. The
"hash.h" header declares two functions and the `git_hash_ctx`
typedef before actually including the hash backend header and can
only declare the remaining hash functions after the include due
to possibly static function declarations inside of the
implementation includes.

Let's break this cycle and help maintainability by creating a
real implementation file for each of the hash implementations.
Instead of relying on the exact include order, we now especially
avoid the use of `GIT_INLINE` for function declarations.
@pks-t
Copy link
Member Author

pks-t commented Jun 24, 2019

Rebased to fix SHA1DC breakage

As we will include additional hash algorithms in the future due
to upstream git discussing a move away from SHA1, we should
accomodate for that and prepare for the move. As a first step,
move all SHA1 implementations into a common subdirectory.

Also, create a SHA1-specific header file that lives inside the
hash folder. This header will contain the SHA1-specific header
includes, function declarations and the SHA1 context structure.
As a preparatory step to allow multiple hashing APIs to exist at
the same time, split the hashing functions into one layer for generic
hashing and one layer for SHA1-specific hashing. Right now, this is
simply an additional indirection layer that doesn't yet serve any
purpose. In the future, the generic API will be extended to allow for
choosing which hash to use, though, by simply passing an enum to the
hash context initialization function. This is necessary as a first step
to be ready for Git's move to SHA256.
Create a separate `git_hash_sha1_ctx` structure that is specific
to the SHA1 implementation and move all SHA1 functions over to
use that one instead of the generic `git_hash_ctx`. The
`git_hash_ctx` for now simply has a union containing this single
SHA1 implementation, only, without any mechanism to distinguish
between different algortihms.
Create an enum that allows us to distinguish between different
hashing algorithms. This enum is embedded into each
`git_hash_ctx` and will instruct the code to which hashing
function the particular request shall be dispatched.

As we do not yet have multiple hashing algorithms, we simply
initialize the hash algorithm to always be SHA1. At a later
point, we will have to extend the `git_hash_init_ctx` function to
get as parameter which algorithm shall be used.
@pks-t
Copy link
Member Author

pks-t commented Jun 27, 2019

I'm just going ahead now and merge this. It's been open for a rather long time, and I finally want to continue working on this. There's still the one commen from @tiennou about inlining struct decls into "sha1.h", but we might as well tackle that in a future iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants