Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hashing function #225

Closed
mcgodes opened this issue Jul 27, 2016 · 11 comments
Closed

Add hashing function #225

mcgodes opened this issue Jul 27, 2016 · 11 comments

Comments

@mcgodes
Copy link

mcgodes commented Jul 27, 2016

Would love to have a hash function available to me through the std library. Something like a UUID, md5, sha1, etc...

Something like:

local myObject = {
  key: "val",
  hello: "world",
};

local myObjectHash = std.md5(std.toString(myObject))

Don't think I want to write this as a function in jsonnet, as its been done elsewhere a number of times and I'm concerned about performance issues

@sparkprime
Copy link
Member

Are you using Jsonnet via the API or commandline? If the former then in the latest release you can register callbacks from the host language (C, Python, etc) and these are available in Jsonnet functions. Such functions must be pure -- always return the same output for the same input. Wrapping md5 or whatever would satisfy that criteria.

@mcgodes
Copy link
Author

mcgodes commented Jul 29, 2016

Alright, sweet! Thank you, that's very helpful.

@mcgodes mcgodes closed this as completed Jul 29, 2016
@chris-codaio
Copy link

I'm using jsonnet from the command line and would like to see common hash functions available in the std lib.

@sparkprime
Copy link
Member

Other than md5, what would you use?

@sparkprime
Copy link
Member

SHA1 doesn't seem to be that useful in a config language for example. However in principle anything can be added as a native function, if its absence is blocking adoption in a given domain.

@chris-codaio
Copy link

MD5 would be enough for my purposes.

My immediate use case is hashing the content of a ConfigMap Kubernetes resource and using it as a metadata label on the pod that consumes it. This would auto-redeploy the pod when the configmap changes - something kubernetes doesn't do on its own today.

@sparkprime
Copy link
Member

One can imagine use cases for generating unique names for things based on their content. MD5 would also suffice for that.

Here's a question though: MD5 on sequences of bytes is straightforward, but on strings there is a question of representation. Should it be MD5 on the unicode codepoints (32 bits each) or UTF8? You get a different hash code for the two cases.

For example, Python rejects unicode strings that don't fit into ASCII:

>>> len('å')
2
>>> len(u'å')
1
>>> md5.md5('å')
<md5 HASH object @ 0x7f508f26ce90>
>>> md5.md5(u'å')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 0: ordinal not in range(128)

@sparkprime
Copy link
Member

sparkprime commented Aug 27, 2016

Here's an implementation of md5 in Jsonnet that only works for ASCII strings.

Two things were annoying while writing this: 1) No hex literals and 2) No unsigned 32 bit ints operators. The first could be easily fixed but the second one wouldn't be sensible to add. I had to work around it by explicitly overflowing them at 2^32 for example.

It is quite slow, it took 1m45s to calculate the md5 of all 10 meg of /usr/share/dict/words (after replacing non-unicode codepoints with 0). However, it does copy the whole thing several times in the pre-processing stage. A native implementation is pretty much instant. So there we go.

{
    md5(str)::
        local s = [
            7, 12, 17, 22,  7, 12, 17, 22,  7, 12, 17, 22,  7, 12, 17, 22,
            5,  9, 14, 20,  5,  9, 14, 20,  5,  9, 14, 20,  5,  9, 14, 20,
            4, 11, 16, 23,  4, 11, 16, 23,  4, 11, 16, 23,  4, 11, 16, 23,
            6, 10, 15, 21,  6, 10, 15, 21,  6, 10, 15, 21,  6, 10, 15, 21,
        ];

        local K = [
            3614090360, 3905402710, 606105819, 3250441966, 4118548399, 1200080426,
            2821735955, 4249261313, 1770035416, 2336552879, 4294925233, 2304563134,
            1804603682, 4254626195, 2792965006, 1236535329, 4129170786, 3225465664,
            643717713, 3921069994, 3593408605, 38016083, 3634488961, 3889429448, 568446438,
            3275163606, 4107603335, 1163531501, 2850285829, 4243563512, 1735328473,
            2368359562, 4294588738, 2272392833, 1839030562, 4259657740, 2763975236,
            1272893353, 4139469664, 3200236656, 681279174, 3936430074, 3572445317, 76029189,
            3654602809, 3873151461, 530742520, 3299628645, 4096336452, 1126891415,
            2878612391, 4237533241, 1700485571, 2399980690, 4293915773, 2240044497,
            1873313359, 4264355552, 2734768916, 1309151649, 4149444226, 3174756917,
            718787259, 3951481745
        ];


        // Convert to array of bytes.
        local ascii = std.makeArray(
            std.length(str),
            function (i)
                local cp = std.codepoint(str[i]);
                assert cp < 128 : "String must be ASCII";
                cp);
        // Find smallest padding_bytes such that (std.length(str) + 1 + padding_bytes) % 64 == 56.
        local padding_bytes = (64 - (std.length(str) + 1 + 8) % 64) % 64;
        // Big-endian int64 in bytes.
        local bitlength = std.length(str) * 8;
        local M_bytes = ascii + [128] + std.makeArray(padding_bytes, function(i) 0) + [
            (bitlength >> ((i) * 8)) & 255,
            for i in std.range(0, 7)
        ];
        assert std.length(M_bytes) % 64 == 0;

        // Jsonnet bitwise operations are on signed long, so derived new ones
        // for the cases that differ:
        local UINT_MAX = (1 << 32) - 1;
        local cast_32_bit(x) =
            x & UINT_MAX;
        local unsigned_32bit_invert(x) =
            (x ^ UINT_MAX);

        // Jsonnet has no bitwise rotate operator. 
        local left_rotate (x, c) =
            cast_32_bit(x << c) | (x >> (32-c));

        local get_block(offset) = std.makeArray(16, function(j)
            M_bytes[offset + j * 4 + 0] << 0
            | M_bytes[offset + j * 4 + 1] << 8
            | M_bytes[offset + j * 4 + 2] << 16
            | M_bytes[offset + j * 4 + 3] << 24);

        local
            // Combined inner and outer loop.
            loop(chunk, M, i, A0, B0, C0, D0, A=A0, B=B0, C=C0, D=D0) =
                if i == 64 then
                    local new_A = cast_32_bit(A0 + A);
                    local new_B = cast_32_bit(B0 + B);
                    local new_C = cast_32_bit(C0 + C);
                    local new_D = cast_32_bit(D0 + D);
                    local new_chunk = chunk + 64;
                    if new_chunk >= std.length(M_bytes) then
                        [new_A, new_B, new_C, new_D]
                    else
                        loop(new_chunk, get_block(new_chunk), 0,
                             new_A, new_B, new_C, new_D) tailstrict
                else
                    local F =
                        if i < 16 then
                            (B & C) | (unsigned_32bit_invert(B) & D)
                        else if i < 32 then
                            (D & B) | (unsigned_32bit_invert(D) & C)
                        else if i < 48 then
                            B ^ C ^ D
                        else
                            C ^ (B | unsigned_32bit_invert(D));
                    local g =
                        if i < 16 then
                            i
                        else if i < 32 then
                            (5 * i + 1) % 16
                        else if i < 48 then
                            (3 * i + 5) % 16
                        else
                            (7 * i) % 16;
                    loop(chunk, M,
                              i + 1,
                              A0, B0, C0, D0,
                              D,
                              cast_32_bit(B + left_rotate(cast_32_bit(A + F + K[i] + M[g]), s[i])),
                              B,
                              C) tailstrict;

        local R = loop(0, get_block(0), 0, 1732584193, 4023233417, 2562383102, 271733878);
        local R2 = std.makeArray(16, function(i) (R[i / 4] >> ((i % 4) * 8)) & 255);

        '%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x' % R2,
}

@chris-codaio
Copy link

UTF-8 is fine for me, but worth seeing if anyone else has an opinion.

@ahakanbaba
Copy link
Contributor

Hi @sparkprime
Do you know that this md5 implementation actually works ? Do you have some tests ?

A hash function looks very useful actually. Do you see value in adding this to the stdlib ?

@sparkprime
Copy link
Member

Anyone still following this thread, note that a native std.md5("foo") now exists and is much faster than this version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants