Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return unique hash for input #137

Closed
DGollings opened this issue Dec 21, 2020 · 8 comments · Fixed by #138
Closed

Return unique hash for input #137

DGollings opened this issue Dec 21, 2020 · 8 comments · Fixed by #138

Comments

@DGollings
Copy link

DGollings commented Dec 21, 2020

So i'm busy ingesting shipments, they arrive as either csv, json, xml or edi

The interface I'm working should take an array of shipments, divide that into individual shipments, hash those and store the original input for success/audit/retry/failure tracking reasons. This would make it easier to ingest 99/100 shipments and retry (after localizing and fixing the issue) that one shipment that's invalid for whatever reason.

In order to decide whether something has been ingested correctly I thought a solution could be hashing it 'unit' of input and storing the original input somewhere as well

Quite easy for csv

Weird python-and-bash-esque psuedocode:

for line in csv:
  process(line) && hash(line) && gzip(line) -> store result, hash, line in db

It becomes less so for json and xml, even marshal and unmarshal is not 100% identical to the input

Even worse is EDI

So, even though I liked the idea of storing the original it quickly becomse cumbersome. A decent alternative is is hashing and storing the output of transform.Read()

But that comes with several issues

  • I can change the output and thus the hash using the schema (not really an issue)
  • its not original (but it is more consistent (all json)), so kind of bug/feature
  • I don't see what I haven't told omniparser to see, so new fields that might have been added

None of these are a major issue, but part of hashing a new representation of the input, not the input itself

I was wondering how hard would it be to hash the input of whatever generates the output would be?
So:
hash, data, err := transform.Read

Is your internal data stable enough? That you could say 'for loop' the IDR input through the sha256 encoder (it supports streaming) and return a stable/unchanging hash?

As in, in theory ["a", "b", "c"] should return the same hash for a, b and c regardless of ordering

Also, I imagine being able to verify whether a file has been fully processed is interesting for more than one usecase

@jf-tech
Copy link
Owner

jf-tech commented Dec 21, 2020

First, what you do mean by "stable"? The example you give is a bit interesting to say the least: stable in Golang world usually means stability for map, not for slice/array. Because in my opinion, slice/array element order matters thus ["a", "b", "c"] shouldn't be considered the same as ["b", "c", "a"].

That said if you have a special scenario here that warrants different meaning of stability, you can always either do it in code via a new custom_func, or using javascript if it's powerful enough:

	schema, err := omniparser.NewSchema(
		schemaFileBaseName,
		schemaFileReader,
		omniparser.Extension{
			CreateSchemaHandler: omniv21.CreateSchemaHandler,
			CustomFuncs: customfuncs.Merge(
				customfuncs.CommonCustomFuncs,
				v21.OmniV21CustomFuncs,
				customfuncs.CustomFuncs{
					"my_json_stabler": myJSONStabler,
				}),
		})

func myJSONStabler(_ *transformctx.Ctx, node *idr.Node) (interface{}, error) {
    // do whatever stability sort you need with the node and return the output into a string.
}

Then in your schema you can have:

    "transform_declarations": {
        "FINAL_OUTPUT": { "xpath": "....xpath to streaming node....", "object": {
            "hash": { "custom_func": {
                "name": "uuidv3",
                "args": [
                    { "xpath": ".", "custom_func": { "name": "my_json_stabler" }},
                ],
            }},
            "rest of your fields": { ... },
            "rest of your fields": { ... },
            ...
        }}
    }

Basically, using my_json_stabler custom_func to take the "." streaming node in, stabilize it, and marshal it into a string, then uuidv3 will convert the string into a hash. Then you code calling transform.Read() can extract that hash field out and do whatever you want.

If I'm mistaken about what you want to achieve, please attach a sample input and your desired output so I can take a deeper look.

@jf-tech
Copy link
Owner

jf-tech commented Dec 21, 2020

BTW, I'm on vacation right now, I'll try to reply daily, but some days I might get spotty connectivity, so bear that in mind.

@DGollings
Copy link
Author

Sorry, by stable I mean your in memory representation of the source input. Maybe consistent would be a better word to describe it.

So i"m easily able to run omniparser over and over and get the exact same sha256sum output per line of input. That part works, until I change the schema (of course)

Which is why I was wondering whether it was possible to get a sha256sum of the input, before transforming it, but using the actual input of an output. Not the entire input file, or an assumption of what the input probably is.

In case of input

[
    {
        "foo": "a"
    },
    (... other stuff ---)
    {
        "bar": "b"
    }
]

schema: object : {test: { foo : {xpath : foo}, bar: {xpath:bar}}}

I can either hash the output of:

{
    "test": {
        "foo": "a",
        "bar": "b"
    }
}

Or write some kind of custom code to somehow get the values of foo and bar and hash those before it is turned into the new object.

Or, assuming ETL: have omniparser hash the output of the extract, but before the transform. With the idea being of having a more 'stable' (consistent) hash, regardless of schema.

Basically the question I'm trying to answer is: has this file and its contents been processed properly and in its entirety?

But then the hash will still change if I do anything other than a trivial schema change (schema bar, foo)

I don't think that that which I thought would be nice to have (hash of input) is possible on anything other than a csv, which has well defined rows. It becomes very difficult or impossible with more freeform transforms
I'll close this issue, thanks either way and enjoy your holiday :).

@jf-tech
Copy link
Owner

jf-tech commented Jan 1, 2021

@DGollings

Turns out you don't even need to write custom code for this:

            "hash_json_input": { "custom_func": {
                "name": "javascript_with_context",
                "args": [ { "const": "_node" } ]
            }},
            "hash": { "custom_func": {
                "name": "uuidv3",
                "args": [ { "custom_func": { "name": "javascript_with_context", "args": [ { "const": "_node" } ] } } ]
            }},

You just need the "hash" ("hash_json_input" is just to show you what will be hashed).

@DGollings
Copy link
Author

Just now had a chance to try this out and it works, cool :)

However from a puritans perspective it would be neater not to have to embed it into the template as it in itself changes the output.

Given input, foo:bar, the output is changed by adding a hash key/field

{foo:bar, hash: value} != {foo:bar}

Neater would be something like transform := NewTransform()
and adding Hash() (and/Input() ?) to the Transform struct

So

for {
hash := transform.Hash() (no advance)
data := transform.Read() (reads and advances a 'line')
}

or

for transform.Next() {
hash := transform.Hash() (no advance)
data := transform.Read() (no advance, Next() does that)
}

Seeing as uuidv3 is a an existing function, it wouldn't be too big an interface/api change right?

@jf-tech
Copy link
Owner

jf-tech commented Jan 8, 2021

@DGollings let me think about it - I need to be careful about changing interface as it might break existing code / schema. But let me reopen the issue so I don't forget.

@jf-tech jf-tech reopened this Jan 8, 2021
@jf-tech
Copy link
Owner

jf-tech commented Jan 8, 2021

The main issue with returning a hash from transform.Read is lack of definitive meaning of hash - omniparser uses extension and schema handler model, and it's completely up to the schema handler (the one included in this repo is the omni.2.1 handler) to define what's a record, what's a transform, etc. It's hard to explain what transform.Hash() return means and nor it's flexible in terms of API - what if people don't want to uuidv3 (MD5 hash), then what?

A couple ideas I'm brewing right now:

  1. add func transform.CurRawRecord() (interface{}, error), so each schema handler can define whatever the raw record is and returned as an interface{}. This API addition is fine, will break current schema handler implementations (I hope no one actually will be impacted given omniparser just release a month or so ago), but will not break any usage of omniparser, as calling transform.CurRawRecord is optional.

Con: still not the best in terms of API design.

  1. I'm thinking about an observer/callback model, where during schema.NewTransform call, caller can supply an event callback, and specify what events they're interested, then during omniparser/handler ingestion/transformation, the event callback will be invoked with data. Not clear how to do this cleanly yet.

What's your preference, @DGollings ?

@DGollings
Copy link
Author

DGollings commented Jan 9, 2021

is lack of definitive meaning of hash

Agree, I call it checksum in my code. But in this case you do have a clear meaning, its the md5sum of whatever was gathered as input. And you can always add the disclaimer "not guaranteed to be consistent across versions of omniparser"

it's completely up to the schema handler (...) to define what's a record, what's a transform, etc.

Well, that's not really an issue, as long as its consistent with itself over time, how to enforce/guarantee that is another matter though. The question, well, the question that I'm trying to answer, is: "have I processed this record already". Not necessarily,
"is this record unique and cryptographically verifiable across all of time, space and versions and/or implementations of omniparser" :)

In that case you're better of saving a sha512 of the entire file.

I'd even go so far that I will quite likely never check a checksum again after say one week. Once a file has been archived and deemed successful processed

A couple ideas I'm brewing right now:

Option 1 is easier for the user, option 2 is easier for the maintainer(s) (ignoring that it probably needs quite a lot of documentation)
When in doubt, I'd pick the option that's easier for users :)

jf-tech added a commit that referenced this issue Jan 9, 2021
…iparser to access the raw ingested record.

See details in #137.
jf-tech added a commit that referenced this issue Jan 9, 2021
…iparser to access the raw ingested record. (#138)

See details in #137.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants