-
-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return unique hash for input #137
Comments
First, what you do mean by "stable"? The example you give is a bit interesting to say the least: stable in Golang world usually means stability for map, not for slice/array. Because in my opinion, slice/array element order matters thus That said if you have a special scenario here that warrants different meaning of stability, you can always either do it in code via a new
Then in your schema you can have:
Basically, using If I'm mistaken about what you want to achieve, please attach a sample input and your desired output so I can take a deeper look. |
BTW, I'm on vacation right now, I'll try to reply daily, but some days I might get spotty connectivity, so bear that in mind. |
Sorry, by stable I mean your in memory representation of the source input. Maybe consistent would be a better word to describe it. So i"m easily able to run omniparser over and over and get the exact same sha256sum output per line of input. That part works, until I change the schema (of course) Which is why I was wondering whether it was possible to get a sha256sum of the input, before transforming it, but using the actual input of an output. Not the entire input file, or an assumption of what the input probably is. In case of input
schema: I can either hash the output of:
Or write some kind of custom code to somehow get the values of foo and bar and hash those before it is turned into the new object. Or, assuming ETL: have omniparser hash the output of the extract, but before the transform. With the idea being of having a more 'stable' (consistent) hash, regardless of schema. Basically the question I'm trying to answer is: has this file and its contents been processed properly and in its entirety? But then the hash will still change if I do anything other than a trivial schema change (schema bar, foo) I don't think that that which I thought would be nice to have (hash of input) is possible on anything other than a csv, which has well defined rows. It becomes very difficult or impossible with more freeform transforms |
Turns out you don't even need to write custom code for this:
You just need the |
Just now had a chance to try this out and it works, cool :) However from a puritans perspective it would be neater not to have to embed it into the template as it in itself changes the output. Given input, foo:bar, the output is changed by adding a hash key/field
Neater would be something like transform := NewTransform() So
or
Seeing as uuidv3 is a an existing function, it wouldn't be too big an interface/api change right? |
@DGollings let me think about it - I need to be careful about changing interface as it might break existing code / schema. But let me reopen the issue so I don't forget. |
The main issue with returning a hash from A couple ideas I'm brewing right now:
Con: still not the best in terms of API design.
What's your preference, @DGollings ? |
Agree, I call it checksum in my code. But in this case you do have a clear meaning, its the md5sum of whatever was gathered as input. And you can always add the disclaimer "not guaranteed to be consistent across versions of omniparser"
Well, that's not really an issue, as long as its consistent with itself over time, how to enforce/guarantee that is another matter though. The question, well, the question that I'm trying to answer, is: "have I processed this record already". Not necessarily, In that case you're better of saving a sha512 of the entire file. I'd even go so far that I will quite likely never check a checksum again after say one week. Once a file has been archived and deemed successful processed
Option 1 is easier for the user, option 2 is easier for the maintainer(s) (ignoring that it probably needs quite a lot of documentation) |
…iparser to access the raw ingested record. See details in #137.
So i'm busy ingesting shipments, they arrive as either csv, json, xml or edi
The interface I'm working should take an array of shipments, divide that into individual shipments, hash those and store the original input for success/audit/retry/failure tracking reasons. This would make it easier to ingest 99/100 shipments and retry (after localizing and fixing the issue) that one shipment that's invalid for whatever reason.
In order to decide whether something has been ingested correctly I thought a solution could be hashing it 'unit' of input and storing the original input somewhere as well
Quite easy for csv
Weird python-and-bash-esque psuedocode:
It becomes less so for json and xml, even marshal and unmarshal is not 100% identical to the input
Even worse is EDI
So, even though I liked the idea of storing the original it quickly becomse cumbersome. A decent alternative is is hashing and storing the output of transform.Read()
But that comes with several issues
None of these are a major issue, but part of hashing a new representation of the input, not the input itself
I was wondering how hard would it be to hash the input of whatever generates the output would be?
So:
hash, data, err := transform.Read
Is your internal data stable enough? That you could say 'for loop' the IDR input through the sha256 encoder (it supports streaming) and return a stable/unchanging hash?
As in, in theory ["a", "b", "c"] should return the same hash for a, b and c regardless of ordering
Also, I imagine being able to verify whether a file has been fully processed is interesting for more than one usecase
The text was updated successfully, but these errors were encountered: