-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Morphlines hashDigest command #443
Conversation
@@ -62,6 +62,11 @@ | |||
<scope>test</scope> | |||
</dependency> | |||
|
|||
<dependency> | |||
<groupId>commons-codec</groupId> | |||
<artifactId>commons-codec</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The policy for existing morphline mvn modules is to not add additional dependencies on further 3rd party libs, in order to avoid the potential for dependency version conflicts / jar hell. Consider providing similar functionality just using what's in the JDK or with existing dependencies, or consider adding a new separate mvn module, e.g. along the lines of the kite-morphlines-json module.
Another possibility to avoid a new external dep is to manually shade the codec classes into https://github.com/kite-sdk/kite/tree/master/kite-morphlines/kite-morphlines-core/src/main/java/org/kitesdk/morphline/shaded/org/apache/commons/codec/binary/binary but I'd only recommend doing so if the classes are small and self-contained, not for big involved chunks of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved by shading the relevant methods into a static class. Shading the whole classes had too many dependencies.
What's the probability of hash collision (two different messages returning the same id) with the MessageDigests vs. GenerateUUID impl? How much less likely is a collision with SHA-256 than with that we have already today? |
Regarding comparison vs GenerateUUID - the latter is guaranteed to be unique regardless of the input document - i.e. running the same document through a billion times will give a billion UUIDs. This new function will always return the same result for the same input - hence its desirability. Whether a collision is likely depends on the algorithm chosen and the number of inputs, but casually speaking is close to zero :-) |
Thanks for reviewing @whoschek - I've updated based on your feedback. |
} | ||
|
||
@Override | ||
protected void doNotify(Record notification) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now doNotify() nomore does anything and the method can be removed, which improves readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
Also, let's rename the name of the test case class to HashDigestTest in order to reflect the "hashDigest" name. |
} | ||
|
||
{ | ||
hashDigest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename hashCommand.conf to hashDigest.conf to be consistent with the previous renames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Thanks again for the great contrib! I think we're almost there. After the final minor outstanding changes please squash the multiple commits into a single big commit, and then we're ready to merge. |
Also would be good to add corresponding docs in kite-morphlines/src/site/confluence/morphlinesReferenceGuide.confluence but that can be done in a separate commit if you like. |
To read the html output that is generated from the confluence docs, or for debugging: cd kite; mvn post-site -pl kite-morphlines; open kite-morphlines/target/site/index.html |
Also, for better perf consider using UTF16 instead of UTF8 for string conversion. |
Regarding using UTF16 rather than UTF8 - this would give different results as the input byte array would be different. I think in general people would expect UTF8 (for example http://www.miraclesalad.com/webtools/md5.php) however actually we can make that configurable fairly simply. I'll set the default to UTF-8 but add a parameter called charset as per readCSV command. |
Thanks for reviewing @whoschek - much appreciated. |
Updated the docs onto the same commit. Let me know if there's any further changes required. |
Hi Tristan. I merged this. Thanks for this great contrib! |
Thanks Wolfgang for your feedback and swift turnaround. Tristan Stevens m +44(0)7808 986422 | On 27 Apr 2016 8:18 p.m., "Wolfgang Hoschek" notifications@github.com
|
Initial commit for hashCommand implementation and tests