Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Morphlines hashDigest command #443

Merged
merged 1 commit into from
Apr 27, 2016
Merged

Conversation

tmgstevens
Copy link
Contributor

Initial commit for hashCommand implementation and tests

@tmgstevens
Copy link
Contributor Author

@@ -62,6 +62,11 @@
<scope>test</scope>
</dependency>

<dependency>
<groupId>commons-codec</groupId>
<artifactId>commons-codec</artifactId>
Copy link

@whoschek whoschek Apr 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The policy for existing morphline mvn modules is to not add additional dependencies on further 3rd party libs, in order to avoid the potential for dependency version conflicts / jar hell. Consider providing similar functionality just using what's in the JDK or with existing dependencies, or consider adding a new separate mvn module, e.g. along the lines of the kite-morphlines-json module.

Another possibility to avoid a new external dep is to manually shade the codec classes into https://github.com/kite-sdk/kite/tree/master/kite-morphlines/kite-morphlines-core/src/main/java/org/kitesdk/morphline/shaded/org/apache/commons/codec/binary/binary but I'd only recommend doing so if the classes are small and self-contained, not for big involved chunks of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by shading the relevant methods into a static class. Shading the whole classes had too many dependencies.

@whoschek
Copy link

What's the probability of hash collision (two different messages returning the same id) with the MessageDigests vs. GenerateUUID impl? How much less likely is a collision with SHA-256 than with that we have already today?

@tmgstevens tmgstevens changed the title Initial commit for hashCommand implementation and tests Morphlines hashDigest command Apr 15, 2016
@tmgstevens
Copy link
Contributor Author

Regarding comparison vs GenerateUUID - the latter is guaranteed to be unique regardless of the input document - i.e. running the same document through a billion times will give a billion UUIDs.

This new function will always return the same result for the same input - hence its desirability.

Whether a collision is likely depends on the algorithm chosen and the number of inputs, but casually speaking is close to zero :-)

@tmgstevens
Copy link
Contributor Author

Thanks for reviewing @whoschek - I've updated based on your feedback.

}

@Override
protected void doNotify(Record notification) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now doNotify() nomore does anything and the method can be removed, which improves readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@whoschek
Copy link

Also, let's rename the name of the test case class to HashDigestTest in order to reflect the "hashDigest" name.

}

{
hashDigest {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename hashCommand.conf to hashDigest.conf to be consistent with the previous renames?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@whoschek
Copy link

Thanks again for the great contrib! I think we're almost there. After the final minor outstanding changes please squash the multiple commits into a single big commit, and then we're ready to merge.

@whoschek
Copy link

Also would be good to add corresponding docs in kite-morphlines/src/site/confluence/morphlinesReferenceGuide.confluence but that can be done in a separate commit if you like.

@whoschek
Copy link

To read the html output that is generated from the confluence docs, or for debugging: cd kite; mvn post-site -pl kite-morphlines; open kite-morphlines/target/site/index.html

@whoschek
Copy link

Also, for better perf consider using UTF16 instead of UTF8 for string conversion.

@tmgstevens
Copy link
Contributor Author

tmgstevens commented Apr 26, 2016

Regarding using UTF16 rather than UTF8 - this would give different results as the input byte array would be different. I think in general people would expect UTF8 (for example http://www.miraclesalad.com/webtools/md5.php) however actually we can make that configurable fairly simply. I'll set the default to UTF-8 but add a parameter called charset as per readCSV command.

@tmgstevens
Copy link
Contributor Author

Thanks for reviewing @whoschek - much appreciated.

@tmgstevens
Copy link
Contributor Author

Updated the docs onto the same commit. Let me know if there's any further changes required.

@whoschek whoschek merged commit 6a4ce72 into kite-sdk:master Apr 27, 2016
@whoschek
Copy link

Hi Tristan. I merged this. Thanks for this great contrib!

@tmgstevens
Copy link
Contributor Author

Thanks Wolfgang for your feedback and swift turnaround.
Tristan

Tristan Stevens
Senior Solutions Architect
Cloudera, Inc. | www.cloudera.com

m +44(0)7808 986422 |

On 27 Apr 2016 8:18 p.m., "Wolfgang Hoschek" notifications@github.com
wrote:

Hi Tristan. I merged this. Thanks for this great contrib!


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#443 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants