This repository has been archived by the owner. It is now read-only.

Implement From trait for finch #1

Merged
merged 5 commits into from Jun 17, 2018

Conversation

Projects
None yet
2 participants
@luizirber
Copy link
Owner

luizirber commented May 30, 2018

Trying out the From trait, and finch seemed like a good first try (since I can check if the implementations are compatible).

pinging @bovee just to make sure I interpreted the finch struct correctly: in sourmash we keep the abundance of a k-mer, but finch uses count and extra_count. What is the difference between them, and can I just sum them up to have the total count?

The other thing was figuring out what is the k-mer length, since this is not saved in the struct. I solved this by checking the first value from .into_vec()

@bovee

This comment has been minimized.

Copy link

bovee commented May 31, 2018

This is really cool!

I think you want just the count field (which is the total); the extra_count field is used to keep track of how many k-mers were in their canonical orientation in the original file (for quality/filtering purposes). You should be able to get the k-mer length for most of our sketches from the raw k-mer sequence, but this isn't guaranteed to exist in the spec (for example, we sometimes strip these out in our internal downstream code to save space).

As an aside (and this probably isn't clear because we don't have any developer docs!), the MinHashKmers struct is our working heap implementation and it doesn't track a lot of the metadata you probably want for generic interop (like the hash seed). You can still use it if you're using finch directly from the sketching step, but internally we mostly use Vec<KmerCount>s instead.

I'm actually in the middle of a big refactor (that I really need to finish and merge soon; hopefully still this week) to add support for the binary mash format that also cleans up some of our internal code around this. The new implementation of our "generic" sketch struct is: https://github.com/onecodex/finch-rs/blob/binary_support/src/serialization.rs#L33 (the exact commit may be subject to rebase changes, but the pub stuct Sketch itself is pretty final). There's also a MultiSketch that stores all the exact k-mer size, hash seed, hash type, etc details, but this may not be appropriate if you just want to convert one sketch at a time.

@luizirber luizirber force-pushed the feature/from_finch branch from 5895a9d to 922ef18 Jun 8, 2018

@luizirber

This comment has been minimized.

Copy link
Owner Author

luizirber commented Jun 8, 2018

Now that 0.1.6 is released I updated the PR and... everything just worked?

@bovee I don't have a higher level struct yet (what would be the sourmash signatures or the finch Sketch), but when I have it I'll take a look into doing a better job at integrating it properly.

(long story short: sourmash-rust only implement the low level data structures, but all the metadata and JSON serialization is done in Python. It may change in the future, tho =])

@luizirber luizirber merged commit 5574683 into master Jun 17, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@luizirber luizirber deleted the feature/from_finch branch Jun 17, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.