Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segment_matrix improvements (#30) #31

Merged
merged 4 commits into from
Apr 7, 2020

Conversation

lomereiter
Copy link
Contributor

Note: there are a few changes in the output JSON files (only one dataset made it into the commit since that's what I test on). This is because order of arrivals and departures arrays' elements is not fixed. For testing against the original results I run jq '.components[].arrivals |= sort_by(.upstream, .downstream) | '.components[].departures |= sort_by(.upstream, .downstream)' on each chunk.

@josiahseaman
Copy link
Member

Is this ready for review?
I'd like to not that occupants is just a precompute to make Schematize logic a bit simpler. It could be removed if the same code was changed in Schematize to be smarter (which may also slow the browser). Arrivals and departures, however are necessary I believe. They're derived from the order of links listed in the bin. I don't know that it would be simple to remove that requirement.

If you care about JSON size, I have a branch where I changed all the lists to dictionaries or sets which is a radically smaller file size https://github.com/graph-genome/component_segmentation/tree/experimental_v6_sparse_matrix. I ran into issues with this format particularly on the Schematize side. Now I think it wouldn't be worth it with all the other JSON format changes. Something to keep in mind though. If we can precompute things to make the browser faster that's good. However, if the large file size makes the file load slow in the browser, that's counter-productive. I'll leave it to your good judgement.

Copy link
Member

@josiahseaman josiahseaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python code here is really clean. I can't find anything to gripe about at all. If I understand correctly, you're saying the order the link columns are listed in the JSON is not deterministic. Example of them switching:

image

Since you've now seen the deep insides, I'd appreciate it if you read Minutiae: Link Column Ordering. This describes the specification, rather than the current reality. Short version: link column should stack from the inside out on a component as they're traversed, there are likely cases where no consistent sort is possible across all individuals.

I see no reason not to merge this in. Really, well done.
.

@josiahseaman
Copy link
Member

Question: since we have data checked into the repo, is it going to generate a diff every time the same command is run?

@josiahseaman josiahseaman merged commit bd82d7b into graph-genome:master Apr 7, 2020
@lomereiter
Copy link
Contributor Author

Question: since we have data checked into the repo, is it going to generate a diff every time the same command is run?

No, it won't. The order now corresponds to the traversal of the sorted dataframe, there are no random choices involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants