segment_matrix improvements (#30) #31

lomereiter · 2020-04-07T18:09:58Z

Note: there are a few changes in the output JSON files (only one dataset made it into the commit since that's what I test on). This is because order of arrivals and departures arrays' elements is not fixed. For testing against the original results I run jq '.components[].arrivals |= sort_by(.upstream, .downstream) | '.components[].departures |= sort_by(.upstream, .downstream)' on each chunk.

josiahseaman · 2020-04-07T22:51:34Z

Is this ready for review?
I'd like to not that occupants is just a precompute to make Schematize logic a bit simpler. It could be removed if the same code was changed in Schematize to be smarter (which may also slow the browser). Arrivals and departures, however are necessary I believe. They're derived from the order of links listed in the bin. I don't know that it would be simple to remove that requirement.

If you care about JSON size, I have a branch where I changed all the lists to dictionaries or sets which is a radically smaller file size https://github.com/graph-genome/component_segmentation/tree/experimental_v6_sparse_matrix. I ran into issues with this format particularly on the Schematize side. Now I think it wouldn't be worth it with all the other JSON format changes. Something to keep in mind though. If we can precompute things to make the browser faster that's good. However, if the large file size makes the file load slow in the browser, that's counter-productive. I'll leave it to your good judgement.

josiahseaman

The Python code here is really clean. I can't find anything to gripe about at all. If I understand correctly, you're saying the order the link columns are listed in the JSON is not deterministic. Example of them switching:

Since you've now seen the deep insides, I'd appreciate it if you read Minutiae: Link Column Ordering. This describes the specification, rather than the current reality. Short version: link column should stack from the inside out on a component as they're traversed, there are likely cases where no consistent sort is possible across all individuals.

I see no reason not to merge this in. Really, well done.
.

josiahseaman · 2020-04-07T23:07:39Z

Question: since we have data checked into the repo, is it going to generate a diff every time the same command is run?

lomereiter · 2020-04-08T05:58:46Z

Question: since we have data checked into the repo, is it going to generate a diff every time the same command is run?

No, it won't. The order now corresponds to the traversal of the sorted dataframe, there are no random choices involved.

lomereiter added 4 commits April 7, 2020 15:20

utils.find_groups

8622bef

avoid nested sets and python lists

5227764

replace usage of sets with numpy arrays

bbe1bb9

use a custom function to sort/dedup connections dataframe

9375a2f

josiahseaman self-requested a review April 7, 2020 23:01

josiahseaman added the optimization label Apr 7, 2020

josiahseaman approved these changes Apr 7, 2020

View reviewed changes

josiahseaman merged commit bd82d7b into graph-genome:master Apr 7, 2020

lomereiter mentioned this pull request Apr 10, 2020

Refactor segment_matrix to use numpy/pandas data structures #30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segment_matrix improvements (#30) #31

segment_matrix improvements (#30) #31

lomereiter commented Apr 7, 2020

josiahseaman commented Apr 7, 2020

josiahseaman left a comment

josiahseaman commented Apr 7, 2020

lomereiter commented Apr 8, 2020

segment_matrix improvements (#30) #31

segment_matrix improvements (#30) #31

Conversation

lomereiter commented Apr 7, 2020

josiahseaman commented Apr 7, 2020

josiahseaman left a comment

Choose a reason for hiding this comment

josiahseaman commented Apr 7, 2020

lomereiter commented Apr 8, 2020