Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support string indices for operators with by option #3108

Open
bentsherman opened this issue Aug 10, 2022 · 12 comments
Open

Support string indices for operators with by option #3108

bentsherman opened this issue Aug 10, 2022 · 12 comments

Comments

@bentsherman
Copy link
Member

Spawned from the language improvements mega-discussion (#3107). Operators that combine or group channels based on an index (i.e. by option) currently only support integer indices. They should also support string indices so that they can be used with maps.

To my knowledge, these operators include:

  • groupTuple
  • join
  • combine
@stale
Copy link

stale bot commented Jan 16, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 16, 2023
@bentsherman bentsherman removed the stale label Jan 17, 2023
@Midnighter
Copy link

Midnighter commented Mar 3, 2023

Would love this feature! I started working on a plugin just for adding these kinds of operators but if nextflow can do it natively, that'd be so much better.

I've seen plenty of issues in nf-core where such channel operations are done using the entire map as the key, hopefully string keys would quickly resolve them.

@Midnighter
Copy link

I think the cross operator also qualifies.

Just wanted to say, I found a "fun" example where grouping happens on two maps at different positions 😬 In the below example from the taxprofiler pipeline, position 0 and 2 are Maps. Should one be able to specify different keys for different positions?

.map {
    meta, reads, db_meta, db ->
        [[id: db_meta.db_name, single_end: meta.single_end], reads, db_meta, db]
}
.groupTuple(by: [0,2,3])

@bentsherman
Copy link
Member Author

@Midnighter , I went ahead and pushed a draft PR, currently only groupMap is implemented.

I don't think cross needs to be changed because it just does an outer product regardless of the item type. See the example in the Nextflow docs:. It already supports lists and maps alike.

Also, your issue #3175 is about using a map as a join key, whereas this issue is about using string keys. So your use case should already work, as long as you aren't modifying your maps in place. In others, tuples can already be joined on a map element, but maps cannot be joined because it requires string indices.

@Midnighter
Copy link

I see, I think I misunderstood the intent of this issue then. You are talking about joining channels that have single map elements only?

Yes, I'm mostly interested in joining channels where a map is an element of a tuple. I thought you were working on a solution that would allow setting string keys also in that case. Indeed I think that joining on the entire map is bad practice and joining on chosen keys from that map is much more explicit and will prevent many accidental errors.

@stale
Copy link

stale bot commented Aug 12, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 12, 2023
@Midnighter
Copy link

I vaguely remember you talking about allowing closures for the by argument, which would solve this elegantly. I can't remember where, though.

@stale stale bot removed the stale label Aug 12, 2023
@bentsherman
Copy link
Member Author

That was for joining on an entire map, which I got mixed up with this issue. We fixed that one by being careful about modifying maps. Also I realized that joining on a closure doesn't really make sense 😅

This issue is only about joining on string keys instead of integers. I have a draft PR for it but the three operators in question are complicated and it's not a priority right now. Feel free to experiment check out the PR if you want to help, but I think your original issue was solved by other means.

@cjw85
Copy link
Contributor

cjw85 commented Apr 22, 2024

@bentsherman Why do you say

joining on a closure doesn't really make sense

I came passed here because I was searching for issues as I want to do exactly this. I've seen numerous examples both in our code and others with the pattern:

chan_a
    .map { meta, stuff -> [meta['thing'] , meta, stuff] }
    .join(chan_b.map { ... same again }
    .map { key, meta, stuff -> [meta, stuff] }

or similar where .join(...) is replaced with e.g. groupTuple(). So essentially two .maps being used to create a temporary key.

I suppose for .join() you would want two optional closures by_right and by_left, in order to allow for different structures in the two channels.

@bentsherman
Copy link
Member Author

for groupTuple it could make sense

for join, the issue I see is that if you join on e.g. { meta['id'] } but the meta-maps are otherwise different between the two samples, you would need to have the joined tuple be something like [ id, meta1, meta2, samples1, samples2 ]. I guess that's how a join works in general but I would expect the meta-maps to be identical most of the time. I suppose you could follow up with a map to merge/de-duplicate the meta maps

@cjw85
Copy link
Contributor

cjw85 commented Apr 22, 2024

I guess that's how a join works in general

Yup! Anything not in the key can (and should) be assumed to be distinct. Filtering the list of "column" down to deduplicate should be separate (think SQL or dataframe joins in either R, Python, or Rust).

The alternative here is to carry around additional keys in other parts of the tuple, though that gets you back into the situation of why the idiom of having a meta item at the front of channel items was born: the tuples becoming long and messy. And that thinking leads back to wanting channel items to be maps themselves.

@bentsherman
Copy link
Member Author

Or record types, which is where we are headed with #4553 . In any case, joining on a closure is a nice generic solution whether you are using tuples, maps, or record types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants