Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channel join loses duplicate keys #654

Closed
stevekm opened this issue Apr 11, 2018 · 5 comments
Closed

Channel join loses duplicate keys #654

stevekm opened this issue Apr 11, 2018 · 5 comments

Comments

@stevekm
Copy link
Contributor

stevekm commented Apr 11, 2018

If I have a pipeline like this:

// Start with a Channel of Sample IDs and associated files
Channel.from(["Sample1", "Sample1.txt"],
            ["Sample2", "Sample2.txt"],
            ["Sample3", "Sample3.txt"],
            ["Sample4", "Sample4.txt"],
            ["Sample5", "Sample5.txt"],
            ["Sample6", "Sample6.txt"])
        .into{samples; samples2; samples3; samples4}

// Channel of sample ID pairs
Channel.from(["Sample1", "Sample2"],
            ["Sample3", "Sample4"],
            ["Sample6", "Sample4"])
        .into{sample_pairs; sample_pairs2}


samples3.join(sample_pairs2)
        .map { sample_ID, sample_file, pair_ID  ->
            return [ pair_ID, sample_ID, sample_file ]
        }
        .join(samples4)
        .map { pair_ID, sample_ID, sample_file, pair_file ->
            return [ sample_ID, sample_file, pair_ID, pair_file ]
        }
        .println()

The output looks like this:

./nextflow run main.nf
N E X T F L O W  ~  version 0.28.0
Launching `main.nf` [cheeky_heyrovsky] - revision: 799d622fe4
[Sample1, Sample1.txt, Sample2, Sample2.txt]
[Sample3, Sample3.txt, Sample4, Sample4.txt]

instead of like this:

[Sample1, Sample1.txt, Sample2, Sample2.txt]
[Sample3, Sample3.txt, Sample4, Sample4.txt]
[Sample6, Sample6.txt, Sample4, Sample4.txt]

When joining in this case, Sample4 exists twice in the paired channel, and the second one is lost at the second join.

My current workaround is here: https://github.com/stevekm/nextflow-demos/blob/253b8e3ea8547d88f571e7c718e34696cf5b9be1/join-pairs/main.nf

However it is very cumbersome. I think this might be resolved by having the options for 'left outer join', 'right outer join', or 'full outer join', since it appears that only an inner join is being used.

@pditommaso
Copy link
Member

Yes, currently it's indeed as a inner join. Using remainder: true it implements an full output join.

@stevekm
Copy link
Contributor Author

stevekm commented Apr 23, 2018

Would it be possible to get options for specifying the other types of joins?

@pditommaso
Copy link
Member

It should be possible.

@karl616
Copy link

karl616 commented Oct 3, 2019

I have a problem with remainder: true. Not sure if it belongs here, but given this toy example:

left=[[a, 1], [a, 2], [b, 3]] 
right=[[a, X], [b, Y]]
left.join(right) => [[a, 1, X], [b, 3, Y]]
left.join(right, remainder: true) => [[a, 1, X], [a, 2, null], [b, 3, Y]]

This is given that X and Y are files. Not sure if that is the reason. From the second operation I would have expected: [[a, 1, X], [a, 2, X], [b, 3, Y]]

@stale
Copy link

stale bot commented Apr 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants