Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Salmon --keepDuplicates by default #1259

Open
adamrtalbot opened this issue Mar 12, 2024 · 2 comments
Open

Salmon --keepDuplicates by default #1259

adamrtalbot opened this issue Mar 12, 2024 · 2 comments
Milestone

Comments

@adamrtalbot
Copy link
Contributor

adamrtalbot commented Mar 12, 2024

Description of feature

By default, Salmon will drop transcripts with identical sequence, as described here: COMBINE-lab/salmon#214 (comment)

This behaviour should reduce unnecessary duplicate count values in the results matrices, but may be unexpected behaviour for a new user. If they supply some transcripts and one is missing, but is not clear why or which one.

Instead, we could add the flag --keepDuplicates for the Salmon index by default (discussed here). This retains all transcripts which is more predictable behaviour. We could add an additional flag (--salmonDropDuplicates) to disable this behaviour. Alternatively, we could make it opt-in, with a flag --salmonKeepDuplicates or a more generic version (--salmonIndexExtraArgs) to allow users to enable this feature when they require it.

@rob-p apologies for the unsolicited message could you help us understand the downside of keeping the duplicate transcripts? What do we lose when we enable --keepDuplicates?

Side note, this is a breaking behaviour so will need to be communicated to users.

@rob-p
Copy link

rob-p commented Mar 12, 2024

Hi @adamrtalbot,

Thanks for pinging me here. I'm tagging @mikelove here too in case he has specific thoughts / input.

Basically, when we first added the feature to deduplicate reference sequences prior to index, we had a twitter poll, and the "deduplicate by default" category won. That's not a super scientific way to answer the question, but the reason we had the poll was because we could both think of good arguments pro and con. The biggest con argument is precisely what you state; to have references in your input set simply not appear in the output can be surprising to users, especially if you're not explicitly aware of the deduplication (which we document, but don't constantly shout at users about). On the pro side, many users were actually legitimately surprised that duplicates exist and, also, precisely how many such duplicates there are in common annotations. Most of those we observed should clearly be considered artifactual (e.g. annotations matching those on a reference chromosome, but appearing on a patch contig in a region with no variation).

Further, from the perspective of quantification, sequence-level duplicates are a priori indistinguishable — they are the same transcript with a different label. Therefore, their assigned abundance should always be considered equal. This was the impetus behind writing them out in the duplicate_clusters.tsv file — if you want to assign/project abundance to these transcripts after the fact, you can simply collect each duplicate cluster, and equally divide the abundance estimated for the retained transcript. Granted, this becomes a bit more difficult if you are using an uncertainty-aware differential analysis tool like swish, which makes use of inferential replicates. However, in that case, I think one could simply divide each inferential sample by the number of duplicates and impute identical inferential replicates for each of these duplicates.

In short, the problem, IMO, is the existence of sequence-level duplicates themselves. They are basically meaningless from an inferential perspective. However, I don't think that either keeping them or discarding them as a default decision is much better than the other. Ultimately, the important thing is that the user knows about them and can decide intelligently and in a problem-specific way if they are important and, if so, how to deal with them.

--Rob

@adamrtalbot
Copy link
Contributor Author

Thanks Rob, that's really helpful. I think we should allow users to set it based on how they want, ideally with the default remaining backwards compatible.

We're not exactly sure how best to implement this, but definitely one for the next major release.

@drpatelh drpatelh added this to the 3.15.0 milestone May 13, 2024
@drpatelh drpatelh modified the milestones: 3.15.0, 3.16.0 May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants