Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand GTF filtering to remove rows with empty transcript ID when required, fix STAR GTF usage #1107

Merged
merged 48 commits into from
Nov 15, 2023

Conversation

pinin4fjords
Copy link
Member

@pinin4fjords pinin4fjords commented Nov 7, 2023

In this PR:

Closes #1074 #1102 #1082

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/rnaseq branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@pinin4fjords pinin4fjords changed the base branch from master to dev November 7, 2023 18:38
Copy link

github-actions bot commented Nov 7, 2023

This PR is against the master branch ❌

  • Do not close this PR
  • Click Edit and change the base to dev
  • This CI test will remain failed until you push a new commit

Hi @pinin4fjords,

It looks like this pull-request is has been made against the nf-core/rnaseq master branch.
The master branch on nf-core repositories should always contain code from the latest release.
Because of this, PRs to master are only allowed if they come from the nf-core/rnaseq dev branch.

You do not need to close this PR, you can change the target branch to dev by clicking the "Edit" button at the top of this page.
Note that even after this, the test will continue to show as failing until you push a new commit.

Thanks again for your contribution!

@pinin4fjords pinin4fjords changed the title Stringtie gtf Prepare GTF for stringtie Nov 7, 2023
Copy link

github-actions bot commented Nov 7, 2023

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 3ac4a3a

+| ✅ 144 tests passed       |+
#| ❔   6 tests were ignored |#
!| ❗   5 tests had warnings |!

❗ Test warnings:

  • files_exist - File not found: .github/workflows/awstest.yml
  • files_exist - File not found: .github/workflows/awsfulltest.yml
  • nextflow_config - Config manifest.version should end in dev: 3.13.0
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in WorkflowRnaseq.groovy: Optionally add in-text citation tools to this list.

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: assets/email_template.html
  • files_unchanged - File ignored due to lint config: assets/email_template.txt
  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy
  • files_unchanged - File ignored due to lint config: .gitignore or .prettierignore or pyproject.toml
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/rnaseq/rnaseq/.github/workflows/awstest.yml
  • multiqc_config - multiqc_config

✅ Tests passed:

Run details

  • nf-core/tools version 2.10
  • Run at 2023-11-15 18:06:44

Copy link

github-actions bot commented Nov 9, 2023

Python linting (black) is failing

To keep the code consistent with lots of contributors, we run automated code consistency checks.
To fix this CI test, please run:

  • Install black: pip install black
  • Fix formatting errors in your pipeline: black .

Once you push these changes the test should pass, and you can hide this comment 👍

We highly recommend setting up Black in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help!

Thanks again for your contribution!

@pinin4fjords pinin4fjords changed the title Prepare GTF for stringtie Expand GTF filtering to remove rows with empty transcript ID when required Nov 9, 2023
@pinin4fjords pinin4fjords linked an issue Nov 9, 2023 that may be closed by this pull request
@pinin4fjords pinin4fjords changed the title Expand GTF filtering to remove rows with empty transcript ID when required Expand GTF filtering to remove rows with empty transcript ID when required, fix STAR GTF usage Nov 9, 2023
@pinin4fjords
Copy link
Member Author

pinin4fjords commented Nov 14, 2023

Therefore, I think, the pipeline should have a fallback. If the PREPARE_GENOME.out.gtf_with_transcript_ids is entirely empty, the PREPARE_GENOME.out.gtf, should be used instead for SALMON_QUANT and STRINGTIE. Like so, there is at least the chance that the tools work, whereas they will certainly fail when presented with an empty file. Alternatively, the filter should not be hardcoded, but adapt to the provided file.

@drpatelh and I had a long discussion about this PR yesterday, and we think there should be a parameter to disable the GTF filtering entirely, mostly to avoid an unnecessary process where we're confident about the GTF.

I think we could address your point here by having the process die with a message when the file ends up empty. If the user is sure they want to proceed they can then disable the filter, and will get the relevant error messages from any active consuming processes that have a problem with that.

Edit: problem with that is when scaffold complement is good, but transcript ids are not. Will propose something soon.

@pinin4fjords pinin4fjords marked this pull request as draft November 14, 2023 12:49
@pinin4fjords
Copy link
Member Author

pinin4fjords commented Nov 15, 2023

@MatthiasZepper @drpatelh

I believe I have addressed both your feedback:

  • @MatthiasZepper - there is now a single filtered form of the GTF. Managing multiple was introducing complexity I don't think is necessary. If there are no lines in the filtered form, the pipeline will throw an error. There is a new option --skip_gtf_transcript_filter that will bypass the transcript_id check if the user wishes to do so on encountering such an error. That should preclude the circumstance you were worried about where an empty GTF is passed through.
  • @drpatelh - per your feedback there is now an option to disable GTF filtering entirely (--skip_gtf_filter), and I have added a conditional to only use filtering where it's useful (i.e. where consuming processes are enabled). This will allow users to skip the filtering process where they have high confidence in the GTF.

@pinin4fjords pinin4fjords marked this pull request as ready for review November 15, 2023 10:40
@pinin4fjords
Copy link
Member Author

@nf-core-bot fix linting

conf/modules.config Outdated Show resolved Hide resolved
Copy link
Member

@MatthiasZepper MatthiasZepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart! I like both: The approach and your implementation of it! Great work!

nextflow_schema.json Outdated Show resolved Hide resolved
bin/filter_gtf.py Show resolved Hide resolved
@drpatelh
Copy link
Member

@nf-core-bot fix linting

Copy link
Member

@drpatelh drpatelh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Agree. Nice work! Hopefully, a step in the right direction to solving these painful GTF formatting issues we have seen numerous times.

@drpatelh drpatelh merged commit 7a04c5d into dev Nov 15, 2023
29 checks passed
@pinin4fjords pinin4fjords deleted the stringtie_gtf branch November 15, 2023 20:55
@ppericard
Copy link
Contributor

Thank you for everyone that worked on this problem !
This was indeed a real pain when working with GTF from multiple sources, not always as well formated as they should.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants