Document FeatureCounts requirements to GTF more appropriately #144

apeltzer · 2018-12-14T14:20:41Z

We recently had a project with a non-standard organism project, where we had to download genome and GFF3 from NCBI instead of using the ENSEMBL ones. This caused featureCounts to not being able to create appropriate counts, as the gene_id was for example missing in that GTF/GFF.

Proper format is:

https://github.com/nf-core/test-datasets/blob/rnaseq/reference/genes.gtf

@ggabernet will post an example of a GFF that didn't work well. I will then take care of writing down some docs on how to make sure the GFF/GTF works fine for an analysis...

The text was updated successfully, but these errors were encountered:

ggabernet · 2018-12-14T14:27:01Z

Here is an example of part of a gtf that did not work for us:

ref_Amel_HAv3.1_top_level_head20.txt

apeltzer · 2018-12-14T14:36:51Z

There are multiple possibilities:

a.) Having a possibility to edit the options supplied to featurecountsdirectly for users
b.) Document that we always need gene_id gene_biotype to be present in the GFF/GTF

a.) Would also require us to adapt featureCounts merging processes in general, e.g. providing this option to the merge_featureCounts process. Could be not straightforward, but would be possible.
b.) Is easy to do, and we could even have a quick check on the GTF/GFF in the beginning of the pipeline to check for the feature existence in the provided GTF/GFF. That would then cause at least an early stop with a more meaningful error message :-)

ewels · 2018-12-14T14:40:08Z

iGenomes also has NCBI and UCSC references, they're just not listed in the iGenomes config. We should probably add these. I think that they're normalised for a lot of stuff like this.

apeltzer · 2018-12-14T14:40:10Z

Once there are some opinions in @ewels looking at u :-P , I'll have a go!

ewels · 2018-12-14T14:41:06Z

Can you check with for example:

s3://ngi-igenomes/igenomes/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf

See if you get the same problem?

Use https://ewels.github.io/AWS-iGenomes/ to get all required s3 URLs.

ggabernet · 2018-12-14T14:42:59Z

iGenomes doesn't have the species I'm working with unfortunately :(
For one of the species I found an ENSEMBL version now though, so if this one runs through then the problem really was the GTF format. For some other species I don't even have an ENSEMBL version.

ewels · 2018-12-14T21:11:39Z

Ah sorry, I was speed reading and didn't pick up on the non-model organism bit. GFF is a horrible format for exactly this reason, it's not really a specified format.

The good news is that now I'm sat down and reading properly, I realise that this is a problem that we already came across ages ago and built in a feature to handle. So your fix is already part of the pipeline! It's even got documentation: https://github.com/nf-core/rnaseq/blob/master/docs/usage.md#featurecounts-extra-gene-names

I guess that suppling the option --fcExtraAttributes gene when running the pipeline will fix the issue.

ewels · 2018-12-14T21:13:39Z

ps. This is where it's used:

rnaseq/main.nf

Line 923 in e837637

    
           def extraAttributes = params.fcExtraAttributes ? "--extraAttributes ${params.fcExtraAttributes}" : ''

From the SubRead documentation:

−−extraAttributes
Extract extra attribute types from the provided GTF annotation and include them in the counting output. These attribute types will not be used to group features. If more than one attribute type is provided they should be separated by comma (in Rsubread featureCountsits value is a character vector).

ggabernet · 2018-12-17T10:02:03Z

Hi Phil, thank you for your answers. This indeed pointed to the solution of the problem, even though not fully. Due to the different annotation in the GTF file, I had to change the featureCount call to -g Parent.

https://github.com/ggabernet/rnaseq/blob/57c4475415b38994b50c6630f856d67f39605b57/main.nf#L932

Would it be a possibility to provide a parameter that allows changing the term for this call, in a similar way as for extra attributes?

ewels · 2018-12-17T10:43:31Z

Absolutely - we can set this to biotype by default but use a params variable to make it customisable. @apeltzer, are you able to PR this?

apeltzer · 2018-12-17T11:15:33Z

YUp, will do

apeltzer · 2018-12-17T12:34:59Z

This is now possible in the dev branch of the pipeline :-)

ggabernet · 2018-12-17T12:36:35Z

Perfect, thank you!

apeltzer self-assigned this Dec 14, 2018

apeltzer mentioned this issue Dec 17, 2018

Configurable featurecounts options #145

Merged

apeltzer closed this as completed Dec 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document FeatureCounts requirements to GTF more appropriately #144

Document FeatureCounts requirements to GTF more appropriately #144

apeltzer commented Dec 14, 2018

ggabernet commented Dec 14, 2018

apeltzer commented Dec 14, 2018

ewels commented Dec 14, 2018

apeltzer commented Dec 14, 2018

ewels commented Dec 14, 2018

ggabernet commented Dec 14, 2018 •

edited

Loading

ewels commented Dec 14, 2018

ewels commented Dec 14, 2018

ggabernet commented Dec 17, 2018 •

edited by apeltzer

Loading

ewels commented Dec 17, 2018

apeltzer commented Dec 17, 2018

apeltzer commented Dec 17, 2018

ggabernet commented Dec 17, 2018

Document FeatureCounts requirements to GTF more appropriately #144

Document FeatureCounts requirements to GTF more appropriately #144

Comments

apeltzer commented Dec 14, 2018

ggabernet commented Dec 14, 2018

apeltzer commented Dec 14, 2018

ewels commented Dec 14, 2018

apeltzer commented Dec 14, 2018

ewels commented Dec 14, 2018

ggabernet commented Dec 14, 2018 • edited Loading

ewels commented Dec 14, 2018

ewels commented Dec 14, 2018

ggabernet commented Dec 17, 2018 • edited by apeltzer Loading

ewels commented Dec 17, 2018

apeltzer commented Dec 17, 2018

apeltzer commented Dec 17, 2018

ggabernet commented Dec 17, 2018

ggabernet commented Dec 14, 2018 •

edited

Loading

ggabernet commented Dec 17, 2018 •

edited by apeltzer

Loading