Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging Matrices in libTSVAggrTransByGene fails? #36

Closed
Niklas123Niklas opened this issue Oct 17, 2017 · 20 comments
Closed

Merging Matrices in libTSVAggrTransByGene fails? #36

Niklas123Niklas opened this issue Oct 17, 2017 · 20 comments
Assignees

Comments

@Niklas123Niklas
Copy link

Niklas123Niklas commented Oct 17, 2017

Hey there,

I try to use kallisto to quant the reads, but it failes during execution of mentioned script.

new.mat<-merge(tsv.mat,g2t,by.x=colnames(tsv.mat)[1],by.y="transcript_id",all.x=TRUE,sort=F)
new.mat <- new.mat[new.mat$transcript_id %in% as.character(g2t$transcript_id),,drop=FALSE]  

The first new.mat is full with information, but after, I think filtering?, the new.mat is totally empty. May it be to wrong files in reference folder?

@Niklas123Niklas Niklas123Niklas changed the title Merging Matrices in libTSVAggrTransByGene failes= Merging Matrices in libTSVAggrTransByGene failes? Oct 17, 2017
@nunofonseca nunofonseca self-assigned this Oct 18, 2017
@nunofonseca
Copy link
Owner

Hi!
Maybe the cdna file that you used contains transcript ids that are not in the GTF? What files cdna and gtf files are you using?
Cheers.

@Niklas123Niklas
Copy link
Author

Niklas123Niklas commented Oct 18, 2017

Hi,
thanks for the hint!
Homo_sapiens.GRCh38.cdna.all.fa is the cdna file and Homo_sapiens.GRCh38.90.gtf is the gtf file, but I don't remember if I got it from the same source. I try to get new ones and retry it.

Or can you give me a suggetion which files are best to use?

@nunofonseca
Copy link
Owner

It seems that Ensembl recently decided that the transcript ids in the cdna file do not need to be exactly the same as the transcript ids in the gtf file (e.g., ENST00000434970.2 in one file and in the other file you get ENST00000434970). This is basically a data issue, anyway I pushed a change to libTSVAggrTransByGene to detect this issue and try to fix it. Please let me know if it fixes the issue.
Cheers.

@nunofonseca nunofonseca changed the title Merging Matrices in libTSVAggrTransByGene failes? Merging Matrices in libTSVAggrTransByGene fails? Oct 18, 2017
@maximilianh
Copy link

maximilianh commented Oct 18, 2017 via email

@nunofonseca
Copy link
Owner

nunofonseca commented Oct 18, 2017

Yes, it is related.
As I wrote above, I added a workaround for that in the script but I just realized that the problem pops up in another point of the pipeline - thank you Ensembl for making my day!

@nunofonseca
Copy link
Owner

A quick update on this issue. There is another problem with the latest versions of Ensembl - the cdna file contains transcripts that are not in the GTF file or, from another perspective, the GTF file has missing transcripts. For instance, in the Ensembl v90, there are 16k transcripts found in the human cdna file and not found in the matching GTF (e.g., ENST00000631435). Currently iRAP will fail because it checks for consistency between the two files (which no longer seems to exist). To work around this data issue, I' ll change the code today/tomorrow to inform and warn about these inconsistencies and carry on (but not to exit).

Cheers.

@Niklas123Niklas
Copy link
Author

Hey there,

Thank you very much for your efforts! I can try your patch tomorrorw and send results to you next week.

Cheers.

@nunofonseca
Copy link
Owner

Hi,
The code should now be able to cope with the inconsistencies. Please let me know if it works or not for you.
Cheers.

@Niklas123Niklas
Copy link
Author

Hey, i tried it today and I got following error

[DONE] Assembly and quantification
make: *** No rule to make target 'Master2/none/kallisto/transcripts.raw.kallisto.irap.tsv', needed by 'stage3'. Schluss.

@nunofonseca
Copy link
Owner

Hi, it should already be fixed in the latest version?
Cheers

@Niklas123Niklas
Copy link
Author

Oh, thanks.
It works so far now.

@Niklas123Niklas
Copy link
Author

Is there a possibillity to filter low expressed genes? I have a lot of genes that have zero expression and if I include them in DE, edgeR will fail. I removed those genes with expression sums over all samples below 10 in transcripts.raw.kallisto.tsv and now it continues DEA.

@Niklas123Niklas
Copy link
Author

In report generation it still fails with following error:
make: *** No rule to make target 'Master2/report/riq//raw_data/C1_0s1.f.fastqc.tsv', needed by 'Master2/report/fastq_qc_report.tsv'. Schluss.

@nunofonseca
Copy link
Owner

You may use the parameter
de_min_count=10
in the configuration file to remove genes with total expression below 10.

@Niklas123Niklas
Copy link
Author

Hey,

If I use this parameter, I get an error message. Is this a bug?

[INFO] Filtering out genes with low counts (<=5)...
[INFO] Filtering out genes with low counts (<=5)...done.
Fehler in conds[i] <- label2group[[conds[i]]] : Ersetzung hat Länge 0
Ruft auf: map.conds2cols
Ausführung angehalten
/home/niklas/irap_install/scripts/../aux/mk/irap_de.mk:161: recipe for target 'Master/tophat2/htseq2/edger/G0SvsG20S.genes_de.tsv' failed

@nunofonseca
Copy link
Owner

Hi,
Could you rerun the ./scripts/irap_DE_edger script with --debug option and share with me the irap_DE_edgeR.Rdata file that is generated? This will allow me to reproduce the error.
Thanks
PS: you should get the irap_DE_edger command by rerunning irap as follows
irap conf=path2your_conf_file [any options that you are passing in the command line] Master/tophat2/htseq2/edger/G0SvsG20S.genes_de.tsv -n

@Niklas123Niklas
Copy link
Author

Niklas123Niklas commented Nov 7, 2017

Hi,
here you go. Thank you very much!

Rdata.zip

It seems like not all values appear in the opt$label2group object.

@Niklas123Niklas
Copy link
Author

Niklas123Niklas commented Nov 20, 2017

Hey there,
any progress here?
I searched a bit in the R-script. The problem is: The opt$label2group variable does only contain the values for the selected groups.

"C1_0s1"  "C1_0s2"  "C2_0s1"  "C2_0s2"  "C1_90s1" "C1_90s2" "C2_90s1"
"G0S"     "G0S"     "G0S"     "G0S"     "G90S"    "G90S"    "G90S"

But colnames (data.f) contains
"C1_0s1" "C1_0s2" "C1_H2O22" "C1_20s1" "C1_90s1" "C1_90s2" "C2_0s1" "C2_0s2" "C2_H2O21" "C2_H2O22" "C2_20s1" "C2_20s2" "C2_90s1"

So either remove the data.f columns not needed (what I would not do), or add the missing items to the opt$label2group and use them in the design matrix and care for the correct contrast. I think if you remove the not needed columns from testing, you remove information from testing that are helpful. Maybe change this that the whole available information are used?

edit: Further investigation lead me to the error: in irap_de.R in line 195 it says
data.f <- data[rows.sel,],
but it should be:
data.f <- data.f[rows.sel,]

@nunofonseca
Copy link
Owner

Hi, thank you for the Rdata file. I'll look into this issue in the coming days - last week I did not have the time.
Cheers.

@nunofonseca
Copy link
Owner

Hi, indeed the problem was on the line that you mentioned. It should now be fixed in the latest release (0.8.5p5). Many thanks again for the report and the fix ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants