Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input only 2~3 isomiRs / missing mismatch for 3'trim variation #17

Open
kicheolkim opened this issue Apr 25, 2020 · 7 comments
Open

input only 2~3 isomiRs / missing mismatch for 3'trim variation #17

kicheolkim opened this issue Apr 25, 2020 · 7 comments

Comments

@kicheolkim
Copy link

kicheolkim commented Apr 25, 2020

Hi,
I found 2 possible error/bug while using the package. I think first one is isomir package issue, but the second one may be related to miRTop.

input only 2~3 isomiRs

  • I was able to input my data successfully. But I realized the count was not matched with original input. For example in one miRNA, original input (rawData from miRTop) has over 100 different types of isomiRs. But I found only 3 of them in the input data. It's not cutoff since I see very low counts, but it seems that only the top 2 or 3 expressed isomirs were loaded into R dataset. Could you check this error? Let me know if you need any information or my data.
    Data I tested for input is here, and I used this for input dataset:
de <- data.frame(row.names=c("S400001603","S400001616","S400001617"), condition = c("cc","bb","aa"))
mirData <- IsomirDataSeqFromRawData(read_tsv("mirtop_rawData.tsv"), de)

I saw many warnings during input, is it possible that is related to this warning?

Warning messages:
1: In prop.test(value, total, pctco, alternative = "greater") :
  Chi-squared approximation may be incorrect
2: In prop.test(value, total, pctco, alternative = "greater") :
  Chi-squared approximation may be incorrect

missing mismatch for 3'trim variation

  • I also found an error in the rawData file exported from miRTop. If the sequence has 5' or 3' variation, mismatch (nucleotide variation) was not recorded together. Could you also check about it?
    Here is an example... Those sequences have different mismatches and 3' trimming together, but mism is 0 in the table.
seq mir mism add t5 t3 S400001603 S400001616 S400001617
TCAGGTAGTAGGTTGTAT hsa-let-7a-5p 0 0 0 agtt 2 1 1
TGAGGTAGTAGGTTGTAA hsa-let-7a-5p 0 0 0 agtt 5 6 11
TGAGGTAGTAGGTTGTAT hsa-let-7a-5p 0 0 0 agtt 197 166 351

I used the development version for both miRTop and isomiR package which I installed last week.

Thanks!!
Kicheol

@svattathil
Copy link

svattathil commented Jul 30, 2020

I have noticed something similar to the first issue described by @kicheolkim. I am using isoMirs version 1.16.2 and output from miraligner version 3.5, and observe that the counts I get from an IsomirDataSeq object created using IsomirDataSeqFromFiles differs from the counts I calculate using base R code. I also observe that the results differ using IsomiRs functions when I load different files at a time. I have attached some test files and some test code. Is there perhaps some filtering happening, or lines dropped during merging?

Thanks for making this package! Happy to provide more information if it would help.

rm(list=ls())
library("isomiRs")

## create file list
FILELIST <- as.list(paste0("test", 1:5, ".mirna"))

## convenience variables
own <- "own"
isom5 <- "isom5"
isom2 <- "isom2"
isom1 <- "isom1"

## use base R functions
origdats <- lapply(FILELIST, read.table, sep="\t", header=TRUE)

## use IsoMiRs package functions
covars <- data.frame(filename = as.character(FILELIST))
rownames(covars) <- paste0("test", 1:5)
obj5 <- IsomirDataSeqFromFiles(FILELIST, covars, uniqueMism = FALSE, rate=0, canonicalAdd=FALSE, uniqueHits=FALSE)
obj2 <- IsomirDataSeqFromFiles(FILELIST[1:2], head(covars,2), uniqueMism = FALSE, rate=0, canonicalAdd=FALSE, uniqueHits=FALSE)
obj1 <- IsomirDataSeqFromFiles(FILELIST[[1]], head(covars,1), uniqueMism = FALSE, rate=0, canonicalAdd=FALSE, uniqueHits=FALSE)

## compare
stats <- vector(mode="list")
stats[[own]] <- data.frame(mappedtotal = sapply(origdats, function(x) { sum(x$freq) }))
stats[[isom5]]  <- data.frame(mappedtotal = apply(counts(isoCounts(obj5)), 2, sum))
stats[[isom2]] <- data.frame(mappedtotal = apply(counts(isoCounts(obj2)), 2, sum))
stats[[isom1]] <- data.frame(mappedtotal = sum(metadata(obj1)[["rawData"]]$test1)) # error when applying isoCounts --  doesn't work with just one sample?
stats

testfiles.zip

@svattathil
Copy link

After looking at the code I bit, I see that there is some internal filtering happening (e.g. clean_noise and remove_gt_n_changes applied to raw data), which explains the discrepancy. Would be helpful to have these filters noted in the documentation, and perhaps have an option to output the raw data without any filtering. But it is also useful to have the filters!

@lpantano
Copy link
Owner

Hi,

I totally missed this, it was when somehow I got turned off notification, so I apologize. It was not intentional.

I need to look a the code, it is true is due to some filtering: see this for more explanation #19 (comment), the parameters that affect the filtering are those.

The mirtop issue is more concerning. Did you use mirtop to convert from other tool? if you could share the command you use, I can do more, or the inputs. Sorry I missed this, happy to help if still needed.

cheers

@kicheolkim
Copy link
Author

Hi, I'm glad you come back. I'm using sRNAtoolbox. So, I converted mirtop from sRNAtoolbox output. Here are the command lines that I used.

mirtop gff --sps hsa --hairpin hairpin.fa --gtf hsa.gff3 --format srnabench -d --out-format gff -o ./ ./sample1 ./sample2 ./sample3

mirtop export -o ./ --sps hsa --hairpin hairpin.fa --gtf hsa.gff3 mirtop.gff

mirtop counts --gff mirtop.gff --hairpin hairpin.fa --gtf hsa.gff3 -o ./

@lpantano
Copy link
Owner

Thank you. I am aware that data coming from that tool can generate some issues. Can you check those sequences in the output of sRNAbench and see how they are defined? Normally Mir top will take the definition from the tool, so I am wondering how are defined from the very beginning.

Let me know, I am very intrigued. You can share the output file as well from sRNAbench if you would like.

@kicheolkim
Copy link
Author

Sure, here is the information of 3 isomiRs. You can also find sRNAbench (sRNAtoolbox) output from here. Hopefully, this would be helpful to solve the problem.

sequence matureName hairpinName isoLabel sequenceVariant readCount RPMlib RPMtotal
TCAGGTAGTAGGTTGTAT hsa-let-7c-5p$hsa-let-7a-5p hsa-let-7c$hsa-let-7a-1$hsa-let-7a-3$hsa-let-7a-2 lv3p|lv3pT|lv3p#-4 - 2 1.1136043475113726 0.2590028081084455
TGAGGTAGTAGGTTGTAA hsa-let-7c-5p$hsa-let-7a-5p hsa-let-7c$hsa-let-7a-1$hsa-let-7a-3$hsa-let-7a-2 lv3p|lv3pT|lv3p#-4 - 5 2.7840108687784317 0.6475070202711137
TGAGGTAGTAGGTTGTAT hsa-let-7c-5p$hsa-let-7a-5p hsa-let-7c$hsa-let-7a-1$hsa-let-7a-3$hsa-let-7a-2 lv3p|lv3pT|lv3p#-4 - 197 109.69002822987021 25.511776598681884

Thank you!!

@lpantano
Copy link
Owner

Hi, thank you for this. Actually, I can see how these three sequences are annotated only as 3p variants with -4 as the nucleotide variation (lv3p|lv3pT|lv3p#-4). Because, mirtop, only translates outputs, it doesn't check 5p or mismatches, because are not described in the original file. We can add some warnings, but it becomes complex to validate other tools.

Let me know your thoughts. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants