Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maximum sample number to run and etc. #34

Closed
juyeong-yi opened this issue Mar 8, 2022 · 9 comments
Closed

maximum sample number to run and etc. #34

juyeong-yi opened this issue Mar 8, 2022 · 9 comments

Comments

@juyeong-yi
Copy link

juyeong-yi commented Mar 8, 2022

Questions

  1. I have 61 samples, then could I get result from 61 samples? Specifically,

is there a maximum sample number when running miRge3.0 ?

The data size ranges from 700M to 1.2G per sample fastq.gz file.
Because when I ran miRge3.0 with 61 samples using optioin "-s", miRge3.0 couldn't make the result as "mapped.csv".
I only got miR counts & RPM csv files and hairpin miR / miRNA / mRNA / ncRNA / pre tRNA / rRNA / snoRNA / tRNA sam files - sam files maybe coudln't be assembled, so there is no "mapped.csv" file.

  1. In mapped.csv, there are RNA classes "exact miRNA", "hairpin miRNA", "mature tRNA", "primary tRNA", "tRNA", "snoRNA", "rRNA", "ncrna others", "mRNA", "isomiR miRNA".

I'm wondering that miR.Counts.csv & miR.RPM.csv are made by "exact miRNA" only or by all of the miRNAs including "exact miRNA", "hairpin miRNA", "isomiR miRNA".
In addition, is "isomiR miRNA" came from miRNA sam file? Because I couldn't find sam file for isomiR miRNA like Q1, however in mapped.csv there is "isomiR miRNA".

  1. I conducted miRge3.0 separately in tree times - separating the 61 samples in three groups (61 = 24 + 24 + 13). I could get three mapped.csv and I united them depending on their transcript ID such as "ENST00000542763.2".

I'm not sure the grouping samples could affect the result of each sample from miRge3.0.
In the end, I will merge the three groups in one matrix. Therefore, if there is no effect from grouping samples when conducting miRge3.0, I assume that merging three groups is valid to further analysis.

  1. I would like to ask that "ncrna others" (in mapped.csv) includes long non-coding RNAs, snRNAs, piRNAs and sometime snoRNAs.

I found ACA57 gene belongs to "ncrna others" in mapped.csv matrix, however this gene is affiliated in snoRNA according to gene card site (https://www.genecards.org/cgi-bin/carddisp.pl?gene=SCARNA11).

  1. Similar with Q3, I would like to be assured that there is no effect from grouping samples when predicting novel miRNAs by miRge3.0.

Of course, I couldn't get any novel prediction result from running 61 samples with option "-s".
Therefore, I conducted miRge3.0 three times with three groups (61 = 24 + 24 + 13) - group1 sampe number is 24, group2 sample number is 24, and group3 sample number is 13.

What I tried and found

Additionally, I tried to run miRge3.0 to find out the maximum sample number.

  1. Sample size 24 (n = 24) was done successfully. Then I could get mapped and unmmped.csv and novel prediction result from each sample.

  2. Sample size 30 (n = 30) was killed when predicting novel miRNAs. Then I could get mapped and unmmaped.csv but I couldn't get unmapped.log and novel miRNA prediction results.

  3. Sample size 40 (n = 40) was killed when summarizing and tabulating results. Then I could get miRNA csv files and other classes RNA sam files, so I couldn't get mapped.csv and unmapped.csv and novel prediction results.

  4. This is somewhat surprising result, sample size 50 (n = 50) was successfully done. So I could get mapped.csv and unmapped.csv and novel prediction.

Spec of server used

In last, I want to refer some information about server specification.

cpuinfo - cpu MHz : 1084.270
- cache size : 16896 KB
- cpu cores : 12
raminfo - MemTotal: 394843400 kB
- MemFree: 1677012 kB
- MemAvailable: 314266072 kB
- Buffers: 189160 kB
- Cached: 308566652 kB
- SwapCached: 38496 kB
- Active: 160510648 kB
- Inactive: 221927100 kB

Please let me know if other information or server spec is needed to answer my questions.
Thank you for reading and I'll be waiting for reply.

@mhalushka
Copy link
Owner

Here are some answers...

  1. There is no maximum sample number, but if samples are large, you do run into memory limitations. We have run 100 small samples (~100 MB) in one run, and that was fine. For larger sample sizes, or when samples are very different (will not collapse well), we would recommend doing smaller batches. We typically run 10 samples together of that size.
  2. miR.Counts.csv & miR.RPM.csv are made by all miRNAs (exact or isomiR). Hairpin is not included. Also, if >90% of reads for a miRNA are isomiRs (default setting), that miRNA is not reported in miR.Counts or miR.RPM. That is a quality step unique to miRge.
  3. Grouping, as you did, is fine and we use that strategy frequently.
  4. It is possible that the ncRNAs are not perfectly separated by RNA classes. That part of miRge is mostly designed to capture other RNAs to not interfere with the isomiR step.
  5. Novel miRNA prediction is much more memory intensive and it is not surprising that these did not run to completion in those batch sizes. I do not think batch size will affect novel miRNA prediction. I don't know how you got 50 to run when smaller batches failed. Was there other activity on the server at other times?
  6. Essentially, you can run your samples as smaller batches, which will process well and can be combined at the end without loss of data.

@juyeong-yi
Copy link
Author

juyeong-yi commented Mar 10, 2022

Thanks for your kind reply :)

However, I would like to get more explanation about your answer 2 & 4.

  1. "Also, if >90% of reads for a miRNA are isomiRs (default setting), that miRNA is not reported in miR.Counts or miR.RPM. That is a quality step unique to miRge."

I think this quality step causes a difference between "mapped.csv" and "miR.Counts & miR.RPM".
The number of total miRNA (exact miRNA + iso miRNA) would be larger in mapped.csv than in miR.Counts & miR.RPM.
In my assumption, if I want to analyse non-coding small RNA data after alignment, I need to use mapped.csv file.
(I'm little bit worried about what I have to use... It could be depend on my decision, but as a novice researcher I want to get advice.)
So here is my question, why is the quality step necessary?
If their is any biological or statistical meaning, then it will be very helpful for me to understand miRge3.0 and conduct more reliable research.

  1. "That part of miRge is mostly designed to capture other RNAs to not interfere with the isomiR step."

What you mean here seems to be related with the order of alignment step.
According to "Supplementary Data 1_revised" posted with miRge3.0 paper,
"There is an initial tight alignment to a modified miRNA library with no mismatches followed by alignments to libraries of hairpin miRNA, small nucleolar RNA (snoRNA), ribosomal RNA (rRNA) and other non-coding RNA allowing a single mismatch. Only the top best alignment for primary and mature tRNAs with 1 and 0 mismatches respectively are retained. The remaining unaligned reads are further aligned to messenger RNA (mRNA) library with 0 mismatches followed by a loose alignment to the original modified miRNA library allowing up to 2 mismatches to identify isomiRs."

Then could I assume that the reason why the ACA57 gene in my result is affiliated in "ncrna others" rather than "snoRNA"
is that mismatches occured more than one time in the step of snoRNA alignment ?

  1. Actually, I couldn't figure out the reason yet. I will post it to share if I can find it...! I hope..!

@mhalushka
Copy link
Owner

  1. The reason for the >90% threshold is that many of the “miRNAs” listed in miRBase are not real miRNAs, but are repeat elements. Because there are so many repeat element sequences in the genome, and because some repeat elements are transcribed, it is clear that these random sequences get pulled into the “miRNA” isomiRs of non-miRNAs. We have seen examples where there are 1-2 canonical sequences of a miRNA and 30,000 “isomiRs” of that miRNA. No bona fide (real) miRNAs have that pattern. We did a lot of analysis and found that all real, reasonably expressed miRNAs have at least 50% of reads match the canonical sequence. At lower expression levels that might be a little less, so we set the threshold as needing at least 10% of all read being canonical (>90% threshold). This greatly reduces false positive. Other alignment tools either do not have a problem with this because they allow no mismatch (but fail to capture many isomiR classes) or they have a problem with this and do not correct for it leading to false positive information. Our isomiR alignment step is “loose” so we are careful about this problem. If you want the best total miRNA value, you should use miR.Counts or miR.RPM. If you are looking at isomiRs, you will need to go back to the mapped.csv file, but avoid miRNAs that have 0 values in miR.Counts.

  2. ACA57 might have appeared in the ncRNA instead of snoRNA because we have it in the wrong library. I think we got our RNA libraries from multiple sources and do not recall anymore if we took snoRNAs out of a larger file or obtained them from a resource that was only snoRNAs. Either way, it should be counted properly, but in the wrong step.

I hope this is helpful.

Sincerely,
Marc

@juyeong-yi
Copy link
Author

Thank you professor Halushka,
I totally got understand what you mean.
Your explanation is so clear and informative!
Of course, it is so helpful and meaningful for my research.
I will develop my analysis strategy based on your advice.

Many thanks,
Ju-yeong Yi

@juyeong-yi
Copy link
Author

juyeong-yi commented Mar 22, 2022

Hi, actually I have a new issue on 2 again.

I ran miRge3.0 in three times with sample size 24 + 24 + 13, because my 61 sample fastq files ranges from 700M to 1.2G.
I made a matrix from three of mapped.csv files using R function "full join".
(Of course I got NA in some values, because of full_join, I just changed NA into 0. Now I know it could be a problem, so I will use inner_join next time.)
So, with full_join, there is no missing genes in the integrated matrix.

And I checked the gene lists of three miR.csv files. They have exactly the same number of genes and gene list. The number of genes in the three miR.csv files is 960.
However, I found that the number of exact miRNA in the integrated mapped.csv is 893 and the number of iso miRNA is about 1200.

Last time, I was appreciated to get information about the quality step for miR.csv.
According to "if >90% of reads for a miRNA are isomiRs (default setting), that miRNA is not reported in miR.Counts or miR.RPM." this sentence, I think the number of exact miRNA in the united mapped.csv is at least bigger than 960.

Furthermore I filtered exact miRNA and iso miRNA with the gene list from miR.csv to apply the quality step for miR.csv, and I found 578 genes of exact miRNA is matched with the gene list. About 800 genes of iso miRNA is matched with the gene list and among them 578 genes are matched with exact miRNA. It means that about 200 iso miRNA genes exist without matched exact miRNA genes. So I'm little bit confused with the quality step for miR.csv file...
If I missed some concepts, please advise me!

Thank you for reading my chaos.. and it is really cheerful whenever I got reply :)

@arunhpatil
Copy link
Collaborator

Hi @juyeong-yi,

Yes, in order to consider and report the miRNA in miR.Counts.csv and miR.RPM.csv, we expect the weightage coming from exact miRNAs mapped over the isomiRs (as the proportions described earlier by Marc).

Regrading the 578 genes and 800 genes, did you check if they are unique and not duplicate miRNAs? Since, when it comes to isomiRs, there are wide range of varibality in the read sequence and each of those reads are reported along with their miRNA names. Further, 578 genes are canonical read sequences, there is only one copy of the sequence. Thus, the 200 isomiRs are probably duplicate names. (Marc may add to this incase if I have misunderstood the question).

I was wondering if you could use counts and RPM to analyze the miRNAs expression, is there a particular reason for your focus on mapped.csv? In case you want to combine counts and RPM data from different folders you can use these python scripts to join them (attached). This was custom Python script, so you might need to change the output name according to your needs. The script uses Pandas outer join to join the dataframes from two or more individual runs. The usage is as follows:

Directory structure:

____| input_dir____
                            |_____miRge.2021-12-12_17-20-41
                            |_____miRge.2021-12-12_18-20-41
                            |_____miRge.2021-12-12_19-20-41

Usage: python pandas_combine_RPM.py input_dir
Output: combined_RPM.csv

scripts.zip

I hope this was helpful.

Thank you,
Arun.

@juyeong-yi
Copy link
Author

juyeong-yi commented Mar 23, 2022

Thank you @arunhpatil .
Briefly, the reason why I focus in mapped.csv, specifically why I made the integrated mapped.csv matrix is... I want to apply DESeq normalization to compare sample to sample. Furthermore I will apply SVA and RUV normalization, because I found there are unwanted and unknown batch effects - I mean batch effect caused from technical things not from biological signal. (And batch1, batch2, batch3 ... from experiment they are not what I mean here.)
I will apply these normalization after removing mRNAs and rRNAs, because the fastq files are made by Small RNA sequencing. So only small RNA would be the target of this sequencing.
Therefore, I want to use whole small RNA matrix for normalization, I think it is more proper to data originated from small RNA sequencing even though miR.Count and miR.RPM are convenient to use.
Of course I am novice in RNA analysis, I need your recommendation and opinion. Please feel free to show what you think about, it would be really helpful.

But, I want to ask again @mhalushka ..
Because I already made my integrated mapped.csv have unique transcript ID per RNA class.

What I want to ask here is, is that possible the number of exact miRNA in mapped.csv is smaller than the number of miRNA in miR.csv file?

I think I need to post how I made the integrated mapped.csv, of course Arun's panda combine scripts will be helpful!
step1) I made new column "species" to labeling RNA classes. I wanted to gather all RNA classes in one column "species".
step2) I united all RNA classed columns into "ID" column. And I removed sequence column and annotFlag column.

Sequence annotFlag exact miRNA hairpin miRNA mature tRNA primary tRNA snoRNA rRNA ncrna others mRNA isomiR miRNA Sample1 ~ Sample24
-->
ID Species Sample1 ~ Sample24 (Species column have "exact miRNA", "hairpin miRNA", "mature tRNA", "primary tRNA", "snoRNA rRNA", "ncrna others", "mRNA", "isomiR miRNA". ID column have transcript ID.)

step3) I merged ID and species colum to filter unique ID, because there are duplicated IDs per species. I made there are unique transcripts IDs per species. Also at first I have 61 sample and I divide them in three groups (24+ 24 + 13). So I processed the other two matrices in same way. And then I merged three matrices by ID;Species column with full_join function in R. In some values, NA showed up and I simply made it into "0". I think it is artifact 0, so maybe I need to use inner_join. I want to be assured there is no missing value when merging because I used full_join.
-->
ID;Species Sample1 ~ Sample24
-->
ID;Species Sample1 ~ Sample61

step4) I separated ID and species again. So there could be exist same transcript ID between miRNA and iso miRNA species.
-->
ID Species Sample1 ~ Sample61

step5) I wanted to apply the quality step "if >90% of reads for a miRNA are isomiRs (default setting), that miRNA is not reported in miR.Counts or miR.RPM." to the integrated mapped matrix. So I just brought miRNA list from miR.Count or miR.RPM (I checked they have exactly same miRNA list, of course.) and I filtered exact miRNA and iso miRNA in the integrated matrix according to the miRNA list.

Then I got some problem.
According to "if >90% of reads for a miRNA are isomiRs (default setting), that miRNA is not reported in miR.Counts or miR.RPM." this, miR.csv files have smaller list of miRNA than the mapped.csv.
And I also merged three mapped.csv then the integrated matrix should have larger miRNA list, I think.
But in my integrated matrix (of course they have unique transcript ID per RNA species), It has smaller number of exact miRNA.

Please note me if there is any misunderstanding point.
Thank you for reading my long Question and explanation.

Ju-yeong Yi

@mhalushka
Copy link
Owner

Hi. I am a little confused, so let me start with this question: What I want to ask here is, is that possible the number of exact miRNA in mapped.csv is smaller than the number of miRNA in miR.csv file?

The number of miRNAs in mapped.csv should not be smaller than in the miR.csv file. HOWEVER, we use a "merges" file to combine some miRNAs together. These are miRNAs with known SNPs, or miRNAs that are so similar we use this type of nomenclature "hsa-miR-27a-3p/27b-3p." So if a miRNA has fewer reads in the mapped.csv file than the miR.csv file, see if there were separate "SNP" versions of the miRNA in the mapped files. The SNPs were added so that we wouldn't inadvertently remove a miRNA if someone was homozygous for the rare allele, which would have give us no exact matched. I also think this is why you have more miRNAs in the mapped.csv file, reads could go to hypothetical SNPA or SNPB versions of the same miRNA.

This is generally why we think users should use the miR.csv file for downstream applications. If you want to look at non-miRNAs and non-tRFs, then you are correct that you will have to use the full mapped.csv file. However, using the full denominator of all reads in the mapped.csv file is a little dangerous as the ratio of different RNA classes can easily be a feature of technical factors such as RNA quality and library preparation such that different RNA lengths are more/less common. We try to do analyses within the same class of RNA.

I hope this is helpful.

@juyeong-yi
Copy link
Author

Thank you for response.
I think my long and unorganized writing made you confused.
I will organize it and make the question more specified with example.
And I will discuss it again.

I really appreciate every time about your nice explanation.
Ju-yeong Yi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants