New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
maximum sample number to run and etc. #34
Comments
Here are some answers...
|
Thanks for your kind reply :) However, I would like to get more explanation about your answer 2 & 4.
I think this quality step causes a difference between "mapped.csv" and "miR.Counts & miR.RPM".
What you mean here seems to be related with the order of alignment step. Then could I assume that the reason why the ACA57 gene in my result is affiliated in "ncrna others" rather than "snoRNA"
|
I hope this is helpful. Sincerely, |
Thank you professor Halushka, Many thanks, |
Hi, actually I have a new issue on 2 again. I ran miRge3.0 in three times with sample size 24 + 24 + 13, because my 61 sample fastq files ranges from 700M to 1.2G. And I checked the gene lists of three miR.csv files. They have exactly the same number of genes and gene list. The number of genes in the three miR.csv files is 960. Last time, I was appreciated to get information about the quality step for miR.csv. Furthermore I filtered exact miRNA and iso miRNA with the gene list from miR.csv to apply the quality step for miR.csv, and I found 578 genes of exact miRNA is matched with the gene list. About 800 genes of iso miRNA is matched with the gene list and among them 578 genes are matched with exact miRNA. It means that about 200 iso miRNA genes exist without matched exact miRNA genes. So I'm little bit confused with the quality step for miR.csv file... Thank you for reading my chaos.. and it is really cheerful whenever I got reply :) |
Hi @juyeong-yi, Yes, in order to consider and report the miRNA in miR.Counts.csv and miR.RPM.csv, we expect the weightage coming from exact miRNAs mapped over the isomiRs (as the proportions described earlier by Marc). Regrading the 578 genes and 800 genes, did you check if they are unique and not duplicate miRNAs? Since, when it comes to isomiRs, there are wide range of varibality in the read sequence and each of those reads are reported along with their miRNA names. Further, 578 genes are canonical read sequences, there is only one copy of the sequence. Thus, the 200 isomiRs are probably duplicate names. (Marc may add to this incase if I have misunderstood the question). I was wondering if you could use counts and RPM to analyze the miRNAs expression, is there a particular reason for your focus on mapped.csv? In case you want to combine counts and RPM data from different folders you can use these python scripts to join them (attached). This was custom Python script, so you might need to change the output name according to your needs. The script uses Pandas outer join to join the dataframes from two or more individual runs. The usage is as follows: Directory structure:
I hope this was helpful. Thank you, |
Thank you @arunhpatil . But, I want to ask again @mhalushka .. What I want to ask here is, is that possible the number of exact miRNA in mapped.csv is smaller than the number of miRNA in miR.csv file? I think I need to post how I made the integrated mapped.csv, of course Arun's panda combine scripts will be helpful! Sequence annotFlag exact miRNA hairpin miRNA mature tRNA primary tRNA snoRNA rRNA ncrna others mRNA isomiR miRNA Sample1 ~ Sample24 step3) I merged ID and species colum to filter unique ID, because there are duplicated IDs per species. I made there are unique transcripts IDs per species. Also at first I have 61 sample and I divide them in three groups (24+ 24 + 13). So I processed the other two matrices in same way. And then I merged three matrices by ID;Species column with full_join function in R. In some values, NA showed up and I simply made it into "0". I think it is artifact 0, so maybe I need to use inner_join. I want to be assured there is no missing value when merging because I used full_join. step4) I separated ID and species again. So there could be exist same transcript ID between miRNA and iso miRNA species. step5) I wanted to apply the quality step "if >90% of reads for a miRNA are isomiRs (default setting), that miRNA is not reported in miR.Counts or miR.RPM." to the integrated mapped matrix. So I just brought miRNA list from miR.Count or miR.RPM (I checked they have exactly same miRNA list, of course.) and I filtered exact miRNA and iso miRNA in the integrated matrix according to the miRNA list. Then I got some problem. Please note me if there is any misunderstanding point. Ju-yeong Yi |
Hi. I am a little confused, so let me start with this question: What I want to ask here is, is that possible the number of exact miRNA in mapped.csv is smaller than the number of miRNA in miR.csv file? The number of miRNAs in mapped.csv should not be smaller than in the miR.csv file. HOWEVER, we use a "merges" file to combine some miRNAs together. These are miRNAs with known SNPs, or miRNAs that are so similar we use this type of nomenclature "hsa-miR-27a-3p/27b-3p." So if a miRNA has fewer reads in the mapped.csv file than the miR.csv file, see if there were separate "SNP" versions of the miRNA in the mapped files. The SNPs were added so that we wouldn't inadvertently remove a miRNA if someone was homozygous for the rare allele, which would have give us no exact matched. I also think this is why you have more miRNAs in the mapped.csv file, reads could go to hypothetical SNPA or SNPB versions of the same miRNA. This is generally why we think users should use the miR.csv file for downstream applications. If you want to look at non-miRNAs and non-tRFs, then you are correct that you will have to use the full mapped.csv file. However, using the full denominator of all reads in the mapped.csv file is a little dangerous as the ratio of different RNA classes can easily be a feature of technical factors such as RNA quality and library preparation such that different RNA lengths are more/less common. We try to do analyses within the same class of RNA. I hope this is helpful. |
Thank you for response. I really appreciate every time about your nice explanation. |
Questions
is there a maximum sample number when running miRge3.0 ?
The data size ranges from 700M to 1.2G per sample fastq.gz file.
Because when I ran miRge3.0 with 61 samples using optioin "-s", miRge3.0 couldn't make the result as "mapped.csv".
I only got miR counts & RPM csv files and hairpin miR / miRNA / mRNA / ncRNA / pre tRNA / rRNA / snoRNA / tRNA sam files - sam files maybe coudln't be assembled, so there is no "mapped.csv" file.
I'm wondering that miR.Counts.csv & miR.RPM.csv are made by "exact miRNA" only or by all of the miRNAs including "exact miRNA", "hairpin miRNA", "isomiR miRNA".
In addition, is "isomiR miRNA" came from miRNA sam file? Because I couldn't find sam file for isomiR miRNA like Q1, however in mapped.csv there is "isomiR miRNA".
I'm not sure the grouping samples could affect the result of each sample from miRge3.0.
In the end, I will merge the three groups in one matrix. Therefore, if there is no effect from grouping samples when conducting miRge3.0, I assume that merging three groups is valid to further analysis.
I found ACA57 gene belongs to "ncrna others" in mapped.csv matrix, however this gene is affiliated in snoRNA according to gene card site (https://www.genecards.org/cgi-bin/carddisp.pl?gene=SCARNA11).
Of course, I couldn't get any novel prediction result from running 61 samples with option "-s".
Therefore, I conducted miRge3.0 three times with three groups (61 = 24 + 24 + 13) - group1 sampe number is 24, group2 sample number is 24, and group3 sample number is 13.
What I tried and found
Additionally, I tried to run miRge3.0 to find out the maximum sample number.
Sample size 24 (n = 24) was done successfully. Then I could get mapped and unmmped.csv and novel prediction result from each sample.
Sample size 30 (n = 30) was killed when predicting novel miRNAs. Then I could get mapped and unmmaped.csv but I couldn't get unmapped.log and novel miRNA prediction results.
Sample size 40 (n = 40) was killed when summarizing and tabulating results. Then I could get miRNA csv files and other classes RNA sam files, so I couldn't get mapped.csv and unmapped.csv and novel prediction results.
This is somewhat surprising result, sample size 50 (n = 50) was successfully done. So I could get mapped.csv and unmapped.csv and novel prediction.
Spec of server used
In last, I want to refer some information about server specification.
cpuinfo - cpu MHz : 1084.270
- cache size : 16896 KB
- cpu cores : 12
raminfo - MemTotal: 394843400 kB
- MemFree: 1677012 kB
- MemAvailable: 314266072 kB
- Buffers: 189160 kB
- Cached: 308566652 kB
- SwapCached: 38496 kB
- Active: 160510648 kB
- Inactive: 221927100 kB
Please let me know if other information or server spec is needed to answer my questions.
Thank you for reading and I'll be waiting for reply.
The text was updated successfully, but these errors were encountered: