New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testing allelic analysis #2
Comments
Hello Mike, Thank you for bringing this to my attention, I am currently looking into these issues. Invalid VCF FixI was able to duplicate the error by removing the fileformat field in the header. I believe adding something like Hopefully that fixes the current issue, but I have a feeling there may be other header related problems. I'll continue to improve the file-handling and catch issues like this. Please keep me posted with any other issues you run into, it's very helpful. WASP to WASP2 CompatibilityAs for your second question, I believe it's possible, but I recommend trying to get the new count pipeline working with your data. WASP2 currently outputs its count data as a 8 column TSV with a header. ATAC-Seq Format:
The RNA-Seq counts include an additional column called feature, and replaces peak with genes RNA-Seq Format:
We can denote which feature we want to analyze with the --feature flag. (If you don't include the flag it will automatically just analyze all the unique features found in the count file, but for future reference, valid inputs would probably be something along the lines of "transcript", "exon" etc.). With this in mind, it might be a lot harder to convert RNA-Seq to the appropriate format, and there's still a lot of improvements that can be made to the RNA-Seq pipeline, so the inputs may change in the future. In regard to previous WASP count data the easiest conversion would be using the output of WASP's bam2h5.py script using the --txt_counts option The text file will have columns: chromosome, snp_position, ref_allele, alt_allele, genotype, ref_allele_count, alt_allele_count, other_count counts.txt.gz Format:
There are 2 additional filtering steps that WASP2 does that will have to be done manually.
The count data should now have the required information after these two steps. The data should be compatible once you include the appropriate header info, and columns. In order to avoid issues, I would drop any unnecessary columns, and Please let me know if you have any additional questions or concerns! Thanks, |
Thanks for the detailed info Aaron! I may start working on the analysis step, because I can manually get the counts from the As far as WASP2 counting, I got further but stuck again: I got past the VCF issues by doing as you said, also adding FORMAT tag lines, then bgzip and tabix.
I'm getting the following error after getting past VCF issues:
|
Hi Mike, Glad to hear we're getting a bit further into the pipeline! As for the current issue, I believe this is an issue with the preprocessing of the gtf. You're currently using the option
Usually this will filter out the gtf to look for features such like transcripts, but in this case it will look for a feature named I recommend removing the
One option that will be particularly useful for debugging is...
What this will do is store the intermediary files to some location. Being able to look at these files should help pinpoint the issue better. I believe the filtered files Let me know how these fixes go! Hopefully they help solve the current issue |
Oh, sorry about that ... I wasn't reading carefully obviously and just saw "features ... in gtf". In the README you have
This is totally correct and I was just reading too fast, but maybe for the lazy readers like me:
I've got a new error related to parsing the GTF, whether I set
The GTF in question is from Ensembl: http://ftp.ensembl.org/pub/release-100/gtf/drosophila_melanogaster/ I looked into
|
Ah, I think you just caught a bug. I believe the error is because the bam filtering step expects regions in bed format as opposed to gtf. I should be able to make a quick bugfix by the beginning of the week. In the meantime, I believe including the Essentially, the pipeline filters the bam file to only include reads that overlap regions of interests, This could cut down a lot of time during the counting step, but could also take a lot of time to do as well. So the Let me know if this works! |
Yes I think you're right about it expecting BED:
I read in the GTF into Bioconductor and wrote out a BED file with rtracklayer. Proceeding this way works only if I switch to
One note: the After fixing that, I can count alleles:
Happy to test out the GTF version when you have time. |
Glad to hear you got the counts working! I recently pushed a hotfix(a5ad506) that should solve the gtf filtering issue. Let me know if there are any other major issues! Your feedback has been incredibly helpful. |
I'm trying the new code, but still getting same error as previously it seems.
If it helps I can share a small reproducible example? The BAM file above is actually a subset of the full BAM file restricted to 10Mb-11Mb on chromosome 2L. I've uploaded these files (BAM + VCF) and the index files here: https://www.dropbox.com/sh/u5uhepsmbswdz8g/AACfsgrf6PeRG-7jPl6kaBqJa?dl=0 The GTF file can be obtained from here: http://ftp.ensembl.org/pub/release-100/gtf/drosophila_melanogaster/ |
Hi Mike, I was able to reproduce the bug, and fixed the issue (7fb8537). The gtf file was converting the index from int to float, and thus caused issues when intersecting with the vcf. I've also attached the results of the counting program to the Dropbox. Please let me know if there are any other issues. |
Thanks for taking a look! I was in all day meeting end of last week but will test this out on my end ASAP. |
Great this works on my end as well. I'll close this one and let you know how the testing pipeline goes
|
I've been able to make it all the way to the testing successfully. Does WASP2 have an AI test across samples? If not I was considering to use a p-value combination method, Fisher, Stouffer, harmonic mean, etc. |
@AaronJeeHo has been working on testing for differences in AI between two samples or between groups of cells (using single cell data). But we haven't thought about testing for AI across samples. Is there a use case you have in mind? I guess it could be interesting to test if the AI distribution is shifted between two groups of samples. We recently compared AI in tumor and normal samples in a couple of projects, but we used a very simple (not ideal) approach where we just counted number of samples with/without significant AI and then did Fisher's Exact Test. This obviously doesn't account for the problem of incomplete power. |
The use case is that I have replicates in some experiments. So looking for consistent AI across a number of heterozygotes. We've also been working on trends in AI in bulk and sc, happy to chat about ideas on this. Re: testing consistent AI I may explore different p-value combinations (combining across samples per gene). As we have simulation I can find the best method (at least for the sim). |
Aaron,
Thanks for sharing the new version for testing allelic imbalance. I'm starting to try to port over our WASP AI testing to WASP2. I thought I would start from
.merge.bam
files and recount those.As background, I have VCF files which represent a simulated diploid genome. Previously the chr-separated VCF worked fine with WASP pipeline all the way to CHT, but there may be some issues with those VCF, where I need to add additional fields or tags.
As a side question, can I use the AI test in WASP2 with counts from WASP? E.g. these:
If not, with the WASP2 counting script, I am now getting this error:
The VCF in question has some non-required fields missing but it appears valid:
The VCF file looks like:
The text was updated successfully, but these errors were encountered: