Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing allelic analysis #2

Closed
mikelove opened this issue Jan 6, 2022 · 14 comments
Closed

Testing allelic analysis #2

mikelove opened this issue Jan 6, 2022 · 14 comments

Comments

@mikelove
Copy link

mikelove commented Jan 6, 2022

Aaron,

Thanks for sharing the new version for testing allelic imbalance. I'm starting to try to port over our WASP AI testing to WASP2. I thought I would start from .merge.bam files and recount those.

As background, I have VCF files which represent a simulated diploid genome. Previously the chr-separated VCF worked fine with WASP pipeline all the way to CHT, but there may be some issues with those VCF, where I need to add additional fields or tags.

As a side question, can I use the AI test in WASP2 with counts from WASP? E.g. these:

wasp_cht/alt_as_counts.sample_A_1.h5
wasp_cht/hap_read_counts.sample_A_1.adj
wasp_cht/hap_read_counts.sample_A_1.hetp
wasp_cht/hap_read_counts.sample_A_1.txt
wasp_cht/other_as_counts.sample_A_1.h5
wasp_cht/read_counts.sample_A_1.h5
wasp_cht/ref_as_counts.sample_A_1.h5

If not, with the WASP2 counting script, I am now getting this error:

python run_analysis.py count --rna -ft data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz \
  -a wasp_mapping/sample_A_1.merge.bam -g data/drosophila_wg.vcf -s sample \
  -r data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz -o testai
Namespace(command='count', stype='rna', singlecell=False, 
features=['data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz'], 
alignment='wasp_mapping/sample_A_1.merge.bam', 
genotypes='data/drosophila_wg.vcf', sample='sample', 
regions='data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz', 
barcodes=None, output='testai', nofilt=False, keeptemps=None)
Bulk Analysis
GTF filtered by feature
Filtering reads that overlap regions of interest
Bam file filtered!
Traceback (most recent call last):
  File "/proj/milovelab/bin/WASP2/src/analysis/run_analysis.py", line 273, in <module>
    main()
  File "/proj/milovelab/bin/WASP2/src/analysis/run_analysis.py", line 266, in main
    parse_counting(args.alignment, args.genotypes, args.regions, 
args.sample, args.output, args.stype, nofilt=args.nofilt, temp_loc=args.keeptemps, features=args.features)
  File "/proj/milovelab/bin/WASP2/src/analysis/run_analysis.py", line 53, in parse_counting
    intersect_df = preprocess_data(in_bam, in_vcf, in_region, in_sample, stype, nofilt, tmpdir, features)
  File "/proj/milovelab/bin/WASP2/src/analysis/run_analysis.py", line 34, in preprocess_data
    write_sample_snp(in_vcf, in_sample, out_dir)
  File "/proj/milovelab/bin/WASP2/src/analysis/filter_data.py", line 24, in write_sample_snp
    vcf = VariantFile(in_file)
  File "pysam/libcbcf.pyx", line 4054, in pysam.libcbcf.VariantFile.__init__
  File "pysam/libcbcf.pyx", line 4284, in pysam.libcbcf.VariantFile.open
ValueError: invalid file `b'data/drosophila_wg.vcf'` (mode=`b'r'`) - is it VCF/BCF format?

The VCF in question has some non-required fields missing but it appears valid:

vcf-validator data/drosophila_wg.vcf
Could not parse the fileformat version string [#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample], assuming VCFv4.2
The header tag 'reference' not present. (Not required but highly recommended.)
The "fileformat" field not present in the header, assuming VCFv4.2
The header tag 'contig' not present for CHROM=2L. (Not required but highly recommended.)
column sample at 2L:10591 .. FORMAT tag [GL] not listed in the header,FORMAT tag [GT] not listed in the header
The header tag 'contig' not present for CHROM=2R. (Not required but highly recommended.)
The header tag 'contig' not present for CHROM=3L. (Not required but highly recommended.)
The header tag 'contig' not present for CHROM=3R. (Not required but highly recommended.)
The header tag 'contig' not present for CHROM=4. (Not required but highly recommended.)
The header tag 'contig' not present for CHROM=X. (Not required but highly recommended.)

The VCF file looks like:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample
2L	10591	rs1	T	A	.	PASS	.	GT:GL	0|1:-100,0,-100
2L	11464	rs2	A	T	.	PASS	.	GT:GL	0|1:-100,0,-100
...
@AaronJeeHo
Copy link
Member

Hello Mike,

Thank you for bringing this to my attention, I am currently looking into these issues.

Invalid VCF Fix

I was able to duplicate the error by removing the fileformat field in the header. I believe adding something like ##fileformat=VCFv4.2 to the beginning of the header should allow the parser to recognize the file.

Hopefully that fixes the current issue, but I have a feeling there may be other header related problems. I'll continue to improve the file-handling and catch issues like this. Please keep me posted with any other issues you run into, it's very helpful.

WASP to WASP2 Compatibility

As for your second question, I believe it's possible, but I recommend trying to get the new count pipeline working with your data.

WASP2 currently outputs its count data as a 8 column TSV with a header.

ATAC-Seq Format:

chrom	pos	ref	alt	peak	ref_count	alt_count	other_count
chr1	1019397	C	A	chr1_1019383_1019826	0	2	0
chr1	1019636	A	G	chr1_1019383_1019826	0	1	0
chr1	1060101	T	G	chr1_1060008_1060158	0	0	0
chr1	1407232	G	C	chr1_1407182_1407478	0	2	0
chr1	1724489	G	C	chr1_1724249_1724783	4	0	0

The RNA-Seq counts include an additional column called feature, and replaces peak with genes

RNA-Seq Format:

chrom	pos	ref	alt	feature	genes	ref_count	alt_count	other_count
chr1	996168	G	T	transcript	RP11-54O7.17	0	0	0
chr1	997077	G	A	exon	RP11-54O7.17	1	0	0
chr1	999842	C	A	exon	HES4	0	2	0
chr1	999842	C	A	CDS	HES4	0	0	0
chr1	999842	C	A	transcript	HES4	0	0	0
chr1	1004625	A	G	transcript	ISG15	1	0	0

We can denote which feature we want to analyze with the --feature flag. (If you don't include the flag it will automatically just analyze all the unique features found in the count file, but for future reference, valid inputs would probably be something along the lines of "transcript", "exon" etc.).

With this in mind, it might be a lot harder to convert RNA-Seq to the appropriate format, and there's still a lot of improvements that can be made to the RNA-Seq pipeline, so the inputs may change in the future.

In regard to previous WASP count data the easiest conversion would be using the output of WASP's bam2h5.py script using the --txt_counts option

The text file will have columns: chromosome, snp_position, ref_allele, alt_allele, genotype, ref_allele_count, alt_allele_count, other_count

counts.txt.gz Format:

chr1 993941 A C 1|1 0 0 0
chr1 993947 A G 1|1 0 0 0
chr1 994678 C A 0|1 3 0 0
chr1 994949 C T 1|0 1 1 0
chr1 994997 C T 1|0 0 0 0
chr1 995153 C G 1|0 0 1 0

There are 2 additional filtering steps that WASP2 does that will have to be done manually.

  1. Filter file to only include heterozygous snp's. This should be pretty simple since the genotypes are provided.

  2. Intersect SNP's with peaks/genes. The software WASP2 uses is bedtools.

  • I recommend using the bedtools intersect command with -wb option.

The count data should now have the required information after these two steps. The data should be compatible once you include the appropriate header info, and columns. In order to avoid issues, I would drop any unnecessary columns, and
ensure the column ordering is correct.

Please let me know if you have any additional questions or concerns!

Thanks,
Aaron

@mikelove
Copy link
Author

Thanks for the detailed info Aaron!

I may start working on the analysis step, because I can manually get the counts from the txt_counts output which I already have.

As far as WASP2 counting, I got further but stuck again:

I got past the VCF issues by doing as you said, also adding FORMAT tag lines, then bgzip and tabix.

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GL,Number=G,Type=String,Description="Genotype likelihood">

I'm getting the following error after getting past VCF issues:

python /proj/milovelab/bin/WASP2/src/analysis/run_analysis.py count --rna -ft data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz   -a wasp_mapping/sample_A_1.merge.bam -g data/drosophila_wg.vcf.gz -s sample   -r data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz -o testai
Namespace(command='count', stype='rna', singlecell=False, features=['data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz'], alignment='wasp_mapping/sample_A_1.merge.bam', genotypes='data/drosophila_wg.vcf.gz', sample='sample', regions='data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz', barcodes=None, output='testai', nofilt=False, keeptemps=None)
Bulk Analysis
GTF filtered by feature
Filtering reads that overlap regions of interest
Bam file filtered!
Created Filtered VCF
Created Intersection File
Traceback (most recent call last):
  File "/proj/milovelab/bin/WASP2/src/analysis/run_analysis.py", line 273, in <module>
    main()
  File "/proj/milovelab/bin/WASP2/src/analysis/run_analysis.py", line 266, in main
    parse_counting(args.alignment, args.genotypes, args.regions, args.sample, args.output, args.stype, nofilt=args.nofilt, temp_loc=args.keeptemps, features=args.features)
  File "/proj/milovelab/bin/WASP2/src/analysis/run_analysis.py", line 53, in parse_counting
    intersect_df = preprocess_data(in_bam, in_vcf, in_region, in_sample, stype, nofilt, tmpdir, features)
  File "/proj/milovelab/bin/WASP2/src/analysis/run_analysis.py", line 39, in preprocess_data
    intersect_df = parse_gene_df(str(Path(out_dir) / "intersect.bed"))
  File "/proj/milovelab/bin/WASP2/src/analysis/filter_data.py", line 94, in parse_gene_df
    df = pd.read_csv(intersect_file, sep="\t", header=None, usecols=[0, 1, 3, 4, 12, 18])
  File "/nas/longleaf/home/milove/.conda/envs/WASP2/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/nas/longleaf/home/milove/.conda/envs/WASP2/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/nas/longleaf/home/milove/.conda/envs/WASP2/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 482, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/nas/longleaf/home/milove/.conda/envs/WASP2/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 811, in __init__
    self._engine = self._make_engine(self.engine)
  File "/nas/longleaf/home/milove/.conda/envs/WASP2/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/nas/longleaf/home/milove/.conda/envs/WASP2/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 69, in __init__
    self._reader = parsers.TextReader(self.handles.handle, **kwds)
  File "pandas/_libs/parsers.pyx", line 549, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

@AaronJeeHo
Copy link
Member

Hi Mike,

Glad to hear we're getting a bit further into the pipeline!

As for the current issue, I believe this is an issue with the preprocessing of the gtf.

You're currently using the option

-ft data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz

Usually this will filter out the gtf to look for features such like transcripts, but in this case it will look for a feature named data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz, which doesn't exist and thus creates an empty gtf (Will need to add some extra validation/fail-safes).

I recommend removing the -ft flag altogether(skips filtering step), or changing it to something like...

-ft transcript

One option that will be particularly useful for debugging is...

--keeptemps [some directory]

What this will do is store the intermediary files to some location. Being able to look at these files should help pinpoint the issue better. I believe the filtered files filter.gtf and intersect.bed(bedtools intersect output), are empty when they shouldn't be.

Let me know how these fixes go! Hopefully they help solve the current issue

@mikelove
Copy link
Author

Oh, sorry about that ... I wasn't reading carefully obviously and just saw "features ... in gtf".

In the README you have

-ft/--features: (RNA ONLY): Features to analyze in gtf. By default analyzes transcript. Analyzes all features if flag denoted without feature.

This is totally correct and I was just reading too fast, but maybe for the lazy readers like me:

-ft/--features: (RNA ONLY): Name of features to analyze in the GTF file, e.g. transcript, exon, gene, etc. Analyzes all features if feature argument not specified.


I've got a new error related to parsing the GTF, whether I set -ft gene or don't specify -ft at all I get the following:

python /proj/milovelab/bin/WASP2/src/analysis/run_analysis.py count --rna -a wasp_mapping/sample_A_1.merge.bam -g data/drosophila_wg.vcf.gz -s sample   -r data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz -o testai --keeptemps tmp
Namespace(command='count', stype='rna', singlecell=False, features=['transcript'], alignment='wasp_mapping/sample_A_1.merge.bam', genotypes='data/drosophila_wg.vcf.gz', sample='sample', regions='data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz', barcodes=None, output='testai', nofilt=False, keeptemps='tmp')
Bulk Analysis
GTF filtered by feature
Filtering reads that overlap regions of interest
....
pysam.utils.SamtoolsError: 'samtools returned with error 1: stdout=None, stderr=[bed_read] Parse error reading "tmp/filter.gtf" at line 1\nsamtools view: Could not read file "tmp/filter.gtf"\n'

The GTF in question is from Ensembl:

http://ftp.ensembl.org/pub/release-100/gtf/drosophila_melanogaster/

I looked into tmp/filter.gtf and noticed it has no header, could that be the issue? E.g. here is the top line of filter.gtf file with no -ft specified:

head tmp/filter.gtf 
3R	FlyBase	transcript	567076.0	2532932.0	.	+	.	"gene_id ""FBgn0267431""; transcript_id ""FBtr0392909""; gene_name ""Myo81F""; gene_source ""FlyBase""; gene_biotype ""protein_coding""; transcript_source ""FlyBase""; transcript_biotype ""protein_coding"";"

@AaronJeeHo
Copy link
Member

Ah, I think you just caught a bug.

I believe the error is because the bam filtering step expects regions in bed format as opposed to gtf.

I should be able to make a quick bugfix by the beginning of the week.

In the meantime, I believe including the --nofilt flag should solve your problem.

Essentially, the pipeline filters the bam file to only include reads that overlap regions of interests, This could cut down a lot of time during the counting step, but could also take a lot of time to do as well. So the --nofilt option lets you skip the filtering step if preferred.

Let me know if this works!

@mikelove mikelove changed the title Invalid VCF file Testing allelic analysis Jan 15, 2022
@mikelove
Copy link
Author

Yes I think you're right about it expecting BED:

python /proj/milovelab/bin/WASP2/src/analysis/run_analysis.py count --rna --nofilt -a wasp_mapping/sample_A_1.merge.bam -g data/drosophila_wg.vcf.gz -s sample   -r data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz -o testai
...
    raise BEDToolsError(subprocess.list2cmdline(cmds), stderr)
pybedtools.helpers.BEDToolsError: 
Command was:

	bedtools intersect -wb -b /tmp/tmpzfji27kc/filter.gtf -a /tmp/tmpzfji27kc/filter.vcf

Error message was:
***** ERROR: illegal number "567076.0". Exiting...

I read in the GTF into Bioconductor and wrote out a BED file with rtracklayer. Proceeding this way works only if I switch to --atac of course.

python /proj/milovelab/bin/WASP2/src/analysis/run_analysis.py count --rna --nofilt -a wasp_mapping/sample_A_1.merge.bam -g data/drosophila_wg.vcf.gz -s sample   -r genes.bed -o testai
usage: run_analysis.py [-h] {count,analysis} ...
run_analysis.py: error: --rna option selected, but .bed file given

One note: the output directory needs to be created or else it will fail on the last line (sometimes bioinf tools will create an output directory for you).

After fixing that, I can count alleles:

python /proj/milovelab/bin/WASP2/src/analysis/run_analysis.py count --atac --nofilt -a wasp_mapping/sample_A_1.merge.bam -g data/drosophila_wg.vcf.gz -s sample   -r genes.bed -o testai
Namespace(command='count', stype='atac', singlecell=False, features=None, alignment='wasp_mapping/sample_A_1.merge.bam', genotypes='data/drosophila_wg.vcf.gz', sample='sample', regions='genes.bed', barcodes=None, output='testai', nofilt=True, keeptemps=None)
Bulk Analysis
Created Filtered VCF
Created Intersection File
SNP DF Created
Counting Alleles for 2L
Counted 10605 SNP's in 256.69881796836853 seconds!

Counting Alleles for 2R
Counted 11128 SNP's in 229.88397359848022 seconds!

Counting Alleles for 3L
Counted 10570 SNP's in 250.6389660835266 seconds!

Counting Alleles for 3R
Counted 13270 SNP's in 425.5656509399414 seconds!

Counting Alleles for 4
Counted 424 SNP's in 7.0546629428863525 seconds!

Counting Alleles for X
Counted 7575 SNP's in 220.31168627738953 seconds!

Counted all SNP's in 1390.226710319519 seconds!
Counting pipeline completed in 1391.6359329223633 seconds!

Happy to test out the GTF version when you have time.

@AaronJeeHo
Copy link
Member

Glad to hear you got the counts working!

I recently pushed a hotfix(a5ad506) that should solve the gtf filtering issue.

Let me know if there are any other major issues!

Your feedback has been incredibly helpful.

@mikelove
Copy link
Author

I'm trying the new code, but still getting same error as previously it seems.

WASP2 > git show
commit a5ad5067d189b61cc7ac172d95f91360a27c801b
Author: Aaron Ho <aho@salk.edu>
Date:   Tue Jan 18 20:32:18 2022 -0800
~~~
python /proj/milovelab/bin/WASP2/src/analysis/run_analysis.py count --rna --nofilt -a test_A_1_2L.merge.bam -g data/drosophila_wg.vcf.gz -s sample -r data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz -o testai
Namespace(command='count', stype='rna', singlecell=False, features=['transcript'], alignment='test_A_1_2L.merge.bam', genotypes='data/drosophila_wg.vcf.gz', sample='sample', regions='data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz', barcodes=None, output='testai', nofilt=True, keeptemps=None)
Bulk Analysis
GTF filtered by feature
Created Filtered VCF
...
    raise BEDToolsError(subprocess.list2cmdline(cmds), stderr)
pybedtools.helpers.BEDToolsError: 
Command was:

	bedtools intersect -wb -b /tmp/tmpcz7c2wgk/filter.gtf -a /tmp/tmpcz7c2wgk/filter.vcf

Error message was:
***** ERROR: illegal number "567076.0". Exiting...

If it helps I can share a small reproducible example? The BAM file above is actually a subset of the full BAM file restricted to 10Mb-11Mb on chromosome 2L. I've uploaded these files (BAM + VCF) and the index files here:

https://www.dropbox.com/sh/u5uhepsmbswdz8g/AACfsgrf6PeRG-7jPl6kaBqJa?dl=0

The GTF file can be obtained from here:

http://ftp.ensembl.org/pub/release-100/gtf/drosophila_melanogaster/

@AaronJeeHo
Copy link
Member

Hi Mike,
Sorry for the delayed response, it's been quite busy.

I was able to reproduce the bug, and fixed the issue (7fb8537). The gtf file was converting the index from int to float, and thus caused issues when intersecting with the vcf.

I've also attached the results of the counting program to the Dropbox.

Please let me know if there are any other issues.

@mikelove
Copy link
Author

Thanks for taking a look! I was in all day meeting end of last week but will test this out on my end ASAP.

@mikelove
Copy link
Author

mikelove commented Feb 3, 2022

Great this works on my end as well. I'll close this one and let you know how the testing pipeline goes

python /proj/milovelab/bin/WASP2/src/analysis/run_analysis.py count --rna -ft gene -a wasp_mapping/sample_A_1.merge.bam -g data/drosophila_wg.vcf.gz -s sample   -r data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz -o testai
Namespace(command='count', stype='rna', singlecell=False, features=['gene'], alignment='wasp_mapping/sample_A_1.merge.bam', genotypes='data/drosophila_wg.vcf.gz', sample='sample', regions='data/Drosophila_melanogaster.BDGP6.28.100.chr.gtf.gz', barcodes=None, output='testai', nofilt=False, keeptemps=None)
Bulk Analysis
GTF filtered by feature
Created Filtered VCF
Created Intersection File
SNP DF Created
Filtering reads that overlap regions of interest
Bam file filtered!
Counting Alleles for 2L
Counted 10953 SNP's in 255.29638981819153 seconds!

Counting Alleles for 2R
Counted 11739 SNP's in 174.22105026245117 seconds!

Counting Alleles for 3L
Counted 11011 SNP's in 211.66355562210083 seconds!

Counting Alleles for 3R
Counted 13921 SNP's in 461.71386790275574 seconds!

Counting Alleles for 4
Counted 430 SNP's in 4.299241542816162 seconds!

Counting Alleles for X
Counted 8102 SNP's in 186.62723541259766 seconds!

Counted all SNP's in 1293.9235508441925 seconds!
Counting pipeline completed in 1681.3137261867523 seconds!

@mikelove mikelove closed this as completed Feb 3, 2022
@mikelove
Copy link
Author

I've been able to make it all the way to the testing successfully.

Does WASP2 have an AI test across samples? If not I was considering to use a p-value combination method, Fisher, Stouffer, harmonic mean, etc.

@gmcvicker
Copy link

@AaronJeeHo has been working on testing for differences in AI between two samples or between groups of cells (using single cell data). But we haven't thought about testing for AI across samples. Is there a use case you have in mind?

I guess it could be interesting to test if the AI distribution is shifted between two groups of samples. We recently compared AI in tumor and normal samples in a couple of projects, but we used a very simple (not ideal) approach where we just counted number of samples with/without significant AI and then did Fisher's Exact Test. This obviously doesn't account for the problem of incomplete power.

@mikelove
Copy link
Author

mikelove commented Mar 3, 2022

The use case is that I have replicates in some experiments. So looking for consistent AI across a number of heterozygotes.

We've also been working on trends in AI in bulk and sc, happy to chat about ideas on this.

Re: testing consistent AI I may explore different p-value combinations (combining across samples per gene). As we have simulation I can find the best method (at least for the sim).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants