Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue writing REP metric to VCF #17

Closed
pblaney opened this issue Dec 21, 2021 · 3 comments
Closed

Issue writing REP metric to VCF #17

pblaney opened this issue Dec 21, 2021 · 3 comments

Comments

@pblaney
Copy link

pblaney commented Dec 21, 2021

Hi Kez,

During the steps to write the output of the call function to the VCF, I encountered the following error:

dysgu call --ibam PD26400a_T.final.bam --sites PD26400a_T_vs_PD26400b_N.consensus.somatic.sv.vcf --all-sites True --ho
m-ref-sites True --clip-length 15 --metrics Homo_sapiens_assembly38.fasta working_dir working_dir/PD26400a_T.final.dysgu_reads.bam 
> svs.vcf
2021-12-21 13:12:33,777 [INFO   ]  [dysgu-call] Version: 1.3.3
2021-12-21 13:12:33,777 [INFO   ]  Input file is: working_dir/PD26400a_T.final.dysgu_reads.bam
2021-12-21 13:12:33,777 [INFO   ]  call --ibam PD26400a_T.final.bam --sites PD26400a_T_vs_PD26400b_N.consensus.somatic.sv.vcf --all-sites True --hom-ref-sites True --clip-length 15 --metrics Homo_sapiens_assembly38.fasta working_dir working_dir/PD26400a_T.final.dysgu_reads.bam
[W::hts_idx_load3] The index file is older than the data file: PD26400a_T.final.bai
2021-12-21 13:12:34,079 [INFO   ]  Sample name: PD26400a_T
2021-12-21 13:12:34,079 [INFO   ]  Writing vcf to stdout
2021-12-21 13:12:34,079 [INFO   ]  Running pipeline
2021-12-21 13:12:34,782 [INFO   ]  Calculating insert size. Removed 27 outliers with insert size >= 938.0
2021-12-21 13:12:34,798 [INFO   ]  Inferred read length 151.0, insert median 458, insert stdev 89
2021-12-21 13:12:34,799 [INFO   ]  Max clustering dist 903
2021-12-21 13:12:34,799 [INFO   ]  Minimum support 3
2021-12-21 13:12:34,799 [INFO   ]  Reading --sites
[W::vcf_parse_info] INFO/END=34143939 is smaller than POS at chr1:84232901
2021-12-21 13:12:34,826 [INFO   ]  Building graph with clustering 903 bp
2021-12-21 13:14:03,262 [INFO   ]  Total input reads 4894147
2021-12-21 13:14:04,966 [INFO   ]  Added 57 variants from input sites
2021-12-21 13:14:06,505 [INFO   ]  Graph constructed
2021-12-21 13:19:44,026 [INFO   ]  Number of components 2578332. N candidates 416974
2021-12-21 13:19:44,330 [INFO   ]  Number of matching SVs from --sites 62
2021-12-21 13:20:02,988 [INFO   ]  Re-alignment of soft-clips done. N candidates 300203
2021-12-21 13:20:07,483 [INFO   ]  Number of candidate SVs merged: 25814
2021-12-21 13:20:07,483 [INFO   ]  Number of candidate SVs after merge: 274389
2021-12-21 13:20:07,573 [INFO   ]  Number of candidate SVs dropped with sv-len < min-size or support < min support: 243030
2021-12-21 13:20:07,893 [INFO   ]  Number or SVs near gaps dropped 174
2021-12-21 13:20:09,897 [INFO   ]  Loaded n=24 chromosome coverage arrays from working_dir
/conda/lib/python3.7/site-packages/sklearn/base.py:333: UserWarning: Trying to unpickle estimator LabelEncoder from version 0.23.2 when using version 1.0.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  UserWarning,
2021-12-21 13:20:29,021 [INFO   ]  Model: pe, diploid: True, contig features: True. N features: 43
Traceback (most recent call last):
  File "/conda/bin/dysgu", line 8, in <module>
    sys.exit(cli())
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/conda/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/conda/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/conda/lib/python3.7/site-packages/dysgu/main.py", line 445, in call_events
    cluster.cluster_reads(ctx.obj)
  File "dysgu/cluster.pyx", line 1190, in dysgu.cluster.cluster_reads
  File "dysgu/io_funcs.pyx", line 656, in dysgu.io_funcs.to_vcf
  File "dysgu/io_funcs.pyx", line 279, in dysgu.io_funcs.make_main_record
ValueError: could not convert string to float: '.'

There was no issues with the fetch command and the beginning of the svs.vcf appears to be output correctly (i.e. full header was present, including all FILTER/INFO/FORMAT fields).

The issue stems from the REP metric calculation, specifically at the following code block:

if not small_output:
        info_extras += [
                    f"REP={'%.3f' % float(rep)}",
                    f"REPSC={'%.3f' % float(repsc)}",

                    ]

I'm running the most recent push to pypi (v1.3.3).

@kcleal
Copy link
Owner

kcleal commented Dec 22, 2021

Hi Patrick,
Thanks for reporting this. I was not able to reproduce the error, but as the error looked relatively trivial I have modified the code so that the exception is bypassed. v1.3.4 should address this issue, but please re-open if this is not the case.
Could I ask, was the file PD26400a_T_vs_PD26400b_N.consensus.somatic.sv.vcf generated by dysgu? I suspect the error originates from merging SV events between the --sites input and the discovered variants.
Also note in v1.3.4, the --hom-ref-sites option has been removed as it is redundant. The --all-sites True option will output a genotype for all input sites including variants with genotype 0/0.

@kcleal kcleal closed this as completed Dec 22, 2021
@pblaney
Copy link
Author

pblaney commented Dec 22, 2021

Hey Kez,
I'll give it another try in the next few days and get back to you if I encounter any further issues.

So that file was not generated by dysgu. It is a consensus call set from Manta, SvABA, and DELLY that was merged using SURVIVOR. This does bring a question to mind, what information does dysgu pull from the VCF when using the --sites parameter? My file PD26400a_T_vs_PD26400b_N.consensus.somatic.sv.vcf is a little messy/unconventional right now so I was curious how to clean it for ensure it worked as intended.

@kcleal
Copy link
Owner

kcleal commented Dec 23, 2021

Currently, the information pulled is quite limited, basically:

chrom, start, chrom2, stop, svt, idx, r.id, svlen, pval

Where svt is SVTYPE and pval is from dysgu (if available).
This is an interesting test case, and I would be interested to know what your conclusions are from trying this combination of callers. In my own tests, I found using calls from other software can work well but as long as the supplied --sites have relatively few false positives. i.e. --sites should be a list of "confident sites".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants