Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run post analysis steps such as copynumber and novel SNP detection? #11

Closed
yingchen69 opened this issue May 17, 2023 · 8 comments

Comments

@yingchen69
Copy link

Hi,

The readme mentions that t1k post analysis steps can do copynumber and novel SNP detection. Is there any detail regarding how to do the tasks? There is a t1k-copynumber.py, but I am not sure if python t1k-copynumber.py -g T1K_genotyping_result_file can work.

Thanks a lot for the help!

Ying

@mourisl
Copy link
Owner

mourisl commented May 17, 2023

The novel SNP detection is automatically included in the workflow/wrapper, and the vcf file is the SNP detection results where the coordinate is the concatenated exons for each allele.

For the t1k-copynumber.py script, the "-g" option takes the the XXX_genotype.tsv file generated by T1K. You may also add the gene names to option "--nomissing" as a comma-separated list, where the genes are expected to be present on every chromosome. For example, we know there are four KIR framework genes, so the copy number inference command should be:

python3 t1k-copynumber.py -g XXX_genotype.tsv --nomissing KIR3DL3,KIR2DL4,KIR3DP1,KIR3DL2

Hope this helps and please let me know if you have further questions.

@yingchen69
Copy link
Author

Hi,

Thanks a lot for the quick reply!

I am still a bit confused by the SNP detection. Here is my command to get the genotype.tsv:

run-t1k -1 my_R1.fq.gz -2 my_R2.fq.gz --preset hla -f /hlaidx/hlaidx_dna_seq.fa --od outputdir -o mysampleid

I do not see any vcf in the output folder. Should I set more parameter to get SNP result in the output?

Best,

Ying

@mourisl
Copy link
Owner

mourisl commented May 17, 2023

Can you show me the output on the screen?

Another way to call the variants is through the "./analyzer" command, which can be run as:
"./analyzer -f /hlaidx/hlaidx_dna_seq.fa -a outputdir/mysampleid_allele.tsv -1 outputdir/mysampleid_aligned_1.fa -2 outputdir/mysampleid_aligned_1.fa.gz -s 0.97 -o outputdir/mysampleid" in your case. After that you shall see the outputdir/mysampleid_allele.tsv file.

@yingchen69
Copy link
Author

yingchen69 commented May 17, 2023 via email

@yingchen69
Copy link
Author

Hi,

I just got our server back and I checked that we do have the allele.vcf files. Sorry for the confusion :(

I got t1k-copynumber.py working, but I am not sure about the column names. From the code it seems the column names should be gene, number of alleles, allele 1, allele 1 cn, allele 1 ratio, allele 2, allele 2 cn, allele 2 ratio. What is the 'ratio'? Is it in log2?

Another question, how does T1K deal with duplicated reads? My data are all from whole exon sequencing and I see ~20% duplicates rate by flagstat. I tried several tools for HLA typing such as polysolver, OptiType, hisat-genotype,HLAHD, HLALA. I always got totally different HLA typing results if I removed duplicated reads from the bam files first.

Thanks a lot!

Ying

@mourisl
Copy link
Owner

mourisl commented May 18, 2023

Ratio is the log-ratio of the likelihood between the most likely copy number and the second likely copy number. I'm still trying to optimize t1k-copynumber.py, so please interpret its result with caution.

We don't remove the duplicated reads. The duplicated reads will contribute to the allele abundance estimation (or other type of allele score in other HLA genotypers), therefore it is expected that the deduplication will affect the genotyping results. Hope this helps.

@yingchen69
Copy link
Author

Hi,

Thanks a lot for the details!

Regarding the duplicate reads, is their a reason to keep them for analysis? Normally we think duplicate reads are artifacts due to PCR, optical...

Best,

Ying

@mourisl
Copy link
Owner

mourisl commented May 18, 2023

In RNA-seq data, I saw some genes are very highly expressed and some reads can become duplicated by chance. I think the deduplication should be left to the user when preprocessing the fastq file if they feel the duplication has become a severe issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants