GitHub - morisUtokyo/cTR: Clustering reads with tandem repeats from the same locus

Introduction

cTR is a tool for clustering PacBio HiFi reads with mosaic tandem repeats that are collected from multiple samples at the same locus. It outputs a set of representative reads that are highly similar to each other.

cTR feeds such a fasta file that the annotation begins with a sample ID and a read ID. For example, the first row of ACCCC_GTTTT.fasta in the "test" directory shows that the sample and read IDs are 0 and 0, and the rest indicates mosaic tandem repeat pattern of the read string:

> 0,0,(ACCCC)22(GTTTT)22
ACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTT
> 0,1,(ACCCC)22(GTTTT)22
ACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTT

cTR clusters reads from an identical sample into one or two groups, and outputs the representative each group. If two haplotypes are homozygous, one group is output; if two haplotypes are heterozygous, two groups are output. In one group, the maximum Levenshtein distance between a pair of reads is 0.01 times the average length of the reads. Each of two groups also meets this condition. The default maximum value is set ot 0.01 because the maximum error rate of PacBio reads is less than 0.01 as of today. But it can be changed (say, 0.005) by redefining MAX_DIFF_RATIO in clusterTR.h.

Subsequently, cTR further clusters representative reads from all samples into groups so that the maximum Levenshtein distance between a pair of reads in a group is smaller than a threshold times the average read length. The maximum threshold is set to 0.03 by default but can be also changed by redefining MAX_DIAMETER in cluster.h. Precisely, when the average length is L, the diameter of every cluster at most ceil(L MAX_DIAMETER), where ceil(x) is the minimum integer >= x. For example, when we set MAX_DIAMETER=0.03 and L=15 (60), the diameter of every cluster is at most 1=ceil(0.45) and (2=ceil(1.8), respectively).cTR outputs a representative of each group with an annotation of the form:

> GroupSize = N, Diameter = D, RadiusFromCentroid = R, CentroidReadName =  sampleID,readID, CentroidReadLength = L

For example, the first two rows of ACCCC_GTTTT_rep.fasta in the "test" directory is:

> GroupSize = 39, Diameter = 10, RadiusFromCentroid = 10, CentroidReadName =  6,2,(ACCCC)21(GTTTT)23, CentroidReadLength = 220
ACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCACCCCGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTT

Each representative read of a sample is put into one of the groups in all samples. cTR outputs a table that describe this association. For example, the first five rows in ACCCC_GTTTT_hap_cent_table.txt are:

0,0,(ACCCC)22(GTTTT)22     9,2,(ACCCC)22(GTTTT)23
0,2,(ACCCC)24(GTTTT)20     25,2,(ACCCC)23(GTTTT)20
1,0,(ACCCC)24(GTTTT)21     41,0,(ACCCC)24(GTTTT)22
1,2,(ACCCC)24(GTTTT)20     25,2,(ACCCC)23(GTTTT)20
2,0,(ACCCC)23(GTTTT)23     8,0,(ACCCC)24(GTTTT)23

cTR also generates a table that shows a list of representative reads in each group of all samples. For example, ACCCC_GTTTT_table.txt shows:

 6,2,(ACCCC)21(GTTTT)23     6,2,(ACCCC)21(GTTTT)23
 15,2,(ACCCC)21(GTTTT)23     6,2,(ACCCC)21(GTTTT)23
 17,0,(ACCCC)21(GTTTT)23     6,2,(ACCCC)21(GTTTT)23
 21,0,(ACCCC)21(GTTTT)23     6,2,(ACCCC)21(GTTTT)23
 26,2,(ACCCC)21(GTTTT)23     6,2,(ACCCC)21(GTTTT)23

Usage

cTR [-ace] [-f fasta file] [-d output directory]

-f: Input a fasta file name without .fasta
-i: Input directory
-d: Output the results to the specified directory
-a: Output a detailed analysis of groups
-c: Output a fasta file of representative reads of groups
-e: Output a fasta file of one or two representative tandem repeat alleles for each individual

For example, see test.sh in the test directory.

cTR use KSW2, a library to align a pair of reads that implements dynamic programming. Obtain a copy of the KSW2 program from:

https://github.com/lh3/ksw2

Suzuki, H. and Kasahara, M. (2018). Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics, 19:45.

Li, H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100.

For your convenience, a copy of the KSW2 program is placed on this github.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
test		test
EditDistance.c		EditDistance.c
LICENSE.txt		LICENSE.txt
Makefile		Makefile
NJtree.c		NJtree.c
README.md		README.md
cTR_test.sh		cTR_test.sh
cli.c		cli.c
clusterTR.c		clusterTR.c
clusterTR.h		clusterTR.h
handle_one_file.c		handle_one_file.c
kalloc.c		kalloc.c
kalloc.h		kalloc.h
kseq.h		kseq.h
ksw2.h		ksw2.h
ksw2_extd.c		ksw2_extd.c
ksw2_extd2_sse.c		ksw2_extd2_sse.c
ksw2_extf2_sse.c		ksw2_extf2_sse.c
ksw2_exts2_sse.c		ksw2_exts2_sse.c
ksw2_extz.c		ksw2_extz.c
ksw2_extz2_sse.c		ksw2_extz2_sse.c
ksw2_gg.c		ksw2_gg.c
ksw2_gg2.c		ksw2_gg2.c
ksw2_gg2_sse.c		ksw2_gg2_sse.c
main.c		main.c
region_complexity.c		region_complexity.c
repCentroids.c		repCentroids.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Usage

About

Releases 1

Packages

Languages

License

morisUtokyo/cTR

Folders and files

Latest commit

History

Repository files navigation

Introduction

Usage

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages