GitHub - n-long/UCEpipeline: find ultra-conserved elements from pairwise genome alignments

UCEpipeline

find ultra-conserved elements (ungapped, 100% identity matches) from pairwise genome alignments

Requires:
LAST aligner
GNU Parallel
samtools
Seqmagick
split_multifasta
blat2gff

Create LAST databases for each genome

lastdb -R11 dbname /path/to/genome.fa
OR
parallel "lastdb -R11 {.}.database {}" ::: *genomes.fa

See lastDB man page for options regarding soft-masking and additional handling of simple repeats.

Split each genome into separate files for each chromosome/contig/scaffold

split_multifasta --input_file /path/to/genome.fa --output_dir genome_dir

Pairwise genome alignments

ls genome_dir/ | parallel "lastal -j1 -r5 -q100 -b100 -k2 targetgenome.database genome_dir/{}" > genome1_genome2.maf

See lastal options. -j1 for gapless alignments, -r5 for normal match score, -q100 and -b100 for unfavorably high gap and mismatch scores, and -k1 to not skip any positions in sliding window comparison.

Convert MAF alignments to FASTA and deduplicate (optional min-length argument to drop small sequences)

parallel "perl maf2fasta.pl < {} | grep -v "=" > {.}.fa" ::: *.maf

parallel seqmagick mogrify --deduplicate-sequences --min-length 20 {} ::: *.fa

Concatenate FASTA files, deduplicate again, and rename headers to incrementing number

cat *last.fa > UCE.fa

seqmagick mogrify --deduplicate-sequences UCE.fa

awk '/^>/{print ">"++i; next}{print}' UCE.fa > UCEcands.fa

Map UCE candidates back to each genome

cat UCEcands.fa | parallel --pipe --recstart '>' lastal -T1 -u0 -r5 -q100 -b100 -k1 eachgenome.database - > genome_UCE.maf

Convert alignment to PSL and then to GFF (additional filter for the few <100% hits that get through)

parallel "maf-convert psl {} > {.}.psl" ::: *_UCE.maf

parallel "./psl2gff.pl < {} | awk '\$6=100' > {.}.gff" ::: *.psl

Parse GFFs for UCE sequence names

for i in *UCE.gff; do awk -F';' '{print $2}' $i | sed 's/Target=//g' | awk '{print $1}' | sort | uniq > $i.names; done

Do pairwise join on each name file to see which are common to all

join genome1.names genome2.names > output1
join output1 genome3.names > output2
etc. etc.

Extract reduced list from larger UCE candidate sequence file

cat output3 | parallel -j 24 "{} samtools faidx UCEcands.fa >> UCEreduced.fa"

Map only those elements surviving all vs. all comparison back to each genome

cat UCEreduced.fa | parallel --pipe --recstart '>' lastal -T1 -u0 -r5 -q100 -b100 -k1 eachgenome.database - > genome_UCE.maf

-j1 option is switched out for -T1 so that only alignments extending to the end of the query sequence are accepted.

parallel "maf-convert psl {} > {.}.psl" ::: *_UCE.maf

parallel "./psl2gff.pl < {} > {.}.gff" ::: *.psl

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
README.md		README.md
maf2fasta.pl		maf2fasta.pl
psl2gff.pl		psl2gff.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UCEpipeline

Create LAST databases for each genome

Split each genome into separate files for each chromosome/contig/scaffold

Pairwise genome alignments

Convert MAF alignments to FASTA and deduplicate (optional min-length argument to drop small sequences)

Concatenate FASTA files, deduplicate again, and rename headers to incrementing number

Map UCE candidates back to each genome

Convert alignment to PSL and then to GFF (additional filter for the few <100% hits that get through)

Parse GFFs for UCE sequence names

Do pairwise join on each name file to see which are common to all

Extract reduced list from larger UCE candidate sequence file

Map only those elements surviving all vs. all comparison back to each genome

About

Releases

Packages

Languages

n-long/UCEpipeline

Folders and files

Latest commit

History

Repository files navigation

UCEpipeline

Create LAST databases for each genome

Split each genome into separate files for each chromosome/contig/scaffold

Pairwise genome alignments

Convert MAF alignments to FASTA and deduplicate (optional min-length argument to drop small sequences)

Concatenate FASTA files, deduplicate again, and rename headers to incrementing number

Map UCE candidates back to each genome

Convert alignment to PSL and then to GFF (additional filter for the few <100% hits that get through)

Parse GFFs for UCE sequence names

Do pairwise join on each name file to see which are common to all

Extract reduced list from larger UCE candidate sequence file

Map only those elements surviving all vs. all comparison back to each genome

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages