find ultra-conserved elements (ungapped, 100% identity matches) from pairwise genome alignments
Requires:
LAST aligner
GNU Parallel
samtools
Seqmagick
split_multifasta
blat2gff
lastdb -R11 dbname /path/to/genome.fa
OR
parallel "lastdb -R11 {.}.database {}" ::: *genomes.fa
See lastDB man page for options regarding soft-masking and additional handling of simple repeats.
split_multifasta --input_file /path/to/genome.fa --output_dir genome_dir
ls genome_dir/ | parallel "lastal -j1 -r5 -q100 -b100 -k2 targetgenome.database genome_dir/{}" > genome1_genome2.maf
See lastal options. -j1 for gapless alignments, -r5 for normal match score, -q100 and -b100 for unfavorably high gap and mismatch scores, and -k1 to not skip any positions in sliding window comparison.
Convert MAF alignments to FASTA and deduplicate (optional min-length argument to drop small sequences)
parallel "perl maf2fasta.pl < {} | grep -v "=" > {.}.fa" ::: *.maf
parallel seqmagick mogrify --deduplicate-sequences --min-length 20 {} ::: *.fa
cat *last.fa > UCE.fa
seqmagick mogrify --deduplicate-sequences UCE.fa
awk '/^>/{print ">"++i; next}{print}' UCE.fa > UCEcands.fa
cat UCEcands.fa | parallel --pipe --recstart '>' lastal -T1 -u0 -r5 -q100 -b100 -k1 eachgenome.database - > genome_UCE.maf
Convert alignment to PSL and then to GFF (additional filter for the few <100% hits that get through)
parallel "maf-convert psl {} > {.}.psl" ::: *_UCE.maf
parallel "./psl2gff.pl < {} | awk '\$6=100' > {.}.gff" ::: *.psl
for i in *UCE.gff; do awk -F';' '{print $2}' $i | sed 's/Target=//g' | awk '{print $1}' | sort | uniq > $i.names; done
join genome1.names genome2.names > output1
join output1 genome3.names > output2
etc. etc.
cat output3 | parallel -j 24 "{} samtools faidx UCEcands.fa >> UCEreduced.fa"
cat UCEreduced.fa | parallel --pipe --recstart '>' lastal -T1 -u0 -r5 -q100 -b100 -k1 eachgenome.database - > genome_UCE.maf
-j1 option is switched out for -T1 so that only alignments extending to the end of the query sequence are accepted.
parallel "maf-convert psl {} > {.}.psl" ::: *_UCE.maf
parallel "./psl2gff.pl < {} > {.}.gff" ::: *.psl