No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
README.md Make small clarifications Jul 21, 2017

README.md

last-genome-alignments

Here are some pair-wise genome alignments made with LAST.

2017 human-ape alignments

The human genome (hg38) was aligned to chimp (panTro5) and gorilla (gorGor5), as follows. This alignment recipe is very accurate-but-slow. A faster recipe would mask repeats during alignment, and/or omit -m50.

First, an "index" of the human genome was prepared, suitable for comparing it to highly-similar sequences:

lastdb -P0 -uNEAR -R01 hg38-NEAR hg38_no_alt_analysis_set.fa

Then, substitution and gap frequencies were determined:

last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 hg38-NEAR panTro5.fa > hg38-panTro5.mat

Next, many-to-one ape-to-human alignments were made:

lastal -m50 -E0.05 -C2 -p hg38-panTro5.mat hg38-NEAR panTro5.fa | last-split -m1 > hg38-panTro5-1.maf

The above command was the slowest step (3 CPU-weeks). You can "easily" parallelize it, by processing each sequence within panTro5.fa separately (in parallel). But each process uses quite a lot of memory, so take care that multiple parallel runs don't exceed your memory.

Next, one-to-one ape-to-human alignments were made:

maf-swap hg38-panTro5-1.maf |
awk '/^s/ {$2 = (++s % 2 ? "panTro5." : "hg38.") $2} 1' |
last-split -m1 |
maf-swap > hg38-panTro5-2.maf

The awk command prepends the assembly name to each chromosome name (e.g. chr7 -> hg38.chr7).

Finally, simple-sequence alignments were discarded, the alignments were converted to tabular format, and alignments with error probability > 10^-5 were discarded:

last-postmask hg38-panTro5-2.maf |
maf-convert -n tab |
awk -F'=' '$2 <= 1e-5' > hg38-panTro5.tab

2017 human-mouse alignments

The human genome (hg38) was aligned to mouse (mm10). This alignment recipe is even more slow-and-sensitive.

First, an "index" of the human genome was prepared, suitable for comparing it to less-similar sequences:

lastdb -P0 -uMAM4 -R01 hg38-MAM4 hg38_no_alt_analysis_set.fa

Then, substitution and gap frequencies were determined:

last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 hg38-MAM4 mm10.fa > hg38-mm10.mat

Next, many-to-one mouse-to-human alignments were made:

lastal -m100 -E0.05 -C2 -p hg38-mm10.mat hg38-MAM4 mm10.fa | last-split -m1 > hg38-mm10-1.maf

Finally, one-to-one MAF alignments, and high-confidence tabular alignments, were made in the same way as above.