### Load library

In [1]:
from ma_mapper import sequence_alignment


# Seqeuence alignment wrapper
As explained in the main page, ma_mapper is a wrapper package designed to use other bioinformatic package to do a specific task which is mapping/overlaying genome-wide data on multiple alignement of TE. It is mostly compatible with input from external package. However, this package also offer some module to steamline/simplify task such as sequence alignment.

This package wrap MAFFT fucntion so if MAFFT is installed it can be used from inside ma_mapper.

## extract coordinates from repeatmasker tabel
The input of sequence alignment function is a file of TE sequences in FASTA format. One of the simple ways to extract dna seqeunces in the human genome is to use TE coordinate in BED format to extract sequences with SeqIO.

The user can use extract_coord_from_repeatmasker_table to extract TE coordinates from a repeatmasker table

In [2]:
coord_table=sequence_alignment.extract_coord_from_repeatmasker_table(
    subfamily = 'THE1C',
    repeatmasker_table = '/rds/project/rds-XrHDlpCeVDg/users/pakkanan/data/resource/repeatmasker_table/hg38_repeatlib2014/hg38.fa.out.tsv'
    )

In [3]:
coord_table

Unnamed: 0,genoName,genoStart,genoEnd,internal_id,score,strand
0,chr1,119563,119944,THE1C_0,10,-
1,chr1,296133,296514,THE1C_1,10,-
2,chr1,710552,710933,THE1C_2,10,-
3,chr1,1269889,1270250,THE1C_3,10,+
4,chr1,1610181,1610533,THE1C_4,10,+
...,...,...,...,...,...,...
10041,chrY,19242916,19243027,THE1C_10041,10,-
10042,chrY,21083462,21083833,THE1C_10042,10,-
10043,chrY,21293656,21293702,THE1C_10043,10,-
10044,chrY,21293746,21294046,THE1C_10044,10,-


Extract TE seqeunces from human genome sequence and save them into fasta format file

In [4]:
te_seqeunces=sequence_alignment.sequence_io(
    coordinate_table=coord_table,
    source_fasta='/rds/project/rds-XrHDlpCeVDg/users/pakkanan/data/resource/human_genome_fasta/hg38_fasta/hg38.fa',
    save_to_file='/rds/project/rds-XrHDlpCeVDg/users/pakkanan/data/resource/repeatmasker_table/THE1C.fa')

In [5]:
te_seqeunces

[SeqRecord(seq=Seq('tgatatggttaggctttgtatccccacctgaatctcatcttgaattgtaatccc...aca'), id='THE1C_0::chr1:119563-119944(-)', name='', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('tgatatggttaggctttgtatccccacctgaatctcgtcttgaattgtaatccc...aca'), id='THE1C_1::chr1:296133-296514(-)', name='', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('tgatatggttaggctttgtatccccacctgaatctcgtcttgaattgtaatccc...aca'), id='THE1C_2::chr1:710552-710933(-)', name='', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('tgatatgctttggctatgtccccacccaaatcttatattgacttgtaatcccca...aca'), id='THE1C_3::chr1:1269889-1270250(+)', name='', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('tgatacagtttggctgtgtccccatccaaatctcatcttgcatttcccacaatc...aca'), id='THE1C_4::chr1:1610181-1610533(+)', name='', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('atatggtttggctgtgtccccacaaaatctctcttgaattgtagttcccataat...gca'), id='THE1C_5::chr1:2954589-2954963(-)', name='', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('tgatacagtttggctgtgtc

Align TE sequences using MAFFT warpper

mafft arguments like nthread,  nthreadtb, nthreadtit can be used, additional parameter can be used by additional command using mafft_arg= 

In [6]:
sequence_alignment.mafft_align(
    input_filepath='/rds/project/rds-XrHDlpCeVDg/users/pakkanan/data/resource/repeatmasker_table/THE1C.fa',
    nthread=6,
    output_filepath='/rds/project/rds-XrHDlpCeVDg/users/pakkanan/data/resource/repeatmasker_table/THE1C.align'
)


nthread = 6
stacksize: -1 kb
generating a scoring matrix for nucleotide (dist=200) ... done
Gap Penalty = -1.53, +0.00, +0.00



Making a distance matrix ..
 10001 / 10046 (thread    0)
done.

Constructing a UPGMA tree (treeout, efffree=0) ... 
 10040 / 10046
done.

Progressive alignment 1/2... 
STEP  1773 / 10045 (thread    3)df
Reallocating..done. *alloclen = 2027
STEP  3940 / 10045 (thread    1)dd
Reallocating..done. *alloclen = 3148
STEP  4337 / 10045 (thread    3)d
Reallocating..done. *alloclen = 4337
STEP  6168 / 10045 (thread    0)d
Reallocating..done. *alloclen = 5702
STEP  7455 / 10045 (thread    3)d
Reallocating..done. *alloclen = 6917
STEP  8429 / 10045 (thread    3)f
Reallocating..done. *alloclen = 8467
STEP  9732 / 10045 (thread    5)d
Reallocating..done. *alloclen = 9863
STEP  10045 / 10045 (thread    4)d
done.

Making a distance matrix from msa.. 
 10000 / 10046 (thread    3)
done.

Constructing a UPGMA tree (treeout, efffree=1) ... 
 10040 / 10046
done.

Progressive al

In [7]:
!more /rds/project/rds-XrHDlpCeVDg/users/pakkanan/data/resource/repeatmasker_table/THE1C.align

>THE1C_0::chr1:119563-119944(-)
-----tgat-a-t-g----g----t--ta------g-----g---------c-t------
-t----t------g---------------t----a-------t--------------c--
------------c--------------c----------c-----a---------------
------------c--------------------c--------------t-----------
----------------g-a-----a------------------------------t---c
-------t-------------------c--------------------------------
------------a-t-----------c---------t------------t----------
--------------g-----------------------a-----a---------------
t-------------t--------------------------------gtaatc------c
ccat------------a---------g---------t-----------------------
-----------c---------c--------------------------------------
----------------------------------c-------------------------
---c-------------------a----------------------------t-------
-------------a---------a------------t----------c------------
----c------------------c---------c---------a----------------
----------------------------------c------------------

The output alignment file (fasta format) is ready to use in the downstream analyses