# 01 - Sequence preparation 

## Summary

In this notebook, we are preparing the sequences for massive-structure prediction. In principle, we will simulate almost 3k structures; this means that we will have to perform certain optimizations to both speed up our process and also to avoid being IP blocked by the MMSeqs2 server.

We have the following groups of sequences:
- all extant seqs in tree, n = 385
- "_map" anc seqs (most probable), n = 384
- "_altall" anc seqs (next best res at ambiguous sites), n = 384
- "_alt2" to "_alt5" anc seqs (random sampled res at ambiguous sites), n = 1536

For the all extant, there is not much to do apart from just send them sequencially or using the HH-suite/local MMseqs2 solutions that I have failed so far to deliver. For the ancestors, we will group them by parent sequence, so we will be able to re-utilize the initial MMSeqs2 search.

# Methods

In [1]:
import pandas as pd
from Bio import AlignIO as alio
from Bio import SeqIO as sqio
import subprocess
import tqdm
import numpy as np
import tarfile
import os

In [2]:
def alignment_to_table(file):
    out = []
    for seq in alio.read(open(file), 'fasta'):
        out.append(dict(id=seq.id, seq=str(seq.seq).replace('-', '')))
    return pd.DataFrame.from_records(out)
    

In [3]:
aln_D = alignment_to_table('../sequences/AGNifAlign105.ext-anc.alt.D.fasta')
aln_K = alignment_to_table('../sequences/AGNifAlign105.ext-anc.alt.K.fasta')
aln_H = alignment_to_table('../sequences/AGNifAlign105.ext-anc.alt.H.fasta')

In [4]:
aln_DK = pd.merge(aln_D, aln_K, on='id', how='inner', suffixes=['_D', '_K'])
aln_DKH = pd.merge(aln_DK, aln_H, on='id', how='inner').rename(columns={'seq': 'seq_H'})
assert(len(aln_DKH) == len(aln_D))
aln_DKH

Unnamed: 0,id,seq_D,seq_K,seq_H
0,1207_alt4,MSKKEEKEELIEEILDVYPEKARKNREKHIAVNDPDSGQCAVKSNV...,ASKEEVEKVLEWTKTEEYKEKNFKRKALVINPAKACQPLGAVLAAL...,MRQIAIYGKGGIGKSTTTQNTVAALAEMGKKIMIVGCDPKADSTRL...
1,1207_alt5,MSENEERKEIIEEVLEVYPEKARKNRKKHLAVNDPDAASCAVKSNV...,ASAEEVQKVKDWTNTEEYKEKNFKRKALVINPAKACQPLGAVLAAL...,MRQIAIYGKGGIGKSTTTQNTVAALAEMGKKVMIVGCDPKADSTRL...
2,1207_map,MSEKEETQKLIEEVLEVYPEKARKNRKKHIAVNDPEASSCAVKSNV...,CTKEEVEKVADWTNTEEYKEKNFKRKALVINPAKACQPLGAVLAAL...,MRQIAIYGKGGIGKSTTTQNTVAALAEMGKKVMIVGCDPKADSTRL...
3,1207_alt2,MSEDEQSKKLVEEVLNVYPEKARKNRAKHVAVNDPDAGSCVVKSNV...,HTPEEVERVKDWTNTEEYKEKNFARKALVINPAKACQPLGAMLAAL...,MRQIAIYGKGGIGKSTTTQNTVAALAEMGKKVMIVGCDPKADSTRL...
4,1207_alt3,MSTKEQTQKIVEEVLEIYPEKARKNRRKHLAVNDPGANSCSVKSNV...,HTKEEVQEVAEWTNTEEYKEKNFARKALVINPAKACQPLGALLAAL...,MRQIAIYGKGGIGKSTTTQNTVAALAEMGKKVMIVGCDPKADSTRL...
...,...,...,...,...
2684,Nif_archaeon_BMS3Bbin15,MLLKCDKTIPERKKHIVIKGENGCGGDSSGCEIACNVPTTPGDMTE...,MSIVTKQNRAVAINPTRSCAPIGAMLANYGVHGALTINHGSQGCAT...,MRQVAFYGKGGIGKSTTQQNTAASLARIGKKIMVVGCDPKADCTRL...
2685,Nif_Candidatus_Viridilinea_mediisalina,MKLKCNATLPDRALHIALKTSEGGCRRGDGTDCFIASNSATTPGDM...,MSCVTTQDRAVAINPTRSCAPIGAMLANYGIHGAITINHGSQGCAT...,MRQVAFYGKGGIGKSTTQQNTAAALASMGNKLMVVGCDPKADCTRL...
2686,Nif_Chloroflexales_bacterium_ZM16-3,MELKSSTTIPERAQHIALKVEGGKCQRGDGAGCAIVSNSATTPGDM...,MSCVTTQDRAVSINPTRSCAPIGAMLANYGIHGAITINHGSQGCAT...,MRQIAFYGKGGIGKSTTQQNTAAALASMGNKIMVVGCDPKADCTRL...
2687,Nif_Oscillochloris_trichoides,MQFKCNETLPERGTHIALKVAGGGCQRGDGTSCGIVSNSATTPGDM...,MSCVTLQDRAVAINPTRSCAPIGAMLANYGIHGAITINHGSQGCAT...,MRQVAFYGKGGIGKSTTQQNTAAAFASMGNKLMVVGCDPKADCTRL...


Now we will rename the ancestors id to include the tag "anc_" in front of them.

In [5]:
def place_anc_tag(x):
    try:
        int(x.split('_')[0])
        return 'Anc_' + x
    except:
        return x

aln_DKH['id'] = aln_DKH['id'].apply(place_anc_tag)

Now we will build all the sequences for prediction.

In [6]:
aln_DKH['DDKK'] = aln_DKH.apply(lambda x: x['seq_D'] + ':' + x['seq_K'] + ':' + x['seq_D'] + ':' + x['seq_K'], axis=1)
aln_DKH['HH'] = aln_DKH.apply(lambda x: x['seq_H'] + ':' + x['seq_H'], axis=1)
aln_DKH.loc[15].HH

'MRQIAIYGKGGIGKSTTTQNTVAGLASLGKKVMIVGCDPKADSTRLILHAKAQATVMDKVRELGTVEDLELEDVLKRGYGDVKCVESGGPEPGVGCAGRGVITAINFLEEEGAYTPDLDYVFYDVLGDVVCGGFAMPIRENKAQEIYIVVSGEMMAMYAANNICKGIVKYASSGSVRLAGLICNSRNTDREADLIEALAKRLGTQMIHFVPRDNQVQRAELRRMTVIEYSPEHKQAEEYRQLAQKIADNKMFVVPTPLEMDELEDLLMEFGIMEAEDESIVGKAENA:MRQIAIYGKGGIGKSTTTQNTVAGLASLGKKVMIVGCDPKADSTRLILHAKAQATVMDKVRELGTVEDLELEDVLKRGYGDVKCVESGGPEPGVGCAGRGVITAINFLEEEGAYTPDLDYVFYDVLGDVVCGGFAMPIRENKAQEIYIVVSGEMMAMYAANNICKGIVKYASSGSVRLAGLICNSRNTDREADLIEALAKRLGTQMIHFVPRDNQVQRAELRRMTVIEYSPEHKQAEEYRQLAQKIADNKMFVVPTPLEMDELEDLLMEFGIMEAEDESIVGKAENA'

We will split our dataset into extant and ancestral, as they require very different treatments.

In [7]:
aln_DKH['type'] = aln_DKH['id'].apply(lambda x: x.split('_')[0])
aln_DKH['type'].unique()

array(['Anc', 'Anf', 'Vnf', 'Nif'], dtype=object)

There are no G subunits in the dataset, and yet we have Vnf and Anf sequences... Let's try to avoid dealing with those for now.

In [8]:
aln_DKH_nif = aln_DKH.query('type == "Nif"').copy()
aln_DKH_anc = aln_DKH.query('type == "Anc"').copy()
aln_DKH_alt = aln_DKH.query('type != "Anc" and type != "Nif"').copy()
# aln_DKH_alt

### Procesing ancestors

The ancestors require extra work, as we need to organize all the sequences by parent sequence to enable the later re-alignment.

In [9]:
aln_DKH_anc['parent'] = aln_DKH_anc['id'].apply(lambda x: '_'.join(x.split('_')[:2]))
aln_DKH_anc['parent']

0       Anc_1207
1       Anc_1207
2       Anc_1207
3       Anc_1207
4       Anc_1207
          ...   
2299    Anc_1534
2300    Anc_1534
2301    Anc_1525
2302     Anc_821
2303     Anc_821
Name: parent, Length: 2304, dtype: object

## DDKK

In [12]:
aln_DKH_nif_batches = np.array_split(aln_DKH_nif, 10)
for i, batch in enumerate(aln_DKH_nif_batches):
    print(f'processsing ddkk-extant.{i}.csv')
    batch[['id', 'DDKK']].rename(columns={'DDKK': 'sequence'}).to_csv(f'input/ddkk-extant.{i}.csv', index=None)

processsing ddkk-extant.0.csv
processsing ddkk-extant.1.csv
processsing ddkk-extant.2.csv
processsing ddkk-extant.3.csv
processsing ddkk-extant.4.csv
processsing ddkk-extant.5.csv
processsing ddkk-extant.6.csv
processsing ddkk-extant.7.csv
processsing ddkk-extant.8.csv
processsing ddkk-extant.9.csv


In [15]:
aln_DKH_anc['variant'] = aln_DKH_anc['id'].apply(lambda x: x.split('_')[-1])
aln_DKH_anc.query('variant == "map"')

Unnamed: 0,id,seq_D,seq_K,seq_H,DDKK,HH,type,parent,variant
2,Anc_1207_map,MSEKEETQKLIEEVLEVYPEKARKNRKKHIAVNDPEASSCAVKSNV...,CTKEEVEKVADWTNTEEYKEKNFKRKALVINPAKACQPLGAVLAAL...,MRQIAIYGKGGIGKSTTTQNTVAALAEMGKKVMIVGCDPKADSTRL...,MSEKEETQKLIEEVLEVYPEKARKNRKKHIAVNDPEASSCAVKSNV...,MRQIAIYGKGGIGKSTTTQNTVAALAEMGKKVMIVGCDPKADSTRL...,Anc,Anc_1207,map
9,Anc_1213_map,MSEKKVKPVEGITKERTEKLIDETLEAYPEKARKKRAPHLAANDPA...,MSEAAAVVKKVTETTPEEVERVKEWINTEEYKEKNFAREALVINPA...,MRQIAIYGKGGIGKSTTTQNTVAGLASLGKKVMIVGCDPKADSTRL...,MSEKKVKPVEGITKERTEKLIDETLEAYPEKARKKRAPHLAANDPA...,MRQIAIYGKGGIGKSTTTQNTVAGLASLGKKVMIVGCDPKADSTRL...,Anc,Anc_1213,map
18,Anc_1215_map,MSERKPIKGVTTERTEKLIDETLAEMPEKAQKKRAPHLGANDPSAS...,MSAEAAVKKVTEHTPEEIERVKEWINTEEYKEKNFAREALVVNPAH...,MRQIAIYGKGGIGKSTTTQNTVAGLASLGKKVMIIGCDPKADSTRL...,MSERKPIKGVTTERTEKLIDETLAEMPEKAQKKRAPHLGANDPSAS...,MRQIAIYGKGGIGKSTTTQNTVAGLASLGKKVMIIGCDPKADSTRL...,Anc,Anc_1215,map
21,Anc_1214_map,MSERKPIKGVTKERTEKLIDETLAEMPEKAQKKRAPHLAANDPSAS...,MSAEAAVVKKVTEHTPEEIERVKEWINTEEYKEKNFAREALVVNPA...,MRQIAIYGKGGIGKSTTTQNTVAGLASLGKKVMIIGCDPKADSTRL...,MSERKPIKGVTKERTEKLIDETLAEMPEKAQKKRAPHLAANDPSAS...,MRQIAIYGKGGIGKSTTTQNTVAGLASLGKKVMIIGCDPKADSTRL...,Anc,Anc_1214,map
29,Anc_1216_map,MSEKIKKVDGITKESTQAMIDKTLEAYPEKARKKRAPHLAPNDQAS...,MANALGLEVKPVTETTPEEVERVKNWINTEEYKEKNFARQALVINP...,MRQIAIYGKGGIGKSTTTQNTVAGLASLGKKVMIVGCDPKADSTRL...,MSEKIKKVDGITKESTQAMIDKTLEAYPEKARKKRAPHLAPNDQAS...,MRQIAIYGKGGIGKSTTTQNTVAGLASLGKKVMIVGCDPKADSTRL...,Anc,Anc_1216,map
...,...,...,...,...,...,...,...,...,...
2273,Anc_1533_map,MLLKCDKTIPERKKHIVIKGENGCGKGDGSPCEIACNVPTTPGDMT...,MSAITKQKRAVAINPARSCAPIGAMLAAMGVHGAITIVHGSQGCAT...,MRQIAFYGKGGIGKSTTQQNTAAALASMGNKIMVVGCDPKADCTRL...,MLLKCDKTIPERKKHIVIKGENGCGKGDGSPCEIACNVPTTPGDMT...,MRQIAFYGKGGIGKSTTQQNTAAALASMGNKIMVVGCDPKADCTRL...,Anc,Anc_1533,map
2283,Anc_1535_map,MQLKCNQTLPERATHIALKGEDGKCQRGDGTGCFIASNVATTPGDM...,MSCVTTQDRAVAINPTRSCAPIGAMLANYGIHGAITINHGSQGCAT...,MRQVAFYGKGGIGKSTTQQNTAAALASMGNKLMVVGCDPKADCTRL...,MQLKCNQTLPERATHIALKGEDGKCQRGDGTGCFIASNVATTPGDM...,MRQVAFYGKGGIGKSTTQQNTAAALASMGNKLMVVGCDPKADCTRL...,Anc,Anc_1535,map
2286,Anc_1537_map,MQLKCNETLPERATHIALKVAGGGCQRGDGTGCFIVSNSATTPGDM...,MSCVTTQDRAVAINPTRSCAPIGAMLANYGIHGAITINHGSQGCAT...,MRQVAFYGKGGIGKSTTQQNTAAALASMGNKLMVVGCDPKADCTRL...,MQLKCNETLPERATHIALKVAGGGCQRGDGTGCFIVSNSATTPGDM...,MRQVAFYGKGGIGKSTTQQNTAAALASMGNKLMVVGCDPKADCTRL...,Anc,Anc_1537,map
2294,Anc_1536_map,MQLKCNETLPERAQHIALKVEGGKCQRGDGTGCFIVSNSATTPGDM...,MSCVTTQDRAVAINPTRSCAPIGAMLANYGIHGAITINHGSQGCAT...,MRQVAFYGKGGIGKSTTQQNTAAALASMGNKLMVVGCDPKADCTRL...,MQLKCNETLPERAQHIALKVEGGKCQRGDGTGCFIVSNSATTPGDM...,MRQVAFYGKGGIGKSTTQQNTAAALASMGNKLMVVGCDPKADCTRL...,Anc,Anc_1536,map


In [16]:
aln_DKH_anc_batches = np.array_split(aln_DKH_anc.query('variant == "map"'), 10)
for i, batch in enumerate(aln_DKH_anc_batches):
    print(f'processsing ddkk-ancestral.{i}.csv')
    batch[['id', 'DDKK']].rename(columns={'DDKK': 'sequence'}).to_csv(f'input/ddkk-ancestral.{i}.csv', index=None)

processsing ddkk-ancestral.0.csv
processsing ddkk-ancestral.1.csv
processsing ddkk-ancestral.2.csv
processsing ddkk-ancestral.3.csv
processsing ddkk-ancestral.4.csv
processsing ddkk-ancestral.5.csv
processsing ddkk-ancestral.6.csv
processsing ddkk-ancestral.7.csv
processsing ddkk-ancestral.8.csv
processsing ddkk-ancestral.9.csv


In [36]:
aln_DKH_alt[['id', 'DDKK']].rename(columns=dict(DDKK='sequence')).to_csv('./input/ddkk-alternative.csv', index=None)

# HH

In [12]:
aln_DKH['variant'] = aln_DKH.id.apply(lambda x: x.split('_')[-1] if x[:3] == 'Anc' else 'map')
aln_DKH_map = aln_DKH.query('variant == "map"')

In [13]:
aln_DKH_map_batches = np.array_split(aln_DKH_map, 21)
for i, batch in enumerate(aln_DKH_map_batches):
    print(f'processsing hh.{i}.csv')
    batch[['id', 'HH']].rename(columns={'HH': 'sequence'}).to_csv(f'input/hh.{i}.csv', index=None)

processsing hh.0.csv
processsing hh.1.csv
processsing hh.2.csv
processsing hh.3.csv
processsing hh.4.csv
processsing hh.5.csv
processsing hh.6.csv
processsing hh.7.csv
processsing hh.8.csv
processsing hh.9.csv
processsing hh.10.csv
processsing hh.11.csv
processsing hh.12.csv
processsing hh.13.csv
processsing hh.14.csv
processsing hh.15.csv
processsing hh.16.csv
processsing hh.17.csv
processsing hh.18.csv
processsing hh.19.csv
processsing hh.20.csv
