# Sampling the sequence space for mutations

In biological sequence spaces (DNA, RNA, and proteins), the possible search space is vast and high dimensional. The further a biological sequence mutates away from its original sequence, the more unpredictable the final behavior of the system becomes, with mutations often layering on top of each other at the sequence level in a way that translates non-linearly to the higher order structures - the phenomenon of epistasis. To explore the local mutational landscape required sampling both a well-chosen set of starting sequences and sufficient sampling in their surroundings. 

Finding good starting sequences has a few caveats. The goal of genetic circuits is to actuate some behavior through the interactions of its components in response to a signal. Therefore, sequences that have combinations of weak or strong interactions are desirable. However, the vast majority of sequence space has weak interactions and is thus non-functional, meaning pairs of sequences have extremely weak binding. Finding sequences that can bind to each other strongly is a prerequisite, but the total sampling of the possible circuit space should still be diverse enough to capture many different types of circuits.

For RNA and DNA, complementarity can be used as a way to guarantee binding. Random sampling of the sequence space can be followed up with induced complementarity, for example by delegating one circuit component as a template, then reserving portions of the other components to be complementary to the template strand. The degree and patterning of this induced complementarity can be varied depending on the number of strands and the level of intervention desired. The `SeqGenerator` class therefore has four different ways to generate the components of a genetic circuit, termed 'protocols'. Their differences are demonstrated below.

In [2]:
import sys
import os
import Bio
import numpy as np


if __package__ is None:

    module_path = os.path.abspath(os.path.join('..'))
    sys.path.append(module_path)

    __package__ = os.path.basename(module_path)


from src.utils.common.setup_new import construct_circuit_from_cfg, prepare_config
from src.utils.data.data_format_tools.common import load_json_as_dict
from src.utils.data.data_format_tools.manipulate_fasta import load_seq_from_FASTA
from src.utils.data.fake_data_generation.seq_generator import RNAGenerator
from src.utils.evolution.evolver import Evolver
from src.utils.misc.type_handling import flatten_listlike, flatten_nested_dict
from src.utils.results.result_writer import ResultWriter


config = load_json_as_dict(os.path.join('..', 'tests', 'configs', 'simple_circuit.json'))


In [3]:
data_writer = ResultWriter(purpose='tests')

In [4]:
paths_circuits = RNAGenerator(data_writer=data_writer).generate_circuits_batch(
            name='toy_RNGA',
            num_circuits=10,
            num_components=3, 
            slength=20,
            proportion_to_mutate=0.5,
            protocol='random',
            template=None)

samples = [None] * 10
for i, p in enumerate(paths_circuits):
    samples[i] = load_seq_from_FASTA(list(p.values())[0], as_type='dict')
    # construct_circuit_from_cfg(config)

In [5]:
samples[0]

{'RNA_0': 'GGGGCGCUUCUGGUAAAUUU',
 'RNA_1': 'CGCAAAGCGUUAGCAUCCGU',
 'RNA_2': 'CUAUUCCAGAACAAAAAAAG'}

In [6]:
from Bio import Align

samples_names = sorted(set(flatten_listlike([list(s.keys()) for s in samples])))
aligner = Align.PairwiseAligner()
ref_aligments = aligner.align(samples[0][samples_names[0]], samples[0][samples_names[0]])
alignments = flatten_listlike(flatten_listlike([[[aligner.align(s[k1], Bio.Seq.complement_rna(s[k2])) for s in samples] for k1 in samples_names] for k2 in samples_names]))
# print(alignments[0][0])
print('Reference alignment (perfect complementarity): ', ref_aligments[0].score)
print('Alignment scores for complements in circuits: ')
print(np.unique([a.score for a in alignments]))
for a in alignments[:3]:
    print(a[0])

Reference alignment (perfect complementarity):  20.0
Alignment scores for complements in circuits: 
[ 6.  8.  9. 10. 11. 12. 13. 14.]
target            0 GGGGCGCUUCU-G-GUAA-A---UUU--- 20
                  0 ----|-|--|--|-|-||-|---|||--- 29
query             0 ----C-C--C-CGCG-AAGACCAUUUAAA 20

target            0 UUCUA-GAUGCCU-CG-AGGG-UG----- 20
                  0 ----|-|||-|-|-||-||---|------ 29
query             0 ----AAGAU-C-UACGGAG--CU-CCCAC 20

target            0 CCGAGGAAC-CGCUAAA-GA-C-A------- 20
                  0 --|-|---|-|-||----|--|-|------- 31
query             0 --G-G---CUC-CU---UG-GCGAUUUCUGU 20



Now we will compare the random sequence generation to the similarity scores produced by heuristic complementarity-inducing methods.

In [55]:
good_template = 'GCCCCGGGGCUCUCUAUACG'  # toy_mRNA_circuit_133814, ensemble_generate_circuits/2023_02_24_170946/generate_species_templates/circuits
bad_template = 'UAGCCCUUGAUAAGGGCUAA'   # ensemble_generate_circuits/2023_02_24_170946/generate_species_templates/circuits/toy_mRNA_circuit_0.fasta

templates = {'strong': good_template, 'weak': bad_template}
protocols = ['template_mix', 'template_mutate', 'template_split']
path_dict = {}
i = 0
for n, t in templates.items():
    path_dict[n] = {}
    for p in protocols:
        np.random.seed(i)
        i+= 1
        num_circuits = 10 if p == 'template_mutate' else 1
        path_dict[n][p] = RNAGenerator(data_writer=data_writer).generate_circuits_batch(
            name=f'toy_RNA_{n}_{p}',
            num_circuits=num_circuits,
            num_components=3,
            slength=20,
            proportion_to_mutate=0.5,
            protocol=p,
            template=t)

templated_samples = path_dict
for n, v in path_dict.items():
    for prot, paths in v.items():
        templated_samples[n][prot] = [load_seq_from_FASTA(list(pv.values())[0], as_type='dict') for pv in paths]
    # construct_circuit_from_cfg(config)
templated_samples['strong']['template_mix'][:3]


[{'RNA_0': 'GGGCGCGCCCAGUGAAAUCC',
  'RNA_1': 'CCGGCCCGCGUGACAUUUGG',
  'RNA_2': 'CGCGGGCCGGACAGUUAAGC'}]

In [56]:

templated_alignments = {}
templated_alignments_flat = []
for n, v in templated_samples.items():
    templated_alignments[n] = {}
    for prot, seqs in v.items():
        templated_alignments[n][prot] = {}
        for i, s in enumerate(templated_samples[n][prot]):
            templated_alignments[n][prot][i] = {}
            for k1 in samples_names:
                templated_alignments[n][prot][i][k1] = {}
                for k2 in samples_names:
                    templated_alignments[n][prot][i][k1][k2] = aligner.align(s[k1], Bio.Seq.complement_rna(s[k2]))
                    templated_alignments_flat.append(templated_alignments[n][prot][i][k1][k2])
        # flatten_listlike(flatten_listlike([[[aligner.align(s[k1], Bio.Seq.complement_rna(
        #     s[k2])) for k1 in samples_names] for k2 in samples_names] for s in templated_samples[n][prot]]))



In [57]:

print('Alignment scores for complements in circuits: ')

print(np.unique([a.score for a in templated_alignments_flat]))

i = 0
print(templated_alignments['strong']['template_mix'][i]['RNA_0']['RNA_1'][0])
print(templated_alignments['strong']['template_mix'][i]['RNA_0']['RNA_2'][0])
print(templated_alignments['strong']['template_mix'][i]['RNA_1']['RNA_2'][0])

Alignment scores for complements in circuits: 
[ 7.  8.  9. 10. 11. 12. 13. 14. 15. 16.]
target            0 GGGC-G--CGCCCAG-UG-AAAUCC 20
                  0 ||-|-|--|||--|--||-|||-|| 25
query             0 GG-CCGGGCGC--A-CUGUAAA-CC 20

target            0 GGGCGCGCCCAG---UG--AAAU-CC- 20
                  0 |--|||-||--|---||--||-|-|-- 27
query             0 G--CGC-CC--GGCCUGUCAA-UUC-G 20

target            0 -CCGGCCCGCG--UGA-CA-UUU-GG 20
                  0 -|-|-||||-|--||--||-||--|- 26
query             0 GC-G-CCCG-GCCUG-UCAAUU-CG- 20



In [74]:
def convert_seqs_to_binary_complement(refseq, mutseq): 
    return (np.asarray(list(refseq)) == np.asarray(list(mutseq))) * 1

In [80]:
tmix = templated_samples['strong']['template_mix'][0]
print('Reference sequence: ', templates['strong'])
print('Mix sequence: ', tmix['RNA_0'], '- pattern: ', ''.join(str(convert_seqs_to_binary_complement(templates['strong'], tmix['RNA_0']))))
print('Mix sequence: ', tmix['RNA_1'], '- pattern: ', ''.join(str(convert_seqs_to_binary_complement(templates['strong'], tmix['RNA_1']))))
print('Mix sequence: ', tmix['RNA_2'], '- pattern: ', ''.join(str(convert_seqs_to_binary_complement(templates['strong'], tmix['RNA_2']))))

Reference sequence:  GCCCCGGGGCUCUCUAUACG
Mix sequence:  GGGCGCGCCCAGUGAAAUCC - pattern:  [1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0]
Mix sequence:  CCGGCCCGCGUGACAUUUGG - pattern:  [0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1]
Mix sequence:  CGCGGGCCGGACAGUUAAGC - pattern:  [0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0]


In [81]:
tsplit = templated_samples['strong']['template_split'][0]
print('Reference sequence: ', templates['strong'])
print('Split sequence: ', tsplit['RNA_0'], '- pattern: ', ''.join(str(convert_seqs_to_binary_complement(templates['strong'], tsplit['RNA_0']))))
print('Split sequence: ', tsplit['RNA_1'], '- pattern: ', ''.join(str(convert_seqs_to_binary_complement(templates['strong'], tsplit['RNA_1']))))
print('Split sequence: ', tsplit['RNA_2'], '- pattern: ', ''.join(str(convert_seqs_to_binary_complement(templates['strong'], tsplit['RNA_2']))))

Reference sequence:  GCCCCGGGGCUCUCUAUACG
Split sequence:  CGGGGCGGGCUCUCUAUACG - pattern:  [0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
Split sequence:  GCCCCGCCCGAGUCUAUACG - pattern:  [1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1]
Split sequence:  GCCCCGGGGCUCAGAUAUCG - pattern:  [1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1]
