# Sampling the sequence space for mutations

In biological sequence spaces (DNA, RNA, and proteins), the possible search space is vast and high dimensional. The further a biological sequence mutates away from its original sequence, the more unpredictable the final behavior of the system becomes, with mutations often layering on top of each other at the sequence level in a way that translates non-linearly to the higher order structures - the phenomenon of epistasis. To explore the local mutational landscape required sampling both a well-chosen set of starting sequences and sufficient sampling in their surroundings. 

Finding good starting sequences has a few caveats. The goal of genetic circuits is to actuate some behavior through the interactions of its components in response to a signal. Therefore, sequences that have combinations of weak or strong interactions are desirable. However, the vast majority of sequence space has weak interactions and is thus non-functional, meaning pairs of sequences have extremely weak binding. Finding sequences that can bind to each other strongly is a prerequisite, but the total sampling of the possible circuit space should still be diverse enough to capture many different types of circuits.

For RNA and DNA, complementarity can be used as a way to guarantee binding. Random sampling of the sequence space can be followed up with induced complementarity, for example by delegating one circuit component as a template, then reserving portions of the other components to be complementary to the template strand. The degree and patterning of this induced complementarity can be varied depending on the number of strands and the level of intervention desired. The `SeqGenerator` class therefore has four different ways to generate the components of a genetic circuit, termed 'protocols'. Their differences are demonstrated below.

In [39]:
import sys
import os
import Bio
import numpy as np


if __package__ is None:

    module_path = os.path.abspath(os.path.join('..'))
    sys.path.append(module_path)

    __package__ = os.path.basename(module_path)


from src.utils.common.setup_new import construct_circuit_from_cfg, prepare_config
from src.utils.data.data_format_tools.common import load_json_as_dict
from src.utils.data.data_format_tools.manipulate_fasta import load_seq_from_FASTA
from src.utils.data.fake_data_generation.seq_generator import RNAGenerator
from src.utils.evolution.evolver import Evolver
from src.utils.misc.type_handling import flatten_listlike
from src.utils.results.result_writer import ResultWriter


config = load_json_as_dict(os.path.join('..', 'tests', 'configs', 'simple_circuit.json'))


In [None]:
data_writer = ResultWriter(purpose='tests')

In [15]:
paths_circuits = RNAGenerator(data_writer=data_writer).generate_circuits_batch(
            name='toy_RNGA',
            num_circuits=10,
            num_components=3, 
            slength=20,
            proportion_to_mutate=0.5,
            protocol='random',
            template=None)

samples = [None] * 10
for i, p in enumerate(paths_circuits):
    samples[i] = load_seq_from_FASTA(list(p.values())[0], as_type='dict')
    # construct_circuit_from_cfg(config)

In [19]:
samples[0]

{'RNA_0': 'GCAGAUUCUAUUGCAUCCCC',
 'RNA_1': 'CACUAAGGAAUCAAGCAGAA',
 'RNA_2': 'CGCCCCUUAGGGACGCAGUU'}

In [54]:
from Bio import Align

samples_names = sorted(set(flatten_listlike([list(s.keys()) for s in samples])))
aligner = Align.PairwiseAligner()
ref_aligments = aligner.align(samples[0][samples_names[0]], samples[0][samples_names[0]])
alignments = flatten_listlike(flatten_listlike([[[aligner.align(s[k1], Bio.Seq.complement_rna(s[k2])) for s in samples] for k1 in samples_names] for k2 in samples_names]))
# print(alignments[0][0])
print('Reference alignment (perfect complementarity): ', ref_aligments[0].score)
print('Alignment scores for complements in circuits: ')
np.unique([a.score for a in alignments])
for a in alignments:
    print(a[0])

Reference alignment (perfect complementarity):  20.0
Alignment scores for complements in circuits: 
target            0 GCAGAUUCUAUU-GCAU--CCCC------- 20
                  0 -|-|-|-|||---|-||--|---------- 30
query             0 -C-G-U-CUA--AG-AUAAC---GUAGGGG 20

target            0 UUGUAGGGU-CUA-CCGC---U--C-A-U- 20
                  0 ----|-----|-|-||-|---|--|-|-|- 30
query             0 ----A----AC-AUCC-CAGAUGGCGAGUA 20

target            0 UC-GGU--AUGAGACGGUCU-C--GU---- 20
                  0 ---|----||-|--|--|||-|--|----- 30
query             0 --AG--CCAU-A--C--UCUGCCAG-AGCA 20

target            0 GUGUUUCUAGCA--GA--GUAC----GC- 20
                  0 ------|-|-||--||--||-|----||- 29
query             0 ------C-A-CAAAGAUCGU-CUCAUGCG 20

target            0 CCCCAAAC----UA--GC-UCG--CGUU-A-- 20
                  0 ------------|---|--|||--||---|-- 32
query             0 --------GGGGU-UUG-AUCGAGCG--CAAU 20

target            0 GAAG-UGAA-CCAUCUUCGCC--A-------- 20
                  0 -----|