# README

- Requirements
    - numpy, pandas (``pip install numpy``, ``pip install pandas``)
    - D-Tailor (``pip install -e .`` from the cloned repository)
    - ViennaRNA for MFE calculations: https://www.tbi.univie.ac.at/RNA/#download

In [2]:
import sys, logging

# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #

# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

## Example sequences

- NOTE: had to clean the CDS file to not include `|` in the gene names, D-Tailor tries to write files out with those and it just won't work in bash

In [25]:
# Random sequences to optimize
seqs_to_optimize = {
    'example1': 'ATGTCTGCAACTTCCGTCACTTTCCCAATGATCAACGAAACTTACCAACAGCCAACCGGGCTTTTCATCAACAATGAATTTGTTAGTGCAAAGTCAGGTAAGACTTTTGATGTTAACACTCCAATTGATGAGTCTCTCATTTGTAAAGTCCAACAGGCCGATGCTGAAGATGTTGAAATTGCCGTTCAAGCAGCATCTAAAGCTTACAAGACTTGGAGATTTACACCGCCAAATGAAAGAGGCAGATACTTGAACAAATTGGCCGATTTGATGGACGAAAAGAGAGACTTACTTGCCAAAATTGAATCCCTTGATAATGGTAAGGCCTTACATTGTGCAAAATTCGATGTCAATCTTGTCATTGAATATTTCAGATACTGTGCAGGTTACTGTGATAAAATCGATGGTAGAACAATTACAACCGATGTCGAACATTTTACCTACACTAGAAAGGAACCTTTAGGTGTCTGTGGTGCAATTACACCTTGGAACTTCCCATTGCTGATGTTTGCTTGGAAAATCGGCCCGGCTTTAGCAACCGGTAATACCATTATTTTGAAGCCTGCCAGTGCAACACCTCTATCAAACCTCTTTACTTGTACCTTGATCAAGGAGGCGGGCATTCCAGCCGGTGTTGTTAATGTTGTTCCAGGTTCCGGTAGAGGCTGTGGTAACTCCATTTTACAACATCCTAAAATTAAGAAGGTTGCGTTTACCGGATCTACAGAAGTTGGTAAAACTGTTATGAAGGAATGTGCTAATTCCATCAAAAAGGTTACTCTCGAATTGGGTGGTAAGTCTCCAAACATTGTTTTCAAAGACTGTAACGTTGAACAAACCATTCAAAATTTGATTACTGGTATTTTCTTCAATGGTGGTGAAGTCTGTTGTGCTGGTTCTAGAATTTACATTGAAGCAACCGATGAGAAATGGTATACTGAATTCTTGACCAAATTCAAGGAGACTGTTGAAAAATTAAAGATTGGTAACCCATTTGAAGAGGGTGTTTTCCAAGGTGCACAAACCACTCCAGATCAATTCCAAACTGTCTTGGACTACATCACCGCTGCTAACGAATCCAGCTTGAAACTATTAACTGGTGGTAAAAGAATTGGCAATAAGGGATACTTTGTTGAGCCAACTATCTTCTACGATGTTCCTCAAAATTCCAAGTTAACTCAAGAAGAAATCTTTGGTCCAGTTGCTGTTGTTTTACCTTTCAAGTCCACTGAAGAATTGATTGAAAAGGCAAATGATTCCGATTTTGGTTTAGGTTCCGGTATTCACACTGAAGATTTCAACAAGGCAATTTGGGTTTCCGAAAGGCTTGAAGCAGGTTCTGTTTGGATCAACACTTACAATGATTTCCACCCAGCTGCTCCATTCGGTGGTTACAAGGAATCCGGTATTGGCAGAGAAATGGGTATTGAAGCTTTCGACAACTATACTCAAACCAAGTTAGTTAGAGCTAGAGTTAACAAGCCAGCTTTTTAG',
    # 'example2': 'ATGGCAACCAGGACTATTAAAAACAAATACGAGTCTTACGATGCCGCACTATCCTTACAACGTCAAGTGTTGTGCTACAGTAAAGACCAAAAAGAGATCTATTATACGGCAGACCCAGCGGATCTTGAGGAAGAATCATCAGAGCCGGCTCCTGCCGCCTCCAGTTCAGCCCCTGCGCCTGTAGCGGCCGCTCCTGCGCCAGCACCAGCGGGGCCGGTTGCAGGAGTTGCCGATGCTCCTGTCCCAGCAGCGCTGGTCTTGAGGACCCTGGTCGCTCACAAGTTAAAAAAGCCGCTGGAAGATATCCCTATGACTAAAACAATTAAAGATCTAGTTGGAGGAAAATCAACAGTGCAAAATGAGATCCTAGGTGACCTGGGAAAAGAATTTGGAGCGACCCCGGAAAAACCGGAAGACACACCGCTGCAAGAACTAGCTGACCAATTCCAAGGCAGCTTCAACGGAACTTTAGGATCCCAAACTTCTTCACTGATTGCGAAGCTAATGAGTTCCAAAATGCCCGGTGGTTTCTCAGTAGGAGCTGCTCGTAAATACTTACAAAGTAGGTGGGGTCTAGAGCATGGACGTCAGGATGCTGTACTTTTGTATGCTCTAACAAACGAGCCCGCAGCAAGACTAGGATCTGAACCTGAAGCAAAGGCCTTCTTTGATGCCGCGGCACAAAAATATGCAGCCGCAGAAGGCGTTAGCCTTTCCTCTGGGGCTCCAGCTGGCGGTGCGGTCGCTGTTGGCGCAGTAGCTGTGGCTGCTGGGGCCGGCCCGGTCGCAGATGTGCCGGACGCCCCCGTGCCAGCCGCGTTAGTTCTACACACGCTAGTAGCTCATAAGTTAAAGAAACCGTTGTCTGACGTGCCAATGAGTAAACCTATCAAGGATCTTGTCGGAGGCAAATCTACGGTACAAAATGAAATATTAGGAGACCTTGGGAAGGAATTCGGGTCTACACCTGAAAAACCAGAGGATACTCCGCTGCAGGAGCTGGCTGACCAATTCCAAGACTCCTTTAATGGCACTCTTGGCAAACAATCTAGTACATTGATAGCCAAGTTGATGTCATCTAAAATGCCTGGAGGCTTCAGCGTAGGCGCTGCCAGAAAATATCTACAGTCCAGGTGGGGACTACAACAGGGGAGACAAGACGCTGTTCTATTGTATGCTCTAACCAATGAACCTGCAGCAAGACTAGGTAGTGAAGCTGAGGCTAAATCTTTCTTCGATACCGTGGCACAAAAGTACGCCGCAGCTGAGGGTGTGTCCTTATCAGCCGGTGGTGCAGGTGGAGCATCCGCTGGTGCCGGTGGTGCTGTGATCGATACAGCAGCTTTGGACGCCATTACTAAAGAAACCAAGGATCTAGCTAGACAACAGCTAGAGACTTTAGCCAGGTACTTAAAATTAGATCTTACAAAAGGCGATCGTTCCTTAATTAAAGAGAAAGAGGCTTCCAAAGTTTTACAGGCAGAATTAGATCTGTGGGCAGAGGAGCACGGCGAATTCTATGCATCCGGAATCAAACCAGTCTTTTCTCCCTTGAAAGCACGTCAGTACGATTCCTACTGGAATTGGGCCCGTCAAGACTCACTTTCGATGTACTTCGACATTATATTTGGTAAGCTGAAATCTATAGATAGAGAGACTGTGACTCAGTGTATTCAGATTATGAACAGAGCAAATCCCACTTTAATCGAGTTCATGCAGTATCATATAGACCATACTCCAACATACAAAGGCGAGACCTACGAGCTAGCAAAGAAATTGGGGCAACAATTGATAGATAATTGCAAAGAAATGCTTTCGAAGAATCAGTCTCCGGTGTTCAAGGATGTTTCTTACCCAACCGGCCCAAAGACTACTGTAGATGCTAAGGGCAACATTAATTATGAAGAAGTAAGTCGTGATAGCTGCAGAAAGTTCGAGGAATACGTACACGACATGGCTAAGGGTGGTGAAATGACAAAGGAAGTCAAGCCTACTATAGAAGAGGATTTAGCTAAGGTTTATAAAGCCTTAAGTAGACAGGCATCAGCAGAGAATCAATTGCAAATCGAATCCCTGTACAAACAGTTAATAGAGTTCGTAGAGAAGTCAAACGAGATTGAGGTATCAAAATCAGTTAGTGCTGTTCTTGACAACGAATCTACTGACGACGAGACAGATGAAATCGCAAGCTTGAAGGACTTTAGTGAAATTAAGAAACCAGTTTCAAGCACTATCCCACCAGAAACTATACCGTTTCTTCACATTAAGAGCAAAACAAAATTGGACAGTTGGGTTTACGACAAAACAAAATCTGCACTTTTCCTTGATGGTCTAGAAAAGGGTGCAGTCAACGGTATCTCTTACAAGGGTAAGATAGCCTTAGTCACGGGTGCTGGTGCTGGTTCGATAGGTGCAGAGGTCCTAAAAGGGTTGCTAAGTGGTGGAGCAAAGGTAATCGTGACTACATCGAGATACTCCAAGAAGGTTACAGAGTATTATCAGAGCCTGTATTCCAGGTTCGGTGCCTCGGGATCGGCTCTTGTTGTTGTCCCCTTTAATCAAGGATCGAAGCAGGACGTAGTCGCTTTAGTTAAGTACATTTATGACGATGTCAAACAGGGCGGCCTGGGATGGGATTTAGACTTCGTCATACCGTTCGCCGCCATACCTGAGGCCGGCATCGAAGTAGAAAATATTGGTAGTAAATCTGAGTTAGCACATCGTATAATGCTGACAAATTTATTGCGTCTACTAGGAGAAGTCGCAAGCAATAAAAGAGCAAGGAACATAACTACAAGGCCCGCAGAGGTGATACTGCCACTTAGCCCGAACCATGGTACTTTTGGGTCCGATGGGTTGTACAGTGAATCCAAGCTAGCGTTGGAGACGTTGTTTAACCGTTGGCATTCTGAGAGCTGGTCTACCTTTCTTACAATTTGCGGTGCGGTAATAGGATGGACTCGTGGGACTGGATTGATGTCTGGGAACAATATGATAGCAGAAGGCATTGAGAAATTGGGAGTGAGGACATTTTCACAGAAGGAGATGGCTTTCAACATTCTAGGTCTTATGACCCCGGAGGTAGTCCGTATGTGTGAGGAGGGACCCGTGATGGCGGATCTGAATGGAGGCCTACAATTTATTGAAAACCTAAGGGAGTATACGAATCAGTTGCGTTCCGAAATCAACAATACTTCCGAAGTCAGAAGAGCAGTTTCCATCGAAACAGCGATTGAGCACAAGATCGTCAACGGAGAGAATGCAGACGCTCCGTTCAATAAAGCGGAAGTCAAGCCTAAAGCCAACCTGACTTTTGATTTCCCAGAAACATCCCCCTACGAAGAGATAAAAGCTAAAGCACCAGAACTAGAAGGAATGTTGGATCTTGAACGTGTGATCGTCGTGACTGGCTTTTCAGAAGTGGGTCCATGGGGAAATTCTCGTACGAGGTGGGAAATGGAGGCTTTCGGAGAGTTCTCCTTGGAGGGATGTATAGAGATGGCTTGGATCATGGGATTCATCAAGTACTTCAATGGGAATATAAAGGGAAAACAATACACTGGGTGGGTCGACGCTAAAACCAATGTGCCGGTGGACGACGTTGATGTCAAAAAGAAATATGAGGCTGAAATCTTAGCGCATTCTGGCATAAGACTGGTGGAACCGGAGTTGTTTCACGGGTATGACCCCAACAAAAAACAATTAATACAAGAAGTAGTGATTCAACACGATCTTGAACCATTCATCACAGACAAGCCGACTGCAATGCAATATCAACTGCAACATGGGGAAAAAGTAGAAGTGTTTCCCGATGAATCCGGGGAGGAGTATTCCGTCAAGATACTGAAGGGAGCTACGTTGTACGTACCAAAAGCGTTGAGGTTCGACAGGCTAGTGGCGGGTCAAATCCCAACTGGATGGAATGCCAAGCATTACGGGATATCAGACGACATAATTGATCAGGTTGATCCAATTACATTGTACGTCCTTGTCGCCACCGTCGAAGCTTTGTTATCAGCGGGAATCACAGACCCGTATGAATTCTATAAGTATGTTCATGTATCAGAAGTTGGGAACTGCTCAGGCAGTGGGATGGGAGGTGTCTCTGCTCTAAAAGGTATGTTCAGGGACAGGTATAAGGAAATTCCCGTCCAAAATGATATCTTGCAAGAATCATTTATAAACACCATGTCCGCCTGGGTCAACATGTTGCTACTGTCTAGCAGTGGCCCAATCAAGACTCCCGTTGGTGCATGCGCTACGGCTTTAGAGAGCGTTGACGTCGGCGTTGAAACAATCCTATCCGGAAAGGCAAAGATAGTGCTAGTCGGAGGTTATGACGATTTTCAGGAAGAGGGGTCCTACGAATTCGCTAATATGAACGCCACTTCCAATGCGGTGGAAGAGTTTGCCCACGGAAGAACTCCTGACGAGATGTGCAGGCCTGCAACTACGACAAGGAATGGTTTTATGGAAGCTCAGGGCTCAGGAATCCAAGTCCTAATGACGGCGTCTCTTGCCCTGCAAATGGGTGTGCCGATATACGCAATAGTTGCCATGTCCAGTACTGCTTCCGATAAGATCGGACGTTCTGTTCCGGCTCCAGGTAAGGGAATTTTGACGACCGCTCGTGAATACCAGGGCGATTTGAAGTATAAATCAGCGAAGATGGACATAAAGTACCGTTCCAGGCAATTGAAGAACAGGATAGCGCAGATCAAAAACTGGGCCGAAGGGGAGCTTGATTACATTCAGGAGGAGGCCGCTCAACTTGCGGAATCAGATGCTTCTTTCAACAAAAGCGAGTTCCTAAGGGAAAGGACCGAAGAGATTGAACGTGACGCTATCAAACAAGTAAAAGATGCACAGAGGCAGCTGGGGAACGAATTCTGGAAAAATGACCCCCGTATTGCTCCGATCAGAGGTGCATTAGCGACATATAATTTGACAATTGACGATTTAGATGTATGTTCCTTCCACGGTACGTCCACCAAAGCAAACGATAAAAATGAGACGGCAACCGTCGACAAAATGATGCAGCACCTTGGACGTACAGAAGGGAATACAGTTTATGGGGTATTTCAGAAATTCCTAACTGGCCATCCGAAAGGGGCGGCAGGCGCATGGATGCTAAATGGCGCGATCCAAATACTTAATACTGGAATCGTGCCAGGGAACAGAAACGCGGACAATATAGACAAAGTCCTGGAGGACTACAAGTATGTATTGTTTCCCTCAAGAACCATAACAACGGATGGTATTAAAGCAGTGAGCGTGACTTCATTCGGCTTTGGACAGAAAGGCGCCCAGGCCATCGTAATACACCCGGACTACCTTTACGCCGCATTATCAAAAGAGGAATACGAGAGCTACACGGCGAAGGTAAGCAGCCGTCAGAAGAAATCCTATGCCTACATTCATAACGGCATGCTAAATAATTCAATATTCGTCGCTAAAGATCACGCACCGTACAACGACGACCAACAGGAGAGCGTGTATTTGGATCCATTAGCCAGGGTGTCCCCCAATAAAAAAGACGAACTAGTTTTCAACGACAACGAATTGCAGGAGAATGGCAAGTATATAAGCCCGGTAGCCGATAAAACGGCATCCGTACTATCTAACCTGACAAAAGAACAAATAGGAAGCAAAGGTGTTGGGGTAGATGTAGAGCTGATCGCCGAGATTAACATTAACAACGAGACTTTCATCGAACGTAATTTCACAGAAGAAGAAATTAAGTATTGCTCCGGGAGTGCAAACCCCAGATCCAGTTTCGCCGGGGCCTGGTCAGCTAAGGAAGCAGTCTTCAAGTCTTTAGAGGTTGAAAGCAAAGGAGCCGGAGCAAGTCTAAAGGATATAGAAATTGTGCATGCTGCAAATGGAGCTCCCACGGTCACCCTGCATGGTTCTGCCCTGCAAGCAGCCAATAAAAGGGGAGTTAAAAACGTGAAGGTCAGCATTTCTCATGATGACGTGCAGTCTGTAGCGGTAGCTATCAGTGAGTTTTAA'
}

# Ecoli CDS FASTA file
ecoli_cds_fasta = './GCF_000005845.2_ASM584v2_cds_from_genomic_cleaned.fna'

## Get and set some required data first

##### CAI table for your organism's genomic CDS

- NOTE: Biopython's CodonAdaptationIndex complains about sequences with incomplete frames, or stuff with "N" in them. Thus a modified version is supplied here.

In [4]:
from CodonUsage import CodonAdaptationIndex

def get_codon_usage_table(fasta_file: str) -> dict:
    """Get a codon usage table of RSCUs from a FASTA file of genomic sequences.
    
    Args:
        fasta_file: Path to FASTA file

    Returns:
        dict: Codon to RSCU value

    """
    biop_cai = CodonAdaptationIndex()
    biop_cai.generate_index(fasta_file=fasta_file)

    final_table = {k.lower():v for k,v in biop_cai.index.items()}

    return final_table

In [5]:
cai_table = get_codon_usage_table(ecoli_cds_fasta)

##### Define what you want a ramp to be

In [6]:
# Ramp = first 120 bp (40 codons)
ramp_from_to = (0,120)

## Analyze CAI and MFE for an entire CDS FASTA file

``AllAndRampAnalyzer`` -- for each CDS, calculates:

- The CAI of the entire sequence (`allCAI`)
- The CAI of the specified ramp (`rampCAI`)
- The CAI of the the non-ramp portion (`restCAI`)
- The MFE of the ramp (`mfeStructureRNAFoldMFE`)

    - **IMPORTANT** - requres `RNAfold` to be executable from your path! Otherwise returns "NA"
    
**This analyzed information can then be used to set quintiles/levels for D-tailor to optimize to.**

Any errors seen are usually inconsistencies with the coding sequence itself.

I usually save the computed info as a json file for easier reuse.

In [7]:
import tempfile
from dtailor.RunningExamples.Analyzer.AllAndRampAnalyzer import AllAndRampAnalyzer

runner = AllAndRampAnalyzer(
                input_file=ecoli_cds_fasta,  # Path to CDS FASTA
                input_type="FASTA",
                cai_table=cai_table,  # CAI table you just calculated
                ramp=True,
                ramp_from_to=ramp_from_to,
                root_dir=tempfile.mkdtemp())

final_info = runner.run()  # Returns a dictionary of calculated features for each coding sequence in the FASTA

[2019-05-10 17:06] [dtailor.Functions] ERROR: Sequence does not start with a start codon
[2019-05-10 17:06] [dtailor.Functions] ERROR: Sequence does not start with a start codon
[2019-05-10 17:06] [dtailor.Functions] ERROR: Sequence does not start with a start codon
[2019-05-10 17:06] [dtailor.Functions] ERROR: Sequence does not stop with a stop codon
[2019-05-10 17:06] [dtailor.Functions] ERROR: Sequence does not stop with a stop codon
[2019-05-10 17:06] [dtailor.Functions] ERROR: CDS length is not a multiple of 3
[2019-05-10 17:06] [dtailor.Functions] ERROR: Sequence does not start with a start codon
[2019-05-10 17:06] [dtailor.Functions] ERROR: Sequence does not start with a start codon
[2019-05-10 17:06] [dtailor.Functions] ERROR: CDS length is not a multiple of 3
[2019-05-10 17:06] [dtailor.Functions] ERROR: Sequence does not start with a start codon
[2019-05-10 17:06] [dtailor.Functions] ERROR: Sequence does not start with a start codon
[2019-05-10 17:06] [dtailor.Functions] ERRO

## Set your design parameters

##### Calculate the quintiles - these are the levels to optimize to

In [8]:
import numpy as np

def get_quintile_levels(inlist: list) -> dict:
    """Get a dictionary of 5 levels representing the quintiles of a list with numeric values"""
    quintiles = []

    for x in [0, 20, 40, 60, 80, 100]:
        quintiles.append(np.percentile(sorted(inlist), x))

    levels = {}
    for i, x in enumerate(quintiles):
        if i == len(quintiles) - 1:
            continue
        else:
            levels[str(i + 1)] = (x, quintiles[i + 1])

    # TODO: this needs to change to allow set max and set mins
    if levels['5'][1] > 0 and levels['1'][0] > 0:
        levels['1'] = (0, levels['1'][1])
        levels['5'] = (levels['5'][0], 1)
    elif levels['1'][0] < 0:
        levels['1'] = (-9999, levels['1'][1])
        levels['5'] = (levels['5'][0], 0)

    return levels

In [9]:
# Gather all the CAI and MFE values
values_cai_all = []
values_mfe_ramp = []

for locus, info in final_info.items():
    if info:
        values_cai_all.append(info['allCAI'])
        values_mfe_ramp.append(info['mfeStructureRNAFoldMFE'])

# Calculate the quintiles
levels_cai_all = get_quintile_levels(values_cai_all)
levels_mfe_ramp = get_quintile_levels(values_mfe_ramp)

In [10]:
levels_cai_all

{'1': (0, 0.6621406925016591),
 '2': (0.6621406925016591, 0.6976944572806913),
 '3': (0.6976944572806913, 0.7274185081606481),
 '4': (0.7274185081606481, 0.7579032438981487),
 '5': (0.7579032438981487, 1)}

In [11]:
levels_mfe_ramp

{'1': (-9999, -36.2),
 '2': (-36.2, -32.1),
 '3': (-32.1, -28.5),
 '4': (-28.5, -24.4),
 '5': (-24.4, 0)}

In [12]:
LONGEST_REPEAT = 19
LONGEST_HOMOPOLYMER = 10
MIN_GLOBAL_GC = 25
MAX_GLOBAL_GC = 65
MIN_LOCAL_GC = 35
MAX_LOCAL_GC = 65
LOCAL_GC_WINDOW = 50
MAX_SMALL_REPEAT_PERCENTAGE = 30

levels_longestrepeat = {
    '1': (0, LONGEST_REPEAT),
    '0': (LONGEST_REPEAT, 99999999999)}

levels_longesthomopolymer = {
    '1': (0, LONGEST_HOMOPOLYMER),
    '0': (LONGEST_HOMOPOLYMER, 99999999999)}

levels_globalgc = {
    '0': (0, MIN_GLOBAL_GC),
    '1': (MIN_GLOBAL_GC, MAX_GLOBAL_GC),
    '2': (MAX_GLOBAL_GC, 100)}

levels_localgc = {
    '0': (0, MIN_LOCAL_GC),
    '1': (MIN_LOCAL_GC, MAX_LOCAL_GC),
    '2': (MAX_LOCAL_GC, 100)}

levels_smallrepeatpercentage = {
    '1': (0, MAX_SMALL_REPEAT_PERCENTAGE),
    '0': (MAX_SMALL_REPEAT_PERCENTAGE, 1000)}

In [20]:
from collections import OrderedDict


def make_design_params(in_seq: str, 
                       ramp_from_to: tuple, 
                       levels_mfe_ramp: tuple, 
                       levels_cai_all: tuple,
                       levels_longestrepeat: tuple,
                       levels_longesthomopolymer: tuple,
                       levels_globalgc: tuple,
                       levels_localgc: tuple) -> OrderedDict:
    """Simple function to make your design parameters"""
    
    design_params = OrderedDict()
    
    design_params["mfe"] = {
        'feattype'      : 'StructureRNAFoldMFE',
        'type'          : 'REAL',
        'mutable_region': ramp_from_to,
        'thresholds'    : levels_mfe_ramp
    }

    design_params["ramp"] = {
        'feattype'      : 'CAI',
        'type'          : 'REAL',
        'mutable_region': ramp_from_to,
        'thresholds'    : levels_cai_all
    }

    design_params["rest"] = {
        'feattype'      : 'CAI',
        'type'          : 'REAL',
        'mutable_region': (ramp_from_to[1], len(in_seq)),
        'thresholds'    : levels_cai_all
    }
    
    design_params['lrs'] = {
        'feattype'      : 'LongestRepeatedSubstr',
        'type'          : 'INTEGER',
        'mutable_region': (0, len(in_seq)),
        'thresholds'    : levels_longestrepeat
    }

    design_params['lh'] = {
        'feattype': 'LongestHomopolymer',
        'type' : 'INTEGER',
        'mutable_region': (0, len(in_seq)),
        'thresholds'    : levels_longesthomopolymer
    }
    
    design_params['ggc'] = {
        'feattype'      : 'GlobalGC',
        'type'          : 'REAL',
        'mutable_region': (0, len(in_seq)),
        'thresholds'    : levels_globalgc
    }
    
    design_params['lgc'] = {
        'feattype'      : 'LocalGC',
        'type'          : 'REAL',
        'mutable_region': (0, len(in_seq)),
        'thresholds'    : levels_localgc
    }
    
    return design_params

## Design a sequence to meet specific targets

##### Specify your targets and output directory

In [21]:
targets = ['5.1.5.1.1.1.1', '5.5.5.1.1.1.1', '1.5.5.1.1.1.1']

outdir = './out'

##### Run it!

In [22]:
from dtailor.DesignOfExperiments.Design import CustomDesign, RandomSampling
from dtailor.RunningExamples.Designer.GenericDesigner import GenericDesigner

In [23]:
import os.path as op
import sqlite3
import pandas as pd

def parse_sqlite_results(in_seq_id, outdir):
    # Parse out sequences from SQLite database
    output_file = op.join(outdir, in_seq_id) + '.sqlite'
    conn = sqlite3.connect(output_file)
    tmp = pd.read_sql(con=conn, sql="select * from generated_solution WHERE des_solution_id <> '';")
    # TODO: this will drop sequences with multiple solutions...is that wanted?
    result = {k: v['sequence'] for k, v in
              tmp[['des_solution_id', 'sequence']].drop_duplicates(subset='des_solution_id').set_index(
                  'des_solution_id').to_dict(orient='index').items()}
    return result

In [26]:
results = {}

for ident, seq in seqs_to_optimize.items():
    
    # Specify the parameters you want to optimize to
    design = CustomDesign(featuresObj=make_design_params(in_seq=seq, 
                                                         ramp_from_to=ramp_from_to,
                                                         levels_mfe_ramp=levels_mfe_ramp,
                                                         levels_cai_all=levels_cai_all,
                                                         levels_longestrepeat=levels_longestrepeat,
                                                         levels_longesthomopolymer=levels_longesthomopolymer,
                                                         levels_globalgc=levels_globalgc,
                                                         levels_localgc=levels_localgc),
                          targets=targets)

    # Get ready to run everything and run it
    runner = GenericDesigner(
        name=ident, 
        seed=seq,
        design=design, 
        cai_table=cai_table,  # Your pre computed CAI table
        mutable_region=None, cds_region=None,   # Just leave these as None for now..unsure what changing them does
        keep_aa=True,  # Don't mutate amino acids
        check_frame=True,  # Check if designed sequence has correct coding frame
        check_start=True,  # Check if designed sequence still has start codon
        check_end_stop=True,   # Check if designed sequence still has stop codon
        check_within_stop=True,  # Check if designed sequence does not have stop codons inside
        root_dir=outdir, 
        createDB=True)
    
    # Actually run it
    stats = runner.run()
    
    # Pull out the designed seqs from the SQLite db
    result = parse_sqlite_results(in_seq_id=ident, outdir=outdir)
    
    # Store in a dictionary
    results[ident] = result

[2019-05-10 17:15] [dtailor.SequenceDesigner] INFO: Starting scores:
[2019-05-10 17:15] [dtailor.SequenceDesigner] INFO: {'mfeStructureRNAFoldMFE': -17.5, 'rampCAI': 0.6222998089169588, 'restCAI': 0.5998345299571216, 'lrsLongestRepeatedSubstr': 11, 'lhLongestHomopolymer': 5, 'ggcGlobalGC': 40.02677376171352, 'lgcLocalGC': 56.0}
[2019-05-10 17:15] [dtailor.SequenceDesigner] INFO: {'mfeStructureRNAFoldMFELevel': '5', 'rampCAILevel': '1', 'restCAILevel': '1', 'lrsLongestRepeatedSubstrLevel': '1', 'lhLongestHomopolymerLevel': '1', 'ggcGlobalGCLevel': '1', 'lgcLocalGCLevel': '1'}
[2019-05-10 17:15] [dtailor.SequenceDesigner] INFO: Looking for combination: 5.1.5.1.1.1.1
[2019-05-10 17:16] [dtailor.SequenceDesigner] INFO: Solution found for 5.1.5.1.1.1.1, inserting into DB
[2019-05-10 17:16] [dtailor.SequenceDesigner] INFO: Looking for combination: 5.5.5.1.1.1.1
[2019-05-10 17:16] [dtailor.SequenceDesigner] INFO: SolutionIterator: starting from level (mfe.ramp.rest.lrs.lh.ggc.lgc): 5.1.5.1.1.

In [27]:
results

{'example1': {'5.1.5.1.1.1.1': 'atgtctgcaacttccgtcactttcccaatgatcaacgaaacttaccaacagccaaccgggcttttcatcaacaatgaatttgttagtgcaaagtcaggtaagacttttgatgttaacactccgattgatgagtctttaatttgcaaagtccagcaggcggatgcggaagatgtggaaattgcggtgcaagcggcgtcgaaagcgtataaaacctggcggtttacgccgccgaacgaacgtggccgctacctgaacaaactggcggatctgatggacgaaaaaagagatctgcttgcgaaaattgaaagtctcgataacggcaaagcgctgcattgcgcgaaatttgatgtgaacctcgtgattgaatattttcgctactgtgcaggctattgtgataaaattgatggccgcaccattacgaccgatgtggaacattttacctataccagaaaagaacctctgggcgtgtgcggcgcgattacgccgtggaactttccgctgctgatgtttgcctggaaaattggcccggccctggccaccggcaacaccattattttgaaacctgccagtgcgacaccactaagcaacctgtttacgtgcaccttaattaaagaagcgggcattccagccggtgttgtcaatgttgtgccaggctcgggcagaggctgcggcaactcgattctgcagcatccgaaaattaagaaagtggcgtttaccgggagcacagaagtgggtaaaaccgtgatgaaagaatgcgctaacagcattaaaaaagtgacgctggaattaggcggcaagtctccgaacattgtgttcaaagattgtaacgtggaacagaccattcaaaacttgattacgggcattttctttaacggcggcgaagtctgctgcgcgggcagccgcatttatattgaagccaccgatgaaaaatggtataccgaatttttgaccaaattcaaagaaaccgt