# Design


## Environment

- conda
- docker
- colab

- To make google drive as colab working directory
```python
from google.colab import drive
drive.mount('/content/drive')
```

- Change working directory

```python
import os
os.chdir('/content/drive/MyDrive/design_build')
```

In [125]:
!pwd
!wget https://media.addgene.org/snapgene-media/v1.7.9-0-g88a3305/sequences/222046/51c2cfab-a3b4-4d62-98df-0c77ec21164e/addgene-plasmid-50005-sequence-222046.gbk

/home/haseong/dev/design_build_python
--2024-05-29 11:03:00--  https://media.addgene.org/snapgene-media/v1.7.9-0-g88a3305/sequences/222046/51c2cfab-a3b4-4d62-98df-0c77ec21164e/addgene-plasmid-50005-sequence-222046.gbk
Resolving media.addgene.org (media.addgene.org)... 172.66.40.66, 172.66.43.190, 2606:4700:3108::ac42:2bbe, ...
Connecting to media.addgene.org (media.addgene.org)|172.66.40.66|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8623 (8.4K) [application/octet-stream]
Saving to: ‘addgene-plasmid-50005-sequence-222046.gbk’


2024-05-29 11:03:00 (257 MB/s) - ‘addgene-plasmid-50005-sequence-222046.gbk’ saved [8623/8623]



In [59]:
!ls

README.md				   data		 seq_analysis.ipynb
addgene-plasmid-50005-sequence-222046.gbk  design.ipynb  synbio
align.wig				   example.txt
build.ipynb				   images


![](images/puc19.png){width=500px}

## Biopython

[Biopython](https://biopython.org/) is a collection of freely available Python tools for computational biology and bioinformatics.

In [1]:
from Bio import SeqIO
from pandas import DataFrame

records = SeqIO.read("addgene-plasmid-50005-sequence-222046.gbk", "genbank")

features = []
for feature in records.features:
    features.append({
        "Label": feature.qualifiers.get("label", [""])[0],
        "Strand": feature.strand,
        "Start": feature.location.start,
        "End": feature.location.end, 
        "Type": feature.type
    })
print(features)
DataFrame(features)

[{'Label': '', 'Strand': 1, 'Start': ExactPosition(0), 'End': ExactPosition(2686), 'Type': 'source'}, {'Label': 'pBR322ori-F', 'Strand': 1, 'Start': ExactPosition(117), 'End': ExactPosition(137), 'Type': 'primer_bind'}, {'Label': 'L4440', 'Strand': 1, 'Start': ExactPosition(370), 'End': ExactPosition(388), 'Type': 'primer_bind'}, {'Label': 'CAP binding site', 'Strand': 1, 'Start': ExactPosition(504), 'End': ExactPosition(526), 'Type': 'protein_bind'}, {'Label': 'lac promoter', 'Strand': 1, 'Start': ExactPosition(540), 'End': ExactPosition(571), 'Type': 'promoter'}, {'Label': 'lac operator', 'Strand': 1, 'Start': ExactPosition(578), 'End': ExactPosition(595), 'Type': 'protein_bind'}, {'Label': 'M13/pUC Reverse', 'Strand': 1, 'Start': ExactPosition(583), 'End': ExactPosition(606), 'Type': 'primer_bind'}, {'Label': 'M13 rev', 'Strand': 1, 'Start': ExactPosition(602), 'End': ExactPosition(619), 'Type': 'primer_bind'}, {'Label': 'M13 Reverse', 'Strand': 1, 'Start': ExactPosition(602), 'End'



Unnamed: 0,Label,Strand,Start,End,Type
0,,1,0,2686,source
1,pBR322ori-F,1,117,137,primer_bind
2,L4440,1,370,388,primer_bind
3,CAP binding site,1,504,526,protein_bind
4,lac promoter,1,540,571,promoter
5,lac operator,1,578,595,protein_bind
6,M13/pUC Reverse,1,583,606,primer_bind
7,M13 rev,1,602,619,primer_bind
8,M13 Reverse,1,602,619,primer_bind
9,lacZ-alpha,1,614,938,CDS


## Primers

[primers](https://github.com/Lattice-Automation/primers) It is uniquely focused on DNA assembly flows like Gibson Assembly and Golden Gate cloning. You can design primers while adding sequence to the 5' ends of primers.

In [101]:
from primers import create, primers
from pandas import DataFrame
from random import sample, choices

myseq_list = choices(["A", "T", "G", "C"], k=100)
myseq = "".join(myseq_list)
print(myseq)

fwd, rev = create(myseq, add_fwd = "GGGG", add_rev = "TTTT")
# p1, p2 = primers(myseq, add_fwd = "GGGG", add_rev = "TTTT")
print(fwd)
print(rev)

## display table form
DataFrame(list(fwd.dict().values())[:-1], index = list(rev.dict().keys())[:-1])

## default argument values 
create

GTTTTCGTCCCAGCGACGGACTCAGTCGCAGCTGTTTATTTCTGTAGTGTCGGTATGAATATTGATTAGTCAGGGTTACTGGTCCAGGATCTTTTGTCAT
Primer(seq='GGGGGTTTTCGTCCCAGCGA', len=20, tm=62.7, tm_total=71.3, gc=0.7, dg=-0.71, fwd=True, off_target_count=0, scoring=Scoring(penalty=9.1, penalty_tm=0.7, penalty_tm_diff=0, penalty_gc=4.0, penalty_len=3.0, penalty_dg=1.4, penalty_off_target=0.0))
Primer(seq='TTTTATGACAAAAGATCCTGGACC', len=24, tm=62.5, tm_total=63.2, gc=0.4, dg=-0.2, fwd=False, off_target_count=0, scoring=Scoring(penalty=3.9, penalty_tm=0.5, penalty_tm_diff=0, penalty_gc=2.0, penalty_len=1.0, penalty_dg=0.4, penalty_off_target=0.0))


<function primers.primers.primers(seq: str, add_fwd: str = '', add_rev: str = '', add_fwd_len: Tuple[int, int] = (-1, -1), add_rev_len: Tuple[int, int] = (-1, -1), offtarget_check: str = '', optimal_tm: float = 62.0, optimal_gc: float = 0.5, optimal_len: int = 22, penalty_tm: float = 1.0, penalty_gc: float = 0.2, penalty_len: float = 0.5, penalty_tm_diff: float = 1.0, penalty_dg: float = 2.0, penalty_off_target: float = 20.0) -> Tuple[primers.primers.Primer, primers.primers.Primer]>

- Default arguments and values

```plain
<function primers.primers.primers(seq: str, add_fwd: str = '', add_rev: str = '', add_fwd_len: Tuple[int, int] = (-1, -1), add_rev_len: Tuple[int, int] = (-1, -1), offtarget_check: str = '', optimal_tm: float = 62.0, optimal_gc: float = 0.5, optimal_len: int = 22, penalty_tm: float = 1.0, penalty_gc: float = 0.2, penalty_len: float = 0.5, penalty_tm_diff: float = 1.0, penalty_dg: float = 2.0, penalty_off_target: float = 20.0) -> Tuple[primers.primers.Primer, primers.primers.Primer]>
```

- offtarget

In [116]:
from primers import create
from random import choices

def print_primer_info(x):
    from pandas import DataFrame
    df = DataFrame(list(x.dict().values())[:-1], index = list(x.dict().keys())[:-1])
    print(df)

primer_binding_seq = "GTCATATGCATTCGATGCGTTAGG"
rnd_seq1 = "".join(choices(["A", "T", "G", "C"], k=100))
rnd_seq2 = "".join(choices(["A", "T", "G", "C"], k=100))

myseq = primer_binding_seq+rnd_seq1
print(myseq)
len(myseq)

fwd, rev = create(myseq)
print_primer_info(fwd)

## primer considering offtargets
myseq2 = primer_binding_seq+rnd_seq1+primer_binding_seq+rnd_seq2
fwd2, rev = create(myseq2)
print_primer_info(fwd2)

## optimal_tm is ignored 
fwd2, rev = create(myseq2, optimal_tm = 62)
print_primer_info(fwd2)

GTCATATGCATTCGATGCGTTAGGCGTACTTGAGCCGTTGGGTTGTCTTAGGTTGTTGGCTGCTCAGAGCTCTGGCCAGGTCGCGTCGGTTCAGTTGTATCTAAGCTGCATTGCTGCTGAGCTG
                                     0
seq               GTCATATGCATTCGATGCGT
len                                 20
tm                                62.5
tm_total                          62.5
gc                                 0.5
dg                               -0.56
fwd                               True
off_target_count                     0
                                            0
seq               GTCATATGCATTCGATGCGTTAGGCGT
len                                        27
tm                                       68.7
tm_total                                 68.7
gc                                        0.5
dg                                      -0.56
fwd                                      True
off_target_count                            0
                                            0
seq               GTCATATGCATTCGATGCGTTAGGCGT
len               

## pydna

- [pyDNA](https://github.com/BjornFJohansson/pydna) The pydna python package provide a human-readable formal descriptions of 🧬 cloning and genetic assembly strategies in Python 🐍 which allow for simulation and verification.

In [121]:
from pydna.dseqrecord import Dseqrecord

dsr = Dseqrecord("ATGCGTTGC")
dsr.figure()

Dseqrecord(-9)
[48;5;11m[0mATGCGTTGC
TACGCAACG

In [122]:
from pydna.readers import read

p = read("addgene-plasmid-50005-sequence-222046.gbk")
p.list_features()

Ft#,Label or Note,Dir,Sta,End,Len,type,orf?
0,nd,-->,0,2686,2686,source,no
1,L:pBR322ori-F,-->,117,137,20,primer_bind,no
2,L:L4440,-->,370,388,18,primer_bind,no
3,L:CAP binding si,-->,504,526,22,protein_bind,no
4,L:lac promoter,-->,540,571,31,promoter,no
5,L:lac operator,-->,578,595,17,protein_bind,no
6,L:M13/pUC Revers,-->,583,606,23,primer_bind,no
7,L:M13 rev,-->,602,619,17,primer_bind,no
8,L:M13 Reverse,-->,602,619,17,primer_bind,no
9,L:lacZ-alpha,-->,614,938,324,CDS,yes


In [123]:
extracted_site = p.extract_feature(10)
extracted_site.seq

Dseq(-57)
AAGC..ATTC
TTCG..TAAG

## Parts

- pUC19 from Addgene
- Remove BsaI site
- Insert a part into MCS

- Hinz, Aaron J., Benjamin Stenzler, and Alexandre J. Poulain. "Golden gate assembly of aerobic and anaerobic microbial bioreporters." Applied and environmental microbiology 88.1 (2022): e01485-21.

![plasmid pUC19](images/pUC19.png){width=500px}

![bsaI replacement](images/amp_bsaI1.png){height=200px}
![bsaI replacement2](images/amp_bsaI2.png){height=200px}

#### List of parts

- pUC19-J23100.gb
- pUC19-RB0030.gb
- pUC19-L2U3H03.gb
- pUC19-egfp.gb

![](images/puc19_egfp.png){width=500px}

## Golden gate assembly


In [6]:
from pydna.readers import read

egfp = read("data/parts/pUC19-egfp.gb")
promoter = read("data/parts/pUC19-J23100.gb")
terminator = read("data/parts/pUC19-L2U3H03.gb")
rbs = read("data/parts/pUC19-RB0030.gb")

egfp_feature_list = egfp.list_features()
egfp_feature_list


'LOCUS       pUC19-egfp        3371 bp DNA     circular SYN 18-MAY-2024\n'
Found locus 'pUC19-egfp' size '3371' residue_type 'DNA'
Some fields may be wrong.
'LOCUS       pUC19-J23100        2686 bp DNA     circular SYN 18-MAY-2024\n'
Found locus 'pUC19-J23100' size '2686' residue_type 'DNA'
Some fields may be wrong.
'LOCUS       pUC19-gg-L2U3H03        2688 bp DNA     circular SYN 18-MAY-2024\n'
Found locus 'pUC19-gg-L2U3H03' size '2688' residue_type 'DNA'
Some fields may be wrong.
'LOCUS       pUC19-R-B0030        2750 bp DNA     circular SYN 18-MAY-2024\n'
Found locus 'pUC19-R-B0030' size '2750' residue_type 'DNA'
Some fields may be wrong.


Ft#,Label or Note,Dir,Sta,End,Len,type,orf?
0,nd,-->,0,3371,3371,source,no
1,L:pBR322ori-F,-->,117,137,20,primer_bind,no
2,L:L4440,-->,370,388,18,primer_bind,no
3,L:CAP binding si,-->,504,526,22,protein_bind,no
4,L:lac promoter,-->,540,571,31,promoter,no
5,L:lac operator,-->,578,595,17,protein_bind,no
6,L:M13/pUC Revers,-->,583,606,23,primer_bind,no
7,L:M13 rev,-->,602,619,17,primer_bind,no
8,L:M13 Reverse,-->,602,619,17,primer_bind,no
9,L:lacZ-alpha,-->,614,1623,1009,CDS,no


In [73]:
from primers import create, primers
from pydna.all import pcr

myseq = egfp.extract_feature(13)
str(myseq.seq)

fwd, rev = create(str(myseq.seq), add_fwd = "GGGG", add_rev = "TTTT")
print(fwd)

pcr_product = pcr(fwd.seq, rev.seq, egfp.extract_feature(13))
pcr_product.figure()


Primer(seq='GGGGATGGTGAGCAAGGGCG', len=20, tm=65.1, tm_total=72.1, gc=0.7, dg=0, fwd=True, off_target_count=0, scoring=Scoring(penalty=10.1, penalty_tm=3.1, penalty_tm_diff=0, penalty_gc=4.0, penalty_len=3.0, penalty_dg=0.0, penalty_off_target=0.0))


    5ATGGTGAGCAAGGGCG...CATGGACGAGCTGTACAAGTAA3
                        ||||||||||||||||||||||
                       3GTACCTGCTCGACATGTTCATTTTTT5
5GGGGATGGTGAGCAAGGGCG3
     ||||||||||||||||
    3TACCACTCGTTCCCGC...GTACCTGCTCGACATGTTCATT5

In [88]:
from primers import create
from pydna.all import pcr

frag1 = egfp.extract_feature(10)
frag2 = egfp.extract_feature(12)
frag3 = egfp.extract_feature(13)
frag4 = egfp.extract_feature(18)
frag5 = egfp.extract_feature(19)

targetseq = frag1+frag2+frag3+frag4+frag5
fwd, rev = create(str(targetseq.seq))

pcr_product = pcr(fwd.seq, rev.seq, targetseq)
pcr_product.figure()


5GGTCTCAGTCAATGGTGA...TACAAGTAAGGGATGAGACC3
                      ||||||||||||||||||||
                     3ATGTTCATTCCCTACTCTGG5
5GGTCTCAGTCAATGGTGA3
 ||||||||||||||||||
3CCAGAGTCAGTTACCACT...ATGTTCATTCCCTACTCTGG5

In [93]:
from Bio.Restriction import BsaI

cut_product = pcr_product.cut(BsaI)

len(cut_product)



3

In [67]:
from pydna.gel import gel
from pydna.ladders import GeneRuler_1kb_plus
from pydna.dseqrecord import Dseqrecord


AttributeError: 'str' object has no attribute 'm'

### snapgene read

In [26]:
from snapgene_reader import snapgene_file_to_dict, snapgene_file_to_seqrecord
from Bio.Seq import Seq
from Bio.SeqUtils import MeltingTemp as mt
from Bio.SeqUtils import gc_fraction
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation

# data = snapgene_file_to_seqrecord("data/hs/pACBB-lycopene_vector.dna")
# type(data)

In [None]:
from Bio import SeqIO
from pandas import DataFrame

record = SeqIO.read("data/hs/pACBB-lycopene_vector.dna", "snapgene")

## convert the list of record.features  to pandas dataframe 
df = DataFrame([feature.qualifiers for feature in record.features])
df.shape
record.features[0]

## filtering CDS from record
cds = [feature for feature in record.features if feature.type == "CDS"]

## extract positions
positions = [feature.location.start for feature in cds]
positions[0]

In [None]:
# read_snapgene_file("data/hs/pACBB-lycopene_vector.dna")
data = snapgene_file_to_dict("data/hs/pACBB-lycopene_vector.dna")
# data.keys()
# data["seq"]
data["features"]

myseqrec = SeqRecord(
    data["seq"],
    id="pACBB-lycopene_vector",
    name="pACBB-lycopene_vector",
    description="pACBB-lycopene_vector"
)

# myseqrec.features.append(SeqFeature(data["features"])) 
SeqFeature(data["features"])

In [64]:
from Bio import SeqIO

record = SeqIO.read("data/hs/addgene-plasmid-32548-sequence-37547.gbk", "gb")
record.features[1].extract(record).seq


Seq('AGCGAGTCAGTGAGCGAG')