# Analyzing the GC-Content of codon-optimized sequence
GC-Content is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine or cytosine. This mearsure is often cited as being important for applications of molecular biology, genomics, and systematics. We leverage it here to assess the similarity in sequence optimization between *Optipyzer* and other codon optimiation tools, namely Integrated DNA Technology's (IDT).

We identified 55 functional protein sequences that were randomly generated (Kefee and Szostak, 2001). These sequences were originally expressed in *Escherichia coli*, as such, they were optimized for expression in *Homo sapiens*. They were optimized on two platforms: IDT, and *Optipyzer*. After analysis, the sequences were analyzed for their GC-Content.

## IDT Optimization
To utilize IDT's codon optimization you must first register. Once complete, you can access their web-interface. The optimization was ran with the following parameters using their **bulk input**:
- Sequence type: Amino Acids
- Product type: Gene
- Organism: Homo sapiens
- Delimiter: FASTA

The sequences were pasted into their tool and submitted. Once complete, the sequences were extracted using a console script. See [this](js/extract_idt_sequences.js) JavaScript file. The script was pasted into the console directly. The resulting file was formatted into a FASTA file.

## Optipyzer Optimization
The optipyzer package will be utilized to optimize the sequences:

In [33]:
# install dependencies
!pip install optipyzer bipython tqdm pandas

6776.86s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Collecting bipython
  Downloading bipython-0.1.2.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting urwid>=1.1.1
  Downloading urwid-2.1.2.tar.gz (634 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m634.6/634.6 kB[0m [31m186.4 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting bpython>=0.12
  Downloading bpython-0.23-py3-none-any.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.7/194.7 kB[0m [31m186.3 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting curtsies>=0.4.0
  Downloading curtsies-0.4.1.tar.gz (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.6/53.6 kB[0m [31m240.7 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting cwcwidth
 

In [24]:
# read in sequences
from Bio import SeqIO

records = list(SeqIO.parse("inputs/keefe_szostak.fasta", "fasta"))

In [15]:
from optipyzer.api import api
from tqdm import tqdm

SPECIES_ID='122563'
optimizer = api()

# run optimization
optimized_sequences = {}
for record in tqdm(records):
  result = optimizer.optimize(
    str(record.seq),
    {
      SPECIES_ID: 1
    },
    seq_type="protein",
    seed=99
  )
  optimized_sequences[record.id] = result

100%|██████████| 56/56 [01:40<00:00,  1.79s/it]


In [23]:
# write to file
out_path = "results"
file = "keefe_szostak_OPTIPYZER.fasta"
with open(f"{out_path}/{file}", 'w+') as fh:
  for id in tqdm(list(optimized_sequences.keys())):
    fh.write(f">{id}\n")
    fh.write(optimized_sequences[id]['optimized_sd'] + "\n")

100%|██████████| 56/56 [00:00<00:00, 114297.34it/s]


In [25]:
# read in sequences
from Bio import SeqIO

idt_records = list(SeqIO.parse("results/keefe_szostak_IDT.fasta", "fasta"))
optipyzer_records = list(SeqIO.parse("results/keefe_szostak_OPTIPYZER.fasta", "fasta"))

In [31]:
pairwise_gc_content = []
for idt, opti in zip(idt_records, optipyzer_records):
  idt_seq, optipyzer_seq = (str(idt.seq).lower(), str(opti.seq).lower())
  idt_gc = (idt_seq.count('g') + idt_seq.count('c'))/len(idt_seq)
  optipyzer_gc = (optipyzer_seq.count('g') + optipyzer_seq.count('c'))/len(idt_seq)
  pairwise_gc_content.append({
    'id': idt.id,
    'idt_gc': idt_gc,
    'optipyzer_gc': optipyzer_gc,
    'squared_diff': (idt_gc - optipyzer_gc)**2
  })

In [38]:
import pandas as pd
df = pd.DataFrame(pairwise_gc_content)
df.to_csv("results/pairwise_gc_condtent.csv")