# Analyzing the GC-Content of codon-optimized sequence
GC-Content is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine or cytosine. This mearsure is often cited as being important for applications of molecular biology, genomics, and systematics. We leverage it here to assess the similarity in sequence optimization between *Optipyzer* and other codon optimiation tools, namely Integrated DNA Technology's (IDT).

We identified 55 functional protein sequences that were randomly generated (Kefee and Szostak, 2001). These sequences were originally expressed in *Escherichia coli*, as such, they were optimized for expression in *Homo sapiens*. They were optimized on two platforms: IDT, and *Optipyzer*. After analysis, the sequences were analyzed for their GC-Content.

## IDT Optimization
To utilize IDT's codon optimization you must first register. Once complete, you can access their web-interface. The optimization was ran with the following parameters using their **bulk input**:
- Sequence type: Amino Acids
- Product type: Gene
- Organism: Homo sapiens
- Delimiter: FASTA

The sequences were pasted into their tool and submitted. Once complete, the sequences were extracted using a console script. See [this](js/extract_idt_sequences.js) JavaScript file. The script was pasted into the console directly. The resulting file was formatted into a FASTA file.

## Optipyzer Optimization
The optipyzer package will be utilized to optimize the sequences:

In [2]:
# install dependencies
%pip install optipyzer bipython tqdm pandas biopython

Collecting biopython
  Downloading biopython-1.80.tar.gz (17.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.9/17.9 MB[0m [31m59.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Using legacy 'setup.py install' for biopython, since package 'wheel' is not installed.
Installing collected packages: biopython
  Running setup.py install for biopython ... [?25ldone
[?25hSuccessfully installed biopython-1.80

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.1[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
# read in sequences
from Bio import SeqIO

records = list(SeqIO.parse("inputs/keefe_szostak.fasta", "fasta"))

In [12]:
from optipyzer.api import API
from tqdm import tqdm

SPECIES="human"
optimizer = API()

# run optimization
optimized_sequences = {}
for record in tqdm(records):
  result = optimizer.optimize(
    str(record.seq),
    {
      SPECIES: 1
    },
    seq_type="protein",
    seed=99
  )
  optimized_sequences[record.id] = result

100%|██████████| 56/56 [01:19<00:00,  1.42s/it]


In [6]:
# write to file
out_path = "results"
file = "keefe_szostak_OPTIPYZER.fasta"
with open(f"{out_path}/{file}", 'w+') as fh:
  for id in tqdm(list(optimized_sequences.keys())):
    fh.write(f">{id}\n")
    fh.write(optimized_sequences[id]['optimized_sd'] + "\n")

100%|██████████| 56/56 [00:00<00:00, 113578.83it/s]


In [1]:
# read in sequences
from Bio import SeqIO

idt_records = list(SeqIO.parse("results/keefe_szostak_IDT.fasta", "fasta"))
optipyzer_records = list(SeqIO.parse("results/keefe_szostak_OPTIPYZER.fasta", "fasta"))
jcat_records = list(SeqIO.parse("results/keefe_szostak_JCAT.fasta", "fasta"))

In [3]:
pairwise_gc_content = []
for idt, opti, jcat in zip(idt_records, optipyzer_records, jcat_records):
  idt_seq, optipyzer_seq, jcat_seq = (str(idt.seq).lower(), str(opti.seq).lower(), str(jcat.seq).lower())
  idt_gc = (idt_seq.count('g') + idt_seq.count('c'))/len(idt_seq)
  optipyzer_gc = (optipyzer_seq.count('g') + optipyzer_seq.count('c'))/len(idt_seq)
  jcat_gc = (jcat_seq.count('g') + jcat_seq.count('c'))/len(idt_seq)
  pairwise_gc_content.append({
    'id': idt.id,
    'idt_gc': idt_gc,
    'optipyzer_gc': optipyzer_gc,
    'jcat_gc': jcat_gc,
    'squared_diff': (idt_gc - optipyzer_gc)**2
  })

In [4]:
import pandas as pd
df = pd.DataFrame(pairwise_gc_content)
df.to_csv("results/pairwise_gc_content.csv")