## **PCR Primer Design using Python**
**Written by Kilian Zindel**

Polymerase Chain Reaction (PCR) is a method used to amplify (or make billions of copies) of a DNA strand. It is often a precursor to many other procedures used in genetic testing and research. In this method, DNA is replicated using many repeated cycles of heating and cooling, called thermal cycles. There are 3 processes that occur during each cycle. 
1. **Denaturation**: The tempature is raised to near-boiling, causing the double stranded DNA to separate (or denauture) into single strands.
2. **Annealing**: The tempature is decreased to around 50–65°C, and **PCR Primers** bind (or anneal) to their complementary matches on the DNA sequence.
3. **Extension**: The temperature is raised to around 68–72°C at which a Polymerase Enzyme binds to the **Primer** and adds nucleotides to the 3' end to complete the strand.

The cycle is then repeated in order to duplicate DNA exponentially. Under perfect conditions, it would only take 30 cycles to produce over a billion copies of the target DNA.

PCR Primers are short, single-stranded DNA sequences, typically spanning 18 to 25 nucleotides in length. They are essential for the extension phase of PCR because the Polymerase cannot start a new sequence, only add to the 3' end of an existing one. There are different kinds of primers designed for different use cases. In this tutorial I will be designing primers optimized for [Polymerase Chain Reaction (PCR)](https://www.biointeractive.org/classroom-resources/polymerase-chain-reaction-pcr). 

In [6]:
# The power of exponential growth 
number_of_cycles = 30
duplicate_DNA_strands = pow(2, number_of_cycles)
duplicate_DNA_strands

1073741824

![PCR-image](https://upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Polymerase_chain_reaction-en.svg/1024px-Polymerase_chain_reaction-en.svg.png)

#### Calculating the Melting Temperature (Tm) of a Primer

An important consideration when designing a PCR Primer is it's melting temperature. During the annealing process, the temperature plays a critical role in how primers bind to the template DNA because it affects the stability of hydrogen bonds between the primer and template. If the temperature is too high (e.g., >72°C) the hydrogen bonds break and the primer begins to dissocate from the DNA. If the temperature is too low (e.g., <50°C) primers are more likely to bind to and stick to non-target sequences. 

The temperature range of 50–65°C is considered to be optimal for annealing temperature (Ta) in PCR. The exact annealing temperature should be set at 3-5°C below the primers melting temperature, which means the optimal melting temperature for the primer should be 55°C or higher. 

This tempature range is optimal for two main reasons:
1. Specificity: The tempature is high enough for primers to anneal specifically to their complementary sequence on the template DNA.
2. Stability: The temparture is low enough for primers to form stable bonds with the template DNA.

The **Melting Temperature (Tm)** depends on three main factors 
1. GC Content: higher GC content means higher melting temperature 
2. Length: high length means higher melting temperature
3. Ionic Strength of the DNA Solution (Total ion concentration): More stable with high ionic strength solution

It's been shown that there is a linear relationship between Primer and GC content, the Primer's Melting Temperature can be approximated using the Wallace rule. Let's define two functions in order to calculate GC content and melting temperature. 

In [36]:
def calc_gc_content(seq):
    # calculate the percentage of nucleotides in sequence that are either 'G' or 'C'
    c = seq.count('C')
    g = seq.count('G')
    gc_content = (g + c) / len(seq)
    return round(100 * gc_content, 0) 

def calc_tm(seq):
    # calculate the melting temperature of the sequence
    # Simple approximation (Wallace rule):
    # Tm ≈ 2°C*(A+T) + 4°C*(G+C)
    a = seq.count('A')
    t = seq.count('T')
    g = seq.count('G')
    c = seq.count('C')
    return 2*(a+t) + 4*(g+c)

#### Detecting runs of repeated nucleotides
We will also need a function to detect runs of 4 or more identical bases. It's best practice to avoid such primers because too many repeated bases can cause "breathing" of the primer in which a base bulges out, for example, a run of 5 Guanine nucleotides might bind to a run of 4 Cytosine nucleotides. This could aid in mispriming. 

In [44]:
def check_max_run(seq, max_run=4):
    count = 1
    for i in range(1, len(seq)):
        # check if current and next base are the same
        if seq[i-1] == seq[i]:
            # if so: increment the count
            count += 1
            if count > max_run:
                return True
        else:
            # otherwise: reset the count
            count = 1 
    return False

#### Checking the GC Clamp

Because GC bonds are stronger than AT bonds, having 1-3 G/C nucleotides at the 3' end of the primer encourage complete primer binding, reduces the chances of the Primer dissociating, and aides in extension during PCR. The 3' end (group of last 5 nucleotides) is critical because this is where the Taq Polymerase begins adding nucleotides. 

We also need to ensure that we don't overclamp (having more than 3 G/C nucleotides in the last 5). Overclamping can lead to the formation of primer dimers (when primers bind to eachother) or other secondary structures like hairpins.  

Let's write a function to check the 3' end and ensure it contains the recommended number of G/C nucleotides, and check if the last nucleotide is a G or C.

In [45]:
def check_gc_clamp(seq, max_gc_clamp=3):
    # check the last nucleotide 
    last_nt = seq[-1]
    if not (last_nt == 'G' or last_nt == 'C'):
        return False
    
    # check the GC clamp 
    clamp = seq[:-5]
    g = clamp.count('G')
    c = clamp.count('C')
    gc_content = g + c
    if 1 <= gc_content <= max_gc_clamp:
        # The GC Clamp is optimal 
        return True 
    else: 
        return False 

#### Best Practices in Selecting Primers
- https://www.bocsci.com/resources/what-are-oligonucleotide-primers.html?srsltid=AfmBOoqrulSfsaXCtnkj0b8bwHnZRzN-KGALtljojbyG2wuYoq10FRNd
- https://www.thermofisher.com/blog/behindthebench/pcr-primer-design-tips/

There are a number of guidelines and best practices we should adhere to when selecting Primers
- Primers typically range from 18 to 24 nucleotides in length
- Optimal melting temperature ranges from 55°C to 65°C
- Optimal GC content is between 40% and 60%
- A stable 3' end which terminates in 2-3 G/C bases.
- Avoid runs of more than 4 identical bases

#### Now Let's go get some Sequences!
Here are some websites with the info we need:
- https://www.ncbi.nlm.nih.gov/gene
- https://www.uniprot.org/uniprotkb/P25084/entry

Instead of manually getting sequences on the webiste, we can use the python request library and the NCBI API to retrieve the information we need. 

In [30]:
import requests

def fetch_sequence(accession):
    # gets FASTA data from the NCBI database using an accession number 
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
    params = {
        'db': 'nucleotide',
        'id': accession,
        'rettype': 'fasta',
        'retmode': 'text'
    }
    response = requests.get(base_url + "efetch.fcgi", params=params)
    
    # parse FASTA to extract sequence
    fasta_data = response.text 
    # remove the first line and all newline characters
    lines = fasta_data.splitlines()
    sequence = "".join(lines[1:])
    return sequence

# Accession Number for Example Sequence: 
accession = "NM_001268006.2" # Caenorhabditis elegans act-1 gene,

# Fetch Sequence from NCBI Database 
seq = fetch_sequence(accession)
seq

'TTTCATATGTTTTCGTCATAAATAAATAGTTACAAGAAATAATGGAGTCGTCTGACAATTTACATGATATAGATAATCTTGAAAACGGTAACATGGCTTGCCAGTGCTTCTTGGTTGGAGCCGGATACGTGGCTCTTGCAGCTGTGGCTTATCGTCTTTTGACGATTTTCTCGAATATTTTGGGCCCATACGTTCTTCTGTCGCCAATCGATTTGAAGAAAAGAGCTGGAGCTTCTTGGGCTGTTGTCACCGGAGCCACTGACGGAATCGGAAAAGCATACGCCTTCGAATTGGCTCGTCGTGGATTCAATGTCCTGCTCGTTTCGCGTACCCAATCAAAACTCGATGAGACGAAGAAGGAGATTCTCGAGAAGTATTCCAGCATTGAGGTCCGCACTGCCGCCTTCGACTTCACCAACGCTGCTCCTTCTGCTTACAAAGATCTTCTCGCCACCTTGAACCAAGTAGAGATCGGAGTTCTTATTAACAACGTTGGAATGAGCTACGAATATCCAGATGTACTTCACAAAGTTGACGGTGGAATCGAGCGTCTTGCAAACATCACCACCATCAACACTCTTCCACCAACATTGCTCTCCGCCGGAATCCTTCCACAAATGGTCGCACGAAAGGCTGGAGTCATTGTTAATGTTGGATCTTCAGCTGGAGCAAATCAAATGGCTCTCTGGGCTGTGTATTCAGCTACAAAGAAGTATGTCTCCTGGCTCACCGCTATCCTCCGAAAAGAATATGAACATCAAGGAATCACTGTCCAAACTATTGCTCCAATGATGGTCGCCACAAAGATGTCAAAAGTCAAGAGAACTTCATTCTTCACTCCAGACGGAGCCGTGTTCGCTAAATCAGCTCTGAACACTGTTGGAAATACCTCAGACACCACCGGATACATCACGCATCAACTTCAACTCGAGCTCATGGATCTCATTCCAACATTCATCCGCGACAAGATCCTCACAAATATGAGTGTCGGAACTCGTG

In [15]:
import pandas as pd
candidates = []
primer_length = 20

for start_pos in range(len(seq) - primer_length + 1):
    forward_primer = seq[start_pos:start_pos+primer_length]
    candidates.append({
        'start': start_pos,
        'primer_seq': ''.join(forward_primer),
    })
df_candidates = pd.DataFrame(candidates)

df_candidates

Unnamed: 0,start,primer_seq
0,0,ATGGCCTTGGTTGACGGTTT
1,1,TGGCCTTGGTTGACGGTTTT
2,2,GGCCTTGGTTGACGGTTTTC
3,3,GCCTTGGTTGACGGTTTTCT
4,4,CCTTGGTTGACGGTTTTCTT
...,...,...
696,696,AATTTGGGTCTTATTACTCT
697,697,ATTTGGGTCTTATTACTCTC
698,698,TTTGGGTCTTATTACTCTCT
699,699,TTGGGTCTTATTACTCTCTG


In [16]:
def reverse_complement(seq):
    complement_map = {'A':'T','T':'A','C':'G','G':'C','N':'N'}
    rc = ''.join(complement_map.get(base, 'N') for base in reversed(seq))
    return rc

In [17]:
def gc_content(seq):
    seq = seq.upper()
    return (seq.count('G') + seq.count('C')) / len(seq) * 100


In [None]:
 
# INPUTS: 

sequence = "ACTG..."  # Target sequence
# target region ???
min_length = 18
max_length = 24
gc_range = (40, 60)  # GC content range in %
tm_range = (55, 65)  # Melting temperature range in °C
max_self_complementarity = 5  # Maximum allowed score for self-complementarity
max_gc_clamp = 3  # Max GC bases at the 3' end to avoid "GC clamp"
max_hairpin_delta_g = -9.0  # Threshold for hairpin stability (kcal/mol)

# get sequence from database using accession number 
accession = "NM_000000"
sequence = get_sequence(accession)

# generate a list of possible forward and reverse primers
primer_candidates = generate_primers(sequence) 


# filter by criteria (melting temperature, desired lenght, GC_content etc. 

# check for secondary structures (hairpins, self dimers, cross dimers etc.) 

# check specificity with BLAST search 




MORE BEST PRACTICES: 
- forward and reverse primers should be within  5°C of eachother in terms of melting temperature
- avoid intraprimer homology (3 bases that complement within the primer)
- avoid inter primer homology (forward and reverse primers with complementary sequences) which can lead to self dimers and primer dimers instead of annealing to to desired DNA sequences. 