In [3]:
import numpy as np

**Question 1**: Write a function to calculate the long ORF coverage fraction, given a DNA sequence as input, a minimum long ORF length, and which set of codons to use as stops.

The function <code>reverse_complement</code> takes in a DNA sequence as input, flips the sequence (reverses it such that we can read from 5' to 3' using Python), and then takes the complement using a dictionary of complementary base pairs. This function is defined first because it is called within the function <code>orf_fraction</code> for the purpose of checking the reverse complement  of the genome for long ORFs.

In [4]:
# function to output the reverse complement of the sequence and access the other three frames
def reverse_complement(sequence: str) -> str:
    # reverse the sequence
    sequence = sequence[::-1]

    # dictionary of complementary base pairs
    complement = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}

    # Use list comprehension to build the reverse complement
    reverse_complement_sequence = ''.join([complement[nucleotide] for nucleotide in sequence])

    return reverse_complement_sequence

<code>orf_fraction</code> takes in a DNA sequence, a list of defined codons to detect as stop codons, and a minimum number of amino acids to use as a benchmark for counting a sequence of uninterrupted DNA as an ORF. Before reading through any frames, variables <code>template_orf_length</code> and <code>reverse_orf_length</code> are initialized to zero, as these variables will be used to keep track of the total lengths of the long ORFs in the template strand and complementary strand, respectively. 

The function then iterates through the first three frames on the template DNA strand using a for loop. An amino acid counter is initialized to zero (<code>amino_acids = 0</code>) in order to keep track of the number of amino acids counted in the current ORF. A for loop is nested within this larger for loop in order to iterate through every three nucleotides in the DNA sequence, or to examine the DNA sequence one codon/amino acid at a time. 

An if statement is then used to first check if the codon is a stop codon. If it is the case that the current codon is a stop codon and the current ORF length (tracked by <code>amino_acids</code>) is less than 200 amino acids, the function does not count that section of the DNA as a long ORF; thus <code>amino_acids</code> is reset back to zero and the function starts over in counting amino acids. If it is the case that the current codon is a stop codon and the current ORF length is greater than 200 amino acids, then it is counted as a long ORF. The current ORF length is then added onto <code>template_orf_length</code> in order to keep track of the total ORF coverage for the template frames before resetting <code>amino_acids</code> back to zero. If the current codon is not a stop codon, then the amino acid counter increases by one and the function moves on to check the next codon. 

This process is then repeated for the reverse complement genome in a for loop below the one described above, with the total ORF coverage for the reverse complement genome being stored in <code>reverse_orf_length</code>. 

Finally, the total ORF coverage for all six frames is calculated as <code>orf_length = template_orf_length + reverse_orf_length</code>. The long ORF coverage fraction is then computed by dividing this ORF coverage by the length of the original sequence. 

In [5]:
# function to calculate the long ORF coverage fraction
def orf_fraction(sequence: str, stop_codons: list, min_ORF: int = 200) -> float:

    total_length = len(sequence)

    template_orf_length = 0
    reverse_orf_length = 0

    # read through three frames on the template DNA strand
    for i in range(0,3):

        # set seq as the current frame being read
        seq = sequence[i:]

        # start the counter for the number of amino acids on the template strand
        amino_acids = 0

        # iterate through every three amino acids in the DNA sequence
        for j in range(0, len(seq) - 2, 3):
            codon = seq[j:j+3]

            # if we hit a stop codon, check two cases
            if codon in stop_codons:

                # if the amino acid count is less than the minimum ORF length, reset amino acid count to zero
                if amino_acids < min_ORF:
                    amino_acids = 0

                # if the amino acid count is greater than the minimum ORF length, add on to the total ORF length
                elif amino_acids >= min_ORF:
                    template_orf_length += amino_acids * 3

                    # reset the amino acid count to zero
                    amino_acids = 0

            # increase the amino acid count by one if we do not hit a stop codon
            else:
                amino_acids += 1

        if amino_acids >= min_ORF:
          template_orf_length += amino_acids * 3

    # read through three frames on the reverse complement genome
    for i in range (0,3):
      # start the counter for the number of amino acids on the complementary strand
      reverse_amino_acids = 0
      reverse_seq = reverse_complement(sequence)[i:]

      # iterate through every three amino acids in the complementary DNA sequence
      for k in range(0, len(reverse_seq) - 2, 3):
          codon = reverse_seq[k:k+3]

          if codon in stop_codons:
              if reverse_amino_acids < min_ORF:
                  reverse_amino_acids = 0

              elif reverse_amino_acids >= min_ORF:
                  reverse_orf_length += reverse_amino_acids * 3
                  reverse_amino_acids = 0

          else:
              reverse_amino_acids += 1

      if reverse_amino_acids >= min_ORF:
          reverse_orf_length += reverse_amino_acids * 3

    orf_length = template_orf_length + reverse_orf_length

    fraction = round(orf_length / total_length, 3)
    return fraction

**Question 2**: Write a function to generate a synthetic random DNA sequence of a given GC composition.

Using Python's random module, the function <code>syn_random_DNA</code> generates a synthetic random DNA sequence given a sequence length and composition of GC bases (as a decimal) as inputs. The function selects from the set of four nucleotides using weights that are defined by the GC composition input; it is assumed that G and C split the GC composition input evenly. The function then joins the generated nucleotides and returns the sequence. 

In [6]:
import random

# set of four nucleotides to select from in generating a synthetic random DNA sequence
nucleotides = ['G', 'A', 'T', 'C']

# function to generate a synthetic random DNA sequence given GC composition and expected sequence length
def syn_random_DNA(sequence_length: int, GC_comp: float) -> str:

    # set up the fractions of the sequence that are G, A, T, C with the given GC composition as input
    G = GC_comp/2
    C = GC_comp/2
    A = (1 - GC_comp) / 2
    T = (1 - GC_comp) / 2

    # using the defined weights/fractions, generate a random sequence of length 'sequence_length'
    random_sequence = ''.join(random.choices(nucleotides, weights = (G, A, T, C), k = sequence_length))
    return random_sequence

In [7]:
print('GC_comp', '\t', 'ORF_fraction','\n', '-' * 28)

# define set of stop codons
stop_codons = ["TAA", "TAG", "TGA"]

# list of varying GC compositions that we will be testing out
compositions = [0.4, 0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.98, 0.99, 1]

# iterate through the list of GC compositions and calculate the long ORF coverage fraction for the
# corresponding GC composition
for GC_comp in compositions:
    x = syn_random_DNA(10000, GC_comp)
    print(GC_comp, '\t', '\t', orf_fraction(x, stop_codons))

GC_comp 	 ORF_fraction 
 ----------------------------
0.4 	 	 0.0
0.5 	 	 0.0
0.6 	 	 0.069
0.7 	 	 0.379
0.75 	 	 1.228
0.8 	 	 2.384
0.85 	 	 4.0
0.9 	 	 5.328
0.95 	 	 5.996
0.98 	 	 5.998
0.99 	 	 5.999
1 	 	 5.999


Above, a table containing the long ORF coverage fractions for randomly generated sequences of DNA for varying GC compositions. It is observed that as GC composition increases and approaches 1, the long ORF coverage fraction also increases and approaches 100% coverage, or a value of 6. This makes sense because with a higher percentage of G and C nucleotides in the generated DNA sequence, the less likely that a stop codon will occur, as stop codons are AT-rich. 

The results of this table support our first hypothesis, that GC% composition could have a strong effect on the probability of seeing a long ORF, and therefore in GC-rich genomes (higher GC% composition) there could be many false positives in detecting ORFs over actual protein-coding genes. This could therefore potentially be the cause of the Lestrade's inflated long-ORF-coverage statistic.

At around **70% GC composition**, the long ORF coverage seems to jump dramatically; thus it should be safe to consider long ORFs to be statistically significant under this percentage composition relative to random expectation.

**Question 3**: Get the 10 phage genomes that Moriarty challenged you with.

**Question 4:** Write a function that takes the name of a sequence file as input, parses the sequence file, and returns the DNA sequence.

The function <code>converter</code> takes in a file name, opens and reads it as <code>file</code>, and then iterates through every line in the file in order to extract the cleaned DNA sequence. It does this by initializing an empty string named <code>sequence</code>, then using a for loop to iterate through the file's lines. As the genome files are in FASTA format, the DNA sequence is any line that is not the sequence name, or any line that doe not start with a <code> > </code> symbol. If a line starts with a <code> > </code>, then it is skipped, otherwise the line is stripped of whitespace and linebreaks and then added to the string <code>sequence</code>. The final, cleaned DNA sequence is then returned.

In [31]:
# function that takes file name as input
def converter(file_name):

    # read and open the file
    with open(file_name, 'r') as file:
        sequence = ''
        for line in file:

            # skip over the sequence name
            if line.startswith('>'):
                continue

            # strip whitespace/linebreaks from every line and then add to the sequence string
            else:
                sequence += line.strip()

    # return the cleaned, isolated DNA sequence
    return sequence

**Question 5:** For each of the 10 genomes, read the sequence in from its FASTA file and calculate its long ORF coverage fraction using the standard genetic code, using TAG recoded as a sense codon (TAA|TGA are stops), and using TGA recoded (TAA|TAG are stops). Output a table with a row for each phage, showing these coverage fractions for each of the three genetic codes, and also showing the ratios for TAG coverage/standard coverage and TGA coverage/standard coverage.

The code below sets up the process of obtaining our table output by defining the list of files for the 10 genomes that will be read and have calculations/functions run on them. The stop codons and sets of codons for the case of TAG being recoded and for TGA beinng recoded are also defined for later use.

In [34]:
# list of genome file names
genomes = ['arugula.fa', 'basil.fa', 'chickpea.fa', 'gooseberry.fa', 'huckleberry.fa', 'juniper.fa', 'lentil.fa', 'quince.fa', 'sage.fa', 'tangerine.fa']
stop_codons = ["TAA", "TAG", "TGA"]

# define the lists for the recoded stop codons as sense codons
TAG_recoded = ['TAA', 'TGA']
TGA_recoded = ['TAA', 'TAG']

Using a for loop to iterate through each of the 10 genomes, a table including the long ORF coverage fraction for the standard code, TAG recoded as a sense codon, and TGA recoded as a sense codon is generated. Additionally, the last two columns are the ratios of the TAG recoded:standard code and TGA recoded:standard code. 

In [35]:
print('Phage Name', '\t', 'Standard Code', '\t', 'TAG Recoded', '\t', 'TGA Recoded', '\t', 'TAG:standard', '\t', 'TGA:standard', '\n')

for genome in genomes:
    sequence = converter(genome)

    # # calculate long ORF coverage fraction using the standard genetic code
    standard = orf_fraction(sequence, stop_codons)

    # # TAG recoded as a sense codon
    TAG_fraction = orf_fraction(sequence, TAG_recoded)

    # # TGA recoded as a sense codon
    TGA_fraction = orf_fraction(sequence, TGA_recoded)

    TAG_to_standard = round(TAG_fraction / standard, 3)

    TGA_to_standard = round(TGA_fraction / standard, 3)

    print(genome, '\t', standard, '\t' *2, TAG_fraction, '\t' * 2, TGA_fraction, '\t'*2, TAG_to_standard, '\t'* 2, TGA_to_standard, '\t')

Phage Name 	 Standard Code 	 TAG Recoded 	 TGA Recoded 	 TAG:standard 	 TGA:standard 

arugula.fa 	 0.373 		 0.395 		 0.737 		 1.059 		 1.976 	
basil.fa 	 0.78 		 0.854 		 0.857 		 1.095 		 1.099 	
chickpea.fa 	 0.842 		 0.883 		 0.919 		 1.049 		 1.091 	
gooseberry.fa 	 0.289 		 0.776 		 0.387 		 2.685 		 1.339 	
huckleberry.fa 	 0.798 		 0.86 		 0.838 		 1.078 		 1.05 	
juniper.fa 	 0.285 		 0.75 		 0.347 		 2.632 		 1.218 	
lentil.fa 	 0.369 		 0.384 		 0.739 		 1.041 		 2.003 	
quince.fa 	 0.793 		 0.873 		 0.883 		 1.101 		 1.113 	
sage.fa 	 0.844 		 0.929 		 0.922 		 1.101 		 1.092 	
tangerine.fa 	 0.697 		 0.741 		 0.755 		 1.063 		 1.083 	


Using fewer stop codons, of course the coverage has to go up a little (because you'll falsely extend some true coding regions, if you don't count their true stop codon as a stop), so those ratios are always > 1. But in recoded phage, the difference is dramatic, and you will be able to identify the two TAG-recoded and two TGA-recoded phage genomes. Which are they?


**Gooseberry and juniper are TAG-recoded, and arugula and lentil are TGA-recoded.** For the aforementioned recoded phages, the ratios are double that of the other genomes within the same column. 

Phages would have higher coverage because all of their DNA is rich with information -- essentially, much of their genomes are protein-coding, so higher ORF coverage is expected. This jump in TAG:standard and TGA:standard ratios (for the above phages) compared to the other phages points to the likely conclusion that they are TAG-recoded and TGA-recoded, respectively. This supports our second hypothesis, which is that using the standard genetic code to complete a long ORF coverage analysis would result in a lower ORF coverage fraction.