# **Final Exam** 
**Python for Genomic Data Science Coursera from Johns Hopkins University**

**Final Exam Instructions**

Please thoroughly read the instructions below before you take the Final Exam.

Write a Python program that takes as input a file containing DNA sequences in multi-FASTA format, and computes the answers to the following questions. You can choose to write one program with multiple functions to answer these questions, or you can write several programs to address them. We will provide a multi-FASTA file for you, and you will run your program to answer the exam questions. 

While developing your program(s), please use the following example file to test your work: 
dna.example.fasta

You'll be given a different input file to launch the exam itself.

Here are the questions your program needs to answer. The quiz itself contains the specific multiple-choice questions you need to answer for the file you will be provided.

(1) How many records are in the file? A record in a FASTA file is defined as a single-line header, followed by lines of sequence data. The header line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is an optional description of the entry. There should be no space between the ">" and the first letter of the identifier. 

(2) What are the lengths of the sequences in the file? What is the longest sequence and what is the shortest sequence? Is there more than one longest or shortest sequence? What are their identifiers? 

(3) In molecular biology, a reading frame is a way of dividing the DNA sequence of nucleotides into a set of consecutive, non-overlapping triplets (or codons). Depending on where we start, there are six possible reading frames: three in the forward (5' to 3') direction and three in the reverse (3' to 5'). For instance, the three possible forward reading frames for the sequence AGGTGACACCGCAAGCCTTATATTAGC are: 

AGG TGA CAC CGC AAG CCT TAT ATT AGC

A GGT GAC ACC GCA AGC CTT ATA TTA GC

AG GTG ACA CCG CAA GCC TTA TAT TAG C 

These are called reading frames 1, 2, and 3 respectively. An open reading frame (ORF) is the part of a reading frame that has the potential to encode a protein. It starts with a start codon (ATG), and ends with a stop codon (TAA, TAG or TGA). For instance, ATGAAATAG is an ORF of length 9.

Given an input reading frame on the forward strand (1, 2, or 3) your program should be able to identify all ORFs present in each sequence of the FASTA file, and answer the following questions: what is the length of the longest ORF in the file? What is the identifier of the sequence containing the longest ORF? For a given sequence identifier, what is the longest ORF contained in the sequence represented by that identifier? What is the starting position of the longest ORF in the sequence that contains it? The position should indicate the character number in the sequence. For instance, the following ORF in reading frame 1:

>sequence1

ATGCCCTAG

starts at position 1.

Note that because the following sequence:

>sequence2

ATGAAAAAA

does not have any stop codon in reading frame 1, we do not consider it to be an ORF in reading frame 1. 

(4) A repeat is a substring of a DNA sequence that occurs in multiple copies (more than one) somewhere in the sequence. Although repeats can occur on both the forward and reverse strands of the DNA sequence, we will only consider repeats on the forward strand here. Also we will allow repeats to overlap themselves. For example, the sequence ACACA contains two copies of the sequence ACA - once at position 1 (index 0 in Python), and once at position 3. Given a length n, your program should be able to identify all repeats of length n in all sequences in the FASTA file. Your program should also determine how many times each repeat occurs in the file, and which is the most frequent repeat of a given length.



**Opening file**

In [11]:
# Function for Opening fasta file 
def open_fasta_file(filename):
    """
    Opens a FASTA file and reads its contents.
    
    Parameters:
        filename (str): Path to the FASTA file.
    
    Returns:
        list: A list of lines from the file if successfully read.
    
    Raises:
        FileNotFoundError: If the file does not exist.
        IOError: If an error occurs while reading the file.
    """
    try:
        with open(filename, "r") as file:
            return file.readlines()  # Read all lines and return as a list  
    except FileNotFoundError:
        print(f"Error: The file '{filename}' was not found.")
        return None
    except IOError:
        print(f"Error: Could not read the file '{filename}'.")
        return None

In [12]:
# Function to parse the fasta file and store sequences in a dictionary 

def parse_fasta(file):
    """
    Parses a FASTA file and stores sequences in a dictionary.

    Args:
        file (file object): Opened FASTA file.

    Returns:
        dict: A dictionary with sequence names as keys and sequences as values.
    """
    sequences = {}  # Dictionary to store sequences
    name = None  # Initialize sequence name

    for line in file:
        line = line.rstrip()  # Remove trailing newline characters

        if line.startswith('>'):  # Identify sequence header
            words = line.split()  # Split header line into words
            name = words[0][1:]  # Extract sequence name (without '>')
            sequences[name] = ''  # Initialize sequence in dictionary
        else:
            if name:  # Ensure there's an active sequence name
                sequences[name] += line  # Append sequence data

    return sequences

In [14]:
# Parsing sequences into dictionary
filename = "dna2.fasta"
file = open_fasta_file(filename)

if file:
    fasta_dict = parse_fasta(file)

    # Print parsed sequences
    for name, seq in fasta_dict.items():
        print(f">{name}\n{seq}")

>gi|142022655|gb|EQ086233.1|91
CTCGCGTTGCAGGCCGGCGTGTCGCGCAACGACGTGTGGGGCCTGACGGGCAGGGAGGATCTCGGCGGCGCCAACTATGCGGTCTTTCGGCTCGAAAGCCAGTTCCAGACCTCCGACGGCGCGCTGACCGTGCCCGGCTCCGCATTCAGTTCGCAAGCCTACGTCGGGCTCGGCGGCGACTGGGGGACCGTGACGCTCGGGCGCCAGTTCGATTTCGTCGGCGATCTGATGCCGGCTTTCGCGATCGGCGCGAACACGCCGGCCGGCCTGCTCGCGTGGGGCTTGCCGGCGAATGCGTCGGCGGGCGGTGCGCTCGACAACCGCGTGTGGGGCGTCCAGGTGAACAATGCGGTGAAGTACGTGAGCCCGACGTTCGGCGGATTGTCGTTCGGCGGCCTGTGGGGCTTCGGCAACGTGCCCGGCACGGTCGCGCGCAGCAGCGTGCAAAGCGCGATGCTGTCCTACACGCAAGGCGCGTTCAGCGCCGCGCTCGCTTATTTCGGCCAGCACGATGTAACTGCCGGTGGCAATCTGCGCAATTTCTCGGGCGGTGCAGGCTACAACGTCGGGCAGTTCCGCGTCTTCGGCATGGTGTCGGACGTGCGGATCAGCGCCGCCGCGCCGCTGCGGGCCACGACCTATGACGGCGGCTTGACCTATGCGGTCACGCCGGCGTTGCAGCTCGGCGGCGGCTTCCAGTACCAGCAGCGCGGCGGCGACATCGGCTCGGCCAACCAGGTCACGTTGAGCGCCGACTATTCGCTGTCGAAGCGTACCGGCCTTTACGTGGTATTCGCACGCGGGCACGACAGTGCGTATGGCGCGCAGGTCGAGGCGGCGCTCGGCGGGGCGGCGTCCGGCTCGACGCAGACCGCGGTCCGGCTCGGGCTGCGGCATCAGTTCTGACGATGCGCGAGAAACACGGGCTGCCGCGTACGCCGCGCGCGAGCCCGTGTTTTTCCGCCGGAT

In [17]:
fasta_dict

{'gi|142022655|gb|EQ086233.1|91': 'CTCGCGTTGCAGGCCGGCGTGTCGCGCAACGACGTGTGGGGCCTGACGGGCAGGGAGGATCTCGGCGGCGCCAACTATGCGGTCTTTCGGCTCGAAAGCCAGTTCCAGACCTCCGACGGCGCGCTGACCGTGCCCGGCTCCGCATTCAGTTCGCAAGCCTACGTCGGGCTCGGCGGCGACTGGGGGACCGTGACGCTCGGGCGCCAGTTCGATTTCGTCGGCGATCTGATGCCGGCTTTCGCGATCGGCGCGAACACGCCGGCCGGCCTGCTCGCGTGGGGCTTGCCGGCGAATGCGTCGGCGGGCGGTGCGCTCGACAACCGCGTGTGGGGCGTCCAGGTGAACAATGCGGTGAAGTACGTGAGCCCGACGTTCGGCGGATTGTCGTTCGGCGGCCTGTGGGGCTTCGGCAACGTGCCCGGCACGGTCGCGCGCAGCAGCGTGCAAAGCGCGATGCTGTCCTACACGCAAGGCGCGTTCAGCGCCGCGCTCGCTTATTTCGGCCAGCACGATGTAACTGCCGGTGGCAATCTGCGCAATTTCTCGGGCGGTGCAGGCTACAACGTCGGGCAGTTCCGCGTCTTCGGCATGGTGTCGGACGTGCGGATCAGCGCCGCCGCGCCGCTGCGGGCCACGACCTATGACGGCGGCTTGACCTATGCGGTCACGCCGGCGTTGCAGCTCGGCGGCGGCTTCCAGTACCAGCAGCGCGGCGGCGACATCGGCTCGGCCAACCAGGTCACGTTGAGCGCCGACTATTCGCTGTCGAAGCGTACCGGCCTTTACGTGGTATTCGCACGCGGGCACGACAGTGCGTATGGCGCGCAGGTCGAGGCGGCGCTCGGCGGGGCGGCGTCCGGCTCGACGCAGACCGCGGTCCGGCTCGGGCTGCGGCATCAGTTCTGACGATGCGCGAGAAACACGGGCTGCCGCGTACGCCGCGCGCGAGCCCGTGTTTTTCCGCC

**Question 1. How many records are in the file?**

In [18]:
"""
How many records are in the file?
"""
# Print total number of sequences
print(f"Total number of sequences: {len(fasta_dict)}")

Total number of sequences: 18


**Question 2. What is the length of the longest sequence in the file?**

**Question 3. What is the length of the shortest sequence in the file?**

In [21]:
# Sort sequences based on length (longest to shortest)
sorted_fasta = dict(sorted(fasta_dict.items(), key=lambda item: len(item[1]), reverse=True))

# Print sorted sequences
print("Sequences sorted by length (longest to shortest):\n")
for name, seq in sorted_fasta.items():
    print(f">{name}")
    print(f"Total length: {len(seq)} bp")
    print("\n")

Sequences sorted by length (longest to shortest):

>gi|142022655|gb|EQ086233.1|255
Total length: 4894 bp


>gi|142022655|gb|EQ086233.1|16
Total length: 4804 bp


>gi|142022655|gb|EQ086233.1|91
Total length: 4635 bp


>gi|142022655|gb|EQ086233.1|454
Total length: 4564 bp


>gi|142022655|gb|EQ086233.1|293
Total length: 4338 bp


>gi|142022655|gb|EQ086233.1|396
Total length: 4076 bp


>gi|142022655|gb|EQ086233.1|45
Total length: 3511 bp


>gi|142022655|gb|EQ086233.1|250
Total length: 2867 bp


>gi|142022655|gb|EQ086233.1|527
Total length: 2646 bp


>gi|142022655|gb|EQ086233.1|4
Total length: 2095 bp


>gi|142022655|gb|EQ086233.1|277
Total length: 1432 bp


>gi|142022655|gb|EQ086233.1|75
Total length: 1352 bp


>gi|142022655|gb|EQ086233.1|304
Total length: 1151 bp


>gi|142022655|gb|EQ086233.1|594
Total length: 967 bp


>gi|142022655|gb|EQ086233.1|584
Total length: 964 bp


>gi|142022655|gb|EQ086233.1|88
Total length: 890 bp


>gi|142022655|gb|EQ086233.1|322
Total length: 442 bp


>gi|1420

In [25]:
'''
Question 2. What is the length of the longest sequence in the file?

Question 3. What is the length of the shortest sequence in the file?
'''


# Find the longest sequence
longest_seq_name, longest_seq = max(fasta_dict.items(), key=lambda item: len(item[1]))

# Find the shortest sequence
shortest_seq_name, shortest_seq = min(fasta_dict.items(), key=lambda item: len(item[1]))

# Print results
print(f"Longest sequence: {longest_seq_name}, Length: ({len(longest_seq)} bp)")
print(f"Shortest sequence: {shortest_seq_name}, Length: ({len(shortest_seq)} bp)")


Longest sequence: gi|142022655|gb|EQ086233.1|255, Length: (4894 bp)
Shortest sequence: gi|142022655|gb|EQ086233.1|346, Length: (115 bp)


**The longest sequence in the file: 4894 bp**

**The shortest sequence in the file: 115 bp**

**Finding ORFs from all sequences**

In [34]:
# Creating function for finding ORFs in reading frame
def find_orfs_in_frame(sequence, frame=1):
    """
    Finds Open Reading Frames (ORFs) in a given reading frame.

    Parameters:
        sequence (str): The DNA sequence (uppercase).
        frame (int): The reading frame (1, 2, or 3).

    Returns:
        list: A list of ORFs found in the given reading frame.
    """
    start_codon = "ATG"
    stop_codons = {"TAA", "TAG", "TGA"}
    orfs = []

    seq = sequence[frame - 1:]  # Adjust reading frame
    seq_length = len(seq)

    i = 0
    while i < seq_length - 2:
        codon = seq[i:i+3]

        if codon == start_codon:
            start_index = i
            for j in range(i, seq_length - 2, 3):
                stop_codon = seq[j:j+3]
                if stop_codon in stop_codons:
                    orfs.append(seq[start_index:j+3])  # Include stop codon
                    i = j  # Move to next possible ORF
                    break
        i += 3  # Move to next codon

    return orfs


In [29]:
# Function to find ORFs in all sequences 

def find_orfs_in_fasta(fasta_dict):
    """
    Finds ORFs in each reading frame (1, 2, 3) for all sequences in a FASTA dictionary.

    Parameters:
        fasta_dict (dict): Dictionary with sequence names as keys and DNA sequences as values.

    Returns:
        dict: Dictionary with sequence names as keys and a sub-dictionary of ORFs in each frame.
    """
    orf_results = {}

    for name, sequence in fasta_dict.items():
        sequence = sequence.upper()  # Ensure uppercase for consistency
        orf_results[name] = {
            "Frame 1": find_orfs_in_frame(sequence, frame=1),
            "Frame 2": find_orfs_in_frame(sequence, frame=2),
            "Frame 3": find_orfs_in_frame(sequence, frame=3)
        }

    return orf_results

In [30]:
# Finding ORFs in all sequences
orf_dict = find_orfs_in_fasta(fasta_dict)
orf_dict

{'gi|142022655|gb|EQ086233.1|91': {'Frame 1': ['ATGCCGGCTTTCGCGATCGGCGCGAACACGCCGGCCGGCCTGCTCGCGTGGGGCTTGCCGGCGAATGCGTCGGCGGGCGGTGCGCTCGACAACCGCGTGTGGGGCGTCCAGGTGAACAATGCGGTGAAGTACGTGAGCCCGACGTTCGGCGGATTGTCGTTCGGCGGCCTGTGGGGCTTCGGCAACGTGCCCGGCACGGTCGCGCGCAGCAGCGTGCAAAGCGCGATGCTGTCCTACACGCAAGGCGCGTTCAGCGCCGCGCTCGCTTATTTCGGCCAGCACGATGTAACTGCCGGTGGCAATCTGCGCAATTTCTCGGGCGGTGCAGGCTACAACGTCGGGCAGTTCCGCGTCTTCGGCATGGTGTCGGACGTGCGGATCAGCGCCGCCGCGCCGCTGCGGGCCACGACCTATGACGGCGGCTTGACCTATGCGGTCACGCCGGCGTTGCAGCTCGGCGGCGGCTTCCAGTACCAGCAGCGCGGCGGCGACATCGGCTCGGCCAACCAGGTCACGTTGAGCGCCGACTATTCGCTGTCGAAGCGTACCGGCCTTTACGTGGTATTCGCACGCGGGCACGACAGTGCGTATGGCGCGCAGGTCGAGGCGGCGCTCGGCGGGGCGGCGTCCGGCTCGACGCAGACCGCGGTCCGGCTCGGGCTGCGGCATCAGTTCTGA',
   'ATGCATCATCCCGACGCGCAACGCCAGCTGGTTGCGGCCCGACGACTGCCCGGCCGTGCCGAGCACGTGCGCGTAGTCGAAATCGGTGCCGGTATGGCCGGTTGCATGCTGGTACGCGCCCTGGATGTACGTCGAGGTTCGCTTGCTCAGATCGTAGTCGAGCATCATCGACACCTGGTGCCAGTTCGGCGACGCATCGCCCGCCGCCGTCGCGACATGCGCATGCGTGTAGATATAGGCGGCGCCGAGCCAGAGGTCCGGCCGGAA

In [37]:
# Function to find the longest ORF of all frames
def get_longest_orf(orf_dict):
    """
    Finds the longest ORF across all reading frames from all sequences.

    Parameters:
        orf_dict (dict): Dictionary containing ORFs for each sequence.

    Returns:
        tuple: (Longest ORF sequence, Length of the longest ORF, Sequence name, Frame)
    """
    longest_orf = ""
    longest_length = 0
    longest_seq_name = ""
    longest_frame = ""

    for name, frames in orf_dict.items():
        for frame, orfs in frames.items():
            for orf in orfs:
                if len(orf) > longest_length:
                    longest_orf = orf
                    longest_length = len(orf)
                    longest_seq_name = name
                    longest_frame = frame

    return longest_orf, longest_length, longest_seq_name, longest_frame

In [38]:
# Find the longest ORF across all frames
longest_orf_seq, longest_orf_length, seq_name, frame = get_longest_orf(orf_dict)

# Print result
print(f"Longest ORF found in {seq_name}, {frame}:")
print(f"Length: {longest_orf_length} bp")
print(f"Sequence: {longest_orf_seq}")

Longest ORF found in gi|142022655|gb|EQ086233.1|45, Frame 1:
Length: 2394 bp
Sequence: ATGGAGAAACAGTCTCGCGTTACGCGCGACGGTCGCGGGAGAGTTCTATGCGGTCATCGCTGCCGCGGTCGCGATTGGACTGGTCATGACGTTCGTTCATTTCGACCCGATTCGAGCGCTCTACTGGAGCGCCGTCATCAATGGGATCACGGCAGTGCCCATCATGGTGGTGATGATGCTGATGGCGCAGAGCCGGCGCGTGATGGGCGAGTTCGCAATCAGAGGACCGCTTGCGTGGGGAGGGTGGCTCGCGACGCTCGCCATGGCGCTCGCGGCGGCCGGAATGCTGCTGCCGGGATGAGCCGGCAATCCGGATGGAGAATGCGCATGCCCGCGACGCACCGGCGACGCCTCGCCGGACGGCGGGCGTCGCATTCGCCATTCGCCATTCGCCATTCGCCATTCGCCATTCGCCATTCGCCATTCGCCATTCGCCATTCGCCATTCGCCGAGCGCTCCATCGACGACGGTGGCGGCCACGCCCCGGAATTCGACATGCCTGCATCCTCCGATACGGCGAACCGGCGGGCGTCATCAATCGCGCGCATCCAGCGCGGGCTGAAGCGCGGGCTCGGCCGGCGCTGCCGGTTCATGGCCGCCGTGGCGCGCGGCGGTGGAATGGCCGGGCCGGATCCTGAACCAGATCGCATACATCGCGGGCAGGAACACGAGCGTGAGGACCGTCCCGGCGAACGTGCCGCCGATCAGCGTGTACGCGAGCGTGCCCCAGAACACCGAATGCGTGAGCGGAATGAACGCGAGCACGGCCGCCATCGCGGTAAGAATCACCGGGCGCGCCCGCTGCACGGTCGCTTCGACGACCGCGTGGAACGGATCGAGTCCCGCGTGTTCGTTCTGGTGGATCTGGCCGATCAGGATCAGCGTGTTGCGCATCAGGATCCCCGACAGCGCG

In [39]:
# Function to get the longest ORF in specific frame

def get_longest_orf_by_frame(orf_dict, frame):
    """
    Finds the longest ORF from a specified reading frame across all sequences.

    Parameters:
        orf_dict (dict): Dictionary containing ORFs for each sequence.
        frame (str): Reading frame to search in ("Frame 1", "Frame 2", or "Frame 3").

    Returns:
        tuple: (Longest ORF sequence, Length of the longest ORF, Sequence name)
    """
    longest_orf = ""
    longest_length = 0
    longest_seq_name = ""

    for name, frames in orf_dict.items():
        if frame in frames:  # Ensure the requested frame exists in the dictionary
            for orf in frames[frame]:  # Iterate through all ORFs in the specified frame
                if len(orf) > longest_length:
                    longest_orf = orf
                    longest_length = len(orf)
                    longest_seq_name = name

    return longest_seq_name, longest_length, longest_orf

In [40]:
# The longest ORF in Frame 1
frame_to_check = "Frame 1"
seq_name, longest_orf_length, longest_orf_seq = get_longest_orf_by_frame(orf_dict, frame_to_check)

# Print result
print(f"Longest ORF in {frame_to_check}:")
print(f"Sequence Name: {seq_name}")
print(f"Length: {longest_orf_length} bp")
print(f"Sequence: {longest_orf_seq}")

Longest ORF in Frame 1:
Sequence Name: gi|142022655|gb|EQ086233.1|45
Length: 2394 bp
Sequence: ATGGAGAAACAGTCTCGCGTTACGCGCGACGGTCGCGGGAGAGTTCTATGCGGTCATCGCTGCCGCGGTCGCGATTGGACTGGTCATGACGTTCGTTCATTTCGACCCGATTCGAGCGCTCTACTGGAGCGCCGTCATCAATGGGATCACGGCAGTGCCCATCATGGTGGTGATGATGCTGATGGCGCAGAGCCGGCGCGTGATGGGCGAGTTCGCAATCAGAGGACCGCTTGCGTGGGGAGGGTGGCTCGCGACGCTCGCCATGGCGCTCGCGGCGGCCGGAATGCTGCTGCCGGGATGAGCCGGCAATCCGGATGGAGAATGCGCATGCCCGCGACGCACCGGCGACGCCTCGCCGGACGGCGGGCGTCGCATTCGCCATTCGCCATTCGCCATTCGCCATTCGCCATTCGCCATTCGCCATTCGCCATTCGCCATTCGCCATTCGCCGAGCGCTCCATCGACGACGGTGGCGGCCACGCCCCGGAATTCGACATGCCTGCATCCTCCGATACGGCGAACCGGCGGGCGTCATCAATCGCGCGCATCCAGCGCGGGCTGAAGCGCGGGCTCGGCCGGCGCTGCCGGTTCATGGCCGCCGTGGCGCGCGGCGGTGGAATGGCCGGGCCGGATCCTGAACCAGATCGCATACATCGCGGGCAGGAACACGAGCGTGAGGACCGTCCCGGCGAACGTGCCGCCGATCAGCGTGTACGCGAGCGTGCCCCAGAACACCGAATGCGTGAGCGGAATGAACGCGAGCACGGCCGCCATCGCGGTAAGAATCACCGGGCGCGCCCGCTGCACGGTCGCTTCGACGACCGCGTGGAACGGATCGAGTCCCGCGTGTTCGTTCTGGTGGATCTGGCCGATCAGGATCAGCGTGTTGCGCATCAGGATCCCCG

In [43]:
# The longest ORF in Frame 2
frame_to_check_fr2 = "Frame 2"
seq_name_fr2, longest_orf_length_fr2, longest_orf_seq_fr2 = get_longest_orf_by_frame(orf_dict, frame_to_check_fr2)

# Print result
print(f"Longest ORF in {frame_to_check_fr2}:")
print(f"Sequence Name: {seq_name_fr2}")
print(f"Length: {longest_orf_length_fr2} bp")
print(f"Sequence: {longest_orf_seq_fr2}")

Longest ORF in Frame 2:
Sequence Name: gi|142022655|gb|EQ086233.1|16
Length: 1458 bp
Sequence: ATGGCAATCCTGATTCGTGGCGGCACCGTGGTCGATGCGGACCGTTCCTACCGCGCGGACGTGCTCTGCGCAGCCCCGGAGGACGGCGGCACGATCCTGCAGATCGCCGGGCAGATCGATGCGCCGGCCGGCGCGACCGTCGTCGATGCGCACGACCAGTACGTGATGCCGGGCGGCATCGATCCGCATACGCACATGGAACTGCCGTTCATGGGCACGACCGCGAGCGACGATTTCTACTCGGGTACGGCCGCCGGGCTCGCGGGCGGCACGACGAGCATCATCGACTTCGTGATCCCGAGCCCGAAGCAGCCGCTGATGGACGCGTTCCATGCCTGGCGCGGCTGGGCCGAGAAGGCGGCGGCCGACTACGGCTTCCACGTGGCCGTGACGTGGTGGGACGAGAGTGTGCACCGCGACATGGGCACGCTCGTGCGCGAACACGGCGTGTCGAGCTTCAAGCACTTCATGGCGTACAAGAACGCGATCATGGCCGACGACGAGGTGCTCGTGAACAGCTTCTCGCGTTCGCTCGAACTCGGCGCGTTGCCGACCGTGCATGCGGAGAACGGCGAGCTCGTGTTCCAGTTGCAGAAGGCGCTGCTCGCGCGCGGGATGACGGGGCCGGAGGCGCATCCGCTGTCGCGGCCGCCGGAGGTCGAGGGTGAGGCGGCGAATCGTGCGATCCGCATTGCGCAGGTGCTCGGCGTGCCGGTGTATATCGTGCATGTGTCCGCGAAGGACGCGGTCGATGCGATCACGAAGGCGCGCAGCGAAGGGCTGCGCGTGTTCGGCGAGGTGCTGCCGGGCCATCTGGTGATCGACGAGGCCGTCTATCGCGATCCGGACTGGACACGTGCGGCCGCGCACGTGATGAGCCCGCCGTTCCGCTCGGCCGAGCACCG

In [42]:
# The longest ORF in Frame 3
frame_to_check_fr3 = "Frame 3"
seq_name_fr3, longest_orf_length_fr3, longest_orf_seq_fr3 = get_longest_orf_by_frame(orf_dict, frame_to_check_fr3)

# Print result
print(f"Longest ORF in {frame_to_check_fr3}:")
print(f"Sequence Name: {seq_name_fr3}")
print(f"Length: {longest_orf_length_fr3} bp")
print(f"Sequence: {longest_orf_seq_fr3}")

Longest ORF in Frame 3:
Sequence Name: gi|142022655|gb|EQ086233.1|527
Length: 1821 bp
Sequence: ATGAACAGCGGGGCGAGCAAGCCGCCGGCCGTCACGGGGTCCATCACGAGGGACAGCAGCGGAATGCCGATGATCGCGAATCCACCACCGAACGCGCCGCGCATGAACGCGATCACGAACACGCCGGCAAACGCGATCAGGATCGTGGCCAGCGTCAATTGCAGGCCCATCGCAGCAGGGGTCGCCATCACGACCTCCATGCCGGTTCGAATCGCGGCGTGGCGGACAGCCACGGAGCGGGTCGCACGCGCGGCATCGCCGCACGATGGATCCGGGTTGAACGCGTTGCACCCATGCTGCTTCTCCAATGAGGTACCGGGGCGATGCGGTACACCAACGCACCGCAGGCCGCATGGGCCGCACAAGCATTTCAGCCCCGGTACAATCGACTTGACGAAAGCAGAATGCACCGCCGTCTATCTCAGTGCAATTAAAACATTGACCTCGGTGCAATATTCATTGTTATCGGTGCAATCCATGTCGAATTCCGAATACCTGCAGTTGGCCGACGCGATCGCCGCCCAAATTGCCGACGGCACGCTCAGGCCGGGCGACCGCCTGCCTCCGCAGCGTCATTTCGCCGACCAGCATGCGATCGCCGCATCGACGGCGGGACGGGTTTACGCGGAACTGTTACGGCGCGGCCTTGTGGTCGGCGAAGTCGGCCGAGGCACTTTCGTGTCGGGTGAGACGCGACGCGGGGCCGCTGCGCCGGGCGAGCCGCGCGGCGTTCGGATCGATTTCGAGTTCAACTACCCGACCGTCCCGGCCCAGACCGCGTTGATCACCAGAAGCCTGCGCGGATTGCACCGACCTGCGGAGCTCGACGCCGCGTTACGCGAGGCGACGAGTACCGGGACCCCGGTCATCCGAAGCGTTGCCGCCGCGTATCTGGCGCAGCATG

**Question 4. What is the length of the longest ORF appearing in reading frame 2 of any of the sequences?**

In [44]:
'''
What is the length of the longest ORF appearing in reading frame 2 of any of the sequences?
'''

# Print result
print(f"Length of the longest ORF in reading frame 2: {longest_orf_length_fr2} bp")

Length of the longest ORF in reading frame 2: 1458 bp


**Question 5. What is the starting position of the longest ORF in reading frame 3 in any of the sequences?** 

**The position should indicate the character number where the ORF begins. For instance, the following ORF:**

> sequence1 : **ATGCCCTAG**

**starts at position 1.**

In [46]:
# The sequence name with the longest ORF in the reading frame 3
print(f"The sequence name with the longest ORF in the reading frame 3: {seq_name_fr3}")

The sequence name with the longest ORF in the reading frame 3: gi|142022655|gb|EQ086233.1|527


In [49]:
# Sequence with the longest ORF in reading frame 3
sequence_target = fasta_dict.get("gi|142022655|gb|EQ086233.1|527")

print(seq_name_fr3)
print(f'Total length {len(sequence_target)} bp')
print(sequence_target)


gi|142022655|gb|EQ086233.1|527
Total length 2646 bp
GAGAACCGGGAACCGGAACCATGACAGCCCCGCGCCGGTTTTACGCGAGATAGCCGGAAACGCCGTCCCAGAGCAGTTTCAATGCGGTCACCGCCAGCAATCCGTAGCAGCTCCGGTAGATCAGGCGCTGGTCCAGCCTGCCGTGAAGCCGCCAGCCGAACACCACGCCGGCCGGAATGGCAAGCAGGCACACCGCCATCAACGCCCAGACGTTCGCGGTCGGCTGCACGATCAGCAGCCACGGCACTGCCTTGATCGCATTGCCCACGGTGAAGAACAGGCTCGTCGTTCCCGCGTACATCTCCTTGCTGAGGCCAAGCGGCAGCAGATACATCGCGAGCGGCGGCCCGCCCGAGTGCGCGACCATCGTCGTGACGCCCGATGCAAGGCCGGCCGAGACTGCCTTCGGCGACGAACGCGGACGAACCGTCGGCTCCGCCCCGCCCCTCACCCACAGCCCGACGAAGACCAGCGTGACCACCGCCATCAAAAGCTCGATGGCGCGATGGTCGAGGAAGCGGAAAGCCAGGTAACCGAACCCGATACCGACCACCAGCCCCGGCAGGAGCAGCACGAGGTCGGGCTTCGACCATGTCGACGGCTTCCAGTACCGCAGCGCGAACAGGTCCATCGCGATGAACAGCGGGGCGAGCAAGCCGCCGGCCGTCACGGGGTCCATCACGAGGGACAGCAGCGGAATGCCGATGATCGCGAATCCACCACCGAACGCGCCGCGCATGAACGCGATCACGAACACGCCGGCAAACGCGATCAGGATCGTGGCCAGCGTCAATTGCAGGCCCATCGCAGCAGGGGTCGCCATCACGACCTCCATGCCGGTTCGAATCGCGGCGTGGCGGACAGCCACGGAGCGGGTCGCACGCGCGGCATCGCCGCACGATGGATCCGGGTTGAACGCGTTGCACCCATGCTGCTTCTCCAATGAGG

In [55]:
# Finding the longest ORF location in reading frame 3 in the sequence
# Starts at position 1

# Find the starting index of the ORF inside the full sequence # Position at 0
orf_start_index = sequence_target.find(longest_orf_seq_fr3)

if orf_start_index != -1:
    # Convert to 1-based position
    orf_start_position = orf_start_index + 1
    print(f"The starting position of the longest ORF in reading frame 3: {orf_start_position}")
else:
    print("ORF not found in the sequence.")

The starting position of the longest ORF in reading frame 3: 636


**Question 6 What is the length of the longest ORF appearing in any sequence and in any forward reading frame?**

In [56]:
# Function to find the longest ORF of all frames
def get_longest_orf(orf_dict):
    """
    Finds the longest ORF across all reading frames from all sequences.

    Parameters:
        orf_dict (dict): Dictionary containing ORFs for each sequence.

    Returns:
        tuple: (Longest ORF sequence, Length of the longest ORF, Sequence name, Frame)
    """
    longest_orf = ""
    longest_length = 0
    longest_seq_name = ""
    longest_frame = ""

    for name, frames in orf_dict.items():
        for frame, orfs in frames.items():
            for orf in orfs:
                if len(orf) > longest_length:
                    longest_orf = orf
                    longest_length = len(orf)
                    longest_seq_name = name
                    longest_frame = frame

    return longest_orf, longest_length, longest_seq_name, longest_frame

In [57]:
# Find the longest ORF across all frames
longest_orf_seq, longest_orf_length, seq_name, frame = get_longest_orf(orf_dict)

# Print result
print(f"Longest ORF found in {seq_name}, {frame}:")
print(f"Length: {longest_orf_length} bp")

Longest ORF found in gi|142022655|gb|EQ086233.1|45, Frame 1:
Length: 2394 bp


In [59]:
# Validating the result (sorting from the longest to the lowest ORF)

# Collect all ORFs with their metadata into a list
orf_list = []

for name, frames in orf_dict.items():
    for frame, orfs in frames.items():
        for orf in orfs:
            orf_list.append((name, frame, len(orf), orf))  # Store (sequence, frame, length, sequence itself)

# Sort ORFs by length in descending order
orf_list_sorted = sorted(orf_list, key=lambda x: x[2], reverse=True)

# Print sorted ORFs
for name, frame, length, orf in orf_list_sorted:
    print(f"Sequence: {name}, Frame: {frame}, ORF Length: {length}")


Sequence: gi|142022655|gb|EQ086233.1|45, Frame: Frame 1, ORF Length: 2394
Sequence: gi|142022655|gb|EQ086233.1|527, Frame: Frame 3, ORF Length: 1821
Sequence: gi|142022655|gb|EQ086233.1|16, Frame: Frame 3, ORF Length: 1644
Sequence: gi|142022655|gb|EQ086233.1|250, Frame: Frame 1, ORF Length: 1560
Sequence: gi|142022655|gb|EQ086233.1|16, Frame: Frame 1, ORF Length: 1509
Sequence: gi|142022655|gb|EQ086233.1|16, Frame: Frame 2, ORF Length: 1458
Sequence: gi|142022655|gb|EQ086233.1|255, Frame: Frame 1, ORF Length: 1443
Sequence: gi|142022655|gb|EQ086233.1|454, Frame: Frame 3, ORF Length: 1401
Sequence: gi|142022655|gb|EQ086233.1|16, Frame: Frame 3, ORF Length: 1317
Sequence: gi|142022655|gb|EQ086233.1|91, Frame: Frame 1, ORF Length: 1296
Sequence: gi|142022655|gb|EQ086233.1|396, Frame: Frame 2, ORF Length: 1281
Sequence: gi|142022655|gb|EQ086233.1|255, Frame: Frame 2, ORF Length: 1185
Sequence: gi|142022655|gb|EQ086233.1|396, Frame: Frame 1, ORF Length: 1059
Sequence: gi|142022655|gb|EQ086

Its valid that the longest ORF of all frame from **gi|142022655|gb|EQ086233.1|45, Frame 1:
Length: 2394 bp**

**Question 7. What is the length of the longest forward ORF that appears in the sequence with the identifier gi|142022655|gb|EQ086233.1|16?**

In [64]:
# Target Sequence with the identifier "gi|142022655|gb|EQ086233.1|16"
target_seq_id = "gi|142022655|gb|EQ086233.1|16"

# Check if the sequence exists in orf_dict
if target_seq_id in orf_dict:
    orf_list = []  # Store (frame, ORF length, ORF sequence)
    
    # Extract ORFs from all frames
    for frame, orfs in orf_dict[target_seq_id].items():
        for orf in orfs:
            orf_list.append((frame, len(orf), orf))  # Store (frame, ORF length, ORF sequence)
    
    # Sort ORFs by length (longest to shortest)
    orf_list_sorted = sorted(orf_list, key=lambda x: x[1], reverse=True)

    # Print results
    print(f"All ORFs in '{target_seq_id}' sorted by length (longest to shortest):\n")
    for frame, length, orf in orf_list_sorted:
        print(f"Frame: {frame}, ORF Length: {length} bp")
        print(f"ORF Sequence: {orf}\n")

else:
    print(f"Sequence ID '{target_seq_id}' not found in orf_dict.")

All ORFs in 'gi|142022655|gb|EQ086233.1|16' sorted by length (longest to shortest):

Frame: Frame 3, ORF Length: 1644 bp
ORF Sequence: ATGCGGGCCATCCTGCATCGCCGCCTTTCGTTCCACCCGGGCCGGCATCGAGTGATGCCGGCGTTGACGTTTTCGTGGAGTGAGTCAGATGAATCACGCAGCGAATCCCGCCGATCCCGATCGCGCCGCGGCGCAGGGCGGCAGCCTGTACAACGACGATCTCGCGCCGACGACGCCGGCGCAGCGCACGTGGAAGTGGTATCACTTCGCGGCGCTGTGGGTCGGGATGGTGATGAACATCGCGTCGTACATGCTCGCGGCCGGGCTGATCCAGGAAGGCATGTCGCCGTGGCAGGCGGTGACGACGGTGCTGCTCGGCAACCTGATCGTGCTCGTGCCGATGCTGCTGATCGGCCATGCGGGCGCGAAGCACGGGATTCCGTACGCGGTGCTCGTGCGCGCGTCGTTCGGCACGCAGGGGGCGAAGCTGCCGGCGCTGCTGCGCGCGATCGTCGCGTGCGGCTGGTACGGGATCCAGACCTGGCTCGGCGGCAGCGCGATCTATACGCTGCTGAACATCCTGACCGGCAACGCGCTGCATGGCGCCGCGCTGCCGGTCATCGGCATCGGGTTCGGGCAGCTCGCATGCTTCCTCGTGTTCTGGGCGCTGCAGCTCTACTTCATCTGGCATGGCACCGATTCGATCCGCTGGCTCGAAAGCTGGTCGGCGCCGATCAAGGTCGTGATGTGCGTGGCGCTGGTGTGGTGGGCAACGTCGAAGGCGGGCGGCTTCGGCACGATGCTGTCGGCGCCGTCGCAGTTTGCCGCAGGCGGCAAGAAAGCCGGGCTGTTCTGGGCGACCTTCTGGCCGGGGCTGACCGCGATGGTCGGCTTCTGGGCGACGCTCGCGCTGAACATCCCCGAC

In [72]:
# Target Sequence with the identifier "gi|142022655|gb|EQ086233.1|16"
target_seq_id = "gi|142022655|gb|EQ086233.1|16"

# Check if the sequence exists in orf_dict
if target_seq_id in orf_dict:
    orf_list = []  # Store (frame, ORF length, ORF sequence)
    
    # Extract ORFs from all frames
    for frame, orfs in orf_dict[target_seq_id].items():
        for orf in orfs:
            orf_list.append((frame, len(orf), orf))  # Store (frame, ORF length, ORF sequence)
    
    # Sort ORFs by length (longest to shortest)
    orf_list_sorted = sorted(orf_list, key=lambda x: x[1], reverse=True)

    # Extract the longest ORF
    if orf_list_sorted:
        longest_orf_frame, longest_orf_length, longest_orf_seq = orf_list_sorted[0]
        print(f"\nLongest ORF in '{target_seq_id}':")
        print(f"Frame: {longest_orf_frame}, Length: {longest_orf_length} bp")
        print(f"Sequence: {longest_orf_seq}")

else:
    print(f"Sequence ID '{target_seq_id}' not found in orf_dict.")


Longest ORF in 'gi|142022655|gb|EQ086233.1|16':
Frame: Frame 3, Length: 1644 bp
Sequence: ATGCGGGCCATCCTGCATCGCCGCCTTTCGTTCCACCCGGGCCGGCATCGAGTGATGCCGGCGTTGACGTTTTCGTGGAGTGAGTCAGATGAATCACGCAGCGAATCCCGCCGATCCCGATCGCGCCGCGGCGCAGGGCGGCAGCCTGTACAACGACGATCTCGCGCCGACGACGCCGGCGCAGCGCACGTGGAAGTGGTATCACTTCGCGGCGCTGTGGGTCGGGATGGTGATGAACATCGCGTCGTACATGCTCGCGGCCGGGCTGATCCAGGAAGGCATGTCGCCGTGGCAGGCGGTGACGACGGTGCTGCTCGGCAACCTGATCGTGCTCGTGCCGATGCTGCTGATCGGCCATGCGGGCGCGAAGCACGGGATTCCGTACGCGGTGCTCGTGCGCGCGTCGTTCGGCACGCAGGGGGCGAAGCTGCCGGCGCTGCTGCGCGCGATCGTCGCGTGCGGCTGGTACGGGATCCAGACCTGGCTCGGCGGCAGCGCGATCTATACGCTGCTGAACATCCTGACCGGCAACGCGCTGCATGGCGCCGCGCTGCCGGTCATCGGCATCGGGTTCGGGCAGCTCGCATGCTTCCTCGTGTTCTGGGCGCTGCAGCTCTACTTCATCTGGCATGGCACCGATTCGATCCGCTGGCTCGAAAGCTGGTCGGCGCCGATCAAGGTCGTGATGTGCGTGGCGCTGGTGTGGTGGGCAACGTCGAAGGCGGGCGGCTTCGGCACGATGCTGTCGGCGCCGTCGCAGTTTGCCGCAGGCGGCAAGAAAGCCGGGCTGTTCTGGGCGACCTTCTGGCCGGGGCTGACCGCGATGGTCGGCTTCTGGGCGACGCTCGCGCTGAACATCCCCGACTTCACGCGCTTCGCGCATTCGCAGCGCGACCAGGTGATCGGCCA

**Answer:** The longest length of ORP in the target sequence with the identifier  gi|142022655|gb|EQ086233.1|16 is **1644 bp**

**Question 8. Find the most frequently occurring repeat of length 6 in all sequences. How many times does it occur in all?**

In [67]:
# Find the most frequent repeat of length 6 or hexamer in all sequences

# Importing library
from collections import defaultdict

# Dictionary to store hexamer (6-mer) counts
hexamer_counts = defaultdict(int)

# Extract all 6-mers from every sequence in fasta_dict
for seq_name, sequence in fasta_dict.items():
    for i in range(len(sequence) - 5):  # Ensure a valid 6-mer
        hexamer = sequence[i:i+6]  # Extract 6-mer
        hexamer_counts[hexamer] += 1  # Count occurrences

# Find the hexamer with the highest frequency
most_frequent_hexamer = max(hexamer_counts, key=hexamer_counts.get)
highest_frequency = hexamer_counts[most_frequent_hexamer]

print("Most frequent hexamer repeat:")
print(f"Repeat: {most_frequent_hexamer}, Frequency: {highest_frequency}")

Most frequent hexamer repeat:
Repeat: GCGCGC, Frequency: 153


In [68]:
# We can check all repeats and their frequency of hexamers by executing code below
# Print results
# print("Hexamer repeats and their frequencies:\n")
# for hexamer, count in sorted(hexamer_counts.items(), key=lambda x: x[1], reverse=True):
    #print(f"Repeat: {hexamer}, Frequency: {count}")

**Question no 9. Find all repeats of length 12 in the input file. Let's use Max to specify the number of copies of the most frequent repeat of length 12. How many different 12-base sequences 
occur Max times?**

In [69]:
# Find all repeat of length 12 or twelevemer

# Importing libary 'defaultdict'
from collections import defaultdict

# Dictionary to store 12-mer (12-base) counts
twelvemer_counts = defaultdict(int)

# Extract all 12-mers from every sequence in fasta_dict
for seq_name, sequence in fasta_dict.items():
    for i in range(len(sequence) - 11):  # Ensure valid 12-mer
        twelvemer = sequence[i:i+12]  # Extract 12-mer
        twelvemer_counts[twelvemer] += 1  # Count occurrences

# Find the maximum frequency (Max)
max_frequency = max(twelvemer_counts.values())

# Count how many 12-mers occur exactly 'Max' times
num_max_occurrences = sum(1 for count in twelvemer_counts.values() if count == max_frequency)

# Print results
print(f"The most frequent 12-mer (Max frequency): {max_frequency}")
print(f"Number of different 12-base sequences that occur {max_frequency} times: {num_max_occurrences}")

The most frequent 12-mer (Max frequency): 10
Number of different 12-base sequences that occur 10 times: 4


**Question no 10. Which one of the following repeats of length 7 has a maximum number of occurrences?**

* TGCGCGC
* GCGCGCA
* CGCGCCG
* CATCGCC

In [71]:
# Find the most frequent repeat of length 7 or sevenmer in all sequences

from collections import defaultdict

# Dictionary to store sevenmer (7-mer) counts
sevenmer_counts = defaultdict(int)

# Extract all 7-mers from every sequence in fasta_dict
for seq_name, sequence in fasta_dict.items():
    for i in range(len(sequence) - 6):  # Ensure a valid 7-mer
        sevenmer = sequence[i:i+7]  # Extract 7-mer
        sevenmer_counts[sevenmer] += 1  # Count occurrences

# Find the sevenmer with the highest frequency
most_frequent_sevenmer = max(sevenmer_counts, key=sevenmer_counts.get)
highest_frequency = sevenmer_counts[most_frequent_sevenmer]

print("Most frequent sevenmer repeat:")
print(f"Repeat: {most_frequent_sevenmer}, Frequency: {highest_frequency}")


Most frequent sevenmer repeat:
Repeat: CGCGCCG, Frequency: 63


In [None]:
# We can check and validate by executing code below
# Print all sevenmer results and their frequencies
# print("Sevenmer repeats and their frequencies:\n")
# for sevenmer, count in sorted(sevenmer_counts.items(), key=lambda x: x[1], reverse=True):
    # print(f"Repeat: {sevenmer}, Frequency: {count}")