## Final Exam Instructions
Please thoroughly read the instructions below before you take the Final Exam.

Write a Python program that takes as input a file containing DNA sequences in multi-FASTA format, and computes the answers to the following questions. You can choose to write one program with multiple functions to answer these questions, or you can write several programs to address them. We will provide a multi-FASTA file for you, and you will run your program to answer the exam questions. 

While developing your program(s), please use the following example file to test your work: 
dna.example.fasta

You'll be given a different input file to launch the exam itself.

Here are the questions your program needs to answer. The quiz itself contains the specific multiple-choice questions you need to answer for the file you will be p- rovided.

(1) How many records are in the file? A record in a FASTA file is defined as a single-line header, followed by lines of sequence data. The header line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is an optional description of the entry. There should be no space between the ">" and the first letter of the id- entifier. 

(2) What are the lengths of the sequences in the file? What is the longest sequence and what is the shortest sequence? Is there more than one longest or shortest sequence? What are their i- dentifiers? 

(3) In molecular biology, a reading frame is a way of dividing the DNA sequence of nucleotides into a set of consecutive, non-overlapping triplets (or codons). Depending on where we start, there are six possible reading frames: three in the forward (5' to 3') direction and three in the reverse (3' to 5'). For instance, the three possible forward reading frames for the sequence AGGTGACACCGCAAGCCTTATATTAGC are: 

AGG TGA CAC CGC AAG CCT TAT ATT AGC

A GGT GAC ACC GCA AGC CTT ATA TTA GC

AG GTG ACA CCG CAA GCC TTA TAT TAG C 

These are called reading frames 1, 2, and 3 respectively. An open reading frame (ORF) is the part of a reading frame that has the potential to encode a protein. It starts with a start codon (ATG), and ends with a stop codon (TAA, TAG or TGA). For instance, ATGAAATAG is an ORF of length 9.

Given an input reading frame on the forward strand (1, 2, or 3) your program should be able to identify all ORFs present in each sequence of the FASTA file, and answer the following questions: what is the length of the longest ORF in the file? What is the identifier of the sequence containing the longest ORF? For a given sequence identifier, what is the longest ORF contained in the sequence represented by that identifier? What is the starting position of the longest ORF in the sequence that contains it? The position should indicate the character number in the sequence. For instance, the following ORF in reading frame 1:

>sequence1

ATGCCCTAG

starts at position 1.

Note that because the following sequence:

>sequence2

ATGAAAAAA

does not have any stop codon in reading frame 1, we do not conside- r it to be an ORF in reading frame 1. 

(4) A repeat is a substring of a DNA sequence that occurs in multiple copies (more than one) somewhere in the sequence. Although repeats can occur on both the forward and reverse strands of the DNA sequence, we will only consider repeats on the forward strand here. Also we will allow repeats to overlap themselves. For example, the sequence ACACA contains two copies of the sequence ACA - once at position 1 (index 0 in Python), and once at position 3. Given a length n, your program should be able to identify all repeats of length n in all sequences in the FASTA file. Your program should also determine how many times each repeat occurs in the file, and which is the most frequent repeat of a given length.

In [52]:
import sys
from Bio import SeqIO
from collections import Counter

## Question 1

**How many records are in the file?** A record in a FASTA file is defined as a single-line header, followed by lines of sequence data. The header line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is an optional description of the entry. There should be no space between the ">" and the first letter of the id- entifier. 

In [58]:
file = "dna2.fasta"
records = list(SeqIO.parse(file, "fasta"))
print(f"There are {len(records)} sequence records in the file.")

There are 18 sequence records in the file.


## Question 2

**What are the lengths of the sequences in the file? What is the longest sequence and what is the shortest sequence? Is there more than one longest or shortest sequence? What are their i- dentifiers?** 

In [59]:
# Sequence lengths
print("Sequence lengths:")
for record in records:
    print(f"{record.id}: {len(record.seq)}")

# Create a list of all lengths and find max and min sequence length
lengths = [len(record.seq) for record in records]
max_len = max(lengths)
min_len = min(lengths)

# Check for duplicates by finding records matching the max and min lengths
longest_records = [record.id for record in records if len(record.seq) == max_len]
shortest_records = [record.id for record in records if len(record.seq) == min_len]

# Print summary
print(f"\nThe longest sequence is {max_len} basepairs long.")
if len(longest_records) > 1:
    print("There is more than one longest sequence.")
else:
    print("There is only one longest sequence.")
print("Longest sequence ID(s):", ', '.join(longest_records))

print(f"\nThe shortest sequence is {min_len} basepairs long.")
if len(shortest_records) > 1:
    print("There is more than one shortest sequence.")
else:
    print("There is only one shortest sequence.")
print("Shortest sequence ID(s):", ', '.join(shortest_records))

Sequence lengths:
gi|142022655|gb|EQ086233.1|91: 4635
gi|142022655|gb|EQ086233.1|304: 1151
gi|142022655|gb|EQ086233.1|255: 4894
gi|142022655|gb|EQ086233.1|45: 3511
gi|142022655|gb|EQ086233.1|396: 4076
gi|142022655|gb|EQ086233.1|250: 2867
gi|142022655|gb|EQ086233.1|322: 442
gi|142022655|gb|EQ086233.1|88: 890
gi|142022655|gb|EQ086233.1|594: 967
gi|142022655|gb|EQ086233.1|293: 4338
gi|142022655|gb|EQ086233.1|75: 1352
gi|142022655|gb|EQ086233.1|454: 4564
gi|142022655|gb|EQ086233.1|16: 4804
gi|142022655|gb|EQ086233.1|584: 964
gi|142022655|gb|EQ086233.1|4: 2095
gi|142022655|gb|EQ086233.1|277: 1432
gi|142022655|gb|EQ086233.1|346: 115
gi|142022655|gb|EQ086233.1|527: 2646

The longest sequence is 4894 basepairs long.
There is only one longest sequence.
Longest sequence ID(s): gi|142022655|gb|EQ086233.1|255

The shortest sequence is 115 basepairs long.
There is only one shortest sequence.
Shortest sequence ID(s): gi|142022655|gb|EQ086233.1|346


## Question 3

In molecular biology, a reading frame is a way of dividing the DNA sequence of nucleotides into a set of consecutive, non-overlapping triplets (or codons). Depending on where we start, there are six possible reading frames: three in the forward (5' to 3') direction and three in the reverse (3' to 5'). For instance, the three possible forward reading frames for the sequence AGGTGACACCGCAAGCCTTATATTAGC are: 

AGG TGA CAC CGC AAG CCT TAT ATT AGC

A GGT GAC ACC GCA AGC CTT ATA TTA GC

AG GTG ACA CCG CAA GCC TTA TAT TAG C 

These are called reading frames 1, 2, and 3 respectively. An open reading frame (ORF) is the part of a reading frame that has the potential to encode a protein. It starts with a start codon (ATG), and ends with a stop codon (TAA, TAG or TGA). For instance, ATGAAATAG is an ORF of length 9.

Given an input reading frame on the forward strand (1, 2, or 3) your program should be able to identify all ORFs present in each sequence of the FASTA file, and answer the following questions: **What is the length of the longest ORF in the file? What is the identifier of the sequence containing the longest ORF? For a given sequence identifier, what is the longest ORF contained in the sequence represented by that identifier? What is the starting position of the longest ORF in the sequence that contains it?** The position should indicate the character number in the sequence. For instance, the following ORF in reading frame 1:

>sequence1

ATGCCCTAG

starts at position 1.

Note that because the following sequence:

>sequence2

ATGAAAAAA

does not have any stop codon in reading frame 1, we do not consider it to be an ORF in reading frame 1. 

In [70]:
# Function to find ORFs
def find_orfs(seq, start_codon="ATG", stop_codons=["TAA", "TAG", "TGA"], frame_filter=3):
    seq = str(seq).upper()
    orfs = []

    for frame in range(3):  # frames 0, 1, 2 in Python (i.e., biological 1, 2, 3)
        i = frame
        while i < len(seq) - 2:
            codon = seq[i:i+3]
            if codon == start_codon:
                for j in range(i+3, len(seq)-2, 3):
                    stop = seq[j:j+3]
                    if stop in stop_codons:
                        orf_len = j + 3 - i
                        bio_frame = (i % 3) + 1
                        if frame_filter is None or bio_frame == frame_filter:
                            orfs.append({
                                "start": i + 1,       # 1-based indexing
                                "end": j + 3,
                                "frame": bio_frame,   # Correct biological frame
                                "length": orf_len,
                                "sequence": seq[i:j+3]
                            })
                        break
            i += 3
    return orfs

# Load records
records = list(SeqIO.parse(file, "fasta"))

# Track the longest ORF overall
longest_orf = {"length": 0}

# Store ORFs by ID
orfs_by_id = {}

for record in records:
    orfs = find_orfs(record.seq)
    orfs_by_id[record.id] = orfs
    for orf in orfs:
        if orf["length"] > longest_orf["length"]:
            longest_orf = orf.copy()
            longest_orf["id"] = record.id

# 1. Length of longest ORF in the file
print(f"The longest ORF is {longest_orf['length']} basepairs long.")

# 2. ID of the sequence containing the longest ORF
print(f"It is found in sequence: {longest_orf['id']}")

# 3. Given a sequence ID, find longest ORF (change ID as needed)
query_id = "gi|142022655|gb|EQ086233.1|16"
if query_id in orfs_by_id:
    orfs = orfs_by_id[query_id]
    if orfs:
        longest = max(orfs, key=lambda x: x["length"])
        print(f"\nLongest ORF in sequence '{query_id}' is {longest['length']} bp.")
        print(f"It starts at position {longest['start']}.")
    else:
        print(f"\nSequence '{query_id}' has no ORFs.")
else:
    print(f"\nSequence ID '{query_id}' not found.")


The longest ORF is 1821 basepairs long.
It is found in sequence: gi|142022655|gb|EQ086233.1|527

Longest ORF in sequence 'gi|142022655|gb|EQ086233.1|16' is 1644 bp.
It starts at position 1440.


## Question 4

A repeat is a substring of a DNA sequence that occurs in multiple copies (more than one) somewhere in the sequence. Although repeats can occur on both the forward and reverse strands of the DNA sequence, we will only consider repeats on the forward strand here. Also we will allow repeats to overlap themselves. For example, the sequence ACACA contains two copies of the sequence ACA - once at position 1 (index 0 in Python), and once at position 3. **Given a length n, your program should be able to identify all repeats of length n in all sequences in the FASTA file. Your program should also determine how many times each repeat occurs in the file, and which is the most frequent repeat of a given length.**

In [66]:
# === USER INPUT ===
fasta_file = file
repeat_length = 7  # <- Set desired repeat length here

# === Read sequences and find repeats ===
all_repeats = []

for record in SeqIO.parse(fasta_file, "fasta"):
    seq = str(record.seq).upper()
    for i in range(len(seq) - repeat_length + 1):
        repeat = seq[i:i + repeat_length]
        all_repeats.append(repeat)

# === Count and analyze repeats ===
repeat_counts = Counter(all_repeats)

# Remove repeats that occur only once
filtered_counts = {k: v for k, v in repeat_counts.items() if v > 1}

# === Print results ===
if filtered_counts:
    print(f"Repeats of length {repeat_length} (count > 1):")
    for repeat, count in filtered_counts.items():
        print(f"{repeat}: {count}")

    # Most frequent repeat(s)
    max_count = max(filtered_counts.values())
    most_common = [r for r, c in filtered_counts.items() if c == max_count]

    print(f"\nMost frequent repeat(s) of length {repeat_length}:")
    for repeat in most_common:
        print(f"{repeat} (occurs {max_count} times)")
else:
    print(f"No repeated sequences of length {repeat_length} found more than once.")

Repeats of length 7 (count > 1):
CTCGCGT: 16
TCGCGTT: 14
CGCGTTG: 15
GCGTTGC: 14
CGTTGCA: 7
GTTGCAG: 12
TTGCAGG: 6
TGCAGGC: 15
GCAGGCC: 15
CAGGCCG: 20
AGGCCGG: 12
GGCCGGC: 35
GCCGGCG: 44
CCGGCGT: 21
CGGCGTG: 31
GGCGTGT: 9
GCGTGTC: 19
CGTGTCG: 27
GTGTCGC: 13
TGTCGCG: 11
GTCGCGC: 27
TCGCGCA: 26
CGCGCAA: 16
GCGCAAC: 13
CGCAACG: 9
GCAACGA: 6
CAACGAC: 5
AACGACG: 8
ACGACGT: 7
CGACGTG: 16
ACGTGTG: 4
CGTGTGG: 5
GTGTGGG: 6
TGTGGGG: 4
GTGGGGC: 6
TGGGGCC: 2
GGGCCTG: 4
GGCCTGA: 8
GCCTGAC: 3
CCTGACG: 4
CTGACGG: 7
TGACGGG: 5
GACGGGC: 8
ACGGGCA: 8
CGGGCAG: 14
GGGCAGG: 7
GGCAGGG: 6
GCAGGGA: 2
GAGGATC: 3
AGGATCT: 3
GGATCTC: 7
GATCTCG: 18
ATCTCGG: 9
TCTCGGC: 6
CTCGGCG: 19
TCGGCGG: 13
CGGCGGC: 41
GGCGGCG: 31
GCGGCGC: 31
CGGCGCC: 28
GGCGCCA: 4
GCGCCAA: 4
CGCCAAC: 3
GCCAACT: 2
AACTATG: 2
ACTATGC: 2
CTATGCG: 7
TATGCGG: 4
ATGCGGT: 13
TGCGGTC: 11
GCGGTCT: 3
CGGTCTT: 2
TTTCGGC: 5
TTCGGCT: 4
TCGGCTC: 8
CGGCTCG: 22
GGCTCGA: 16
GCTCGAA: 10
CTCGAAA: 7
TCGAAAG: 5
CGAAAGC: 8
GAAAGCC: 8
AAAGCCA: 2
AAGCCAG: 2
AGCCAGT: