<a href="https://colab.research.google.com/github/kanvaudupa-bioinfo/Bioinfo-lab/blob/main/Info_Prog5_Restriction_mapper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Now, let's write a function to read the FASTA file. This function will parse the file and return a dictionary where keys are sequence headers and values are the corresponding DNA sequences.

## Upload your FASTA file
Please upload your FASTA file using the button below. This file will be used for the restriction site analysis.

In [None]:
from google.colab import files
import os

print("Please upload your FASTA file (e.g., genome.fasta).")
uploaded = files.upload()

# Get the name of the first uploaded file
if uploaded:
    global uploaded_fasta_file_name
    uploaded_fasta_file_name = next(iter(uploaded.keys())) # Use .keys() to get the filename
    print(f"File '{uploaded_fasta_file_name}' uploaded successfully.")
else:
    uploaded_fasta_file_name = None
    print("No file was uploaded. Please upload a FASTA file to proceed.")

Please upload your FASTA file (e.g., genome.fasta).


Saving APP.fasta to APP.fasta
File 'APP.fasta' uploaded successfully.


In [None]:
def read_fasta(file_path):
    sequences = {}
    current_sequence_name = None
    with open(file_path, 'r') as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            if line.startswith('>'):
                current_sequence_name = line[1:].split()[0] # Get header, strip '>' and take first word
                sequences[current_sequence_name] = []
            elif current_sequence_name:
                sequences[current_sequence_name].append(line.upper()) # Convert to uppercase for consistent matching

    # Join list of sequence lines into a single string
    for name, seq_list in sequences.items():
        sequences[name] = ''.join(seq_list)
    return sequences

print("FASTA reader function defined.")

FASTA reader function defined.


Next, we'll define a function that takes the genome sequences and a list of restriction sites, then counts the occurrences of each site.

In [None]:
def count_restriction_sites(genome_sequences, restriction_sites):
    # Initialize a dictionary to store lists of occurrence indices for each site
    site_occurrences_locations = {site: [] for site in restriction_sites}

    for seq_name, sequence in genome_sequences.items():
        print(f"Processing {seq_name}...")
        for site in restriction_sites:
            i = 0
            while True:
                idx = sequence.find(site.upper(), i)
                if idx == -1:
                    break
                # Store the 0-based index of the start of the match
                site_occurrences_locations[site].append(idx)
                i = idx + 1 # Move past the found site to find next (non-overlapping)

    return site_occurrences_locations

print("Restriction site counter function defined to also record locations.")

Restriction site counter function defined to also record locations.


In [None]:
# Define the path to your FASTA file
# This will now use the file uploaded in the previous step
if 'uploaded_fasta_file_name' in globals() and uploaded_fasta_file_name:
    fasta_file = uploaded_fasta_file_name
else:
    print("Error: No FASTA file was uploaded or its name is not available.")
    print("Please ensure you've run the upload cell and successfully selected a file.")
    fasta_file = None # Set to None to prevent further execution if no file

if fasta_file:
    # Read the genome sequences
    genome_data = read_fasta(fasta_file)

    # Define the restriction enzymes and their sites as a dictionary
    # Key: Restriction enzyme name, Value: Restriction sequence
    restriction_enzymes = {
        'EcoRI': 'GAATTC',
        'HindIII': 'AAGCTT',
        'BamHI': 'GGATCC',
        'PvuII': 'CAGCTG',
        'SmaI': 'CCCGGG'
    }

    # Extract just the sequences to pass to the counting function
    restriction_sequences = list(restriction_enzymes.values())

    # Count the occurrences of each restriction site and get their locations
    all_occurrences_with_locations = count_restriction_sites(genome_data, restriction_sequences)

    # Display the results, mapping back to enzyme names
    print("\nRestriction enzyme, site, and cut locations:")
    for enzyme_name, sequence in restriction_enzymes.items():
        # Retrieve the list of locations for the current site
        locations = all_occurrences_with_locations.get(sequence.upper(), [])
        count = len(locations)

        print(f"Enzyme '{enzyme_name}' (Site: '{sequence}'):")
        print(f"  Total occurrences: {count}")
        if locations:
            # Display locations as a comma-separated list, adding 1 for 1-based indexing
            print(f"  Cut locations (1-indexed start): {', '.join(map(str, [loc + 1 for loc in locations]))}")
        else:
            print(f"  No cut locations found.")
else:
    print("Skipping analysis due to missing FASTA file.")

Processing AF293341.1...

Restriction enzyme, site, and cut locations:
Enzyme 'EcoRI' (Site: 'GAATTC'):
  Total occurrences: 1
  Cut locations (1-indexed start): 1286
Enzyme 'HindIII' (Site: 'AAGCTT'):
  Total occurrences: 0
  No cut locations found.
Enzyme 'BamHI' (Site: 'GGATCC'):
  Total occurrences: 0
  No cut locations found.
Enzyme 'PvuII' (Site: 'CAGCTG'):
  Total occurrences: 1
  Cut locations (1-indexed start): 408
Enzyme 'SmaI' (Site: 'CCCGGG'):
  Total occurrences: 1
  Cut locations (1-indexed start): 1318


## Primer Design
Now, let's design primers that are complementary to the sequence approximately 20 base pairs downstream (after) each identified restriction cut site. We'll generate the reverse complement of this 20bp sequence to serve as a primer.

In [None]:
def reverse_complement(dna_sequence):
    complement_map = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C', 'N': 'N'}
    complement_sequence = ''.join([complement_map[base] for base in dna_sequence.upper()])
    return complement_sequence[::-1]

primer_length = 20

print(f"Designing primers (approx. {primer_length}bp) for sequences immediately after each restriction cut site (1-indexed cut position)...")

# Assuming genome_data, restriction_enzymes, and all_occurrences_with_locations are available from previous cells

for enzyme_name, site_sequence in restriction_enzymes.items():
    # Get locations for the current site (using the uppercase site sequence as key)
    # The locations are 0-indexed where the site starts
    locations_0_indexed = all_occurrences_with_locations.get(site_sequence.upper(), [])

    if locations_0_indexed:
        print(f"\nEnzyme '{enzyme_name}' (Site: '{site_sequence}'):")
        for loc_0_indexed in locations_0_indexed:
            # The cut is AT THE END of the restriction site for many enzymes.
            # For simplicity, we are taking sequence *after* the restriction site sequence itself.
            # If the cut is internal to the site, adjust `start_extract_index` accordingly.

            # Start extracting immediately after the restriction site sequence
            start_extract_index = loc_0_indexed + len(site_sequence)

            # Assuming single sequence in genome_data for simplicity or pick the first one
            # If there are multiple sequences in genome_data, you'd need to associate locations with specific sequences.
            # For now, let's assume we are interested in the first sequence in genome_data values.
            # This part needs refinement if FASTA has multiple sequences and primers are needed for each.

            # For this example, let's iterate through all sequences in genome_data
            for seq_header, genome_seq in genome_data.items():
                if start_extract_index < len(genome_seq):
                    # Extract sequence for primer design
                    primer_template_sequence = genome_seq[start_extract_index : start_extract_index + primer_length]

                    # Generate reverse complement
                    designed_primer = reverse_complement(primer_template_sequence)

                    # Print the results (1-indexed cut position for user clarity)
                    print(f"  Cut at {loc_0_indexed + 1}: Template '{primer_template_sequence}' -> Primer '{designed_primer}' (complementary to sequence starting at {start_extract_index + 1} in {seq_header})")
                else:
                    print(f"  Cut at {loc_0_indexed + 1}: Not enough sequence ({primer_length}bp) after cut site in {seq_header} to design primer.")
    else:
        print(f"\nEnzyme '{enzyme_name}' (Site: '{site_sequence}'): No cut locations found for primer design.")


Designing primers (approx. 20bp) for sequences immediately after each restriction cut site (1-indexed cut position)...

Enzyme 'EcoRI' (Site: 'GAATTC'):
  Cut at 1286: Template 'CAGGGATGAATGGGCAAAAG' -> Primer 'CTTTTGCCCATTCATCCCTG' (complementary to sequence starting at 1292 in AF293341.1)

Enzyme 'HindIII' (Site: 'AAGCTT'): No cut locations found for primer design.

Enzyme 'BamHI' (Site: 'GGATCC'): No cut locations found for primer design.

Enzyme 'PvuII' (Site: 'CAGCTG'):
  Cut at 408: Template 'GGAGGAGAGAAGAAAGCGGG' -> Primer 'CCCGCTTTCTTCTCTCCTCC' (complementary to sequence starting at 414 in AF293341.1)

Enzyme 'SmaI' (Site: 'CCCGGG'):
  Cut at 1318: Template 'TTGCCTGGAGCAGTAGGACA' -> Primer 'TGTCCTACTGCTCCAGGCAA' (complementary to sequence starting at 1324 in AF293341.1)
