# Overlapping Subsequence Finder

This notebook helps identify overlapping subsequences in a DNA sequence, which can be useful to check for potential issues in Gibson assembly.

You can paste your DNA sequence and choose the length of subsequences you want to analyze. The notebook will return any subsequences that appear more than once along with their counts.

For questions, feedback, etc. ask Alexa from the Sherbo Lab(lexavanv on GitHub)

## Instructions

1. Paste your DNA sequence into the `sequence` variable in the next code cell.  
2. Set the `length` variable to the size of overlap you want to check (recommended: 8–15).  
3. Run the code cell.  
4. The output will list overlapping subsequences and the number of times each occurs.  
5. You can copy the results into Benchling or another tool to check for potential issues in Gibson assembly.

In [None]:
from collections import defaultdict

def find_overlapping_subsequences(sequence, length):
    # create a dictionary to store sequences, defaultdict provides automatic value of 0 for a non exsistent key
    subseq_count = defaultdict(int)
    
    # iterate over sequence, stops when the final subsequences is the given length
    for i in range(len(sequence) - length + 1):
        #creates subsequences of the given length. i is start, i + length is end. 
        subseq = sequence[i:i + length]
        #add a count of 1 to the dictionary every time a subsequence appears. (ie a subsequence that only appears once will recieve a count of one, a subsequence that appears twice will receive a count of two, etc.)
        subseq_count[subseq] += 1
        # if the count of a subsequence is greater than 1
        if subseq_count[subseq] > 1:
            print('sequence', subseq, 'instances', subseq_count[subseq])

sequence = "cccttgtattactgtttatgtaagcagacagttttattgttcatgatgatatatttttatcttgtgcaatgtaacatcagagattttgagacacaacgtggctttcccccccccccctgcaggtccgacacggggatggatggcgttcccgatcatggtcctgcttgcttcgggagcgatactgagcgaagcaagtgcgtcgagcagtgcccgcttgttcctgaaatgccagtaaagcgctggctgctgaacccccagccggaactgaccccacaaggccctagcgtttgcaatgcaccaggtcatcattgacccaggcgtgttccaccaggccgctgcctcgcaactcttcgcaggcttcgccgacctgctcgcgccacttcttcacgcgggtggaatccgatccgcacatgaggcggaaggtttccagcttgagcgggtacggctcccggtgcgagctgaaatagtcgaacatccgtcgggccgtcggcgacagcttgcggtacttctcccatatgaatttcgtgtagtggtcgccagcaaacagcacgacgatttcctcgtcgatcaggacctggcaacgggacgttttcttgccacggtccaggacgcggaagcggtgcagcagcgacaccgattccaggtgcccaacgcggtcggacgtgaagcccatcgccgtcgcctgtaggcgcgacaggcattcctcggccttcgtgtaataccggccattgatcgaccagcccaggtcctggcaaagctcgtagaacgtgaaggtgatcggctcgccgataggggtgcgcttcgcgtactccaacacctgctgccacaccagttcgtcatcgtcggcccgcagctcgacgccggtgtaggtgatcttcacgtccttgttgacgtggaaaatgaccttgttttgcagcgcctcgcgcgggattttcttgttgcgcgtggtgaacagggcagagcgggccgtgtcgtttggcatcgctcgcatcgtgtccggccacggcgcaatatcgaacaaggaaagctgcatttccttgatctgctgcttcgtgtgtttcagcaacgcggcctgcttggcctcgctgacctgttttgccaggtcctcgccggcggtttttcgcttcttggtcgtcatagttcctcgcgtgtcgatggtcatcgacttcgccaaacctgccgcctcctgttcgagacgacgcgaacgctccacggcggccgatggcgcgggcagggcagggggagccagttgcacgctgtcgcgctcgatcttggccgtagcttgctggaccatcgagccgacggactggaaggtttcgcggggcgcacgcatgacggtgcggcttgcgatggtttcggcatcctcggcggaaaaccccgcgtcgatcagttcttgcctgtatgccttccggtcaaacgtccgattcattcaccctccttgcgggattgccccgactcacgccggggcaatgtgcccttattcctgatttgacccgcctggtgccttggtgtccagataatccaccttatcggcaatgaagtcggtcccgtagaccgtctggccgtccttctcgtacttggtattccgaatcttgccctgcacgaataccagctccgcgaagtcgctcttcttgatggagcgcatggggacgtgcttggcaatcacgcgcaccccccggccgttttagcggctaaaaaagctggcgctgggcctgtttctggcgctggacttcccgctgttccgtcagcagcttttcgcccacggccttgatgatcgcggcggccttggcctgcatatcccgattcaacggccccagggcgtccagaacgggcttcaggcgctcccgaaggtctcgggccgtctcttgggcttgatcggccttcttgcgcatctcacgcgctcctgcggcggcctgtagggcaggctcatacccctgccgaaccgcttttgtcagccggtcggccacggcttccggcgtctcaacgcgctttgagattcccagcttttcggccaatccctgcggtgcataggcgcgtggctcgaccgcttgcgggctgatggtgacgtggcccactggtggccgctccagggcctcgtagaacgcctgaatgcgcgtgtgacgtgccttgctgccctcgatgccccgttgcagccctagatcggccacagcggccgcaaacgtggtctggtcgcgggtcatctgcgctttgttgccgatgaactccttggccgacagcctgccgtcctgcgtcagcggcaccacgaacgcggtcatgtgcgggctggtttcgtcacggtggatgctggccgtcacgatgcgatccgccccgtacttgtccgccagccacttgtgcgccttctcgaagaacgccgcctgctgttcttggctggccgacttccaccattccgggctggccgtcatgacgtactcgaccgccaacacagcgtccttgcgccgcttctctggcagcaactcgcgcagtcggcccatcgcttcatcggtgctgctggccgcccagtgctcgttctctggcgtcctgctggcgtcagcgttgggcgtctcgcgctcgcggtaggcgtgcttgagactggccgccacgttgcccattttcgccagcttcttgcatcgcatgatcgcgtatgccgccatgcctgcccctcccttttggtgtccaaccggctcgacgggggcagcgcaaggcggtgcctccggcgggccactcaatgcttgagtatactcactagactttgcttcgcaaagtcgtgaccgcctacggcggctgcggcgccctacgggcttgctctccgggcttcgccctgcgcggtcgctgcgctcccttgccagcccgtggatatgtggacgatggccgcgagcggccaccggctggctcgcttcgctcggcccgtggacaaccctgctggacaagctgatggacaggctgcgcctgcccacgagcttgaccacagggattgcccaccggctacccagccttcgaccacatacccaccggctccaactgcgcggcctgcggccttgccccatcaatttttttaattttctctggggaaaagcctccggcctgcggcctgcgcgcttcgcttgccggttggacaccaagtggaaggcgggtcaaggctcgcgcagcgaccgcgcagcggcttggccttgacgcgcctggaacgacccaagcctatgcgagtgggggcagtcgaaggcgaagcccgcccgcctgccccccgagcctcacggcggcgagtgcgggggttccaagggggcagcgccaccttgggcaaggccgaaggccgcgcagtcgatcaacaagccccggaggggccactttttgccggagggggagccgcgccgaaggcgtgggggaaccccgcaggggtgcccttctttgggcaccaaagaactagatatagggcgaaatgcgaaagacttaaaaatcaacaacttaaaaaaggggggtacgcaacagctcattgcggcaccccccgcaatagctcattgcgtaggttaaagaaaatctgtaattgactgccacttttacgcaacgcataattgttgtcgcgctgccgaaaagttgcagctgattgcgcatggtgccgcaaccgtgcggcaccctaccgcatggagataagcatggccacgcagtccagagaaatcggcattcaagccaagaacaagcccggtcactgggtgcaaacggaacgcaaagcgcatgaggcgtgggccgggcttattgcgaggaaacccacggcggcaatgctgctgcatcacctcgtggcgcagatgggccaccagaacgccgtggtggtcagccagaagacactttccaagctcatcggacgttctttgcggacggtccaatacgcagtcaaggacttggtggccgagcgctggatctccgtcgtgaagctcaacggccccggcaccgtgtcggcctacgtggtcaatgaccgcgtggcgtggggccagccccgcgaccagttgcgcctgtcggtgttcagtgccgccgtggtggttgatcacgacgaccaggacgaatcgctgttggggcatggcgacctgcgccgcatcccgaccctgtatccgggcgagcagcaactaccgaccggccccggcgaggagccgcccagccagcccggcattccgggcatggaaccagacctgccagccttgaccgaaacggaggaatgggaacggcgcgggcagcagcgcctgccgatgcccgatgagccgtgttttctggacgatggcgagccgttggagccgccgacacgggtcacgctgccgcgccggtagcacttgggttgcgcaaacgccagcaacgcggcctttttacggttcctggccttttgctggccttttgctcacatgttctttcctgcgttatcccctgattctgtggataaccgtattaccgcctttgagtgagctgataccgctcgccgcagccgaacgaccgagcgcagcgagtcagtgagcgaggaagcggaagagcgcccaatacgcaaaccgcctctccccgcgcgttggccgattcattaatgcagctggcacgacaggtttcccgactggaaagcgggcagtgagcgcaacgcaattaggtgttgacggctagctcagtcctaggtatagtgctagctctagacttcgggcgcaggcccacatggagagcgcagatagtccgggatatccgctgttttagagctagaaatagcaagttaaaataaggctagtccgttatcaacttgaaaaagtggcaccgagtcggtgctttttttgcatcaaataaaacgaaaggctcagtcgaaagactgggcctttcgttttatctgttgtttgtcggtgaacgctctctactagagtcacactggctcaccttcgggtgggcctttctgcgtttatattaagccagccccgacacccgccaacacccgctgacgcgccctgacgggcttgtctgctcccggcatccgcttacagacaagctgtgaccgtctccgggagctgcatgtgtcagaggtcatggcgctgatgacgccggtcatcctcatcggcggcatgacgctcggctggttcacgcccaccgaggcggcggtggcggcggtgatctggtcgctgttcctggggctggtgcgctaccgctccatgacattgaagacgctggcgaaagcgaccttcgacaccatcgagacgacggcctcggtgctgttcatcgtcaccgcggcgtccatcttcgcgtggctgctcacggcgagccaggcggcgcagatgctgtcggacgccatccttggcttcacccagaacaagtgggtgttcctggcgctggccaatctgctcatcctgttcgtgggctgcttcatcgacaccatcgcggccatcacgatcctggtgccgatcctgctgcccatcgtgctcaagctcggcatcgacccgatccatttcgggctcatcatgacgctgaacctgatgatcgggctgctgcacccgccgctgggcatggtgctgttcgtgctggcgcgggtggcgcggctctcggtggaacgcaccaccatggccatcctgccctggctggtgccgttgatgatcgcgctgctcgcgatcacctacattcccggcctcaccctctggctgccccacgccatggggctcggacgctgagggcgatcaggcgcccgtccgtccctcgggggcggacggggcgatgacgcggcggccttcctcggccttgtaatagtactggctcgccaccagccagcccttcaacggccgcaggggcgggatgcaggtgagcagcaagaggggcagcgtggtcacgaggtggacccagtagggcgcctggaaggccacctccagccagatggcgaacagaaccgccggcacgcaggcgaagcacatgacgaagaaggccggcccatccgccggatcggcgaaggaatagtcgagcccgcagacctcgcaggccggggcaatcgtcaggaagccattgaaaaggtggccctcgccgcagcgggggcaacggcctctcacgcccgtcgaaaggggcgagagcctcggccaatgttgttcgttcatgtcttctcccgggccggtgcgggcgtctgcccatccggcgtcacgcgggtctggcgattgttcgccggcgcagtgtgctcccgtgaggtaagtctcggccgggctacccgcaagtgagaccgagtcgcagcggcaccaggccgcatccatcctgacgcagggggtgacacgcggccggccgccctttattcggctttcaggaggcgcaagccagccccgacacccgccaacacccgctgacgcgccctgacgggcttgtctgctcccggcatccgcttacagacaagctgtgaccgtctccgggagctgcatgtgtcagaggttaggtggcggtacttgggtcgatatcaaagtgcatcacttcttcccgtatgcccaactttgtatagagagccactgcgggatcgtcaccgtaatctgcttgcacgtagatcacataagcaccaagcgcgttggcctcatgcttgaggagattgatgagcgcggtggcaatgccctgcctccggtgctcgccggagactgcgagatcatagatatagatctcactacgcggctgctcaaacttgggcagaacgtaagccgcgagagcgccaacaaccgcttcttggtcgaaggcagcaagcgcgatgaatgtcttactacggagcaagttcccgaggtaatcggagtccggctgatgttgggagtaggtggctacgtctccgaactcacgaccgaaaagatcaagagcagcccgcatggatttgacttggtcagggccgagcctacatgtgcgaatgatgcccatacttgagccacctaactttgttttagggcgactgccctgctgcgtaacatcgttgctgctgcgtaacataacaccccttgtattactgtttatttaagcagacagttttattgttcatgatgatatatttttatcttgtgcaatgtaacatcagagattttgagacacaacgtggctttccccccccccccctgcaggtccgacacggggatggatggcgttcccgatcatggtcctgcttgcttcgggagcgatactgagcgaagcaagtgcgtcgagcagtgcccgcttgttcctgaaatgccagtaaagcgctggctgctgaacccccagccggaactgaccccacaaggccatcctcgagcgcttttaccgcttgcccgaggacctgatccaacgcttctacggggatcgcctcacgttcaaggacaaggcacgcatccttacgggtcgtccgcccgtatcggtgctgaaggccctttcctgcctccccgagacgaaatccgctatcggtcccagataatgctcgacgtcaccaagtctccccgccgcgccgccgtcatcgggtccggattcggcgggctctccctcgccatccgcctgcaggcggccggcatcgccaccaccatcttcgagcagcgggacaagcccggcggccgggcctatgtctatgaggataagggcttcaccttcgatggcgggccgaccgtcatcaccgacccctcctgcctcgaagaggtctatgaggcggccgggcggcggctcagcgactatgtggacctgatctcggtctcgcccttctaccggctgctgtggtcggacggtcggcagttcgactatgtgaacgagcagaccgcgctcgacgcccagatcgccgccttcaacccggcggatgtggaaggctaccggcgcttcttcgcctattccaaggcggtgttcgaggagggttacctcaagctcggcgccgtgccgttcctgaatttctcggacatgatgaaggccgggccgcagttggccaagctccaggcgtggcgctcggtctattccatggtgtcgagcttcataaaggacgagcacctgcggcaggcgttctccttccactcgctgctggtgggggggaatcccttctcaacctcctccatctacgccctcatccatgcgctggagcgcaaatggggcgtgttcttcccccgcggcggcaccggcgcgctggtgcgcggcatggtgaagctgttcaccgacctcggcggtaccatccacctctccgccaaggtcgatgagataacac"
length = 170  # Length of the subsequences to check
find_overlapping_subsequences(sequence, length)



sequence cccccccccccctgcaggtccgacacggggatggatggcgttcccgatcatggtcctgcttgcttcgggagcgatactgagcgaagcaagtgcgtcgagcagtgcccgcttgttcctgaaatgccagtaaagcgctggctgctgaacccccagccggaactgaccccaca instances 2
sequence ccccccccccctgcaggtccgacacggggatggatggcgttcccgatcatggtcctgcttgcttcgggagcgatactgagcgaagcaagtgcgtcgagcagtgcccgcttgttcctgaaatgccagtaaagcgctggctgctgaacccccagccggaactgaccccacaa instances 2
sequence cccccccccctgcaggtccgacacggggatggatggcgttcccgatcatggtcctgcttgcttcgggagcgatactgagcgaagcaagtgcgtcgagcagtgcccgcttgttcctgaaatgccagtaaagcgctggctgctgaacccccagccggaactgaccccacaag instances 2
sequence ccccccccctgcaggtccgacacggggatggatggcgttcccgatcatggtcctgcttgcttcgggagcgatactgagcgaagcaagtgcgtcgagcagtgcccgcttgttcctgaaatgccagtaaagcgctggctgctgaacccccagccggaactgaccccacaagg instances 2
sequence cccccccctgcaggtccgacacggggatggatggcgttcccgatcatggtcctgcttgcttcgggagcgatactgagcgaagcaagtgcgtcgagcagtgcccgcttgttcctgaaatgccagtaaagcgctggctgctgaacccccagccggaactgaccccacaaggc instances 2
sequence ccccccctgcaggtccgacacggggatggat