# Overview

This script to process all Sanger sequencing files from an Intein-assisted Bisection Mapping (IBM) experiment within a directory and return the split sites where the target protein was split and the split intein was inserted.
  
![IBM Sanger Overview](images/Github_Sanger_overview.png)
  
In brief, a "signature sequence" that marks the end of C-lobe intein is aligned to the PCR product to identify a short sequence immediately downstream, which is extracted. The extracted sequence is then aligned to the intact target protein CDS sequence to identify the DNA position where the CDS was split. The -1 and +1 DNA positions are then used to infer where the protein sequence was split.

## Dependences
1. Biopython
2. Pandas
  
## Input
1. **.ab1 files under a subdirectory**  
Data from Source BioScience (SBS) has Sequence Record ID of  
`459313601_IBMs0768_IBMo0858_B09.ab1`,  
which is   
`<order no.>_<sample name or ID>_<primer name>_<well used in SBS>.ab1`  
  
  If your sequence record ID is different you will need to customize the script below.
    
  The subdirectory in this case is `./IBM_seq_example/`.

  
2. **target CDS sequence**  
    Take the 1st bp of CDS to ~100-200 bp after stop codon
  
  
3. **signature sequence**
    Take 30 - 50 bp of sequence from the C-lobe intein at the C terminal, refer to figure above.

## Output
A csv file with split sites mapped for each sequencing read

# Setup

In [1]:
import os
from Bio import SeqIO, Align
from Bio.Seq import Seq
import pandas as pd

# Functions

In [2]:
def identify_split_site(folderDir, signature_sequence, target_CDS_seq):

    ss_output = pd.DataFrame()  # set up empty df to store all data

    # Parse all files in the same folder

    for file in os.listdir(folderDir):

        filename = os.fsdecode(file)

        if filename.endswith(".ab1"):
            ## Import the data & extract metadata from the sequenicng result
            seqDir = os.path.join(folderDir,filename)
            record = SeqIO.read(seqDir, "abi")
            seq_order,strainID,seq_Primer = record.id.split('_')

            ## Sequence alignment part

            # Set up aligner for alignment
            aligner = Align.PairwiseAligner()
            aligner.mode = 'local'
            aligner.open_gap_score = -10
            aligner.extend_gap_score = -10

            # Reset information to be written
            dna_ss_minus1 = 'n/a'
            dna_ss_plus1 = 'n/a'
            aa_ss_minus1 = 'n/a'
            aa_ss_plus1 = 'n/a'
            aa_ss_middle = 'n/a'

            # Perform local alignment between the sequencing result and the signature sequence
            alignment = aligner.align(record.seq,signature_sequence)
            # Locate the indices where the signature sequence was aligned to
            if alignment: # Analyze only those where alignment is found
                sig_start,sig_end = alignment[0].path[0][0], alignment[0].path[1][0]    
                # Extract the bp that would correspond to the target CDS
                extracted_seq = record.seq[sig_end:]   # Note: no need to +1 here because the indexing of strings starts from 0
                extracted_seq = extracted_seq[0:31] # Only take the first 30 bp of sequence
                # Perform local alignment between the extracted sequence and the target CDS sequence
                target_CDS = Seq(target_CDS_seq) # import the target CDS sequence as a Seq.Seq object
                alignment = aligner.align(target_CDS,extracted_seq)
                # Locate the start index of where the extracted sequence was aligned to, i.e. ID the split site
                if alignment:
                    # problem: a poor alignemnt of signature sequence might return abberent alignment at this stage as well
                    
                    dna_ss_minus1 = alignment[0].path[0][0] # Note: indexing of string starts from 0, so this becomes -1
                    # Workout the rest of the split sites
                    dna_ss_plus1 = dna_ss_minus1 + 1
                    aa_ss_minus1 = dna_ss_minus1 / 3
                    aa_ss_plus1 = aa_ss_minus1 + 1
                    aa_ss_middle = aa_ss_minus1 + 0.5

            # Create a dictionary for import into pandas dataframe
            ss_data = {
                    'order': seq_order,
                    'primer': seq_Primer,
                    'raw_seq': record.seq._data, # raw sequence
                    'read_length': len(record.seq), # length of the raw sequence
                    'dna_ss_minus1': dna_ss_minus1,
                    'dna_ss_plus1':  dna_ss_plus1,
                    'aa_ss_minus1': aa_ss_minus1,
                    'aa_ss_plus1': aa_ss_plus1,
                    'aa_ss_middle':  aa_ss_middle # split site (e.g. 196.5)
                    }

            ss_dfRow = pd.DataFrame(ss_data,index=[strainID])
            ss_output = ss_output.append(ss_dfRow, sort=True)
            
    return ss_output

# Execution

In [3]:
# Provide the CDS of the target protein
target_CDS_seq = 'atgagtattcaacatttccgtgtcgcccttattcccttttttgcggcattttgccttcctgtttttgctcacccagaaacgctggtgaaagtaaaagatgctgaagatcagttgggtgcacgagtgggtTACatcgaactggatctcaacagcggtaagatccttgagagttttcgccccgaagaacgttttccaatgatgagcacttttaaagttctgctatgtggcgcggtattatcccgtattgacgccgggcaagagcaactcggtcgccgcatacactattctcagaatgacttggttgagtactcaccagtcacagaaaagcatcttacggatggcatgacagtaagagaattatgcagtgctgccataaccatgagtgataacactgcggccaacttacttctgacaacgatcggaggaccgaaggagctaaccgcttttttgcacaacatgggggatcatgtaactcgccttgatcgttgggaaccggagctgaatgaagccataccaaacgacgagcgtgacaccacgatgcctgtagcaatggcaacaacgttgcgcaaactattaactggcgaactacttactctagcttcccggcaacaattaatagactggatggaggcggataaagttgcaggaccacttctgcgctcggcccttccggctggctggtttattgctgataaatctggagccggtgagcgtGGAtctcgcggtatcattgcagcactggggccagatggtaagccctcccgtatcgtagttatctacacgacggggagtcaggcaactatggatgaacgaaatagacagatcgctgagataggtgcctcactgattaagcattggTAAtactagagaacgcatgagAAAGCCCCCGGAAGATCACCTTCCGGGGGCTTTtttattgcgctactagtagcggccgctgcaggagtcactaagggttagttagttagat'
# CDS for beta-lactamase

# Provide the signature sequence
signature_sequence= 'CCTGTTCTATGCCAATGATATTCTGACCCATAATTCGTCA' # signature sequence for gp41-1 intein

In [4]:
# Execute the split site identification function

split_site_data = identify_split_site("./IBM_seq_example/", signature_sequence, target_CDS_seq)

split_site_data.sort_values("aa_ss_middle")

Unnamed: 0,aa_ss_middle,aa_ss_minus1,aa_ss_plus1,dna_ss_minus1,dna_ss_plus1,order,primer,raw_seq,read_length
IBMs0758,192.5,192.0,193.0,576,577,459313601,IBMo0858,NNNNNNNNNNTTGANTTAGCGGTACCACCTGTTCTATGCCAATGAT...,551
IBMs0793,196.5,196.0,197.0,588,589,459313601,IBMo0858,NNNNNNNNNNNNGAGTTAGCGGTACCACCTGTTCTATGCCAATGAT...,538
IBMs0757,260.5,260.0,261.0,780,781,459313601,IBMo0858,NNNNNNNGNNTTGANTTAGCGGTAACCACCTGTTCTATGCCAATGA...,347
IBMs0753,261.5,261.0,262.0,783,784,459313601,IBMo0858,NNNNNNNNNNTTGANTTAGCGGTACCACCTGTTCTATGCCAATGAT...,345
IBMs0761,264.5,264.0,265.0,792,793,459313601,IBMo0858,NNNNNNNNNTTGAGTTAGCGGTACCACCTGTTCTATGCCAATGATA...,334
IBMs0810,267.5,267.0,268.0,801,802,459313601,IBMo0858,NNNNNNNGNNTTGANTTAGCGGNNNCACCTGTTCTATGCCAATGAT...,325


In [5]:
# Export as a csv file
split_site_data.to_csv("example_bla_split_sites.csv")