Analysis pipeline manual for IBM libraries sequenced by NGS

# Dependences
1. Biopython
2. Pandas

# Overview
This document outlines the steps involved in analyzing NGS results on IBM final libraries to obtain information on library sequence coverages.

Each final IBM library should look like the following:
![IBM Library](images/Github_NGS_overview_1.png)  

The transposon, and hence split inteins, could be inserted in the forward and reverse orientations.  

Similar to the analysis on Sanger sequencing results, there are "signature sequences" that mark the ends of N- and C-lobe inteins. The signature sequences are aligned to the fragmented reads to identify a short sequence immediately adjacent to them, which is extracted. The extracted sequences are then aligned to the intact target protein CDS sequence to identify the DNA position where the CDS was split. The -1 and +1 DNA positions are then used to infer where the protein sequence was split.

**Note that in entering the signature sequence for intein<sub>C</sub>, enter the sequence in reverse complement.**

Note the way how we define split sites for forward or reverse inserted transposon / split inteins. Regardless of the insertion sequence, we treat the duplicated 5 bp at the 5' end as the sequences that were inserted due to the transposon. The -1 DNA split site is thus upstream of those 5 bp.

![IBM Library](images/Github_NGS_overview_2.png) 

The entire pipeline consists of 3 steps. The aligner used in first 2 steps performs local alignment. There are 3 Jupyter notebooks involved. They are meant to be executed separated. Each notebook correspond to one of the 3 steps.

**Step 1 Filtering**
Notebook = `NGS_step1_filter_fastq.ipynb`  
Filter out reads (fragments) that that contain the N signature seq or the C signature seq, or their reverse-complements.

![IBM Library](images/Github_NGS_overview_3.png) 

From one raw FASTQ file, any fragment that can score a perfect alignment (no gaps, no mismatches) with the signature sequence, whether of the intein<sub>N</sub> or the intein<sub>C</sub>, and whether the sequence was in forward or reverse orientation.

**Step 2 Infer split site**  
Notebook = `NGS_step2_infer_split_site.ipynb`  
Within each filtered FASTQ file and for each FASTQ record:
  
**Step 2.1**  
Realign the signature seq to the read  
Determine whether the signature seq is from intein<sub>N</sub> or intein<sub>C</sub>, and also whether the read was in FW or RV orientation of the signature (intein) CDS sequence.
  
**Step 2.2**  
Flip the read such that it is now in the 5’ to 3’ direction of the signature (intein) CDS sequence.  
![IBM Library](images/Github_NGS_overview_4.png) 
  
**Step 2.3**  
Extract the sequences adjacent to the signature sequence. Which side to choose depends on the type of signature sequence (from intein<sub>N</sub> or intein<sub>C</sub>).
![IBM Library](images/Github_NGS_overview_5.png) 
  
**Step 2.4**  
Take the extracted sequence, get the best alignment (perfect with no mismatches) with the target protein CDS.  
If there are > 1 perfect alignments, discontinue.  
  
**Step 2.5**  
In the best alignment, if the extracted sequence is in FW (default) direction, insertion of transposon was in FW (could be productive or unproductive insertion), otherwise the insertion of tranpsoson was in RV (must be unproductive insertion).**  
![IBM Library](images/Github_NGS_overview_6.png) 

**Step 2.6**
Infer the split sites on both DNA and protein level (if applicable).
This process depends on the type of signature sequence (whether from intein<sub>N</sub> or intein<sub>C</sub>) and the transposon insertion orientation determined in **Step 2.5**.  
![IBM Library](images/Github_NGS_overview_7.png) 
    
Information from Steps 2.1-2.6 will then be stored as a CSV file. Since each library was paired-end read, there are 2 FASTQ files and so 2 CSV files for each library.

**Step 3 Data clean up**
Notebook = `NGS_step3_clean_up_data.ipynb` 
This performs the following steps:
1. Concatentate the 2 CSV files (pair-end reads) together.
2. Remove any records where the forward and reverse reads reported different split/insertion sites.  
3. Deduplicate any records where the forward and reverse reads pointed to the same split/insertion site.
4. Remove sites mapped beyond the permitted transposition windows. 
5. Work out the protein split sites that are missing from the library.
6. Calculate percentage coverage on DNA and amino acid sequence levels.

Results from Step 3.4 is stored in one CSV file per target protein. The CSV file contains the IDs, intermediate processing information and split site information for the records, and can be used to plot sequence coverages of the libraries on both DNA and amino acid sequence levels.

Information from 5 - 6 is stored in a text file for all target proteins.

# Output
A single CSV file which 

For all libraries, there will be one text file that detail information 


# Setup and Inputs

(1) A CSV file named `IBM_NGS_target_protein_info.csv`, placed under the directory where the scripts are run. It should contain the following information:

1. Target protein name: "target_protein"
2. Length of CDS in bp, excluding stop codon: "end_DNA"
3. The DNA level, transposition border at the 5' end of the window: "five_prime_trans_border"
4. The DNA level, transposition border at the 3' end of the window: "three_prime_trans_border"
5. The amino acid sequence level, transposition border at the N-terminus of the window: "n_trans_border"
6. The amino acid sequence level, transposition border at the C-terminus of the window: "c_trans_border"

Note that (3) - (5) should be numbers of X.5, i.e. if the most 5' end where the DNA could be split is "50/51", the number should be "50.1".  
There is no direct relationship between the transposition border on the DNA and the amino acid sequence level.

Check the example file in the directory.

(2) A FASTA file named `target_protein_CDS_plus.fa`, placed under the directory where the scripts are run. It should contain the "extended CDS" of the target proteins.

Extended CDS = take 60 bp upstream, and excluding, the ATG of the CDS and all the way to 60 bp downstream, and excluding the stop codon, in the context of the final library.

So the file should look like:  
`> <target protein 1>_CDS_plus`  
`<60 bp upstream><target protein 1 CDS><60 bp downstream> `  
`> <target protein 2>_CDS_plus`  
`<60 bp upstream><target protein 2 CDS><60 bp downstream>`  

Check the example file in the directory.

(3) Raw, de-multiplexed, FW and RV reads separated FASTQ files from the seqeuncer. These should be the "raw data" obtained from commercial sequencing companies. The files should be placed under the directory `./raw_fastq_files/`

Check the example directory, we provide one truncated file to illustrate the process.

(4) Additonal directories for storing intermediate and final data.
1. `./filtered_fastq_files/`, for outputs from Step 1 
2. `./results_per_fastq/`, for outputs from Step 2 
3. `./results_per_target_protein/`, for outputs from Step 3 

# Note
Codes deposited in this repository needs to be customized if batches of files are to be processed at the same time.