# Tn-seq Sequencing Data Processing Pipeline

Jennifer Stiens
2023
j.j.stiens@gmail.com


## Overview of Process

1. Download from genewiz and sanity check QC to make sure downloads intact
2. FastQC (optional)
3. Illumina adapter trimming and QC with Fastp
4. Barcode from index read added to read heading of read 1 fastq files (read converted into 2-line reads, ".reads")
5. Trimming of the transposon tag
6. Mapping with BWA-mem
7. TA-site quantification, removal of PCR duplicates
8. Generation of insertion files (.wig)

This protocol uses primers based on those used in Mendum et al, 2019 (BMC Genomics) where the barcode is included in the i7 index read.
The primers are included as a spreadsheet, 

#### Necessary modules and scripts

tnseq_pro.py module contains all the python functions for processing the fastq and sam files.

snakemake/tnseq/ directory contains the necessary snakemake scripts for fastp and mapping

### Download from genewiz and sanity checks

In [None]:
#download from genewiz sftp
sftp jstien01_student_bbk@gweusftp.azenta.com
lcd /d/in16/u/sj003/men_tnseq
cd 40-842749567
mget *

In [None]:
#check length of files to see if reads same in all files?
cd fastq
FILES=*.fastq.gz
for file in $FILES; do wc -l $file; done >> sanity_check.txt

# check files downloaded correctly
for file in $FILES; do md5sum -c $file.md5; done >> md5_check.txt

#head
for file in $FILES; do echo $file; zcat $file | head -10 $file; done


### FastQC on all reads

In [None]:
cd fastq
module load fastqc
module load multiqc
FILES=*.fastq.gz
for f in $FILES; do fastqc ${f} -o fastqc; done
cd fastqc
multiqc .


### Fastp for quality control and Illumina adapter trimming of read 1 only 

Don't allow automatic detection or will trim transposon tags which are needed for insertion site confirmation and position. This shouldn't actually trim that many files as most were probably trimmed by genewiz processing.

config.yaml must be present in working directory

In [None]:
cd fastp
conda activate snakemake
snakemake -np -s ~/snakemake/tnseq/fastp/snakefile.smk
snakemake --cores 4 -s ~/snakemake/tnseq/fastp/snakefile.smk

#on server
#!nohup snakemake --cores 8 -s $my_path/snakemake/tnseq/fastp/snakefile.smk > nohup_fastp.out 2>&1 &

#bash command for each file
#fastp -a GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -i fastq/<infile.fastq> -o trimmed/<outfile.fastq>

### Move barcodes from P7 index read to the name portion of the header of R1 reads. 

P7 indexes were delivered as R2 fastq files from genewiz.

Reads will be converted from trimmed reads in fastq format to fasta format (header and sequence only). This takes around 5 minutes per file on laptop.

In [3]:
# iterate through files
import scripts.tnseq_pro as tn

#read all files in directory
#trimmed_files = "fastq/trimmed/"
trimmed_files = "tests/trimmed_reads/"
barcoded_dir = "tests/barcoded_reads"
tn.iterate_add_barcode(trimmed_files, barcoded_dir)
    

['tests/trimmed_reads/trimmed_test_5000_R1_001.fastq']
tests/barcoded_reads/barcode_trimmed_test_5000_R1_001.reads


## Remove transposon tag from reads with tag in first 20 bases

In [2]:
import scripts.tnseq_pro as tn

tn.iterate_tag_trim("tests/barcoded_reads", "tests/output/")

#tn.trim_tag_fastq("barcoded_output/<barcoded_fastq_file.reads>", "barcoded_output/", tag="ACTTATCAGCCAACCTGTTA", mismatch_max=2)
#tn.trim_tag_fastq("tests/barcoded/barcode_trimmed_test_5000_R1_001.reads", "tests/barcoded", tag="ACTTATCAGCCAACCTGTTA", mismatch_max=2)

File being processed:  tests/barcoded_reads/barcode_trimmed_test_5000_R1_001.reads


## Map tag-clipped, barcoded reads with BWA-mem

Use snakemake pipeline to:

1. map (bwa-mem)
2. sort (samtools)
3. index (samtools)
4. filter for mapped reads only (samtools)
5. compile read statistics (samtools)

BWA generated index file (.fasta.fai) must be present in same file as reference fasta

In [None]:
conda activate tnseq
# have to index file first (generate .fasta.fai) for bwa-mem
bwa-mem2 index ref_seqs/Mbovis_AF2122-97.fasta

In [None]:

# with snakemake (maps, sorts, indexes and creates flagstats report)
#check config.yaml file

cd ~/tn_seq/menadione_tnseq/
conda activate snakemake
snakemake -np -s ~/snakemake/tnseq/mapping/snakefile.smk
snakemake --cores 2 -s ~/snakemake/tnseq/mapping/snakefile.smk

#on server
snakemake -np -s $my_path/snakemake/tnseq/mapping/snakefile.smk
nohup snakemake --cores 8 -s $my_path/snakemake/tnseq/mapping/snakefile.smk > nohup_map.out 2>&1 &

Sorted, indexed and filtered reads are in 'sorted_reads' directory along with "flagstat" files which can be used for mapping statistics.

## Quantification and filtering

One function: 'sam_to_wig' is used for:

1. Reducing reads to unique template counts (eliminating duplicates based on same barcode/insertion)
2. Matching unique mapped reads to insertion site (TA-site)
3. Quantifying number of reads per TA-site in genome and creating .wig file



In [3]:
import scripts.tnseq_pro as tn
tn.iterate_sam_to_wig("tests/sorted_reads", "tests/output", "ref_seqs/Mbovis_AF2122-97.fasta")


['tests/sorted_reads/mapped_test_5000_R1_001.sam']
test_5000
tests/sorted_reads/mapped_test_5000_R1_001.sam
number of unique reads assigned to TA sites:  1145
number of unique reads with no ta site match:  12


## Analyse insertion files

In [1]:
import scripts.tnseq_pro as tn  
import glob
wig_dir = "tests/output"
wig_files = glob.glob(wig_dir + "/*.wig")
for file in wig_files:
        print("File being processed: ", file)
        tn.analyze_dataset(file)
        

File being processed:  tests/output/test_5000_insertions.wig
