# Tn-seq Sequencing Data Processing Pipeline

Jennifer Stiens
2023
j.j.stiens@gmail.com


## Overview of Process

1. Download from genewiz and sanity check QC to make sure downloads intact
2. FastQC and/or Fastp
3. Illumina adapter trimming with Fastp
4. Barcode added to read heading of fastq files.
5. Mapping with BWA-mem
6. Read filtering for transposon tag and unique templates (determination of insertion site in genome)
7. TA-site quantification and generation of insertion files (.wig)

#### Necessary modules and scripts

tnseq_pro.py module contains all the python functions for processing the fastq and sam files.

snakemake/tnseq/ directory contains the necessary snakemake scripts for fastp and mapping

### Download from genewiz and sanitiy checks

In [None]:
#download from genewiz sftp
sftp jstien01_student_bbk@gweusftp.azenta.com
lcd /d/in16/u/sj003/men_tnseq
cd 40-842749567
mget *

In [None]:
#check length of files to see if reads same in all files?
cd fastq
FILES=*.fastq.gz
for file in $FILES; do wc -l $file; done >> sanity_check.txt

# check files downloaded correctly
for file in $FILES; do md5sum -c $file.md5; done >> md5_check.txt

#head
for file in $FILES; do echo $file; zcat $file | head -10 $file; done


### FastQC on all reads

In [None]:
cd fastq
module load fastqc
module load multiqc
FILES=*.fastq.gz
for f in $FILES; do fastqc ${f} -o fastqc; done
cd fastqc
multiqc .


### Fastp for quality control and Illumina adapter trimming of read 1 only 

Don't allow automatic detection or will trim transposon tags which are needed for insertion site confirmation and position. This shouldn't actually trim that many files as most were probably trimmed by genewiz processing.

config.yaml must be present in working directory

In [None]:
cd fastp
conda activate snakemake
snakemake -np -s ~/snakemake/tnseq/fastp/snakefile.smk
snakemake --cores 4 -s ~/snakemake/tnseq/fastp/snakefile.smk
#bash command for each file
#fastp -a GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -i fastq/<infile.fastq> -o trimmed/<outfile.fastq>

### Move barcodes from P7 index read to the name portion of the header of R1 reads. 

P7 indexes were delivered as R2 fastq files from genewiz.

Reads will be converted from trimmed reads in fastq format to fasta format (header and sequence only). This takes around 5 minutes per file on laptop.

In [None]:
# iterate through files
import scripts.tnseq_pro as tn

#read all files in directory
trimmed_files = "fastq/trimmed/"
barcoded_dir = "barcoded/"
tn.iterate_add_barcode(trimmed_files, barcoded_dir)
    

### Map trimmed and barcoded reads with BWA-mem

Use snakemake pipeline to:

1. map (bwa-mem)
2. sort (samtools)
3. index (samtools)
4. filter for mapped reads only (samtools)
5. compile read statistics (samtools)

BWA generated index file (.fasta.fai) must be present in same file as reference fasta

In [None]:
conda activate tnseq
# have to index file first (generate .fasta.fai) for bwa-mem
bwa-mem2 index ref_seqs/Mbovis_AF2122-97.fasta

In [None]:

# with snakemake (maps, sorts, indexes and creates flagstats report)
#make config.yaml file

cd ~/tn_seq/menadione_tnseq/
conda activate snakemake
snakemake -np -s ~/snakemake/tnseq/mapping/snakefile.smk
snakemake --cores 2 -s ~/snakemake/tnseq/mapping/snakefile.smk
#!snakemake -np -s $my_path/snakemake/tnseq/snakefile.smk
#!nohup snakemake --cores 8 -s $my_path/snakemake/map_bwa/pe/snakefile.smk > nohup_map.out 2>&1 &

## Filtering and quantification

One function: 'sam_to_wig' is used for:

1. filtering for mapped reads with recognised transposon tag (indicating transposon-gDNA junction) 
2. Matching mapped reads to insertion site (TA-site)
3. reducing reads to unique template counts (eliminating duplicates based on same barcode/insertion)
4. Quantifying number of reads per TA-site in genome and creating .wig file



In [15]:
#iterate through samples and run sam_to_wig script

# iterate through files
import scripts.tnseq_pro as tn

# make list of .sam files in directory
import os
import glob
import re
sam_files = glob.glob("sorted_reads" + "/*.sam")
print(sam_files)
bovis_fasta = "ref_seqs/Mbovis_AF2122-97.fasta"
for file in sam_files:
    #find sample name from file
    sample_filename = os.path.basename(file).split(".")[0]
    sample_name = re.findall(r'mapped_(\w*)_R1_001', sample_filename)[0]
    print(sample_name)
    print(file)
    tn.sam_to_wig(file, bovis_fasta, sample_name)

['sorted_reads/mapped_A1_R1_001.sam']
A1
sorted_reads/mapped_A1_R1_001.sam


KeyboardInterrupt: 

In [None]:
#add a counter to script to monitor progress