### Note: the mapping script takes days to run! You should be running this notebook using screen, so if you close this page the script will keep running.

If you close the notebook while using screen, the log output within the notebook might stop being printed. But the logfile specified in the line where you call the script will keep being written.

How to run a Jupyter notebook with screen: https://docs.google.com/document/d/1lLcl3HhDBrn87M4-IhkpDLOmvmmf09-KeLSgn3U21hE/edit?usp=sharing

In [1]:
import pandas as pd
import os
import matplotlib.pyplot as plt

#makes viewing pandas tables better
pd.set_option('display.max_colwidth', 0)

## Make metafile for TNseq mapping script

### Info that will be the same for all pools

metafile_name: Name of file where we're going to write the metadata

combined_pool_name: What we're going to call the combination of individual pools

ReadModel: Full or relative path to a text file describing the expected structure of TnSeq reads.  The file consists of 4 lines.  Line 1 denotes the number of initial variable bases to ignore as 'N's.  Line 2 is the sequence preceding the barcode.  Line 3 is the expected length of sequence barcode denoted as 'N's.  Line 4 is the sequence between the barcode and the junction with the genome.  Example:

    nnnnnn
    GATGTCCACGAGGTCTCT
    NNNNNNNNNNNNNNNNNNNN
    CGTACGCTGCAGGTCGACAATGATCCAAACTATCAGTGTTTGA	

InsertionSequence: Full or relative path to a text file describing giving the full sequence of each insertion.  This should be a fasta format file with two entries labeled 'insert' and 'plasmid' giving the full sequence of the expected insertion into the genome and and the remaining source plasmid sequence, if appropriate. Example:

    >insert
    agtcac...
    >plasmid
    actgact... 
      
GenomeSequence: Full or relative path to a .fasta file of genome sequences for all species in pool.

OutputDir: Full or relative path to the desired directory to save output files.

In [2]:
#CHANGE THESE AS NEEDED

metafile_name = 'TNSeq_mapping_metafile_07_31_2023.txt'
mapping_log_file = 'TNSeq_mapping_07_31_2023.log'

ReadModel = '/usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/for_analysis/Readmodel_file'
InsertionSequence = '/usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/for_analysis/insertion_seq_file'
GenomeSequence = '/usr2/people/shollyt22/shollyt22/TnSeq_Results_may_2023/TnSeq_M003593/analysed_files/Myceliopthora_genome_with_vector_seq.txt'
OutputDir = 'TNSeq_mapping_output_new_genome'
print(metafile_name, mapping_log_file, ReadModel)

TNSeq_mapping_metafile_07_31_2023.txt TNSeq_mapping_07_31_2023.log /usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/for_analysis/Readmodel_file


### Pool-specific information

In [3]:
#where you saved the merged FASTQs in TNseq_prepare_files.ipynb
FASTQ_directory = '/usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/for_analysis/TNSeq_merge_FASTQ/'

In [4]:
#get the list of fastqs we generated in the BBmerge step above
#copy and paste these into the cell below this one

!ls {FASTQ_directory}/*_merged.fastq

/usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/for_analysis/TNSeq_merge_FASTQ//OORB003_S1_L001_merged.fastq


For each pool, come up with a UNIQUE short name for the pool.

In [5]:
#CHANGE THESE AS NEEDED BUT KEEP THE [('NAME', 'PATH/TO/FASTQ.FASTQ')] FORMAT

short_names_and_fastqs = [('Tnseq_07312023', '/usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/for_analysis/TNSeq_merge_FASTQ/OORB003_S1_L001_merged.fastq')]

### Writing the above info into a metafile

In [6]:
metafile_columns = ['Pool', 'ShortName', 'Fastq', 'ReadModel', 'InsertionSequence', 'GenomeSequence', 'OutputDir']

with open(metafile_name, 'w') as f:
    
    #write column names
    f.write('\t'.join(metafile_columns)+'\n')
    
    #write a line for each pool
    for pool, fastq in short_names_and_fastqs:
        
        #we write the pool name twice because we're first processing each pool separately then combining them
        #if we were combining them all at once then the first pool variable would be the name of the combined pool
        to_write = '\t'.join([pool, pool, fastq, ReadModel, InsertionSequence, GenomeSequence, OutputDir])
        f.write(to_write+'\n')
        

In [7]:
#view metafile
pd.read_csv('/usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/for_analysis/TNSeq_mapping_metafile_07_31_2023.txt', sep='\t')

Unnamed: 0,Pool,ShortName,Fastq,ReadModel,InsertionSequence,GenomeSequence,OutputDir
0,Tnseq_07312023,Tnseq_07312023,/usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/for_analysis/TNSeq_merge_FASTQ/OORB003_S1_L001_merged.fastq,/usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/for_analysis/Readmodel_file,/usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/for_analysis/insertion_seq_file,/usr2/people/shollyt22/shollyt22/TnSeq_Results_may_2023/TnSeq_M003593/analysed_files/Myceliopthora_genome_with_vector_seq.txt,TNSeq_mapping_output_new_genome


## Map transposons in each pool to genome

View additional input and output options here: https://github.com/stcoradetti/RBseq/blob/master/README.txt

This takes a long time to run and will be stuck on the "Looking for 20bp sequence barcode" step for a while. You can see how many barcodes have been processed by switching to the output directory specified above and typing:

    wc -l *blastquery.fasta
    
This number divided by 2 is roughly how many reads have been processed. You can also check the logFile specified below.

In [8]:
#always run this since the server couldnt read directory with the blast package from the local 
%env PATH=$PATH/auto/sahara/namib/home/shollyt22/anaconda3/lib/python3.9/site-packages/:/usr2/people/shollyt22/anaconda3/bin:/usr/lib64/qt-3.3/bin:/usr/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr2/people/kayleeec/bin/ncbi-blast-2.13.0+/bin:/auto/sahara/namib/home/shollyt22/anaconda3/lib/python3.9/site-packages:/usr2/people/kayleeec/bin/ncbi-blast-2.13.0+/bin:/auto/sahara/namib/home/shollyt22/anaconda3/lib/python3.9/site-packages:/usr2/people/kayleeec/bin/ncbi-blast-2.13.0+/bin:/auto/sahara/namib/home/shollyt22/anaconda3/lib/python3.9/site-packages/

env: PATH=$PATH/auto/sahara/namib/home/shollyt22/anaconda3/lib/python3.9/site-packages/:/usr2/people/shollyt22/anaconda3/bin:/usr/lib64/qt-3.3/bin:/usr/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr2/people/kayleeec/bin/ncbi-blast-2.13.0+/bin:/auto/sahara/namib/home/shollyt22/anaconda3/lib/python3.9/site-packages:/usr2/people/kayleeec/bin/ncbi-blast-2.13.0+/bin:/auto/sahara/namib/home/shollyt22/anaconda3/lib/python3.9/site-packages:/usr2/people/kayleeec/bin/ncbi-blast-2.13.0+/bin:/auto/sahara/namib/home/shollyt22/anaconda3/lib/python3.9/site-packages/


In [9]:
cmd = 'nice -n 17 python3 /usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/RBseq_Map_Insertions_v1.1.4_PBa_Jskerker.py --metafile {} --logFile {}'.format(metafile_name, mapping_log_file)

!{cmd}

RBseq_Map_Insertions.py
2023-08-09 12:21:11 Version: 1.1.4_PBa
2023-08-09 12:21:11 Release Date: August 8, 2020
2023-08-09 12:21:11 Options passed:  metafile:TNSeq_mapping_metafile_07_31_2023.txt  logFile:TNSeq_mapping_07_31_2023.log  minQual:10  matchBefore:6  matchAfter:6  matchJunction:4  maxFillerSeq:100  barcodeVariation:2  wobbleAllowed:1  scoreDiff:10  minFraction:0.9  filterNeighborhood:10  filterEditDistance:5  minPercentID:95  maxEvalue:0.1  useMappedFiles:False  noInsertHits:False  noBarcodes:False 
2023-08-09 12:21:11 Logging status updates in TNSeq_mapping_07_31_2023.log
2023-08-09 12:21:11 Loading TnSeq library metadata from TNSeq_mapping_metafile_07_31_2023.txt
2023-08-09 12:21:11 Finding barcodes in fastqs and mapping insertion locations
2023-08-09 12:21:11   Mapping reads from /usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/for_analysis/TNSeq_merge_FASTQ/OORB003_S1_L001_merged.fastq using insertion model /usr2/people/shollyt22/shollyt22/Barseq_July_2023

## Combine pools

Make a new metafile, this time combining the pools into one larger pool.

In [None]:
#CHANGE THESE AS NEEDED

combined_pool_name = 'POOLS30_to_41_combined'
metafile_name = 'TNSeq_mapping_metafile_pools30_to_41_combined.txt'
mapping_log_file = 'TNSeq_mapping_pools30_to_41_combined.log'

In [None]:
metafile_columns = ['Pool', 'ShortName', 'Fastq', 'ReadModel', 'InsertionSequence', 'GenomeSequence', 'OutputDir']

with open(metafile_name, 'w') as f:
    
    #write column names
    f.write('\t'.join(metafile_columns)+'\n')
    
    #write a line for each pool
    for pool, fastq in short_names_and_fastqs:
        
        #the pool name is now the combined pool name
        to_write = '\t'.join([combined_pool_name, pool, fastq, ReadModel, InsertionSequence, GenomeSequence, OutputDir])
        f.write(to_write+'\n')
        

Note how this metafile is different than the one above - the "Pool" column is the name of the combined pool.

In [10]:
#view metafile
pd.read_csv(metafile_name, sep='\t')

Unnamed: 0,Pool,ShortName,Fastq,ReadModel,InsertionSequence,GenomeSequence,OutputDir
0,Tnseq_07312023,Tnseq_07312023,/usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/for_analysis/TNSeq_merge_FASTQ/OORB003_S1_L001_merged.fastq,/usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/for_analysis/Readmodel_file,/usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/for_analysis/insertion_seq_file,/usr2/people/shollyt22/shollyt22/TnSeq_Results_may_2023/TnSeq_M003593/analysed_files/Myceliopthora_genome_with_vector_seq.txt,TNSeq_mapping_output_new_genome


This step takes ~ 2 hours for 4 pools.

In [11]:
cmd = 'nice -n 17 python3 /usr2/people/clairedubin/barseq/latest_pipeline/RBseq_Map_Insertions_v1.1.4_PBa_Jskerker.py --useMappedFiles --metafile {} --logFile {}'.format(metafile_name, mapping_log_file)

!{cmd}

RBseq_Map_Insertions.py
2023-08-09 12:40:45 Version: 1.1.4_PBa
2023-08-09 12:40:45 Release Date: August 8, 2020
2023-08-09 12:40:45 Options passed:  metafile:TNSeq_mapping_metafile_07_31_2023.txt  logFile:TNSeq_mapping_07_31_2023.log  minQual:10  matchBefore:6  matchAfter:6  matchJunction:4  maxFillerSeq:100  barcodeVariation:2  wobbleAllowed:1  scoreDiff:10  minFraction:0.9  filterNeighborhood:10  filterEditDistance:5  minPercentID:95  maxEvalue:0.1  useMappedFiles:True  noInsertHits:False  noBarcodes:False 
2023-08-09 12:40:45 Logging status updates in TNSeq_mapping_07_31_2023.log
2023-08-09 12:40:45 Loading TnSeq library metadata from TNSeq_mapping_metafile_07_31_2023.txt
2023-08-09 12:40:45 --useMappedFiles option passed. Skipping mapping step and proceeding to pool analysis
2023-08-09 12:40:45 Processing mapped reads to characterize mutant pool(s)
2023-08-09 12:40:45 Analyzing pool: Tnseq_07312023
2023-08-09 12:40:45   Reading entries from: TNSeq_mapping_output_new_genome/Tnseq_07

##The next command below was executed for reformatting the insertionSequence file required for mapping transposon in previous step

In [73]:
#for reformatting the insertionSequence file required for mapping transposon
!makeblastdb -in {InsertionSequence} -dbtype nucl



Building a new DB, current time: 08/02/2023 13:57:34
New DB name:   /usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/for_analysis/insertion_seq_file
New DB title:  /usr2/people/shollyt22/shollyt22/Barseq_July_2023/OORB003_TnSeq/for_analysis/insertion_seq_file
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 2 sequences in 0.0413408 seconds.


