### Note: the mapping script takes days to run! You should be running this notebook using screen, so if you close this page the script will keep running.

If you close the notebook while using screen, the log output within the notebook might stop being printed. But the logfile specified in the line where you call the script will keep being written.

How to run a Jupyter notebook with screen: https://docs.google.com/document/d/1lLcl3HhDBrn87M4-IhkpDLOmvmmf09-KeLSgn3U21hE/edit?usp=sharing

In [1]:
import pandas as pd
import os
import matplotlib.pyplot as plt

#makes viewing pandas tables better
pd.set_option('display.max_colwidth', 0)

## Make metafile for TNseq mapping script

### Info that will be the same for all pools

metafile_name: Name of file where we're going to write the metadata

combined_pool_name: What we're going to call the combination of individual pools

ReadModel: Full or relative path to a text file describing the expected structure of TnSeq reads.  The file consists of 4 lines.  Line 1 denotes the number of initial variable bases to ignore as 'N's.  Line 2 is the sequence preceding the barcode.  Line 3 is the expected length of sequence barcode denoted as 'N's.  Line 4 is the sequence between the barcode and the junction with the genome.  Example:

    nnnnnn
    GATGTCCACGAGGTCTCT
    NNNNNNNNNNNNNNNNNNNN
    CGTACGCTGCAGGTCGACAATGATCCAAACTATCAGTGTTTGA	

InsertionSequence: Full or relative path to a text file describing giving the full sequence of each insertion.  This should be a fasta format file with two entries labeled 'insert' and 'plasmid' giving the full sequence of the expected insertion into the genome and and the remaining source plasmid sequence, if appropriate. Example:

    >insert
    agtcac...
    >plasmid
    actgact... 
      
GenomeSequence: Full or relative path to a .fasta file of genome sequences for all species in pool.

OutputDir: Full or relative path to the desired directory to save output files.

In [2]:
#CHANGE THESE AS NEEDED

metafile_name = 'TNSeq_mapping_metafile_pools30_to_41_individual.txt'
mapping_log_file = 'TNSeq_mapping_pools30_to_41_individual.log'

ReadModel = '/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt'
InsertionSequence = '/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt'
GenomeSequence = '/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt'
OutputDir = 'TNSeq_mapping_output_new_genome'

### Pool-specific information

In [4]:
#where you saved the merged FASTQs in TNseq_prepare_files.ipynb
FASTQ_directory = 'TNSeq_merge_FASTQ'

In [5]:
#get the list of fastqs we generated in the BBmerge step above
#copy and paste these into the cell below this one

!ls {FASTQ_directory}/*_merged.fastq

TNSeq_merge_FASTQ/REPEAT-RBAD30_S12_L002_merged.fastq
TNSeq_merge_FASTQ/REPEAT-RBAD31_S11_L002_merged.fastq
TNSeq_merge_FASTQ/REPEAT-RBAD32_S10_L002_merged.fastq
TNSeq_merge_FASTQ/REPEAT-RBAD33_S9_L002_merged.fastq
TNSeq_merge_FASTQ/REPEAT-RBAD34_S8_L002_merged.fastq
TNSeq_merge_FASTQ/REPEAT-RBAD35_S7_L002_merged.fastq
TNSeq_merge_FASTQ/REPEAT-RBAD36_S6_L002_merged.fastq
TNSeq_merge_FASTQ/REPEAT-RBAD37_S5_L002_merged.fastq
TNSeq_merge_FASTQ/REPEAT-RBAD38_S4_L002_merged.fastq
TNSeq_merge_FASTQ/REPEAT-RBAD39_S3_L002_merged.fastq
TNSeq_merge_FASTQ/REPEAT-RBAD40_S2_L002_merged.fastq
TNSeq_merge_FASTQ/REPEAT-RBAD41_S1_L002_merged.fastq


For each pool, come up with a UNIQUE short name for the pool.

In [6]:
#CHANGE THESE AS NEEDED BUT KEEP THE [('NAME', 'PATH/TO/FASTQ.FASTQ')] FORMAT

short_names_and_fastqs = [('POOL30', 'TNSeq_merge_FASTQ/REPEAT-RBAD30_S12_L002_merged.fastq'),
                          ('POOL31', 'TNSeq_merge_FASTQ/REPEAT-RBAD31_S11_L002_merged.fastq'),
                          ('POOL32', 'TNSeq_merge_FASTQ/REPEAT-RBAD32_S10_L002_merged.fastq'),
                          ('POOL33', 'TNSeq_merge_FASTQ/REPEAT-RBAD33_S9_L002_merged.fastq'),
                          ('POOL34', 'TNSeq_merge_FASTQ/REPEAT-RBAD34_S8_L002_merged.fastq'),
                          ('POOL35', 'TNSeq_merge_FASTQ/REPEAT-RBAD35_S7_L002_merged.fastq'),
                          ('POOL36', 'TNSeq_merge_FASTQ/REPEAT-RBAD36_S6_L002_merged.fastq'),
                          ('POOL37', 'TNSeq_merge_FASTQ/REPEAT-RBAD37_S5_L002_merged.fastq'),
                          ('POOL38', 'TNSeq_merge_FASTQ/REPEAT-RBAD38_S4_L002_merged.fastq'),
                          ('POOL39', 'TNSeq_merge_FASTQ/REPEAT-RBAD39_S3_L002_merged.fastq'),
                          ('POOL40', 'TNSeq_merge_FASTQ/REPEAT-RBAD40_S2_L002_merged.fastq'),
                          ('POOL41', 'TNSeq_merge_FASTQ/REPEAT-RBAD41_S1_L002_merged.fastq'),]

### Writing the above info into a metafile

In [7]:
metafile_columns = ['Pool', 'ShortName', 'Fastq', 'ReadModel', 'InsertionSequence', 'GenomeSequence', 'OutputDir']

with open(metafile_name, 'w') as f:
    
    #write column names
    f.write('\t'.join(metafile_columns)+'\n')
    
    #write a line for each pool
    for pool, fastq in short_names_and_fastqs:
        
        #we write the pool name twice because we're first processing each pool separately then combining them
        #if we were combining them all at once then the first pool variable would be the name of the combined pool
        to_write = '\t'.join([pool, pool, fastq, ReadModel, InsertionSequence, GenomeSequence, OutputDir])
        f.write(to_write+'\n')
        

In [8]:
#view metafile
pd.read_csv(metafile_name, sep='\t')

Unnamed: 0,Pool,ShortName,Fastq,ReadModel,InsertionSequence,GenomeSequence,OutputDir
0,POOL30,POOL30,TNSeq_merge_FASTQ/REPEAT-RBAD30_S12_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
1,POOL31,POOL31,TNSeq_merge_FASTQ/REPEAT-RBAD31_S11_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
2,POOL32,POOL32,TNSeq_merge_FASTQ/REPEAT-RBAD32_S10_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
3,POOL33,POOL33,TNSeq_merge_FASTQ/REPEAT-RBAD33_S9_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
4,POOL34,POOL34,TNSeq_merge_FASTQ/REPEAT-RBAD34_S8_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
5,POOL35,POOL35,TNSeq_merge_FASTQ/REPEAT-RBAD35_S7_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
6,POOL36,POOL36,TNSeq_merge_FASTQ/REPEAT-RBAD36_S6_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
7,POOL37,POOL37,TNSeq_merge_FASTQ/REPEAT-RBAD37_S5_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
8,POOL38,POOL38,TNSeq_merge_FASTQ/REPEAT-RBAD38_S4_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
9,POOL39,POOL39,TNSeq_merge_FASTQ/REPEAT-RBAD39_S3_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome


## Map transposons in each pool to genome

View additional input and output options here: https://github.com/stcoradetti/RBseq/blob/master/README.txt

This takes a long time to run and will be stuck on the "Looking for 20bp sequence barcode" step for a while. You can see how many barcodes have been processed by switching to the output directory specified above and typing:

    wc -l *blastquery.fasta
    
This number divided by 2 is roughly how many reads have been processed. You can also check the logFile specified below.

In [None]:
cmd = 'nice -n 17 python3 /usr2/people/clairedubin/barseq/latest_pipeline/RBseq_Map_Insertions_v1.1.4_PBa_Jskerker.py --metafile {} --logFile {}'.format(metafile_name, mapping_log_file)

!{cmd}

RBseq_Map_Insertions.py
2022-05-20 11:55:37 Version: 1.1.4_PBa
2022-05-20 11:55:37 Release Date: August 8, 2020
2022-05-20 11:55:37 Options passed:  metafile:TNSeq_mapping_metafile_pools30_to_41_individual.txt  logFile:TNSeq_mapping_pools30_to_41_individual.log  minQual:10  matchBefore:6  matchAfter:6  matchJunction:4  maxFillerSeq:100  barcodeVariation:2  wobbleAllowed:1  scoreDiff:10  minFraction:0.9  filterNeighborhood:10  filterEditDistance:5  minPercentID:95  maxEvalue:0.1  useMappedFiles:False  noInsertHits:False  noBarcodes:False 
2022-05-20 11:55:37 Logging status updates in TNSeq_mapping_pools30_to_41_individual.log
2022-05-20 11:55:37 Loading TnSeq library metadata from TNSeq_mapping_metafile_pools30_to_41_individual.txt
2022-05-20 11:55:37 Finding barcodes in fastqs and mapping insertion locations
2022-05-20 11:55:37   Mapping reads from TNSeq_merge_FASTQ/REPEAT-RBAD30_S12_L002_merged.fastq using insertion model /usr2/people/clairedubin/barseq/latest_pipeline/Model_pJ

## Combine pools

Make a new metafile, this time combining the pools into one larger pool.

In [10]:
#CHANGE THESE AS NEEDED

combined_pool_name = 'POOLS30_to_41_combined'
metafile_name = 'TNSeq_mapping_metafile_pools30_to_41_combined.txt'
mapping_log_file = 'TNSeq_mapping_pools30_to_41_combined.log'

In [11]:
metafile_columns = ['Pool', 'ShortName', 'Fastq', 'ReadModel', 'InsertionSequence', 'GenomeSequence', 'OutputDir']

with open(metafile_name, 'w') as f:
    
    #write column names
    f.write('\t'.join(metafile_columns)+'\n')
    
    #write a line for each pool
    for pool, fastq in short_names_and_fastqs:
        
        #the pool name is now the combined pool name
        to_write = '\t'.join([combined_pool_name, pool, fastq, ReadModel, InsertionSequence, GenomeSequence, OutputDir])
        f.write(to_write+'\n')
        

Note how this metafile is different than the one above - the "Pool" column is the name of the combined pool.

In [12]:
#view metafile
pd.read_csv(metafile_name, sep='\t')

Unnamed: 0,Pool,ShortName,Fastq,ReadModel,InsertionSequence,GenomeSequence,OutputDir
0,POOLS30_to_41_combined,POOL30,TNSeq_merge_FASTQ/REPEAT-RBAD30_S12_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
1,POOLS30_to_41_combined,POOL31,TNSeq_merge_FASTQ/REPEAT-RBAD31_S11_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
2,POOLS30_to_41_combined,POOL32,TNSeq_merge_FASTQ/REPEAT-RBAD32_S10_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
3,POOLS30_to_41_combined,POOL33,TNSeq_merge_FASTQ/REPEAT-RBAD33_S9_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
4,POOLS30_to_41_combined,POOL34,TNSeq_merge_FASTQ/REPEAT-RBAD34_S8_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
5,POOLS30_to_41_combined,POOL35,TNSeq_merge_FASTQ/REPEAT-RBAD35_S7_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
6,POOLS30_to_41_combined,POOL36,TNSeq_merge_FASTQ/REPEAT-RBAD36_S6_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
7,POOLS30_to_41_combined,POOL37,TNSeq_merge_FASTQ/REPEAT-RBAD37_S5_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
8,POOLS30_to_41_combined,POOL38,TNSeq_merge_FASTQ/REPEAT-RBAD38_S4_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome
9,POOLS30_to_41_combined,POOL39,TNSeq_merge_FASTQ/REPEAT-RBAD39_S3_L002_merged.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,TNSeq_mapping_output_new_genome


This step takes ~ 2 hours for 4 pools.

In [None]:
cmd = 'nice -n 17 python3 /usr2/people/clairedubin/barseq/latest_pipeline/RBseq_Map_Insertions_v1.1.4_PBa_Jskerker.py --useMappedFiles --metafile {} --logFile {}'.format(metafile_name, mapping_log_file)

!{cmd}

RBseq_Map_Insertions.py
2022-05-23 14:24:08 Version: 1.1.4_PBa
2022-05-23 14:24:08 Release Date: August 8, 2020
2022-05-23 14:24:08 Options passed:  metafile:TNSeq_mapping_metafile_pools30_to_41_combined.txt  logFile:TNSeq_mapping_pools30_to_41_combined.log  minQual:10  matchBefore:6  matchAfter:6  matchJunction:4  maxFillerSeq:100  barcodeVariation:2  wobbleAllowed:1  scoreDiff:10  minFraction:0.9  filterNeighborhood:10  filterEditDistance:5  minPercentID:95  maxEvalue:0.1  useMappedFiles:True  noInsertHits:False  noBarcodes:False 
2022-05-23 14:24:08 Logging status updates in TNSeq_mapping_pools30_to_41_combined.log
2022-05-23 14:24:08 Loading TnSeq library metadata from TNSeq_mapping_metafile_pools30_to_41_combined.txt
2022-05-23 14:24:08 --useMappedFiles option passed. Skipping mapping step and proceeding to pool analysis
2022-05-23 14:24:08 Processing mapped reads to characterize mutant pool(s)
2022-05-23 14:24:08 Analyzing pool: POOLS30_to_41_combined
2022-05-23 14:24:08   Readin