### Note: the mapping script takes days to run! You should be running this notebook using screen, so if you close this page the script will keep running.

If you close the notebook while using screen, the log output within the notebook might stop being printed. But the logfile specified in the line where you call the script will keep being written.

How to run a Jupyter notebook with screen: https://docs.google.com/document/d/1lLcl3HhDBrn87M4-IhkpDLOmvmmf09-KeLSgn3U21hE/edit?usp=sharing

In [1]:
import pandas as pd
import os
import matplotlib.pyplot as plt

#makes viewing pandas tables better
pd.set_option('display.max_colwidth', 0)

In [3]:
!source .bashrc

/bin/bash: .bashrc: No such file or directory


In [2]:
!pip install pandas



## Make metafile for TNseq mapping script

### Info that will be the same for all pools

metafile_name: Name of file where we're going to write the metadata

combined_pool_name: What we're going to call the combination of individual pools

ReadModel: Full or relative path to a text file describing the expected structure of TnSeq reads.  The file consists of 4 lines.  Line 1 denotes the number of initial variable bases to ignore as 'N's.  Line 2 is the sequence preceding the barcode.  Line 3 is the expected length of sequence barcode denoted as 'N's.  Line 4 is the sequence between the barcode and the junction with the genome.  Example:

    nnnnnn
    GATGTCCACGAGGTCTCT
    NNNNNNNNNNNNNNNNNNNN
    CGTACGCTGCAGGTCGACAATGATCCAAACTATCAGTGTTTGA	

InsertionSequence: Full or relative path to a text file describing giving the full sequence of each insertion.  This should be a fasta format file with two entries labeled 'insert' and 'plasmid' giving the full sequence of the expected insertion into the genome and and the remaining source plasmid sequence, if appropriate. Example:

    >insert
    agtcac...
    >plasmid
    actgact... 
      
GenomeSequence: Full or relative path to a .fasta file of genome sequences for all species in pool.

OutputDir: Full or relative path to the desired directory to save output files.

In [2]:
#CHANGE THESE AS NEEDED

metafile_name = '../20032023_novaseq/TNSeq_mapping_metafile.txt'
mapping_log_file = '../20032023_novaseq/TNSeq_mapping.log'

ReadModel = '/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt'
InsertionSequence = '/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt'
GenomeSequence = '/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt'
OutputDir = '../20032023_novaseq/FullRun_Output'

### Pool-specific information

In [3]:
#where you saved the merged FASTQs in TNseq_prepare_files.ipynb
FASTQ_directory = '../20032023_novaseq/TNSeq_merge_FASTQ'

In [4]:
#get the list of fastqs we generated in the BBmerge step above
#copy and paste these into the cell below this one

!ls {FASTQ_directory}/*_merged.fastq

../20032023_novaseq/TNSeq_merge_FASTQ/ADRB501_deep_S1_L002_merged.fastq
../20032023_novaseq/TNSeq_merge_FASTQ/ADRB502_deep_S2_L002_merged.fastq
../20032023_novaseq/TNSeq_merge_FASTQ/ADRB503_deep_S3_L002_merged.fastq
../20032023_novaseq/TNSeq_merge_FASTQ/ADRB504_deep_S4_L002_merged.fastq
../20032023_novaseq/TNSeq_merge_FASTQ/ADRB505_deep_S5_L002_merged.fastq
../20032023_novaseq/TNSeq_merge_FASTQ/ADRB506_deep_S6_L002_merged.fastq
../20032023_novaseq/TNSeq_merge_FASTQ/KCRB001_deep_S7_L002_merged.fastq
../20032023_novaseq/TNSeq_merge_FASTQ/KCRB002_deep_S8_L002_merged.fastq
../20032023_novaseq/TNSeq_merge_FASTQ/KCRB003_deep_S9_L002_merged.fastq
../20032023_novaseq/TNSeq_merge_FASTQ/KCRB004_deep_S10_L002_merged.fastq
../20032023_novaseq/TNSeq_merge_FASTQ/KCRB005_deep_S11_L002_merged.fastq
../20032023_novaseq/TNSeq_merge_FASTQ/Undetermined_S0_L002_merged.fastq


For each pool, come up with a UNIQUE short name for the pool.

In [5]:
#CHANGE THESE AS NEEDED BUT KEEP THE [('NAME', 'PATH/TO/FASTQ.FASTQ')] FORMAT

short_names_and_fastqs = [('28C_ADRB501', '../20032023_novaseq/TNSeq_merge_FASTQ/ADRB501_deep_S1_L002_merged.fastq'),
                          ('28C_ADRB502', '../20032023_novaseq/TNSeq_merge_FASTQ/ADRB502_deep_S2_L002_merged.fastq'),
                         ('28C_ADRB503', '../20032023_novaseq/TNSeq_merge_FASTQ/ADRB503_deep_S3_L002_merged.fastq'),
                         ('42C_ADRB504', '../20032023_novaseq/TNSeq_merge_FASTQ/ADRB504_deep_S4_L002_merged.fastq'),
                         ('42C_ADRB505', '../20032023_novaseq/TNSeq_merge_FASTQ/ADRB505_deep_S5_L002_merged.fastq'),
                         ('42C_ADRB506', '../20032023_novaseq/TNSeq_merge_FASTQ/ADRB506_deep_S6_L002_merged.fastq'),
                         ('28C_KCRB001', '../20032023_novaseq/TNSeq_merge_FASTQ/KCRB001_deep_S7_L002_merged.fastq'),
                         ('28C_KCRB002', '../20032023_novaseq/TNSeq_merge_FASTQ/KCRB002_deep_S8_L002_merged.fastq'),
                         ('28C_KCRB003', '../20032023_novaseq/TNSeq_merge_FASTQ/KCRB003_deep_S9_L002_merged.fastq'),
                         ('42C_KCRB004', '../20032023_novaseq/TNSeq_merge_FASTQ/KCRB004_deep_S10_L002_merged.fastq'),
                         ('42C_KCRB005', '../20032023_novaseq/TNSeq_merge_FASTQ/KCRB005_deep_S11_L002_merged.fastq'),]

In [6]:
#CHANGE THESE AS NEEDED BUT KEEP THE [('NAME', 'PATH/TO/FASTQ.FASTQ')] FORMAT

short_names_and_fastqs = [('28C_ADRB501_test', '../20032023_novaseq/TNSeq_merge_FASTQ/ADRB501_deep_S1_L002_merged_subsample.fastq'),
                          ]

### Writing the above info into a metafile

In [7]:
metafile_columns = ['Pool', 'ShortName', 'Fastq', 'ReadModel', 'InsertionSequence', 'GenomeSequence', 'OutputDir']

with open(metafile_name, 'w') as f:
    
    #write column names
    f.write('\t'.join(metafile_columns)+'\n')
    
    #write a line for each pool
    for pool, fastq in short_names_and_fastqs:
        
        #we write the pool name twice because we're first processing each pool separately then combining them
        #if we were combining them all at once then the first pool variable would be the name of the combined pool
        to_write = '\t'.join([pool, pool, fastq, ReadModel, InsertionSequence, GenomeSequence, OutputDir])
        f.write(to_write+'\n')
        

In [8]:
#view metafile
pd.read_csv(metafile_name, sep='\t')

Unnamed: 0,Pool,ShortName,Fastq,ReadModel,InsertionSequence,GenomeSequence,OutputDir
0,28C_ADRB501_test,28C_ADRB501_test,../20032023_novaseq/TNSeq_merge_FASTQ/ADRB501_deep_S1_L002_merged_subsample.fastq,/usr2/people/clairedubin/barseq/latest_pipeline/Model_pJC9BC.txt,/usr2/people/clairedubin/barseq/latest_pipeline/kluyv_insertion_corrected.txt,/usr2/people/clairedubin/barseq/genome_and_annotations/Klac_Kmar.fsa.txt,../20032023_novaseq/FullRun_Output


## Map transposons in each pool to genome

View additional input and output options here: https://github.com/stcoradetti/RBseq/blob/master/README.txt

This takes a long time to run and will be stuck on the "Looking for 20bp sequence barcode" step for a while. You can see how many insertions have been processed by switching to the output directory specified above and typing:

    wc -l *blastquery.fasta
    
This number divided by 2 is roughly how many reads have been processed. You can also check the logFile specified below.

In [9]:
cmd = 'nice -n 17 python3 /usr2/people/kayleeec/rhseq/RHSeq_without_barcodes/RHseq_Map_Insertions_v1-1.1.4_PBa_Jskerker.py --metafile {} --logFile {}'.format(metafile_name, mapping_log_file)

!{cmd}

RBseq_Map_Insertions.py
2023-03-29 19:52:09 Version: 1.1.4_PBa
2023-03-29 19:52:09 Release Date: August 8, 2020
2023-03-29 19:52:09 Options passed:  metafile:../20032023_novaseq/TNSeq_mapping_metafile.txt  logFile:../20032023_novaseq/TNSeq_mapping.log  minQual:10  matchBefore:6  matchAfter:6  matchJunction:4  maxFillerSeq:100  barcodeVariation:2  wobbleAllowed:1  scoreDiff:10  minFraction:0.9  filterNeighborhood:10  filterEditDistance:5  minPercentID:95  maxEvalue:0.1  useMappedFiles:False  noInsertHits:False  noBarcodes:False 
2023-03-29 19:52:09 Logging status updates in ../20032023_novaseq/TNSeq_mapping.log
2023-03-29 19:52:09 Loading TnSeq library metadata from ../20032023_novaseq/TNSeq_mapping_metafile.txt
2023-03-29 19:52:09 Finding barcodes in fastqs and mapping insertion locations
2023-03-29 19:52:09   Mapping reads from ../20032023_novaseq/TNSeq_merge_FASTQ/ADRB501_deep_S1_L002_merged_subsample.fastq using insertion model /usr2/people/clairedubin/barseq/latest_pipeline/Model_p

In [19]:
!pip install blastpy3

Collecting blastpy3
  Using cached blastpy3-0.3.0.tar.gz (9.6 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting pyfaidx>=0.5.8
  Downloading pyfaidx-0.7.2.1-py3-none-any.whl (28 kB)
Building wheels for collected packages: blastpy3
  Building wheel for blastpy3 (setup.py) ... [?25ldone
[?25h  Created wheel for blastpy3: filename=blastpy3-0.3.0-py3-none-any.whl size=10722 sha256=52358464c3e0c38da3e7a707c970d37aab035b9afb886bf8728e3f4d21e3713d
  Stored in directory: /auto/sahara/namib/home/kayleeec/.cache/pip/wheels/b6/51/c1/3a73571decb662a4388e7e8965f1ba50575c0a49f0df8ae53a
Successfully built blastpy3
Installing collected packages: pyfaidx, blastpy3
Successfully installed blastpy3-0.3.0 pyfaidx-0.7.2.1
