ISMapper searches for IS positions in sequence data using paired end Illumina short reads, an IS query/queries of interest and a reference genome. ISMapper reports the IS positions it has found in each isolate, relative to the provided reference genome.
A note on version 2: ISMapper was recently updated. The code has been re-written in Python 3, and re-structured to enable easier debugging. Users can now provide multiple reference genomes (either as indepedent files or a multi-entry genbank file) to ISMapper. During the re-structure, the 'assembly improvement' mode was removed. However, this mode can still be found in version 1 if required.
For a more in depth description of how the program works, see the section Method below.
For support, please create an issue in the GitHub issue tracker.
Read more about ISMapper here:
Table of contents
- Python v3.6+ (not compatible with Python 2)
- BioPython v1.63 to v1.77 (not compatible with v1.78 or higher)
- BWA v0.7.5a or later
- SAMtools v0.1.19 or later
- bedtools v2.20.1 or later
- BLAST+ v2.2.28 or later
Two install methods are described below - the easiest method is to install with conda.
Installation from conda
To install from conda, first you will need to install conda.
The conda install requires the channels
conda-forge. You can install ISMapper as its own environment using
conda create -n ismapper-bioconda -c conda-forge -c bioconda ismapper
To use ISMapper, type
conda activate ismapper-bioconda
Installation with pip
Install Python and its dependencies first.
Download ISMapper from GitHub:
git clone https://github.com/jhawkey/IS_mapper/
Install with pip:
pip install IS_mapper/
Alternatively, you can install ISMapper directly from github with pip.
pip install git+https://github.com/jhawkey/IS_mapper
To install locally:
pip install --user git+https://github.com/jhawkey/IS_mapper
To install a specific release:
pip install git+https://github.com/jhawkey/IS_mapper@my_tag
Test ISMapper is installed:
ismap --version compiled_table.py -h
Running a test case
To check that all dependencies are working correctly and outputs are as expected:
Download Streptococcus suis P1/7 reads from SRA
cd IS_mapper/test/inputs wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR225/ERR225612/ERR225612_*.fastq.gz
Run ISMapper, using the reads you've just downloaded, as well as the provided IS query and reference genbank in test/inputs
ismap --reads IS_mapper/test/inputs/ERR225612_*.fastq.gz --queries IS_mapper/test/inputs/ISSsu3.fasta --reference IS_mapper/test/inputs/S_suis_P17.gbk --output_dir IS_mapper/test/
The resulting table files should match those found in
ISMapper requires three main inputs:
- paired end reads
- a reference genome
- an IS query
--reads Multiple read sets can be supplied using a wildcard (eg:
/path/to/reads/*.fastq.gz). They can also be supplied one after the other, separated by spaces. Regardless of how the reads are supplied, ISMapper will pair the reads belonging to the same isolate together.
--queries Multiple IS queries can also be supplied, either by spaces, or as a multifasta file. ISMapper will run queries sequentially in the same output folder.
--reference Multiple reference genomes can also be supplied, again either separated by spaces, or as a multigenbank file.
Note: All reference genomes must have a locus_tag for each CDS, tRNA or rRNA feature.
--output_dir Path to the directory where you would like your output files to go. A subdirectory for each isolate will be made within this folder. Within each isolate subdirectory, subdirectories will be made for each IS query supplied.
--log Prefix for the log file. If not supplied, the prefix will be current date and time.
--help_all Display the extended help and advanced options.
ismap --reads [isolateA_1.fastq.gz] [isolateA_2.fastq.gz] [isolateB_1.fastq.gz] [isolateB_2.fastq.gz] --queries IS_query.fasta --reference reference_genome.gbk --output_dir
All ISMapper options
Basic ISMapper options: --reads READS [READS ...] Paired end reads for analysing (can be gzipped) --queries QUERIES [QUERIES ...] Multifasta file for query gene(s) (eg: insertion sequence) that will be mapped to. --reference REFERENCE [REFERENCE ...] Reference genome for typing against in genbank format --output_dir OUTPUT_DIR Location for all output files (default is current directory). --log LOG Prefix for log file. If not supplied, prefix will be current date and time. --help_all HELP_ALL Display extended help Parameters for defining hits: --min_clip MIN_CLIP Minimum size for softclipped region to be extracted from initial mapping (default 10). --max_clip MAX_CLIP Maximum size for softclipped regions to be included (default 30). --cutoff CUTOFF Minimum depth for mapped region to be kept in bed file (default 6) --novel_gap_size NOVEL_GAP_SIZE Distance in base pairs between left and right flanks to be called a novel hit (default 15) --min_range MIN_RANGE Minimum percent size of the gap to be called a known hit (default 0.9, or 90 percent) --max_range MAX_RANGE Maximum percent size of the gap to be called a known hit (default 1.1, or 110 percent) --merging MERGING Value for merging left and right hits in bed files together to simply calculation of closest and intersecting regions (default 100). BWA parameters: --a Switch on all alignment reporting for bwa. --T T Mapping quality score for bwa (default 30). --t T Number of threads for bwa (default 1). Parameters for output table: --cds CDS qualifier containing gene information (default product). Also note that all CDS features MUST have a locus_tag --trna TRNA qualifier containing gene information (default product). Also note that all tRNA features MUST have a locus_tag --rrna RRNA qualifier containing gene information (default product). Also note that all rRNA features MUST have a locus_tag Reporting parameters: --temp Switch on keeping the temp folder instead of deleting it at the end of the run --bam Switch on keeping the final bam files instead of deleting them at the end of the run
A note on version 2
ISMapper version 2 is now exclusively for finding IS insertion sites relative to a reference genome. This version of ISMapper is unable to operate in 'assembly improvement mode', as was documented for version 1. If you need this functionality, please download the latest version 1 branch.
Single isolate output
For each isolate, a folder is made for each IS query given. Within the folder for the IS query are the output files pertaining to that query. There will be the same set of output files for each reference sequence provided.
The main output file is
[isolate]__[reference]_table.txt. This file contains one line per IS position found. A description of the column names is below:
region: a numeric code given to the IS position
orientation: the orientation of the IS position, either
x: the left coordinate of the position (always the smallest value, regardless of orientation)
y: the right coordinate of the position (always the largest value, regardless of orientation)
gap: the distance between the
ycoordinates. A negative value indicates that the flanking regions overlap. The sequence of this overlap is usually the target site duplication the IS has made when it inserted.
call: determination of whether the position hit is a novel one (not in the reference) or known (in the reference). A
*indicates that the gap size is larger than expected, so this hit is considered imprecise. A
?indicates that either the left or right flank was below the depth cutoff, and so this hit is uncertain.
percent_ID: if the IS hit is a known one, this is the percentage nucleotide identity to the IS query
percent_cov: if the IS hit is a known one, this is the percentage coverage of the region to the IS query
left_gene: locus tag ID of the gene closest to the left side of the IS position
left_description: description of the gene feature from the left side of the IS position. By default this is the
productof the gene listed in the reference GenBank.
left_strand: DNA strand (1: 5' -> 3', -1: 3' -> 5') of the gene on the left side of the IS position
left_distance: distance of the IS hit from the start codon of the gene on the left side of the IS position
right_gene: locus tag ID of the gene closest to the right side of the IS position
right_description: description of the gene feature from the right side of the IS position. By default this is the
right_strand: DNA strand (1: 5' -> 3', -1: 3' -> 5') of the gene on the right side of the IS hit
right_distance: distance of the IS hit from the start codon of the gene on the right side of the IS position
Falsedepending on whether the IS position has interrupted a gene
Once ISMapper has completed, the individual output tables for each isolate can be combined to create a single table containing all IS positions across all isolates using
Required parameters are:
--tables Each individual isolate table from a single IS query and single reference genbank.
--reference The reference genbank used to generate the individual tables. This must be a single entry genbank file.
--query The IS query sequence in fasta format.
This script generates two output files.
_full_compiled.txt file contains isolates as rows, and merged IS positions as columns. The first isolate row are the positions found in the reference genome. Values in the cells are either
- for absent,
+ for present,
* for present but an imprecise hit and
? for present but an unconfident hit. The final nine rows in the table show orientation of the IS position, and surrounding gene information for each IS hit.
_binary_compiled.txt contains the same information as the
_full_compiled.txt table, but the presence or absence of the IS positions are encoded as either
1. In this table, the additional rows showing orientation of the IS position and surrounding gene information is absent. By default, IS positions found to be imprecise (
*) are encoded as present
1 and unconfident positions (
?) are encoded as absent
0. However, these values can be changed using the
--unconfident flags when running
All options for
compiled_tabe.py are show below.
optional arguments: -h, --help show this help message and exit --tables TABLES [TABLES ...] tables to compile --reference REFERENCE gbk file of reference --query QUERY fasta file for insertion sequence query for compilation --gap GAP distance between regions to call overlapping, default is 0 --cds CDS qualifier containing gene information (default product). Also note that all CDS features MUST have a locus_tag --trna TRNA qualifier containing gene information (default product). Also note that all tRNA features MUST have a locus_tag --rrna RRNA qualifier containing gene information (default product). Also note that all rRNA features MUST have a locus_tag --imprecise IMPRECISE Binary value for imprecise (*) hit (can be 1, 0 or 0.5), default is 1 --unconfident UNCONFIDENT Binary value for questionable (?) hit (can be 1, 0 or 0.5), default is 0 --out_prefix OUT_PREFIX Prefix for output file
ISMapper finds locations of an IS query in short read data using a series of mapping steps.
First, all reads for an isolate are mapped to the IS query using BWA. The reads we are interested in are the ones whose pairs have mapped, but are unmapped themselves (so therefore must be flanking the IS query). These reads are selected using SamTools.
These unmapped pairs are selected and placed into two groups - left end (flanking the left end of the IS query) and right end (flanking the right end of the IS query). Each of these groups are independently mapped (with BWA) to the reference genome, which may or may not have a particular IS query location present.
From this mapping, Bedtools is used to find the depth at each position in the reference genome. We are looking for large peaks of left end and right end reads that indicate a possibly IS query location. Positions where the read depth falls below 6 (default, can be changed using the
cutoff parameter) are eliminated from further analysis, as these may be incorrect alignments.
Overlapping, or close, flanking regions are merged using the Bedtools merge function to prevent multiple hits representing the same region in the final output file. By default, close regions are defined as being <= 100 bp apart.
Once left and right flanking regions have been identified, Bedtools is used to find pairs of left and right flanks which either:
a) intersect or are <= 15 bp (default) apart, which indicates a novel position that is not present in the reference, or;
b) separated by a distance which is approximately equal to the size of the IS query, which indicates a position that is probably in the reference.
All potential IS positions are then assessed to determine if they are tabulated into the final results file. Any hits which are removed are placed in the
_removedHits.txt file, with the reason for their removal.