#  InSilicoSeq

InSilicoSeq is a sequencing simulator producing realistic Illumina reads. Primarily intended for simulating metagenomic samples, it can also be used to produce sequencing data from a single genome.

Ref:
- Doc: https://insilicoseq.readthedocs.io/en/latest/iss/generate.html
- Github: https://github.com/HadrienG/InSilicoSeq

## Environment
- Colab

In [11]:
!lscpu |grep 'Model name'                   # CPU
!cat /proc/cpuinfo | grep processor | wc -l # threads

Model name:                      Intel(R) Xeon(R) CPU @ 2.20GHz
2


## Installation

In [1]:
# Install Bioconda

! wget -O miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh
! chmod +x miniconda.sh
! bash ./miniconda.sh -b -f -p /usr/local
! rm miniconda.sh
! conda config --add channels conda-forge
! conda config --add channels bioconda
! conda install -y mamba
! mamba update -qy --all
! mamba clean -qafy
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

--2023-02-02 10:18:46--  https://repo.anaconda.com/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.130.3, 104.16.131.3, 2606:4700::6810:8303, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.130.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 89026327 (85M) [application/x-sh]
Saving to: ‘miniconda.sh’


2023-02-02 10:18:47 (150 MB/s) - ‘miniconda.sh’ saved [89026327/89026327]

PREFIX=/usr/local
Unpacking payload ...
Collecting package metadata (current_repodata.json): - \ done
Solving environment: / - \ done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - _libgcc_mutex==0.1=main
    - _openmp_mutex==4.5=1_gnu
    - brotlipy==0.7.0=py37h27cfd23_1003
    - ca-certificates==2021.7.5=h06a4308_1
    - certifi==2021.5.30=py37h06a4308_0
    - cffi==1.14.6=py37h400218f_0
    - chardet==4.0.0=py37h06a4308_1003
    - conda-package-handling==1.7.3=

In [2]:
!conda --version
!conda config --show channels

conda 22.9.0
channels:
  - bioconda
  - conda-forge
  - defaults


In [43]:
!conda install -y -c bioconda insilicoseq
!conda install -y -c bioconda simlord

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done


  current version: 22.9.0
  latest version: 22.11.1

Please update

In [44]:
!iss --version
!simlord --version

iss version 1.5.4
SimLoRD v1.0.4


In [49]:
# download the example data
# Mock bacteria genome

!curl -O -J -L https://osf.io/thser/download # SRS121011.fasta
!curl -O -J -L https://osf.io/37kg8/download # minigut.fasta

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   459  100   459    0     0   1699      0 --:--:-- --:--:-- --:--:--  1706
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
curl: (23) Failed writing header
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   459  100   459    0     0   1171      0 --:--:-- --:--:-- --:--:--  1173
100 1733k  100 1733k    0     0  1320k      0  0:00:01  0:00:01 --:--:-- 2414k


## Experiment

In [16]:
!iss --help

usage: iss [subcommand] [options]

InSilicoSeq: A sequencing simulator

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  print software version and exit

available subcommands:
  
    model        generate an error model from a bam file
    generate     simulate reads from an error model


### Simulating NovaSeq, HiSeq or MiSeq reads

- MiSeq: small benchtop sequencer, lower throughput, and lower cost, suitable for small scale sequencing applications such as targeted sequencing and gene expression analysis.

- HiSeq: high-throughput sequencer, capable of sequencing large genomes, transcriptomes, and epigenomes. The HiSeq platform offers multiple sequencing options, including single-end and paired-end reads, and is suited for large-scale genomic projects.

- NovaSeq: high-throughput sequencer with a flexible design that allows for a wide range of sequencing applications, including whole-genome sequencing, transcriptome sequencing, and targeted sequencing. The NovaSeq platform offers a higher output and lower cost per genome compared to the HiSeq platform. Currently, the NovaSeq platform is the most popular among researchers due to its high throughput and relatively low cost per genome.

- Model name	Read length
  - MiSeq	300 bp
  - HiSeq	125 bp
  - NovaSeq	150 bp

In [17]:
!iss generate --help

usage: iss generate [-h] [--quiet | --debug] [--seed <int>] [--cpus <int>]
                    [--genomes <genomes.fasta> [<genomes.fasta> ...]]
                    [--draft <draft.fasta> [<draft.fasta> ...]]
                    [--n_genomes <int>] [--ncbi [<str> [<str> ...]]]
                    [--n_genomes_ncbi [<int> [<int> ...]]]
                    [--abundance <str> | --abundance_file <abundance.txt> | --coverage <str> | --coverage_file <coverage.txt>]
                    [--n_reads <int>] [--mode <str>] [--model <npz>]
                    [--gc_bias] [--compress] --output <fastq>

simulate reads from an error model

arguments:
  -h, --help            show this help message and exit
  --quiet, -q           Disable info logging. (default: False).
  --debug, -d           Enable debug logging. (default: False).
  --seed <int>          Seed all the random number generators
  --cpus <int>, -p <int>
                        number of cpus to use. (default: 2).
  --genomes <genomes.fast

In [27]:
!wget https://github.com/jingwora/bioinformatics-tools/raw/main/tools/InSilicoSeq/Escherichia_coli_042.fasta

--2023-02-02 11:02:14--  https://github.com/jingwora/bioinformatics-tools/raw/main/tools/InSilicoSeq/Escherichia_coli_042.fasta
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/jingwora/bioinformatics-tools/main/tools/InSilicoSeq/Escherichia_coli_042.fasta [following]
--2023-02-02 11:02:14--  https://raw.githubusercontent.com/jingwora/bioinformatics-tools/main/tools/InSilicoSeq/Escherichia_coli_042.fasta
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5316913 (5.1M) [text/plain]
Saving to: ‘Escherichia_coli_042.fasta’


2023-02-02 11:02:15 (64.5 MB/s) - ‘Escherichia_coli_042.fasta’ saved [5316913

In [35]:
# --n_reads  10,000 pairs of reads  
# --genomes  refer from Escherichia_coli_042.fasta 
# --model    Use novaseq model 
# --output   output file name as novaseq_reads_ecoli 
# --cpus 16

!iss generate --n_reads 10000 --genomes Escherichia_coli_042.fasta --model novaseq --output novaseq_reads_ecoli --cpus 16

INFO:iss.app:Starting iss generate
INFO:iss.app:Using kde ErrorModel
INFO:iss.util:Stitching input files together
INFO:iss.app:Using lognormal abundance distribution
INFO:iss.app:Using 16 cpus for read generation
INFO:iss.app:Generating 10000 reads
INFO:iss.app:Generating reads for record: FN554766.1
INFO:iss.util:Stitching input files together
INFO:iss.util:Stitching input files together
INFO:iss.util:Cleaning up
INFO:iss.app:Read generation complete


In [36]:
!du -sh novaseq_reads_ecoli_R1.fastq novaseq_reads_ecoli_R2.fastq

1.6M	novaseq_reads_ecoli_R1.fastq
1.6M	novaseq_reads_ecoli_R2.fastq


In [38]:
# Create hiseq

# --n_reads 10,000 
# --genomes 
# --model hiseq 
# --output  
# --cpus 16

!iss generate --n_reads 10000 --genomes Escherichia_coli_042.fasta --model hiseq --output hiseq_reads_ecoli --cpus 16

INFO:iss.app:Starting iss generate
INFO:iss.app:Using kde ErrorModel
INFO:iss.util:Stitching input files together
INFO:iss.app:Using lognormal abundance distribution
INFO:iss.app:Using 16 cpus for read generation
INFO:iss.app:Generating 10000 reads
INFO:iss.app:Generating reads for record: FN554766.1
INFO:iss.util:Stitching input files together
INFO:iss.util:Stitching input files together
INFO:iss.util:Cleaning up
INFO:iss.app:Read generation complete


In [39]:
!du -sh hiseq_reads_ecoli_R1.fastq hiseq_reads_ecoli_R2.fastq

1.4M	hiseq_reads_ecoli_R1.fastq
1.4M	hiseq_reads_ecoli_R2.fastq


In [40]:
# Miseq Model.

!iss generate --n_reads 10000 --genomes Escherichia_coli_042.fasta --model miseq --output miseq_reads_ecoli --cpus 16

INFO:iss.app:Starting iss generate
INFO:iss.app:Using kde ErrorModel
INFO:iss.util:Stitching input files together
INFO:iss.app:Using lognormal abundance distribution
INFO:iss.app:Using 16 cpus for read generation
INFO:iss.app:Generating 10000 reads
INFO:iss.app:Generating reads for record: FN554766.1
INFO:iss.util:Stitching input files together
INFO:iss.util:Stitching input files together
INFO:iss.util:Cleaning up
INFO:iss.app:Read generation complete


In [41]:
!du -sh miseq_reads_ecoli_R1.fastq miseq_reads_ecoli_R2.fastq

3.0M	miseq_reads_ecoli_R1.fastq
3.0M	miseq_reads_ecoli_R2.fastq


## Simulating Pacbio and Nanopore Reads

In [45]:
# use SimLoRD for Pacbio reads simulation
# -n 100 
# --read-reference Escherichia_coli_042.fasta 

!simlord -n 100 --read-reference Escherichia_coli_042.fasta ecoli_pacbio_reads

Time for reading/generating the reference: 0:00:00.039389 h
Time for simulation of 100 reads: 0:00:01.209782 h.


In [46]:
!du -sh ecoli_pacbio_reads.fastq ecoli_pacbio_reads.sam

1.7M	ecoli_pacbio_reads.fastq
1.9M	ecoli_pacbio_reads.sam


## Simulating from multiple files

In [52]:
!iss generate -n 10000 --genomes SRS121011.fasta minigut.fasta --n_genomes 5 --model novaseq --output novaseq_reads

INFO:iss.app:Starting iss generate
INFO:iss.app:Using kde ErrorModel
INFO:iss.util:Stitching input files together
INFO:iss.app:Using lognormal abundance distribution
INFO:iss.app:Using 2 cpus for read generation
INFO:iss.app:Generating 10000 reads
INFO:iss.app:Generating reads for record: AE016877.1
INFO:iss.app:Generating reads for record: NC_004668.1
INFO:iss.app:Generating reads for record: NC_007795.1
INFO:iss.app:Generating reads for record: NC_004461.1
INFO:iss.app:Generating reads for record: NC_000913.3
INFO:iss.util:Stitching input files together
INFO:iss.util:Stitching input files together
INFO:iss.util:Cleaning up
INFO:iss.app:Read generation complete


In [55]:
!du -sh novaseq_reads_R1.fastq novaseq_reads_R2.fastq

1.6M	novaseq_reads_R1.fastq
1.6M	novaseq_reads_R2.fastq


## Use random ncbi file
- Require internet

In [62]:
# !iss generate --ncbi bacteria -u 10 --model miseq  --output miseq_ncbi

In [63]:
# !iss generate -k bacteria viruses -u 10 4 --model miseq --output miseq_ncbi

## Use Abundance distribution

In [66]:
# --abundance parameter: uniform, halfnormal, exponential or zero-inflated-lognormal

!iss generate -n 10000 -g SRS121011.fasta --abundance exponential -m HiSeq -o HiSeq_reads

INFO:iss.app:Starting iss generate
INFO:iss.app:Using kde ErrorModel
INFO:iss.util:Stitching input files together
INFO:iss.app:Using exponential abundance distribution
INFO:iss.app:Using 2 cpus for read generation
INFO:iss.app:Generating 10000 reads
INFO:iss.app:Generating reads for record: CP009257.1
INFO:iss.app:Generating reads for record: AE016877.1
INFO:iss.app:Generating reads for record: CP000139.1
INFO:iss.app:Generating reads for record: CP017623.1
INFO:iss.app:Generating reads for record: CP010086.2
INFO:iss.app:Generating reads for record: NC_001263.1
INFO:iss.app:Generating reads for record: NC_004668.1
INFO:iss.app:Generating reads for record: NC_000913.3
INFO:iss.app:Generating reads for record: NC_000915.1
INFO:iss.app:Generating reads for record: NC_008530.1
INFO:iss.app:Generating reads for record: NC_003210.1
INFO:iss.app:Generating reads for record: NC_009515.1
INFO:iss.app:Generating reads for record: NC_003112.2
INFO:iss.app:Generating reads for record: NC_006085.1

In [67]:
!du -sh HiSeq_reads_R1.fastq HiSeq_reads_R2.fastq

1.4M	HiSeq_reads_R1.fastq
1.4M	HiSeq_reads_R2.fastq


## Husam sample data

In [68]:
# Load data Sequence reads

!wget https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/hpp_HG002_NA24385_son_v1/ILMN/downsampled/HG002_HiSeq30x_subsampled_R1.fastq.gz
!wget https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/hpp_HG002_NA24385_son_v1/ILMN/downsampled/HG002_HiSeq30x_subsampled_R2.fastq.gz

--2023-02-02 11:52:26--  https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/hpp_HG002_NA24385_son_v1/ILMN/downsampled/HG002_HiSeq30x_subsampled_R1.fastq.gz
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.92.209.240, 52.218.177.192, 52.218.232.64, ...
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.92.209.240|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42868243557 (40G) [binary/octet-stream]
Saving to: ‘HG002_HiSeq30x_subsampled_R1.fastq.gz’


2023-02-02 12:09:09 (40.8 MB/s) - ‘HG002_HiSeq30x_subsampled_R1.fastq.gz’ saved [42868243557/42868243557]

--2023-02-02 12:09:09--  https://s3-us-west-2.amazonaws.com/human-pangenomics/NHGRI_UCSC_panel/HG002/hpp_HG002_NA24385_son_v1/ILMN/downsampled/HG002_HiSeq30x_subsampled_R2.fastq.gz
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.252.152, 52.92.147.176, 52.92.213.32, ...
Connecting to s3-us-west-2.amazonaws.com (s

In [1]:
# # 2500

# !iss generate --n_reads 2500 --genomes HG002_HiSeq30x_subsampled_R1.fastq.gz HG002_HiSeq30x_subsampled_R2.fastq.gz --model novaseq --output HG002_HiSeq30x_subsampled_2500 --cpus 4

In [2]:
# # 10000

# !iss generate --n_reads 10000 --genomes HG002_HiSeq30x_subsampled_R1.fastq.gz HG002_HiSeq30x_subsampled_R2.fastq.gz --model novaseq --output HG002_HiSeq30x_subsampled_10000 --cpus 4

In [3]:
# !du -sh HG002_HiSeq30x_subsampled_2500_R1.fastq HG002_HiSeq30x_subsampled_2500_R2.fastq
# !du -sh HG002_HiSeq30x_subsampled_10000_R1.fastq HG002_HiSeq30x_subsampled_10000_R2.fastq