CoLoG: Coronavirus Local Genomics analysis pipeline

CoLoG (Coronavirus Local Genomics) is a bioinformatics pipeline designed to perform basic SARS-CoV-2 genome analysis in local computing environments. It was developed to bring the genome-analysis component of Japan's COG-JP surveillance web service into an offline, site-operable workflow that laboratories can run securely on their own servers.

Installation

On a Linux system with Conda installed, execute the following commands to install the CoLoG pipeline environment.

1. Clone the repository and build the Conda environment

$ git clone git@github.com:khoriba/COLOG.git
$ conda env create -n colog_env -f ./COLOG/environment.yml

2. Activate the environment and Run the setup scripts

$ conda activate colog_env
(colog_env)$ bash ./COLOG/setup.sh

3. Download the Nextclade reference dataset

If your system requires a proxy connection, include the --proxy option as shown below. This will download the Nextclade reference dataset for the Wuhan-Hu-1 strain and place it under your Conda environment’s reference directory.

(colog_env)$ nextclade dataset get \
  --proxy 'http://proxy-server.xx.xx:xxxx' \
  --name 'nextstrain/sars-cov-2/wuhan-hu-1/orfs' \
  --output-dir "$CONDA_PREFIX/opt/reference/nextstrain/sars-cov-2/wuhan-hu-1/orfs"

Usage

1. Preparing Input Data

Before running CoLoG, you need to organize your sequencing reads into individual directories by sample. The following is an example command sequence that separates paired-end read data (gzip-compressed FASTQ files) and optional assembly data (FASTA files) into sample-specific directories. The mkchank.sh script automatically creates a subdirectory for each sample and moves the corresponding FASTQ files into that directory. After the script execution, each sample will have its own directory containing its paired-end files

# Check the input files
(colog_env)$ ls
JP01_R1_001.fastq.gz  JP02_R1_001.fastq.gz  JP03_R1_001.fastq.gz
JP01_R2_001.fastq.gz  JP02_R2_001.fastq.gz  JP03_R2_001.fastq.gz

# Run mkchank.sh
(colog_env)$ mkchank.sh

# Verify that the sample directories have been created
(colog_env)$ ls -R
.:
JP01  JP02  JP03

./JP01:
JP01_R1_001.fastq.gz  JP01_R2_001.fastq.gz

./JP02:
JP02_R1_001.fastq.gz  JP02_R2_001.fastq.gz

./JP03:
JP03_R1_001.fastq.gz  JP03_R2_001.fastq.gz

2. Example: Running the Pipeline with Paired-End Read Data

The following example demonstrates how to execute the CoLoG pipeline using paired-end read data (gzip-compressed FASTQ files).

In this example: • The working directory contains a folder named JP01. • The JP01 folder contains paired-end FASTQ files: • Read 1 file: R1.fastq.gz • Read 2 file: R2.fastq.gz • The number of threads used is 16 (default is 4). • The default primer set located at $CONDA_PREFIX/opt/reference/nCovid-19-primerN7.mod.bed is used for primer trimming.

# Check the input data
(colog_env)$ ls JP01
JP01_R1_001.fastq.gz  JP01_R2_001.fastq.gz

# Run the analysis pipeline
(colog_env)$ COLOG.sh --threads 16 JP01

Note: If you have already organized your input data as described in Section 1, you can simply specify the folder containing each sample’s paired-end data to run the pipeline. CoLoG will automatically process the files within that directory.

3. Example: Running the Pipeline with Assembly Data

The following example demonstrates how to execute the CoLoG analysis pipeline using assembly data (FASTA files) as input.

In this example: • The working directory contains a folder named JP01. • Inside the JP01 folder, an assembly file (*.fasta) is placed. • The number of threads used is 16 (default is 4). • The default primer set located at $CONDA_PREFIX/opt/reference/nCovid-19-primerN7.mod.bed is used during processing.

# Check the input data
(colog_env)$ ls JP01
JP01.fasta

# Run the analysis pipeline
(colog_env)$ COLOG.sh --threads 16 JP01

4. Example: Running the Pipeline with a Specified Primer Set

You can specify a custom primer set (BED file) and the number of threads to be used when running the CoLoG pipeline. The following example demonstrates how to execute the analysis using a primer file located at /path/to/data/primer.bed and 16 threads.

(colog_env)$ COLOG.sh --primer /path/to/data/primer.bed --threads 16 JP01

Command-line options

Option	Argument	Description
`-p`, `--primer`	filename	Specify the absolute path to the BED file containing the primer set to be used. Default: `$CONDA_PREFIX/opt/reference/nCovid-19-primerN7.mod.bed`
`-t`, `--threads`	integer	Specify the number of threads to use for parallel processing. Default: `4`

5. Main Analysis Results

The analysis pipeline outputs results to the directory specified at runtime.
Table 1 summarizes the major output files generated by CoLoG.

Table 1. Main Analysis Results

File / Directory	Description
`result-summary.txt`	Summary report (not inside ZIP)
`<sample>-result.zip`	Archive containing all analysis results
`contigs/`	Contig-level results
├─ `extract-contigs-skesa.megablast.table`	MEGABLAST result (SKESA contigs)
├─ `extract-contigs-a5miseq.megablast.table`	MEGABLAST result (A5-miseq contigs)
├─ `extract-contigs-skesa.fasta`	Extracted SKESA contigs
├─ `extract-contigs-a5miseq.fasta`	Extracted A5-miseq contigs
└─ other associated MEGABLAST hit lists
`mapping-data-and-mutation-site/`	Mapping-based alignment and variant data
├─ `realigned_reads.bam`, `.bam.bai`	Aligned reads (GATK realigned)
├─ `VarScan-modified.vcf`	Variant calls (VarScan)
├─ `VarScan-modified-rm-Mix-allele.vcf`	Filtered variant calls (mixed alleles removed)
├─ `GATK-indel-filtered-ref.vcf`	INDEL variant calls (GATK)
├─ `vcf-consensus.fasta`	Consensus sequence (SNP only)
├─ `vcf-consensus-with-indel-trim.fasta`	Consensus sequence (SNP+INDEL, trimmed)
├─ `large-deletion.txt`	Candidate large deletions
├─ `BED-depth-check.txt`	Coverage depth per position
└─ `ref-seq_MN908947.fasta`	Reference genome
`other/`	QC and lineage information
├─ `plot_depth.pdf`	Depth distribution plot
├─ `insert_size_histogram.pdf`	Insert size histogram
├─ `Nextclade-result.tsv`	Nextclade results
└─ `lineage_report.csv`	Pangolin lineage report

6. Merging Multiple Summary Results

When multiple analyses have been completed, you can merge their summary files
(result-summary.txt) using the script result-summary-table.bash. The merged output file result-summary-table.txt provides a consolidated overview across all analyzed samples.

Example:
If the current directory contains three sample folders (JP01–JP03), each with its own summary file:

# Check the result-summary.txt files
(colog_env)$ ls */result-summary.txt
JP01/result-summary.txt  JP02/result-summary.txt  JP03/result-summary.txt

# Run the merging script
(colog_env)$ result-summary-table.bash

# Verify the merged summary
(colog_env)$ ls
result-summary-table.txt  JP01  JP02  JP03

7. Creating a Multi-FASTA File

The following example demonstrates how to generate a multi-FASTA file that combines the final assembly sequences obtained from multiple sample analyses using the command collectFasta.bash. In this example, the current directory (project) contains three sample folders (JP01–JP03) and a file named list.txt.
The list.txt file defines which assembly results to include by specifying:

the directory name containing each assembly result, and
the sequence name to be used in the output multi-FASTA file,
separated by a tab character.

Running collectFasta.bash will generate the file RENAMED_project.fasta as the output.

# Check the list file
(colog_env)$ cat list.txt
JP01    virusA
JP02    virusB
JP03    virusC

# Confirm the directories
(colog_env)$ ls
list.txt  JP01  JP02  JP03

# Run the script
(colog_env)$ collectFasta.bash
[INFO] read key-value file: list.txt
[INFO] 3 pairs of key-value loaded

# Verify the generated multi-FASTA file
(colog_env)$ ls
list.txt  JP01  JP02  JP03  RENAMED_project.fasta

Updating Databases for Nextclade and Pangolin

This section describes how to update the reference databases used by Nextclade and Pangolin in the CoLoG pipeline.

(1) Updating the Nextclade Database

The following command updates the Nextclade reference dataset used in the pipeline.
You can specify the proxy server, dataset name, and the database path referenced by the CoLoG pipeline.
The pipeline currently supports only the dataset 'nextstrain/sars-cov-2/wuhan-hu-1/orfs'.
If you do not need a proxy, the --proxy option can be omitted.

(colog_env)$ nextclade dataset get \
  --proxy 'http://proxy-server.xx.xx:xxxx' \
  --name 'nextstrain/sars-cov-2/wuhan-hu-1/orfs' \
  --output-dir "$CONDA_PREFIX/opt/reference/nextstrain/sars-cov-2/wuhan-hu-1/orfs"

Note: The command overwrites the existing database. It is strongly recommended to back up the current directory before updating: $CONDA_PREFIX/opt/reference/nextstrain/sars-cov-2/wuhan-hu-1/orfs

(2) Updating the Pangolin Database

The following example updates the Pangolin database used for lineage classification. A proxy server can be specified if required; otherwise, it may be omitted.

(colog_env)$ https_proxy='http://proxy-server.xx.xx:xxxx' pangolin --update-data

Note: This command also overwrites the existing Pangolin database. The default location of the database is: $CONDA_PREFIX/lib/python3.11/site-packages/pangolin_data/ Do not run the command pangolin --update, as it may update the Pangolin software itself, which could cause compatibility issues and prevent the CoLoG pipeline from functioning correctly.

Citation

Please cite the following if you are using the CoLoG pipeline:

Sekizuka T, Itokawa K, Hashino M, et al. A Genome Epidemiological Study of SARS-CoV-2 Introduction into Japan. mSphere. 2020;5(6):e00786-20. Published 2020 Nov 11. doi:10.1128/mSphere.00786-20

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
reference		reference
scripts		scripts
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
run_colog_all.sh		run_colog_all.sh
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoLoG: Coronavirus Local Genomics analysis pipeline

Installation

1. Clone the repository and build the Conda environment

2. Activate the environment and Run the setup scripts

3. Download the Nextclade reference dataset

Usage

1. Preparing Input Data

2. Example: Running the Pipeline with Paired-End Read Data

3. Example: Running the Pipeline with Assembly Data

4. Example: Running the Pipeline with a Specified Primer Set

Command-line options

5. Main Analysis Results

6. Merging Multiple Summary Results

7. Creating a Multi-FASTA File

Updating Databases for Nextclade and Pangolin

(1) Updating the Nextclade Database

(2) Updating the Pangolin Database

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CoLoG: Coronavirus Local Genomics analysis pipeline

Installation

1. Clone the repository and build the Conda environment

2. Activate the environment and Run the setup scripts

3. Download the Nextclade reference dataset

Usage

1. Preparing Input Data

2. Example: Running the Pipeline with Paired-End Read Data

3. Example: Running the Pipeline with Assembly Data

4. Example: Running the Pipeline with a Specified Primer Set

Command-line options

5. Main Analysis Results

6. Merging Multiple Summary Results

7. Creating a Multi-FASTA File

Updating Databases for Nextclade and Pangolin

(1) Updating the Nextclade Database

(2) Updating the Pangolin Database

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages