CoLoG (Coronavirus Local Genomics) is a bioinformatics pipeline designed to perform basic SARS-CoV-2 genome analysis in local computing environments. It was developed to bring the genome-analysis component of Japan's COG-JP surveillance web service into an offline, site-operable workflow that laboratories can run securely on their own servers.
On a Linux system with Conda installed, execute the following commands to install the CoLoG pipeline environment.
$ git clone git@github.com:khoriba/COLOG.git
$ conda env create -n colog_env -f ./COLOG/environment.yml
$ conda activate colog_env
(colog_env)$ bash ./COLOG/setup.sh
If your system requires a proxy connection, include the --proxy option as shown below. This will download the Nextclade reference dataset for the Wuhan-Hu-1 strain and place it under your Conda environment’s reference directory.
(colog_env)$ nextclade dataset get \
--proxy 'http://proxy-server.xx.xx:xxxx' \
--name 'nextstrain/sars-cov-2/wuhan-hu-1/orfs' \
--output-dir "$CONDA_PREFIX/opt/reference/nextstrain/sars-cov-2/wuhan-hu-1/orfs"
Before running CoLoG, you need to organize your sequencing reads into individual directories by sample.
The following is an example command sequence that separates paired-end read data (gzip-compressed FASTQ files) and optional assembly data (FASTA files) into sample-specific directories. The mkchank.sh script automatically creates a subdirectory for each sample and moves the corresponding FASTQ files into that directory. After the script execution, each sample will have its own directory containing its paired-end files
# Check the input files
(colog_env)$ ls
JP01_R1_001.fastq.gz JP02_R1_001.fastq.gz JP03_R1_001.fastq.gz
JP01_R2_001.fastq.gz JP02_R2_001.fastq.gz JP03_R2_001.fastq.gz
# Run mkchank.sh
(colog_env)$ mkchank.sh
# Verify that the sample directories have been created
(colog_env)$ ls -R
.:
JP01 JP02 JP03
./JP01:
JP01_R1_001.fastq.gz JP01_R2_001.fastq.gz
./JP02:
JP02_R1_001.fastq.gz JP02_R2_001.fastq.gz
./JP03:
JP03_R1_001.fastq.gz JP03_R2_001.fastq.gz
The following example demonstrates how to execute the CoLoG pipeline using paired-end read data (gzip-compressed FASTQ files).
In this example:
• The working directory contains a folder named JP01.
• The JP01 folder contains paired-end FASTQ files:
• Read 1 file: R1.fastq.gz
• Read 2 file: R2.fastq.gz
• The number of threads used is 16 (default is 4).
• The default primer set located at $CONDA_PREFIX/opt/reference/nCovid-19-primerN7.mod.bed is used for primer trimming.
# Check the input data
(colog_env)$ ls JP01
JP01_R1_001.fastq.gz JP01_R2_001.fastq.gz
# Run the analysis pipeline
(colog_env)$ COLOG.sh --threads 16 JP01
Note: If you have already organized your input data as described in Section 1, you can simply specify the folder containing each sample’s paired-end data to run the pipeline. CoLoG will automatically process the files within that directory.
The following example demonstrates how to execute the CoLoG analysis pipeline using assembly data (FASTA files) as input.
In this example:
• The working directory contains a folder named JP01.
• Inside the JP01 folder, an assembly file (*.fasta) is placed.
• The number of threads used is 16 (default is 4).
• The default primer set located at $CONDA_PREFIX/opt/reference/nCovid-19-primerN7.mod.bed is used during processing.
# Check the input data
(colog_env)$ ls JP01
JP01.fasta
# Run the analysis pipeline
(colog_env)$ COLOG.sh --threads 16 JP01
You can specify a custom primer set (BED file) and the number of threads to be used when running the CoLoG pipeline. The following example demonstrates how to execute the analysis using a primer file located at /path/to/data/primer.bed and 16 threads.
(colog_env)$ COLOG.sh --primer /path/to/data/primer.bed --threads 16 JP01
| Option | Argument | Description |
|---|---|---|
-p, --primer |
filename | Specify the absolute path to the BED file containing the primer set to be used. Default: $CONDA_PREFIX/opt/reference/nCovid-19-primerN7.mod.bed |
-t, --threads |
integer | Specify the number of threads to use for parallel processing. Default: 4 |
The analysis pipeline outputs results to the directory specified at runtime.
Table 1 summarizes the major output files generated by CoLoG.
Table 1. Main Analysis Results
| File / Directory | Description |
|---|---|
result-summary.txt |
Summary report (not inside ZIP) |
<sample>-result.zip |
Archive containing all analysis results |
contigs/ |
Contig-level results |
├─ extract-contigs-skesa.megablast.table |
MEGABLAST result (SKESA contigs) |
├─ extract-contigs-a5miseq.megablast.table |
MEGABLAST result (A5-miseq contigs) |
├─ extract-contigs-skesa.fasta |
Extracted SKESA contigs |
├─ extract-contigs-a5miseq.fasta |
Extracted A5-miseq contigs |
| └─ other associated MEGABLAST hit lists | |
mapping-data-and-mutation-site/ |
Mapping-based alignment and variant data |
├─ realigned_reads.bam, .bam.bai |
Aligned reads (GATK realigned) |
├─ VarScan-modified.vcf |
Variant calls (VarScan) |
├─ VarScan-modified-rm-Mix-allele.vcf |
Filtered variant calls (mixed alleles removed) |
├─ GATK-indel-filtered-ref.vcf |
INDEL variant calls (GATK) |
├─ vcf-consensus.fasta |
Consensus sequence (SNP only) |
├─ vcf-consensus-with-indel-trim.fasta |
Consensus sequence (SNP+INDEL, trimmed) |
├─ large-deletion.txt |
Candidate large deletions |
├─ BED-depth-check.txt |
Coverage depth per position |
└─ ref-seq_MN908947.fasta |
Reference genome |
other/ |
QC and lineage information |
├─ plot_depth.pdf |
Depth distribution plot |
├─ insert_size_histogram.pdf |
Insert size histogram |
├─ Nextclade-result.tsv |
Nextclade results |
└─ lineage_report.csv |
Pangolin lineage report |
When multiple analyses have been completed, you can merge their summary files
(result-summary.txt) using the script result-summary-table.bash.
The merged output file result-summary-table.txt provides a consolidated overview across all analyzed samples.
Example:
If the current directory contains three sample folders (JP01–JP03), each with its own summary file:
# Check the result-summary.txt files
(colog_env)$ ls */result-summary.txt
JP01/result-summary.txt JP02/result-summary.txt JP03/result-summary.txt
# Run the merging script
(colog_env)$ result-summary-table.bash
# Verify the merged summary
(colog_env)$ ls
result-summary-table.txt JP01 JP02 JP03
The following example demonstrates how to generate a multi-FASTA file that combines the final assembly sequences obtained from multiple sample analyses using the command collectFasta.bash.
In this example, the current directory (project) contains three sample folders (JP01–JP03) and a file named list.txt.
The list.txt file defines which assembly results to include by specifying:
- the directory name containing each assembly result, and
- the sequence name to be used in the output multi-FASTA file,
separated by a tab character.
Running collectFasta.bash will generate the file RENAMED_project.fasta as the output.
# Check the list file
(colog_env)$ cat list.txt
JP01 virusA
JP02 virusB
JP03 virusC
# Confirm the directories
(colog_env)$ ls
list.txt JP01 JP02 JP03
# Run the script
(colog_env)$ collectFasta.bash
[INFO] read key-value file: list.txt
[INFO] 3 pairs of key-value loaded
# Verify the generated multi-FASTA file
(colog_env)$ ls
list.txt JP01 JP02 JP03 RENAMED_project.fasta
This section describes how to update the reference databases used by Nextclade and Pangolin in the CoLoG pipeline.
The following command updates the Nextclade reference dataset used in the pipeline.
You can specify the proxy server, dataset name, and the database path referenced by the CoLoG pipeline.
The pipeline currently supports only the dataset 'nextstrain/sars-cov-2/wuhan-hu-1/orfs'.
If you do not need a proxy, the --proxy option can be omitted.
(colog_env)$ nextclade dataset get \
--proxy 'http://proxy-server.xx.xx:xxxx' \
--name 'nextstrain/sars-cov-2/wuhan-hu-1/orfs' \
--output-dir "$CONDA_PREFIX/opt/reference/nextstrain/sars-cov-2/wuhan-hu-1/orfs"
Note:
The command overwrites the existing database.
It is strongly recommended to back up the current directory before updating:
$CONDA_PREFIX/opt/reference/nextstrain/sars-cov-2/wuhan-hu-1/orfs
The following example updates the Pangolin database used for lineage classification. A proxy server can be specified if required; otherwise, it may be omitted.
(colog_env)$ https_proxy='http://proxy-server.xx.xx:xxxx' pangolin --update-data
Note:
This command also overwrites the existing Pangolin database.
The default location of the database is:
$CONDA_PREFIX/lib/python3.11/site-packages/pangolin_data/
Do not run the command pangolin --update, as it may update the Pangolin software itself,
which could cause compatibility issues and prevent the CoLoG pipeline from functioning correctly.
Please cite the following if you are using the CoLoG pipeline:
