Skip to content

khoriba/CoLoG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CoLoG: Coronavirus Local Genomics analysis pipeline

CoLoG (Coronavirus Local Genomics) is a bioinformatics pipeline designed to perform basic SARS-CoV-2 genome analysis in local computing environments. It was developed to bring the genome-analysis component of Japan's COG-JP surveillance web service into an offline, site-operable workflow that laboratories can run securely on their own servers.

Installation

On a Linux system with Conda installed, execute the following commands to install the CoLoG pipeline environment.

1. Clone the repository and build the Conda environment

$ git clone git@github.com:khoriba/COLOG.git
$ conda env create -n colog_env -f ./COLOG/environment.yml

2. Activate the environment and Run the setup scripts

$ conda activate colog_env
(colog_env)$ bash ./COLOG/setup.sh

3. Download the Nextclade reference dataset

If your system requires a proxy connection, include the --proxy option as shown below. This will download the Nextclade reference dataset for the Wuhan-Hu-1 strain and place it under your Conda environment’s reference directory.

(colog_env)$ nextclade dataset get \
  --proxy 'http://proxy-server.xx.xx:xxxx' \
  --name 'nextstrain/sars-cov-2/wuhan-hu-1/orfs' \
  --output-dir "$CONDA_PREFIX/opt/reference/nextstrain/sars-cov-2/wuhan-hu-1/orfs"

Usage

1. Preparing Input Data

Before running CoLoG, you need to organize your sequencing reads into individual directories by sample. The following is an example command sequence that separates paired-end read data (gzip-compressed FASTQ files) and optional assembly data (FASTA files) into sample-specific directories. The mkchank.sh script automatically creates a subdirectory for each sample and moves the corresponding FASTQ files into that directory. After the script execution, each sample will have its own directory containing its paired-end files

# Check the input files
(colog_env)$ ls
JP01_R1_001.fastq.gz  JP02_R1_001.fastq.gz  JP03_R1_001.fastq.gz
JP01_R2_001.fastq.gz  JP02_R2_001.fastq.gz  JP03_R2_001.fastq.gz

# Run mkchank.sh
(colog_env)$ mkchank.sh

# Verify that the sample directories have been created
(colog_env)$ ls -R
.:
JP01  JP02  JP03

./JP01:
JP01_R1_001.fastq.gz  JP01_R2_001.fastq.gz

./JP02:
JP02_R1_001.fastq.gz  JP02_R2_001.fastq.gz

./JP03:
JP03_R1_001.fastq.gz  JP03_R2_001.fastq.gz

2. Example: Running the Pipeline with Paired-End Read Data

The following example demonstrates how to execute the CoLoG pipeline using paired-end read data (gzip-compressed FASTQ files).

In this example: • The working directory contains a folder named JP01. • The JP01 folder contains paired-end FASTQ files: • Read 1 file: R1.fastq.gz • Read 2 file: R2.fastq.gz • The number of threads used is 16 (default is 4). • The default primer set located at $CONDA_PREFIX/opt/reference/nCovid-19-primerN7.mod.bed is used for primer trimming.

# Check the input data
(colog_env)$ ls JP01
JP01_R1_001.fastq.gz  JP01_R2_001.fastq.gz

# Run the analysis pipeline
(colog_env)$ COLOG.sh --threads 16 JP01

Note: If you have already organized your input data as described in Section 1, you can simply specify the folder containing each sample’s paired-end data to run the pipeline. CoLoG will automatically process the files within that directory.

3. Example: Running the Pipeline with Assembly Data

The following example demonstrates how to execute the CoLoG analysis pipeline using assembly data (FASTA files) as input.

In this example: • The working directory contains a folder named JP01. • Inside the JP01 folder, an assembly file (*.fasta) is placed. • The number of threads used is 16 (default is 4). • The default primer set located at $CONDA_PREFIX/opt/reference/nCovid-19-primerN7.mod.bed is used during processing.

# Check the input data
(colog_env)$ ls JP01
JP01.fasta

# Run the analysis pipeline
(colog_env)$ COLOG.sh --threads 16 JP01

4. Example: Running the Pipeline with a Specified Primer Set

You can specify a custom primer set (BED file) and the number of threads to be used when running the CoLoG pipeline. The following example demonstrates how to execute the analysis using a primer file located at /path/to/data/primer.bed and 16 threads.

(colog_env)$ COLOG.sh --primer /path/to/data/primer.bed --threads 16 JP01

Command-line options

Option Argument Description
-p, --primer filename Specify the absolute path to the BED file containing the primer set to be used.
Default: $CONDA_PREFIX/opt/reference/nCovid-19-primerN7.mod.bed
-t, --threads integer Specify the number of threads to use for parallel processing.
Default: 4

5. Main Analysis Results

The analysis pipeline outputs results to the directory specified at runtime.
Table 1 summarizes the major output files generated by CoLoG.

Table 1. Main Analysis Results

File / Directory Description
result-summary.txt Summary report (not inside ZIP)
<sample>-result.zip Archive containing all analysis results
contigs/ Contig-level results
├─ extract-contigs-skesa.megablast.table MEGABLAST result (SKESA contigs)
├─ extract-contigs-a5miseq.megablast.table MEGABLAST result (A5-miseq contigs)
├─ extract-contigs-skesa.fasta Extracted SKESA contigs
├─ extract-contigs-a5miseq.fasta Extracted A5-miseq contigs
└─ other associated MEGABLAST hit lists
mapping-data-and-mutation-site/ Mapping-based alignment and variant data
├─ realigned_reads.bam, .bam.bai Aligned reads (GATK realigned)
├─ VarScan-modified.vcf Variant calls (VarScan)
├─ VarScan-modified-rm-Mix-allele.vcf Filtered variant calls (mixed alleles removed)
├─ GATK-indel-filtered-ref.vcf INDEL variant calls (GATK)
├─ vcf-consensus.fasta Consensus sequence (SNP only)
├─ vcf-consensus-with-indel-trim.fasta Consensus sequence (SNP+INDEL, trimmed)
├─ large-deletion.txt Candidate large deletions
├─ BED-depth-check.txt Coverage depth per position
└─ ref-seq_MN908947.fasta Reference genome
other/ QC and lineage information
├─ plot_depth.pdf Depth distribution plot
├─ insert_size_histogram.pdf Insert size histogram
├─ Nextclade-result.tsv Nextclade results
└─ lineage_report.csv Pangolin lineage report

6. Merging Multiple Summary Results

When multiple analyses have been completed, you can merge their summary files
(result-summary.txt) using the script result-summary-table.bash. The merged output file result-summary-table.txt provides a consolidated overview across all analyzed samples.

Example:
If the current directory contains three sample folders (JP01JP03), each with its own summary file:

# Check the result-summary.txt files
(colog_env)$ ls */result-summary.txt
JP01/result-summary.txt  JP02/result-summary.txt  JP03/result-summary.txt

# Run the merging script
(colog_env)$ result-summary-table.bash

# Verify the merged summary
(colog_env)$ ls
result-summary-table.txt  JP01  JP02  JP03

7. Creating a Multi-FASTA File

The following example demonstrates how to generate a multi-FASTA file that combines the final assembly sequences obtained from multiple sample analyses using the command collectFasta.bash. In this example, the current directory (project) contains three sample folders (JP01JP03) and a file named list.txt.
The list.txt file defines which assembly results to include by specifying:

  • the directory name containing each assembly result, and
  • the sequence name to be used in the output multi-FASTA file,
    separated by a tab character.

Running collectFasta.bash will generate the file RENAMED_project.fasta as the output.

# Check the list file
(colog_env)$ cat list.txt
JP01    virusA
JP02    virusB
JP03    virusC

# Confirm the directories
(colog_env)$ ls
list.txt  JP01  JP02  JP03

# Run the script
(colog_env)$ collectFasta.bash
[INFO] read key-value file: list.txt
[INFO] 3 pairs of key-value loaded

# Verify the generated multi-FASTA file
(colog_env)$ ls
list.txt  JP01  JP02  JP03  RENAMED_project.fasta

Updating Databases for Nextclade and Pangolin

This section describes how to update the reference databases used by Nextclade and Pangolin in the CoLoG pipeline.

(1) Updating the Nextclade Database

The following command updates the Nextclade reference dataset used in the pipeline.
You can specify the proxy server, dataset name, and the database path referenced by the CoLoG pipeline.
The pipeline currently supports only the dataset 'nextstrain/sars-cov-2/wuhan-hu-1/orfs'.
If you do not need a proxy, the --proxy option can be omitted.

(colog_env)$ nextclade dataset get \
  --proxy 'http://proxy-server.xx.xx:xxxx' \
  --name 'nextstrain/sars-cov-2/wuhan-hu-1/orfs' \
  --output-dir "$CONDA_PREFIX/opt/reference/nextstrain/sars-cov-2/wuhan-hu-1/orfs"

Note: The command overwrites the existing database. It is strongly recommended to back up the current directory before updating: $CONDA_PREFIX/opt/reference/nextstrain/sars-cov-2/wuhan-hu-1/orfs

(2) Updating the Pangolin Database

The following example updates the Pangolin database used for lineage classification. A proxy server can be specified if required; otherwise, it may be omitted.

(colog_env)$ https_proxy='http://proxy-server.xx.xx:xxxx' pangolin --update-data

Note: This command also overwrites the existing Pangolin database. The default location of the database is: $CONDA_PREFIX/lib/python3.11/site-packages/pangolin_data/ Do not run the command pangolin --update, as it may update the Pangolin software itself, which could cause compatibility issues and prevent the CoLoG pipeline from functioning correctly.

Citation

Please cite the following if you are using the CoLoG pipeline:

Sekizuka T, Itokawa K, Hashino M, et al. A Genome Epidemiological Study of SARS-CoV-2 Introduction into Japan. mSphere. 2020;5(6):e00786-20. Published 2020 Nov 11. doi:10.1128/mSphere.00786-20

About

Coronavirus Local Genomics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors