Skip to content

Matthew-Mosior/Sigma-to-Mosaic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sigma-to-Mosaic: File Format Converter from sigma_out.gvector.txt to mosaic.txt

Introduction

File format converters are essential for a host of data conversion needs, from low level instructions to OS-level file format conversions. This format converter, which is implemented in two different languages (Shell and Haskell), is used to transform the output of SigmaW (Strain-level Inference of Genomes from Metagenomic Analysis), sigma_out.gvector.txt, into the format accepted for the Mosaic Community Challenge: Strains (MOSAIC Community Challenge: Strains).

Shell Implementation

Setting up the Reference Genome Directory

A prerequisite to getting useful output from this shell script is to setup your reference genome directory correctly.
First, your reference genome directory should have the following structure:

[database directory] - [genome directory] - [fasta file]

To create this required reference genome directory, use the shell script GCFrefgenomedirectory.sh. This shell script will correctly set-up your reference genome directory, assuming you have downloaded GCF (RefSeq assembly) sequences.

This shell script will change your initial reference genome directory setup of [database directory] - [fasta file] to the required reference genome directory setup: [database directory] - [genome directory] - [fasta file].

Provide the path to the directory that contains the initial [database directory] as a command line argument as shown in the following example:

sh GCFrefgenomedirectory.sh /usr/home/ncbi/ncbi-genomes-2018-02-17

Usage

This script is very easy to use, it takes sigma_out.gvector.txt as command line arguments. Keep in mind, the Strains community challenge has four datasets:

Simulated_Low_Complexity
Simulated_Medium_Complexity
Simulated_High_Complexity
RealData (Mouse fecal samples)

Each of these datasets contains four sets of paired-end sequencing reads, so in reality:

Simulated_Low_Complexity
sim_low_S1_PE1.fq
sim_low_S1_PE2.fq
sim_low_S2_PE1.fq
sim_low_S2_PE2.fq
sim_low_S3_PE1.fq
sim_low_S3_PE2.fq
sim_low_S4_PE1.fq
sim_low_S4_PE2.fq

Simulated_Medium_Complexity
...

Simulated_High_Complexity
...

RealData (Mouse fecal samples)
...

Since each set of paired-end sequencing reads (i.e. sim_low_S1_PE1.fq and sim_low_S1_PE2.fq) are run together to output a single sigma_out.gvector.txt file, you should be running the script for each dataset as follows:

sh SigmatoMosaic.sh sigma1_out.gvector.txt sigma2_out.gvector.txt sigma3_out.gvector.txt sigma4_out.gvector.txt
*Since all output files are named sigma_out.gvector.txt, you'll need to rename them so that they are all unique, as shown above.

If you have sigma_out.gvector.txt files with many identified organisms (lines that start with "*"), it may be wise to do the following:

nohup sh SigmatoMosaic.sh sigma1_out.gvector.txt sigma2_out.gvector.txt sigma3_out.gvector.txt sigma4_out.gvector.txt &

This will run the script in the background after you logout (nohup) and puts the process into a subshell (&), which allows you to continue to work in the current terminal session, and will keep it running once you logout.

Running SigmatoMosaic.sh will output a single file, mosaic.txt.

Please see example files sigma_out.gvector.txt and mosaic.txt (examples of input and output).

Update to Roadmap (05/31/2018)

SigmatoMosaic.sh now has the following features:
-Incorrect file format detection.
-Placeholder zeros when organism wasn't identified (per file), so relative abundances will be mapped to specific sigma_out.gvector.txt input files.

Haskell Implementation

Setting up the Reference Genome Directory

A prerequisite to getting useful output from this haskell script is to setup your reference genome directory correctly.
First, your reference genome directory should have the following structure:

[database directory] - [genome directory] - [fasta file]

To create this required reference genome directory, use the shell script GCFrefgenomedirectory.sh. This shell script will correctly set-up your reference genome directory, assuming you have downloaded GCF (RefSeq assembly) sequences.

This shell script will change your initial reference genome directory setup of [database directory] - [fasta file] to the required reference genome directory setup: [database directory] - [genome directory] - [fasta file].

Provide the path to the directory that contains the initial [database directory] as a command line argument as shown in the following example:

sh GCFrefgenomedirectory.sh /usr/home/ncbi/ncbi-genomes-2018-02-17

Usage

This script is very easy to use, it takes sigma_out.gvector.txt as command line arguments. Keep in mind, the Strains community challenge has four datasets:

Simulated_Low_Complexity
Simulated_Medium_Complexity
Simulated_High_Complexity
RealData (Mouse fecal samples)

Each of these datasets contains four sets of paired-end sequencing reads, so in reality:

Simulated_Low_Complexity
sim_low_S1_PE1.fq
sim_low_S1_PE2.fq
sim_low_S2_PE1.fq
sim_low_S2_PE2.fq
sim_low_S3_PE1.fq
sim_low_S3_PE2.fq
sim_low_S4_PE1.fq
sim_low_S4_PE2.fq

Simulated_Medium_Complexity
...

Simulated_High_Complexity
...

RealData (Mouse fecal samples)
...

Since each set of paired-end sequencing reads (i.e. sim_low_S1_PE1.fq and sim_low_S1_PE2.fq) are run together to output a single sigma_out.gvector.txt file, you should be running the script for each dataset as follows:

runghc SigmatoMosaic.hs sigma1_out.gvector.txt sigma2_out.gvector.txt sigma3_out.gvector.txt sigma4_out.gvector.txt
*Since all output files are named sigma_out.gvector.txt, you'll need to rename them so that they are all unique, as shown above.

If you have sigma_out.gvector.txt files with many identified organisms (lines that start with "*"), it may be wise to do the following:

nohup runghc SigmatoMosaic.hs sigma1_out.gvector.txt sigma2_out.gvector.txt sigma3_out.gvector.txt sigma4_out.gvector.txt &

This will run the script in the background after you logout (nohup) and puts the process into a subshell (&), which allows you to continue to work in the current terminal session, and will keep it running once you logout.

Running SigmatoMosaic.hs will output a single file, mosaic.txt.

Please see example files sigma_out.gvector.txt and mosaic.txt (examples of input and output).

For maximum performance, please compile the source code.

Credits

Shell implementation and documentation added April 2018.

Haskell implementation and documentation added August 2018.

Author : Matthew Mosior