Skip to content

Commit

Permalink
Merge pull request #9 from luispedro/better_output_docs
Browse files Browse the repository at this point in the history
Better output docs
  • Loading branch information
psj1997 committed Jun 8, 2020
2 parents 62abb45 + 2d11572 commit 1f4c3c0
Show file tree
Hide file tree
Showing 5 changed files with 63 additions and 95 deletions.
47 changes: 17 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

Command line tool to query input genome to GMGC project.



## Install

GMGC-Finder requires [prodigal](https://github.com/hyattpd/Prodigal)
Expand All @@ -14,19 +12,17 @@ Install from source
python setup.py install
```



## Parameters

* `-i/--input` : path to the input genome file(.fasta/.gz/.bz2).
* `-i/--input`: path to the input genome file(.fasta/.gz/.bz2).

* `-o/--output` : Output directory (will be created if non-existent).
* `-o/--output`: Output directory (will be created if non-existent).

* `-nt_input` : path to the input DNA gene file(.fasta/.gz/.bz2).
* `-nt_input`: path to the input DNA gene file(.fasta/.gz/.bz2).

* `-aa_input` : path to the input Protein gene file(.fasta/.gz/.bz2).
* `-aa_input`: path to the input Protein gene file(.fasta/.gz/.bz2).

The input must contain a genome file or both DNA and Protein gene file.
The input must contain a genome file or both DNA and Protein gene files.

## Examples

Expand All @@ -42,29 +38,20 @@ Input is DNA/protein gene sequence.
gmgc-finder -nt_input genes.fna -aa_input genes.faa -o output
```

If input is metagenome , you can use [NGLess](https://github.com/ngless-toolkit/ngless) for assemble and gene prediction. For more details , you can [read the docs](https://gmgc-finder.readthedocs.io/en/latest/usage/).
If yout input is a metagenome, you can use
[NGLess](https://github.com/ngless-toolkit/ngless) for assembly and gene
prediction. For more details, [read the
docs](https://gmgc-finder.readthedocs.io/en/latest/usage/).

## Output

The output folder contains(for more details , you can [read the docs](https://genome2gmgc.readthedocs.io/en/latest/output/)) :

(1) prodigal_out.faa , prodigal_out.fna , gene.coords.gbk : output of prodigal. .faa file means protein sequence predicted by prodigal and .fna file means nucleotide sequence predicted by prodigal.

(2) hit_table.tsv : results of the query. There are five columns in the file: query_name,gene_id,align_category,gene_dna,gene_protein.

(3) genome_bin.tsv : times of a genome bin that input genes hitting it。

(4) summary.txt : Summary of the query.



## Align_category

* EXACT : above 95% nucleotide identity with at least 95% coverage

* SIMILAR : above 80% nucleotide identity with at least 80% coverage

* MATCH : above 50% nucleotide identity with at least 50% coverage
The output folder will contain

* NO MATCH : no match in GMGC
1. Outputs of gene prediction (prodigal).
2. Complete data table, listing all the hits in GMGC, per gene.
3. Complete table, listing all the genome bins (MAGs) that are found in the results.
4. Human readable summary.

For more details, [read the
docs](https://genome2gmgc.readthedocs.io/en/latest/output/). A description of
the outputs is also written to output folder for convenience.
27 changes: 3 additions & 24 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# GMGC-Finder

GMGC-Finder is a command line tool to query input genome to GMGC projec . It will return the summary of alignment categories and genome bins.
GMGC-Finder is a command line tool to query input genome to the Global
Microbial Gene Catalog (GMGC). It will return the summary of alignment
categories and genome bins.

## Commands

Expand All @@ -14,26 +16,3 @@ GMGC-Finder is a command line tool to query input genome to GMGC projec . It wil

The input must contain a genome file or both DNA and Protein gene file.

## Output

The output folder contains :

(1) prodigal_out.faa , prodigal_out.fna , gene.coords.gbk : output of prodigal. .faa file means protein sequence predicted by prodigal and .fna file means nucleotide sequence predicted by prodigal.

(2) hit_table.tsv : results of the query. There are five columns in the file: query_name,gene_id,align_category,gene_dna,gene_protein.

(3) genome_bin.tsv : times of a genome bin that input genes hitting it

(4) summary.txt : Summary of the query.



## Align_category

* EXACT : above 95% nucleotide identity with at least 95% coverage

* SIMILAR : above 80% nucleotide identity with at least 80% coverage

* MATCH : above 50% nucleotide identity with at least 50% coverage

* NO MATCH : no match in GMGC
3 changes: 2 additions & 1 deletion docs/install.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Install

GMGC-Finder requires [prodigal](https://github.com/hyattpd/Prodigal).You need to install prodigal first and add it into your system path.
GMGC-Finder requires [prodigal](https://github.com/hyattpd/Prodigal). You need
to install prodigal first and add it into your system path.

Install from source

Expand Down
62 changes: 29 additions & 33 deletions docs/output.md
Original file line number Diff line number Diff line change
@@ -1,59 +1,55 @@
# Output

Explaination of the files in the output
Explanation of the files in the output


## prodigal\_out.faa , prodigal\_out.fna , gene.coords.gbk

## prodigal_out.faa , prodigal_out.fna , gene.coords.gbk
These three files are the output of the prodigal.
These three files are the output prodigal.

prodigal_out.faa is the protein sequence.
- `prodigal_out.faa` protein sequence
- `prodigal_out.fna` DNA sequence
- `gene.coords.gbk` gene information in Genebank format

prodigal_out.fna is the dna sequence.

gene.coords.gbk is the gene information





## hit_table.tsv :
## hit\_table.tsv :

The results of the queries to the GMGC.

There are five columns in the file.

- query_name: the name/id of the input genome contig
- gene_id: the gene_id with the best hit_score in GMGC
- align_category: there are four different classes of alignment
- gene_dna : the dna sequence of the hitted gene in GMGC
- gene_protein : the protein sequence of the hitted gene in GMGC

Align_category
- `query_name`: the name/id of the input genome contig
- `gene_id`: the gener\_id with the best score in GMGC
- `align_category: there are four different classes of alignment (see below)
- `gene\_dna`: the DNA sequence of the best hit in GMGC
- `gene\_protein`: the protein sequence of the best hit in GMGC

- EXACT : above 95% nucleotide identity with at least 95% coverage
- SIMILAR : above 80% nucleotide identity with at least 80% coverage
- MATCH : above 50% nucleotide identity with at least 50% coverage
- NO MATCH : no match in GMGC
### Alignment category

- `EXACT`: at least 95% nucleotide identity with at least 95% coverage. As
unigenes in the GMGC represent 95% nucleotide clusterings (species-level
threshold), this would mean that the query gene would have clustered with
the GMGC unigene.
- `SIMILAR`: at least 80% amino acid identity with at least 80% coverage.
- `MATCH`: at least 50% amino acid identity with at least 50% coverage.
- `NO MATCH`: no match in GMGC.


## `genome\_bin.tsv`


## genome_bin.tsv

Times of a genome bin that input genes hitting it
Genome bins (MAGs) found in the results (and a count of how often many genes
are contained in them).

There are two columns in the file.

* genome_bin : the name of genome bins in GMGC
* times_gene_hit : the times of input genes hitting it



- `genome\_bin`: the name of genome bins in GMGC
- `times\_gene\_hit`: the times of input genes hitting it

Note that GMGC unigenes can while not all GMGC unigenes are contained in a
genome bin, some are contained in many. Thus, the total counts will not (except
by coincidence) correspond to the number of genes queried.

## summary.txt

Summary of the query
Human-readable summary of the results.

19 changes: 12 additions & 7 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,21 +14,27 @@ The input must contain a genome file or both DNA and Protein gene file.

## Examples

Input is genome sequence.
1. Input is a genome sequence (`input.fasta`).

```bash
gmgc-finder -i input.fasta -o output
```

Input is DNA/protein gene sequence.
GMGC-finder will call `prodigal` to predict genes and then process each gene.

2. Input is DNA/protein gene sequences (`genes.fna` and `genes.faa`,
respectfully).

```bash
gmgc-finder -nt_input genes.fna -aa_input genes.faa -o output
```

If input is metagenome , you can use [NGLess](https://github.com/ngless-toolkit/ngless) for assemble and gene prediction.
# Processing metagenomes using NGLess

If your input is metagenome, you can use
[NGLess](https://github.com/ngless-toolkit/ngless) for assembly and gene
prediction and, then, pass the results to GMGC-finder.

# NGLess

## Install

Expand All @@ -41,10 +47,9 @@ conda install -c bioconda ngless
## Assembly and gene prediction

```bash
ngless "0.6"

ngless "1.0"

sample = 'SAMEA2621155.sampled'
sample = 'SAMEA2621155'
input = load_mocat_sample(sample)

preprocess(input, keep_singles=False) using |read|:
Expand Down

0 comments on commit 1f4c3c0

Please sign in to comment.