Merge pull request #9 from luispedro/better_output_docs

Better output docs
BigDataBiology · Jun 8, 2020 · 1f4c3c0 · 1f4c3c0
2 parents 62abb45 + 2d11572
commit 1f4c3c0
Show file tree

Hide file tree

Showing 5 changed files with 63 additions and 95 deletions.
diff --git a/README.md b/README.md
@@ -2,8 +2,6 @@
 
 Command line tool to query input genome to GMGC project. 
 
-
-
 ## Install
 
 GMGC-Finder requires [prodigal](https://github.com/hyattpd/Prodigal)
@@ -14,19 +12,17 @@ Install from source
 python setup.py install
 ```
 
-
-
 ## Parameters
 
-* `-i/--input` : path to the input genome file(.fasta/.gz/.bz2).
+* `-i/--input`: path to the input genome file(.fasta/.gz/.bz2).
 
-* `-o/--output` : Output directory (will be created if non-existent).
+* `-o/--output`: Output directory (will be created if non-existent).
 
-* `-nt_input` : path to the input DNA gene file(.fasta/.gz/.bz2).
+* `-nt_input`: path to the input DNA gene file(.fasta/.gz/.bz2).
 
-* `-aa_input` : path to the input Protein gene file(.fasta/.gz/.bz2).
+* `-aa_input`: path to the input Protein gene file(.fasta/.gz/.bz2).
 
-The input must contain a genome file or both DNA and Protein gene file.
+The input must contain a genome file or both DNA and Protein gene files.
 
 ## Examples
 
@@ -42,29 +38,20 @@ Input is DNA/protein gene sequence.
 gmgc-finder -nt_input genes.fna -aa_input genes.faa -o output
 ```
 
-If input is metagenome , you can use [NGLess](https://github.com/ngless-toolkit/ngless) for assemble and gene prediction. For more details , you can [read the docs](https://gmgc-finder.readthedocs.io/en/latest/usage/).
+If yout input is a metagenome, you can use
+[NGLess](https://github.com/ngless-toolkit/ngless) for assembly and gene
+prediction. For more details, [read the
+docs](https://gmgc-finder.readthedocs.io/en/latest/usage/).
 
 ## Output
 
-The output folder contains（for more details , you can [read the docs](https://genome2gmgc.readthedocs.io/en/latest/output/)） :
-
-(1) prodigal_out.faa , prodigal_out.fna , gene.coords.gbk :  output of prodigal.  .faa file means protein sequence predicted by prodigal and .fna file means nucleotide sequence predicted by prodigal.
-
-(2) hit_table.tsv : results of the query. There are five columns in the file: query_name,gene_id,align_category,gene_dna,gene_protein.
-
-(3) genome_bin.tsv : times of a genome bin that input genes hitting it。
-
-(4) summary.txt : Summary of the query.
-
-
-
-## Align_category
-
-* EXACT : above 95% nucleotide identity with at least 95% coverage
-
-* SIMILAR : above 80% nucleotide identity with at least 80% coverage
-
-* MATCH : above 50% nucleotide identity with at least 50% coverage
+The output folder will contain
 
-* NO MATCH : no match in GMGC
+1. Outputs of gene prediction (prodigal).
+2. Complete data table, listing all the hits in GMGC, per gene.
+3. Complete table, listing all the genome bins (MAGs) that are found in the results.
+4. Human readable summary.
 
+For more details, [read the
+docs](https://genome2gmgc.readthedocs.io/en/latest/output/). A description of
+the outputs is also written to output folder for convenience.
diff --git a/docs/index.md b/docs/index.md
@@ -1,6 +1,8 @@
 # GMGC-Finder
 
-GMGC-Finder is a command line tool to query input genome to GMGC projec . It will return the summary of  alignment categories and genome bins. 
+GMGC-Finder is a command line tool to query input genome to the Global
+Microbial Gene Catalog (GMGC). It will return the summary of  alignment
+categories and genome bins.
 
 ## Commands
 
@@ -14,26 +16,3 @@ GMGC-Finder is a command line tool to query input genome to GMGC projec . It wil
 
 The input must contain a genome file or both DNA and Protein gene file.
 
-## Output
-
-The output folder contains :
-
-(1) prodigal_out.faa , prodigal_out.fna , gene.coords.gbk :  output of prodigal.  .faa file means protein sequence predicted by prodigal and .fna file means nucleotide sequence predicted by prodigal.
-
-(2) hit_table.tsv : results of the query. There are five columns in the file: query_name,gene_id,align_category,gene_dna,gene_protein.
-
-(3) genome_bin.tsv : times of a genome bin that input genes hitting it
-
-(4) summary.txt : Summary of the query.
-
-
-
-## Align_category
-
-* EXACT : above 95% nucleotide identity with at least 95% coverage
-
-* SIMILAR : above 80% nucleotide identity with at least 80% coverage
-
-* MATCH : above 50% nucleotide identity with at least 50% coverage
-
-* NO MATCH : no match in GMGC
diff --git a/docs/install.md b/docs/install.md
@@ -1,6 +1,7 @@
 # Install
 
-GMGC-Finder requires [prodigal](https://github.com/hyattpd/Prodigal).You need to install prodigal first and add it into your system path.
+GMGC-Finder requires [prodigal](https://github.com/hyattpd/Prodigal). You need
+to install prodigal first and add it into your system path.
 
 Install from source
 

diff --git a/docs/output.md b/docs/output.md
@@ -1,59 +1,55 @@
 # Output
 
-Explaination of the files in the output
+Explanation of the files in the output
 
 
+## prodigal\_out.faa , prodigal\_out.fna , gene.coords.gbk
 
-## prodigal_out.faa , prodigal_out.fna , gene.coords.gbk
-These three files are the output of the prodigal.
+These three files are the output prodigal.
 
-prodigal_out.faa is the protein sequence.
+- `prodigal_out.faa` protein sequence
+- `prodigal_out.fna` DNA sequence
+- `gene.coords.gbk` gene information in Genebank format
 
-prodigal_out.fna is the dna sequence.
 
-gene.coords.gbk is the gene information
-
-
-
-
-
-## hit_table.tsv :
+## hit\_table.tsv :
 
 The results of the queries to the GMGC.
 
 There are five columns in the file.
 
-- query_name: the name/id of the input genome contig
-- gene_id: the gene_id with the best hit_score in GMGC
-- align_category: there are four different classes of alignment
-- gene_dna : the dna sequence of the hitted gene in GMGC
-- gene_protein : the protein sequence of the hitted gene in GMGC
-
-Align_category
+- `query_name`: the name/id of the input genome contig
+- `gene_id`: the gener\_id with the best score in GMGC
+- `align_category: there are four different classes of alignment (see below)
+- `gene\_dna`: the DNA sequence of the best hit in GMGC
+- `gene\_protein`: the protein sequence of the best hit in GMGC
 
-- EXACT : above 95% nucleotide identity with at least 95% coverage
-- SIMILAR : above 80% nucleotide identity with at least 80% coverage
-- MATCH : above 50% nucleotide identity with at least 50% coverage
-- NO MATCH : no match in GMGC
+### Alignment category
 
+- `EXACT`: at least 95% nucleotide identity with at least 95% coverage. As
+   unigenes in the GMGC represent 95% nucleotide clusterings (species-level
+   threshold), this would mean that the query gene would have clustered with
+   the GMGC unigene.
+- `SIMILAR`: at least 80% amino acid identity with at least 80% coverage.
+- `MATCH`: at least 50% amino acid identity with at least 50% coverage.
+- `NO MATCH`: no match in GMGC.
 
 
+## `genome\_bin.tsv`
 
-
-## genome_bin.tsv
-
-Times of a genome bin that input genes hitting it
+Genome bins (MAGs) found in the results (and a count of how often many genes
+are contained in them).
 
 There are two columns in the file.
 
-* genome_bin : the name of genome bins in GMGC
-* times_gene_hit : the times of input genes hitting it 
-
-
-
+- `genome\_bin`: the name of genome bins in GMGC
+- `times\_gene\_hit`: the times of input genes hitting it 
 
+Note that GMGC unigenes can while not all GMGC unigenes are contained in a
+genome bin, some are contained in many. Thus, the total counts will not (except
+by coincidence) correspond to the number of genes queried.
 
 ## summary.txt
 
-Summary of the query
+Human-readable summary of the results.
 
diff --git a/docs/usage.md b/docs/usage.md
@@ -14,21 +14,27 @@ The input must contain a genome file or both DNA and Protein gene file.
 
 ## Examples
 
-Input is genome sequence.
+1. Input is a genome sequence (`input.fasta`).
 
 ```bash
 gmgc-finder -i input.fasta -o output
 ```
 
-Input is DNA/protein gene sequence.
+GMGC-finder will call `prodigal` to predict genes and then process each gene.
+
+2. Input is DNA/protein gene sequences (`genes.fna` and `genes.faa`,
+   respectfully).
 
 ```bash
 gmgc-finder -nt_input genes.fna -aa_input genes.faa -o output
 ```
 
-If input is metagenome , you can use [NGLess](https://github.com/ngless-toolkit/ngless) for assemble and gene prediction.
+# Processing metagenomes using NGLess
+
+If your input is metagenome, you can use
+[NGLess](https://github.com/ngless-toolkit/ngless) for assembly and gene
+prediction and, then, pass the results to GMGC-finder.
 
-# NGLess
 
 ## Install
 
@@ -41,10 +47,9 @@ conda install -c bioconda ngless
 ## Assembly and gene prediction
 
 ```bash
-ngless "0.6"
-
+ngless "1.0"
 
-sample = 'SAMEA2621155.sampled'
+sample = 'SAMEA2621155'
 input = load_mocat_sample(sample)
 
 preprocess(input, keep_singles=False) using |read|: