Skip to content

Commit

Permalink
fix the bug of copping output.md
Browse files Browse the repository at this point in the history
  • Loading branch information
psj1997 committed Jun 12, 2020
1 parent a339c99 commit 6b441c6
Show file tree
Hide file tree
Showing 2 changed files with 55 additions and 1 deletion.
54 changes: 54 additions & 0 deletions gmgc_finder/output.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Output

Explanation of the files in the output

## Prodigal output

These three files are the output prodigal.

- `prodigal_out.faa` protein sequence
- `prodigal_out.fna` DNA sequence
- `gene.coords.gbk` gene information in Genebank format


## Hit Table (`hit_table.tsv`)

The results of the queries to the GMGC.

There are five columns in the file.

- `query_name`: the name/id of the input genome contig
- `gene_id`: the Unigene with the best score in GMGC
- `align_category: there are four different classes of alignment (see below)
- `gene_dna`: the DNA sequence of the best hit in GMGC
- `gene_protein`: the protein sequence of the best hit in GMGC

### Alignment category

- `EXACT`: at least 95% nucleotide identity with at least 95% coverage. As
unigenes in the GMGC represent 95% nucleotide clusterings (species-level
threshold), this would mean that the query gene would have clustered with
the GMGC unigene.
- `SIMILAR`: at least 80% amino acid identity with at least 80% coverage.
- `MATCH`: at least 50% amino acid identity with at least 50% coverage.
- `NO MATCH`: no match in GMGC.


## Genome bins (`genome_bin.tsv`)

Genome bins (MAGs) found in the results (and a count of how often many genes
are contained in them).

There are two columns in the file.

- `genome_bin`: the name of genome bins in GMGC
- `times_gene_hit`: the times of input genes hitting it

Note that GMGC unigenes can while not all GMGC unigenes are contained in a
genome bin, some are contained in many. Thus, the total counts will not (except
by coincidence) correspond to the number of genes queried.

## Summary (`summary.txt`)

Human-readable summary of the results.

2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
'tqdm',
],
package_data={
'docs': ['*.md']},
'gmgc_finder': ['*.md']},
zip_safe=False,
entry_points={
'console_scripts': ['gmgc-finder=gmgc_finder.main:main'],
Expand Down

3 comments on commit 6b441c6

@luispedro
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making a copy of output.md (docs/output.md and gmgc_finder/output.md) is a bad idea. It will get out of sync very fast. Use links if you have to, or copy the file in setup.py or whatever, but there should be a single copy.

(If there are technical reasons why the above are a bad idea, at least add a check in the testing that the two version of the file are identical :)

@psj1997
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know why when installing, it can not find the output.md using 'docs/output.md' in the setup.py...
Using gmgc_finder/output.md can work.

@luispedro
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #11 for a still-not-great, but better approach. I still don't like it, but having two copies is a bad idea: it will almost always lead to errors later when someone updates one copy and not the other.

Please sign in to comment.