-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
55 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# Output | ||
|
||
Explanation of the files in the output | ||
|
||
## Prodigal output | ||
|
||
These three files are the output prodigal. | ||
|
||
- `prodigal_out.faa` protein sequence | ||
- `prodigal_out.fna` DNA sequence | ||
- `gene.coords.gbk` gene information in Genebank format | ||
|
||
|
||
## Hit Table (`hit_table.tsv`) | ||
|
||
The results of the queries to the GMGC. | ||
|
||
There are five columns in the file. | ||
|
||
- `query_name`: the name/id of the input genome contig | ||
- `gene_id`: the Unigene with the best score in GMGC | ||
- `align_category: there are four different classes of alignment (see below) | ||
- `gene_dna`: the DNA sequence of the best hit in GMGC | ||
- `gene_protein`: the protein sequence of the best hit in GMGC | ||
|
||
### Alignment category | ||
|
||
- `EXACT`: at least 95% nucleotide identity with at least 95% coverage. As | ||
unigenes in the GMGC represent 95% nucleotide clusterings (species-level | ||
threshold), this would mean that the query gene would have clustered with | ||
the GMGC unigene. | ||
- `SIMILAR`: at least 80% amino acid identity with at least 80% coverage. | ||
- `MATCH`: at least 50% amino acid identity with at least 50% coverage. | ||
- `NO MATCH`: no match in GMGC. | ||
|
||
|
||
## Genome bins (`genome_bin.tsv`) | ||
|
||
Genome bins (MAGs) found in the results (and a count of how often many genes | ||
are contained in them). | ||
|
||
There are two columns in the file. | ||
|
||
- `genome_bin`: the name of genome bins in GMGC | ||
- `times_gene_hit`: the times of input genes hitting it | ||
|
||
Note that GMGC unigenes can while not all GMGC unigenes are contained in a | ||
genome bin, some are contained in many. Thus, the total counts will not (except | ||
by coincidence) correspond to the number of genes queried. | ||
|
||
## Summary (`summary.txt`) | ||
|
||
Human-readable summary of the results. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
6b441c6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making a copy of
output.md
(docs/output.md
andgmgc_finder/output.md
) is a bad idea. It will get out of sync very fast. Use links if you have to, or copy the file insetup.py
or whatever, but there should be a single copy.(If there are technical reasons why the above are a bad idea, at least add a check in the testing that the two version of the file are identical :)
6b441c6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know why when installing, it can not find the output.md using 'docs/output.md' in the setup.py...
Using gmgc_finder/output.md can work.
6b441c6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #11 for a still-not-great, but better approach. I still don't like it, but having two copies is a bad idea: it will almost always lead to errors later when someone updates one copy and not the other.