-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
62 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,62 @@ | ||
../docs/output.md | ||
# Output of GMGC-mapper | ||
|
||
Explanation of the files in the output directory | ||
|
||
## Prodigal output | ||
|
||
These three files are the output of prodigal (if GMGC-mapper was called in | ||
genome mode) | ||
|
||
- `prodigal_out.faa` protein sequence | ||
- `prodigal_out.fna` DNA sequence | ||
- `gene.coords.gbk` gene information in Genebank format | ||
|
||
|
||
## Hit Table (`hit_table.tsv`) | ||
|
||
The results of the queries to the GMGC. | ||
|
||
There are five columns in the file. | ||
|
||
- `query_name`: the name/id of the input gene | ||
- `gene_id`: the Unigene with the best score in the GMGC | ||
- `align_category: there are four different classes of alignment (see below) | ||
- `gene_dna`: the DNA sequence of the best hit in GMGC | ||
- `gene_protein`: the protein sequence of the best hit in GMGC | ||
|
||
### Alignment category | ||
|
||
- `EXACT`: at least 95% nucleotide identity with at least 95% coverage. As | ||
unigenes in the GMGC represent 95% nucleotide clusterings (species-level | ||
threshold), this would mean that the query gene would have clustered with | ||
the GMGC unigene. | ||
- `SIMILAR`: at least 80% amino acid identity with at least 80% coverage. | ||
- `MATCH`: at least 50% amino acid identity with at least 50% coverage. | ||
- `NO MATCH`: no match in GMGC. | ||
|
||
|
||
## Genome bins (`genome_bin.tsv`) | ||
|
||
Genome bins (MAGs) found in the results (and a count of how many genes are | ||
contained in them). | ||
|
||
There are two columns in the file. | ||
|
||
- `genome_bin`: the name of genome bins in GMGC | ||
- `times_gene_hit`: the times of input genes hitting it | ||
|
||
Note while not all GMGC unigenes are contained in a genome bin, some are | ||
contained in many. Thus, the total counts will not (except by coincidence) | ||
correspond to the number of genes queried. | ||
|
||
## Summary (`summary.txt` and `runlog.yaml`) | ||
|
||
The file `summary.txt` provides a human-readable summary of the results, while | ||
`runlog.yaml` is a summary of run metadata (as a YaML file, it is both machine | ||
and human-readable). | ||
|
||
The file `summary.txt` should be reproducible and running GMGC-mapper twice on | ||
the same input should produce the same results. By design, though, | ||
`runglog.yaml` includes information such as the time when the analysis was run | ||
which is not reproducible. | ||
|