# Using GFF files to gather gene annotation Statistics

Excelsior is the only de novo Augustus gene annotation.  The rest of the files  are Gemoma searches for matches to excelsior genes.  That means:  
* Excelsior will be biased to have more genes than any other genome
* Possibly missing a lot of unique content
* This is most useful for consensus comparisons
* Data comes pre-built with Gene families
* Excelsior should not be grouped into the analysis because it's data source characteristics are fundamentally different

In [2]:
from glob import glob

glob('../gff_files_for_diploid_taxa/*')

['../gff_files_for_diploid_taxa\\FRAX01_predicted_annotation.gff',
 '../gff_files_for_diploid_taxa\\FRAX03_predicted_annotation.gff',
 '../gff_files_for_diploid_taxa\\FRAX04_predicted_annotation.gff',
 '../gff_files_for_diploid_taxa\\FRAX05_predicted_annotation.gff',
 '../gff_files_for_diploid_taxa\\FRAX06_predicted_annotation.gff',
 '../gff_files_for_diploid_taxa\\FRAX07_predicted_annotation.gff',
 '../gff_files_for_diploid_taxa\\FRAX08_predicted_annotation.gff',
 '../gff_files_for_diploid_taxa\\FRAX09_predicted_annotation.gff',
 '../gff_files_for_diploid_taxa\\FRAX10_predicted_annotation.gff',
 '../gff_files_for_diploid_taxa\\FRAX11_predicted_annotation.gff',
 '../gff_files_for_diploid_taxa\\FRAX12_predicted_annotation.gff',
 '../gff_files_for_diploid_taxa\\FRAX13_predicted_annotation.gff',
 '../gff_files_for_diploid_taxa\\FRAX14_predicted_annotation.gff',
 '../gff_files_for_diploid_taxa\\FRAX15_predicted_annotation.gff',
 '../gff_files_for_diploid_taxa\\FRAX16_predicted_annotation.g

In [3]:
len('../gff_files_for_diploid_taxa\\')

30

### Line Counts as a measure of gene annotations

In [13]:
line_counts = {}
for filename in glob('../gff_files_for_diploid_taxa/*.gff'):
    with open(filename, 'r') as gff:
        lines = gff.read().count('_cds0')  # only count the first entry for each gene
        line_counts[filename[29:]] = lines
        print(filename[30:36], '\t{:,}'.format(lines))

FRAX01 	38,773
FRAX03 	38,722
FRAX04 	38,671
FRAX05 	38,690
FRAX06 	38,769
FRAX07 	38,721
FRAX08 	38,739
FRAX09 	38,716
FRAX10 	38,716
FRAX11 	38,658
FRAX12 	38,741
FRAX13 	38,712
FRAX14 	38,697
FRAX15 	38,757
FRAX16 	38,750
FRAX19 	38,684
FRAX20 	38,663
FRAX21 	38,680
FRAX23 	38,766
FRAX25 	38,689
FRAX26 	38,705
FRAX27 	38,654
FRAX28 	38,716
FRAX29 	38,741
FRAX30 	38,719
FRAX31 	38,639
FRAX32 	38,719
FRAX33 	38,705


Line counts are all exon cds listed, so 10x more than genes.  We need to only counts once per group "R0"? listed:

```
FRAX03_contig_43335	GeMoMa	prediction	540	10056	0	0	0	ID=FRAEX38873_V2_000000010.1_R0	ref-gene=FRAEX38873_v2_000000010
FRAX03_contig_43335	GeMoMa	CDS	540	754	0	0	0	ID=FRAEX38873_V2_000000010.1_R0_cds0	Parent=FRAEX38873_V2_000000010.1_R0
FRAX03_contig_43335	GeMoMa	CDS	922	1008	0	0	1	ID=FRAEX38873_V2_000000010.1_R0_cds1	Parent=FRAEX38873_V2_000000010.1_R0
FRAX03_contig_43335	GeMoMa	CDS	5430	5475	0	0	1	ID=FRAEX38873_V2_000000010.1_R0_cds2	Parent=FRAEX38873_V2_000000010.1_R0
FRAX03_contig_43335	GeMoMa	CDS	6273	6349	0	0	0	ID=FRAEX38873_V2_000000010.1_R0_cds3	Parent=FRAEX38873_V2_000000010.1_R0
FRAX03_contig_43335	GeMoMa	CDS	7422	7584	0	0	1	ID=FRAEX38873_V2_000000010.1_R0_cds4	Parent=FRAEX38873_V2_000000010.1_R0
FRAX03_contig_43335	GeMoMa	CDS	8598	8666	0	0	0	ID=FRAEX38873_V2_000000010.1_R0_cds5	Parent=FRAEX38873_V2_000000010.1_R0
FRAX03_contig_43335	GeMoMa	CDS	9441	9520	0	0	0	ID=FRAEX38873_V2_000000010.1_R0_cds6	Parent=FRAEX38873_V2_000000010.1_R0
FRAX03_contig_43335	GeMoMa	CDS	10014	10056	0	0	1	ID=FRAEX38873_V2_000000010.1_R0_cds7	Parent=FRAEX38873_V2_000000010.1_R0
FRAX03_contig_71438	GeMoMa	prediction	3	2458	0	0	0	ID=FRAEX38873_V2_000000020.1_R0	ref-gene=FRAEX38873_v2_000000020
FRAX03_contig_71438	GeMoMa	CDS	3	145	0	0	0	ID=FRAEX38873_V2_000000020.1_R0_cds0	Parent=FRAEX38873_V2_000000020.1_R0
FRAX03_contig_71438	GeMoMa	CDS	656	860	0	0	1	ID=FRAEX38873_V2_000000020.1_R0_cds1	Parent=FRAEX38873_V2_000000020.1_R0
```

Is there ever a case where there's a plain "_R0" entry and then no "_cds0"?  A regular expression style count came up with 38,722 for FRAX03, which agrees exactly with my count.