Columns within chromosomal_features.tab:

1.  Feature name (mandatory); this is the primary systematic name, if available
2.  Gene name (locus name)
3.  Aliases (multiples separated by |)
4.  Feature type
5.  Chromosome
6.  Start Coordinate
7.  Stop Coordinate
8.  Strand 
9.  Primary CGDID
10. Secondary CGDID (if any)
11. Description
12. Date Created
13. Sequence Coordinate Version Date (if any)
14. Blank
15. Blank
16. Date of gene name reservation (if any).
17. Has the reserved gene name become the standard name? (Y/N)
18. Name of S. cerevisiae ortholog(s) (multiples separated by |)


In [1]:
# annotation file downloaded from CandidaDB:
file = '/data/genomes/yeast/C.glabrata/C_glabrata_CBS138_version_s02-m02-r03_chromosomal_feature.tab'
!head {file}

! File name: C_glabrata_CBS138_version_s02-m02-r03_chromosomal_feature.tab
! Organism: Candida glabrata CBS138
! Genome version: s02-m02-r03
! Date created: Sun Jan 20 07:11:54 2013
! Created by: The Candida Genome Database (http://www.candidagenome.org/)
! Contact Email: candida-curator AT lists DOT stanford DOT edu
! Funding: NIDCR at US NIH, grant number 1-R01-DE015873-01
!
CAGL0E06138g		CAGL-IPF6010|CAGL-CDS0046.1|CAG58875.1	ORF|Uncharacterized	ChrE_C_glabrata_CBS138	617486	611835	C	CAL0129056		Ortholog(s) have 3-oxoacyl-[acyl-carrier-protein] reductase (NADPH) activity, 3-oxoacyl-[acyl-carrier-protein] synthase activity, fatty acid synthase activity, holo-[acyl-carrier-protein] synthase activity	2010-10-20	2010-10-20				N	FAS2
CAGL0E03223g		CAGL-IPF2421|CAGL-CDS4017.1|CAG58748.1	ORF|Uncharacterized	ChrE_C_glabrata_CBS138	298783	298037	C	CAL0129058		Ortholog(s) have role in U4 snRNA 3'-end processing, exonucleolytic trimming to generate mature 3'-end of 5.8S rRNA from tricistronic 

In [2]:
# find entry for regulated gene in that file
!grep CAGL0A01650g {file}

CAGL0A01650g		CAGL-IPF3019|CAG57730.1	ORF|Uncharacterized	ChrA_C_glabrata_CBS138	163770	164126	W	CAL0126725		Putative protein; gene is upregulated in azole-resistant strain	2010-10-20	2010-10-20				N	


In [3]:
# open filehandles for reading and writing to new file
fh = open(file, 'r')
fout = open('C_glabrata.gff3', 'w')

for line in fh :
    
    # skip all header lines
    if line.startswith('!') :
        continue
        
    # split file into parts (number of columns might differ)
    parts = line.strip().split('\t')
    
    # assign elements to variables:
    if len(parts) == 16 :
        (feature, alias, feature_type, chrom, start, stop, strand, primary, secondary, description, date, coord, blank1, blank2, reserve, standard) = parts
    if len(parts) == 17 :
        (feature, locus, alias, feature_type, chrom, start, stop, strand, primary, secondary, description, date, coord, blank1, blank2, reserve, standard) = parts
    elif len(parts) == 18 :
        (feature, locus, alias, feature_type, chrom, start, stop, strand, primary, secondary, description, date, coord, blank1, blank2, reserve, standard, ortho) = parts
    else :
        print(f'Unusual number of fields: {len(parts)} in {line.strip()}')
        continue
        
    # simplify chromosome name
    chrom_parts = chrom.split('_')
    chrom = chrom_parts[0].replace('Chr', 'chr')
    
    # change C/W strands into -/+ notation
    if strand == 'C' :
        strand = '-'
        # reverse coordinates for genes on minus strand
        (start, stop) = (stop, start)
    else :
        strand = '+'

    # Exercise:
    # add description as Note and replace special characters with %-notation
    description = description.replace('%','%25')
    description = description.replace(';','%3B')
    description = description.replace('=','%3D')
    description = description.replace(',','%2C')


    out = '\t'.join([chrom, 'CandidaDB', feature_type, start, stop, '.', strand, '.', f'ID={feature};Name={locus};Note={description};ortholog={ortho}'])
    fout.write(out+'\n')
    
fh.close()
fout.close()

In [4]:
!ls -lh *.gff3


-rw-r--r-- 1 boycem4 GEU3302521 1.2M Jan  6 18:11 C_glabrata.gff3


In [5]:
!wc *.gff

wc: '*.gff': No such file or directory


In [6]:
!wc /data/genomes/yeast/C.glabrata/C_glabrata_CBS138_version_s02-m02-r03_chromosomal_feature.tab

   5502  142705 1501089 /data/genomes/yeast/C.glabrata/C_glabrata_CBS138_version_s02-m02-r03_chromosomal_feature.tab


In [7]:
##creates file called c_glbrata.gff3