In [1]:
import numpy as np
import pandas as pd

# GrCh37

1.) Download Ensembl transcript annotations from [UCSC Table Browser](http://rohsdb.cmb.usc.edu/GBshape/cgi-bin/hgTables?hgsid=3960312_ZMTtI4bvavkuiWrNuR3OxAWB52dn&clade=mammal&org=Human&db=hg19&hgta_group=genes&hgta_track=ensGene&hgta_table=0&hgta_regionType=genome&position=chr21%3A33031597-33041570&hgta_outputType=primaryTable&hgta_outFileName=) using settings specified in the [wiki](https://github.com/keoughkath/ExcisionFinder/wiki/Get-gene-annotations) and name the resulting file ensembl_ucsc_output_grch37.tsv. Alternatively, get this file [here](http://lighthouse.ucsf.edu/public_files_no_password/excisionFinderData_public/gene_annots/).

2.) Download mappings from Ensembl gene IDs to gene symbols from the [HUGO website](https://www.genenames.org/cgi-bin/download) with boxes checked for "Approved Symbol" and "Ensembl Gene ID". Name the resulting file gene_to_id_ensembl.tsv.

Files from steps 1 and 2 will need to be in your current directory for the notebook to run correctly.

# GrCh38

1.) Download annotations from UCSC Table Browser and name the resulting file ncbi_ucsc_output_grch38.tsv. Alternatively, get this file [here](http://lighthouse.ucsf.edu/public_files_no_password/excisionFinderData_public/gene_annots/).

2.) Download mappings from NCBI RefSeq gene IDs to gene symbols from the [HUGO website](https://www.genenames.org/cgi-bin/download) with boxes checked for "Approved Symbol" and "RefSeq IDs". Name the resulting file gene_to_id_refseq.tsv.

Importantly, it's not required to complete these steps to use ExcisionFinder. You can just use the pre-generated gene annotations file "gene_annots_wsize_grch37.tsv" or "gene_annots_wsize_grch38.tsv" provided [here](http://lighthouse.ucsf.edu/public_files_no_password/excisionFinderData_public/gene_annots/). This is intended for documentation and reproducibility purposes as well as to allow users to generate their own gene annotations in order to analyze different reference genomes.

In [12]:
def get_gene_annots(ensembl, out, name_mapping=None):
    gene_df = pd.read_csv(ensembl, sep='\t',
                          usecols = ['name','chrom','txStart','txEnd',
                                    'cdsStart','cdsEnd','exonStarts',
                                    'exonEnds','name2'])
    if name_mapping:
        gene_df.columns = ['name','chrom','txStart','txEnd',
                                        'cdsStart','cdsEnd','exonStarts',
                                        'exonEnds','external_id']
        gene_to_id = pd.read_csv(name_mapping, sep='\t', header=0,
                                names=['gene_name','external_id'])
        gene_df = gene_df.merge(gene_to_id, how='left')
    gene_df['size'] = gene_df['txEnd'] - gene_df['txStart']
    gene_df.to_csv(out, sep='\t', index=False)

# GrCh37

In [3]:
get_gene_annots('ensembl_ucsc_output_grch37.tsv', 'gene_annots_wsize_grch37.tsv', 'gene_to_id_ensembl.tsv')

# GrCh38


GrCh38 is not fully supported through the UCSC genome prowser for Ensembl transcript annotations. A good alternative is to use the NCBI RefSeq annotations, which can also be downlaoded from the [UCSC Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=658869521_a49RLfkl8V8eYzEFY5hvCyvezAg0&clade=mammal&org=Human&db=hg38&hgta_group=genes&hgta_track=refSeqComposite&hgta_table=ncbiRefSeq&hgta_regionType=range&position=chr1%3A11102837-11267747&hgta_outputType=primaryTable&hgta_outFileName=ncbi_ucsc_output_grch38.tsv). The NCBI RefSeq annotations contain HUGO Approved Symbols in the 'name2' column, allowing you to skip inputing the name_mapping file. 

In [13]:
get_gene_annots('ncbi_ucsc_output_grch38.tsv', 'gene_annots_wsize_grch38.tsv')