## Tutorial: CTCF motif annotation with bioframe and pygenlib

In this example we use pygenlib to instantiate a human Transcriptome (canonical protein-coding transcripts on chromosome 2 only), download an  CTCF motif dataset with bioframe and annotate all transcripts with those motifs. We then report all genes/transcripts with at least one overlapping CTCF motif.
This example demonstrates how to use these two libaries together for maximum benefit.

We demonstrate two annotation scenarios and compare their runtimes.

Required test resources
* gencode_39 GFF
* JASPAR ctcf sites

First, we load the required libraries.

In [10]:
# set path and load pygenlib
import os, pathlib, platform
PYGENLIB_SRC=pathlib.Path('/Users/niko/projects/pygenlib/') 
os.chdir(PYGENLIB_SRC)
# install libraries. Recommended to run in a venv here!
#!{sys.executable} -m pip install -r requirements.txt 
display(f"Running pygenlib on python {platform.python_version()}. Using pygenlib code from {PYGENLIB_SRC}")
# load pygenlib
import pygenlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import ssl
import bioframe

'Running pygenlib on python 3.10.4. Using pygenlib code from /Users/niko/projects/pygenlib'

First, we download gencode 39 annotations.
NOTE that this needs bedtools, samtools and htslib (bgzip, tabix) installed.
Total size of the downloaded data (for all tutorials) is ~150M. Files are only downloaded if not existing already in the `notebooks/large_test_resources/` directory.

In [11]:
import traceback
from pygenlib.testdata import download_bgzip_slice
outdir=PYGENLIB_SRC / 'notebooks/large_test_resources' # update to your preferred location
large_test_resources = {
    "outdir": f"{outdir}", # update to your preferred location
    "resources": {
        # -------------- Full gencode39 annotation -------------------------------
        "full_gencode_gff": {
            "uri": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/gencode.v39.annotation.gff3.gz",
            "filename": "gencode_39.gff3.gz",
            "recreate": False
        }   
    }
}
display(f'Downloading test data files to {outdir}')
for resname in large_test_resources['resources']:
    try:
        download_bgzip_slice(large_test_resources, resname, view_tempdir=False)
    except Exception:
        display(traceback.format_exc())
        display(f"Error creating resource {resname}. Some tests may not work...")
display("All done.")

'Downloading test data files to /Users/niko/projects/pygenlib/notebooks/large_test_resources'

Creating testdataset full_gencode_gff
Resource already exists, skipping...


'All done.'

Now, we use bioframe to read a set of CTCF binding sites from JASPAR and filter for chromosome 2 and 
minimum P-value.
NOTE that this cell might fail with a TimeoutError if the JASPAR data is not accessible.

In [13]:
# see https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=Grch38&g=jaspar
jaspar_uri='http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/hg38/MA0139.1.tsv.gz'
ssl._create_default_https_context = ssl._create_unverified_context # to avoid invalid certificate problems
ctcf = bioframe.read_table(jaspar_uri, schema='jaspar').query("chrom=='chr2' & pval>500")

URLError: <urlopen error [Errno 60] Operation timed out>

Now, we annotate transcripts with JASPAR motifs. We demonstrate two different approaches:

- approach1
    - we instantiate a filtered pygenlib transcriptome
    - we then annotate with the JASPAR motifs loaded via bioframe
    - for annotation we use a (very flexible) custom callback method
    - finally, we build a pandas dataframe with results containing one row per tx
- approach2
    - we read the GFF with bioframe and correct the start coordinate as bioframe expects 0-based coordinates
    - we overlap tx and ctcf sites with the bioframe overlap method (NOTE that we cannot use the inner join as it would change the order of the ctcf sites)
    - we create a results dataframe while filtering for the respective transcripts. This is done base on GFF attributes parsed from the 'attributes' column.

In [None]:
# gencode annotation GFF
gencode_gff=get_resource("full_gencode_gff", conf=large_test_resources) # see download section above

times, results=Counter(),Counter()
with Timer(times, 'hybrid_approach1') as timer:
    # Build subset of human transcriptome (canonical protein coding genes on chrom 2)
    t=Transcriptome({
        'annotation_gff': gencode_gff,
        'annotation_flavour': 'gencode',
        'transcript_filter': {
            'included_tags': ['Ensembl_canonical'],
            'included_genetypes': ['protein_coding'],
            'included_chrom': ['chr2']
        }
    })
    # custom annotation method
    def anno_ctcf(item, label='ctcf_scores'):
        """ 
            Callback method for annotating transcripts.

            loc: genomic interval of the feature that is annotated
            anno: the transcriptome anno dict for this feature (so you can also access any other already existing annotations for this feature)
            dfrows: list of (loc, row) tuples containing all overlapping locations (loc) and the respective dataframe rows

            This method adds 'ctcf' annotations comprising of a list of (location, score) tuples that are overlapping the annotated transcript
        """
        loc, (anno, dfrows) = item
        if label not in anno:
            anno[label]=[]
        for sloc,dfrow in dfrows:
            anno[label].append((sloc,dfrow.score)) # add ctcf motif annotations
    # anotate all transcripts using the above defined anno_ctcf method
    display("Annotating transcriptome")     
    t.annotate(iterators=BioframeIterator(ctcf), 
               fun_anno=anno_ctcf, 
               feature_types=['transcript'])
    # build dataframe of transcripts with at least one overlapping CTCF peak
    results[timer.name]= pd.DataFrame([(tx.parent.gene_name, 
                                        tx.feature_id,
                                        str(tx.location),
                                        ','.join([str(l) for l,_ in tx.ctcf_scores]),
                                        ','.join([str(s) for _,s in tx.ctcf_scores])) for tx in t.transcript.values() if len(tx.ctcf_scores)>0], 
                                      columns=['gene','tid','location','ctcf_locations','ctcf_scores'])

    
with Timer(times, 'hybrid_approach2') as timer:
    """ Here, we read the gencode annotation data via bioframe and correct the start coordinates as required.
        We then overlap with the ctcf sites and iterate over the resulting dataframe, parse the GFF attributes and
        filter for canonical, protein-coding genes. Finally, we build the result pandas dataframe.
    """
    tx = bioframe.read_table(gencode_gff, schema='gff', comment='#').query("chrom=='chr2' & feature=='transcript'")
    tx['start']=tx['start']-1 # we need to correct start coordinates as bioframe coordinates are assumed to be 0-based and no correction is done (see above)
    over=bioframe.overlap(tx, ctcf, suffixes=('','_ctcf'), how='left', return_index=True) # NOTE: cannot use inner here as it will change order :/
    data={'gene':[], 'tid':[], 'location':[], 'ctcf_locations':[], 'ctcf_scores':[]}
    for loc, row in BioframeIterator(over):
        info = pg.parse_gff_attributes(row.attributes) # parse gff attributes from attributes column.
        if not info.get('gene_type', 'NA') == 'protein_coding':
            continue
        if 'Ensembl_canonical' not in info.get('tag','NA'):
            continue
        if pd.isnull(row.start_ctcf): # filter tx w/o ctcf site
            continue
        data['gene'].append(info['gene_name'])
        data['tid'].append(info['transcript_id'])
        data['location'].append(str(loc))
        ctcf_loc = gi(row.chrom_ctcf, row.start_ctcf+1, row.end_ctcf, strand=row.strand_ctcf)
        data['ctcf_locations'].append(str(ctcf_loc)) # coorect coords
        data['ctcf_scores'].append(str(int(row.score_ctcf)))
    df = pd.DataFrame.from_dict(data) # create dataframe
    df = df.groupby(['gene','tid','location'], sort=False).agg(','.join).reset_index() # colapse rows while keeping sort order
    results[timer.name]=df


# assert that we get same results
assert len(results['hybrid_approach1'].compare(results['hybrid_approach2']).index)==0

plot_times(f"Collecting overlapping ctcf motifs",
       times, reference_method='hybrid_approach1', show_speed=False)

Approach1 is faster (probably due to early filtering) and more flexible in our opinion. 
In the callback annotation method users have access to genomic locations and all data attributes of transcripts and the overlapping CTCF motifs as well as external data. 
They are also free to chose the created annotation data structures (here a list of location/score tuples). Together, this enables the implementation of complex annotation scenarios.