# Annotate RegionDS

Following DMR calling (or any other RegionDS created from other DMR and genomic region sets), we can annotate the DMR regions by addtition genomic annotations stored in BigWig or BED format.


## Import

In [2]:
import pandas as pd
import pathlib
from ALLCools.mcds import RegionDS

## Open RegionDS

In [5]:
dmr_ds = RegionDS.open('HIP_small')
dmr_ds

Using dmr as region_dim


In [7]:
RegionDS.open('HIP_small/dms/', region_dim='dms')

## DMR Chromatin Accessibility Profile (BigWig)

For example, here we annotate cluster-matched chromatin accessibility profiles from a mouse hippocampus snATAC-seq dataset. Each profile is stored in BigWig format.

In [3]:
# prepare the bigwig tab-separated table, first column is cluster name, second column is BigWig path
bigwig_dir = '/gale/netapp/cemba3c/projects/ALLCools/HIPBulk/atac_bulk/'
bigwigs = pd.Series({
    p.name.split('.')[0].split('_')[-1]: str(p)
    for p in pathlib.Path(bigwig_dir).glob('HIP_snATAC_*.bw')
})
bigwigs.to_csv('bigwig.csv', header=False)
bigwigs

CA23    /gale/netapp/cemba3c/projects/ALLCools/HIPBulk...
CGE     /gale/netapp/cemba3c/projects/ALLCools/HIPBulk...
ASC     /gale/netapp/cemba3c/projects/ALLCools/HIPBulk...
MGE     /gale/netapp/cemba3c/projects/ALLCools/HIPBulk...
CA1     /gale/netapp/cemba3c/projects/ALLCools/HIPBulk...
ODC     /gale/netapp/cemba3c/projects/ALLCools/HIPBulk...
MGC     /gale/netapp/cemba3c/projects/ALLCools/HIPBulk...
NonN    /gale/netapp/cemba3c/projects/ALLCools/HIPBulk...
OPC     /gale/netapp/cemba3c/projects/ALLCools/HIPBulk...
DG      /gale/netapp/cemba3c/projects/ALLCools/HIPBulk...
dtype: object

In [4]:
dmr_ds.annotate_by_bigwigs(slop=250,
                           bigwig_table='bigwig.csv',
                           dim='snATAC',
                           cpu=30)

Use chunk size 2


## DMR Overlapping Genome Features (BED)

Next, we overlap the DMR regions with a set of BED files that containing different kinds of genome features. The output dataset is a boolean matrix descriping whether a DMR is overlapping one feature kind (e.g. CGI).

In [5]:
genome_feature_dir = '/home/hanliu/ref/mouse/genome_feature/'
genome_feature_beds = {
    p.name[:-4]: str(p)
    for p in pathlib.Path(genome_feature_dir).glob('*.bed')
}

# also annotate blacklist to filter DMR
genome_feature_beds['blacklist'] = '/home/hanliu/ref/blacklist/mm10-blacklist.v2.bed.gz'

beds = pd.Series(genome_feature_beds)
beds.to_csv('genome_featue_bed.csv', header=False)
beds

CGI_promoter.all                   /home/hanliu/ref/mouse/genome_feature/CGI_prom...
CGI_promoter.protein_coding        /home/hanliu/ref/mouse/genome_feature/CGI_prom...
exon.all                           /home/hanliu/ref/mouse/genome_feature/exon.all...
exon.first                         /home/hanliu/ref/mouse/genome_feature/exon.fir...
exon.protein_coding                /home/hanliu/ref/mouse/genome_feature/exon.pro...
gene.all                           /home/hanliu/ref/mouse/genome_feature/gene.all...
gene.lincRNA                       /home/hanliu/ref/mouse/genome_feature/gene.lin...
gene.protein_coding                /home/hanliu/ref/mouse/genome_feature/gene.pro...
intron.all                         /home/hanliu/ref/mouse/genome_feature/intron.a...
intron.first                       /home/hanliu/ref/mouse/genome_feature/intron.f...
intron.protein_coding              /home/hanliu/ref/mouse/genome_feature/intron.p...
Non_CGI_promoter.all               /home/hanliu/ref/mouse/genome_

In [6]:
dmr_ds.annotate_by_bed(slop=250,
                       bed_table='genome_featue_bed.csv',
                       dim='genome-features',
                       bed_sorted=False,
                       cpu=30)

Use chunk size 2


## DMR Overlapping TE (BED)

Finally, we overlap the DMR regions with Transposable Elements (TE) collected from Repeat Master mouse genome. The output dataset is a boolean matrix descriping whether a DMR is overlapping one TE kind.

In [7]:
bed_dir = '/home/hanliu/ref/mouse/ucsc/TE_Beds/'
beds = {
    p.name[:-4]: str(p)
    for p in pathlib.Path(bed_dir).glob('*.bed')
}
beds = pd.Series(beds)
beds.to_csv('te_bed.csv', header=False)
beds

DNA.DNA                /home/hanliu/ref/mouse/ucsc/TE_Beds/DNA.DNA.bed
DNA.MULE-MuDR        /home/hanliu/ref/mouse/ucsc/TE_Beds/DNA.MULE-M...
DNA.MuDR              /home/hanliu/ref/mouse/ucsc/TE_Beds/DNA.MuDR.bed
DNA.PiggyBac         /home/hanliu/ref/mouse/ucsc/TE_Beds/DNA.PiggyB...
DNA.TcMar            /home/hanliu/ref/mouse/ucsc/TE_Beds/DNA.TcMar.bed
DNA.TcMar-Mariner    /home/hanliu/ref/mouse/ucsc/TE_Beds/DNA.TcMar-...
DNA.TcMar-Pogo       /home/hanliu/ref/mouse/ucsc/TE_Beds/DNA.TcMar-...
DNA.TcMar-Tc2        /home/hanliu/ref/mouse/ucsc/TE_Beds/DNA.TcMar-...
DNA.TcMar-Tigger     /home/hanliu/ref/mouse/ucsc/TE_Beds/DNA.TcMar-...
DNA.hAT                /home/hanliu/ref/mouse/ucsc/TE_Beds/DNA.hAT.bed
DNA.hAT-Blackjack    /home/hanliu/ref/mouse/ucsc/TE_Beds/DNA.hAT-Bl...
DNA.hAT-Charlie      /home/hanliu/ref/mouse/ucsc/TE_Beds/DNA.hAT-Ch...
DNA.hAT-Tip100       /home/hanliu/ref/mouse/ucsc/TE_Beds/DNA.hAT-Ti...
LINE.CR1              /home/hanliu/ref/mouse/ucsc/TE_Beds/LINE.CR1.bed
LINE.D

In [8]:
dmr_ds.annotate_by_bed(slop=250,
                       bed_table='te_bed.csv',
                       dim='TE',
                       bed_sorted=False,
                       cpu=30)

Use chunk size 2


## After Annotation
After annotation, the RegionDS will contain additional dmr-by-feature matrix that can be used for following analysis.

In [9]:
dmr_ds

Unnamed: 0,Array,Chunk
Bytes,5.24 kB,1.05 kB
Shape,"(131, 10)","(131, 2)"
Count,6 Tasks,5 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 5.24 kB 1.05 kB Shape (131, 10) (131, 2) Count 6 Tasks 5 Chunks Type float32 numpy.ndarray",10  131,

Unnamed: 0,Array,Chunk
Bytes,5.24 kB,1.05 kB
Shape,"(131, 10)","(131, 2)"
Count,6 Tasks,5 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.80 kB,262 B
Shape,"(131, 29)","(131, 2)"
Count,16 Tasks,15 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 3.80 kB 262 B Shape (131, 29) (131, 2) Count 16 Tasks 15 Chunks Type bool numpy.ndarray",29  131,

Unnamed: 0,Array,Chunk
Bytes,3.80 kB,262 B
Shape,"(131, 29)","(131, 2)"
Count,16 Tasks,15 Chunks
Type,bool,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.19 kB,262 B
Shape,"(131, 32)","(131, 2)"
Count,17 Tasks,16 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 4.19 kB 262 B Shape (131, 32) (131, 2) Count 17 Tasks 16 Chunks Type bool numpy.ndarray",32  131,

Unnamed: 0,Array,Chunk
Bytes,4.19 kB,262 B
Shape,"(131, 32)","(131, 2)"
Count,17 Tasks,16 Chunks
Type,bool,numpy.ndarray


### Selectively open
You may also select specific datasets to open for future analysis.

In [10]:
RegionDS.open('HIP_small', select_dir=['dmr', 'dmr_TE'])

Using dmr as region_dim
