## Creating a Peak Set Library from a Gene Set Library
One way to analyze ATAC data with scMKL is to associate peaks with genes they overlap with.
This tutorial assumes that a gene set library (saved as a pickled dictionary) and a GTF file for the organism in question is saved.

In this example, we will create an ATAC grouping for MCF-7 using a subset hg38 gtf file. 

In [1]:
# Importing numpy, pandas, and re for data manipulation
import numpy as np
import pandas as pd
import re
import scmkl


# Reading in gene library as [gene_set] : list | set | np.ndarray of genes
gene_sets = np.load("data/RNA_hallmark_groupings.pkl", allow_pickle = True)

# Reading in the feature names from scATAC assay
assay_peaks = np.load("data/MCF7_ATAC_feature_names.npy", allow_pickle = True)

# Reading in GTF file for region comparison (here is subset_version) and naming columns
gene_annotations = pd.read_csv("data/hg38_subset_protein_coding.annotation.gtf", sep = "\t", header = None, skip_blank_lines=True, comment = "#")
gene_annotations.columns = ['chr', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attribute']

### Formating GTF Data
`get_atac_groupings()` takes gene annotations as a pd.DataFrame with columns `['chr', 'start', 'end', 'gene_name', 'strand]` where:
- `'chr'` is the respecitve chromosome for the annotation
- `'start'` is the respective start postition for the annotation
- `'end'` is the respective end postition for the annotation
- `'gene_name'` is the name of the respective gene name of the region for the annotation (can be parsed from the attribute column of a GTF file')
- `'strand'` is the strand for the annotation, can be `'+'` or `'-'`

In [2]:
# Removing annotations from GTF data that are not protein_coding and filtering to only gene features
gene_annotations = gene_annotations[gene_annotations['attribute'].str.contains('protein_coding')]
gene_annotations = gene_annotations[gene_annotations['feature'] == 'gene']

# Parsing attribute column for gene name and adding it to gene_annotations DataFrame
# If using gene IDs in gene_sets, set gene_annotations['gene_name] to gene IDs instead
gene_annotations['gene_name'] = [re.findall(r'(?<=gene_name ")[A-z0-9]+', attr)[0] for attr in gene_annotations['attribute']]

gene_annotations.head()

Unnamed: 0,chr,source,feature,start,end,score,strand,frame,attribute,gene_name
0,chr1,HAVANA,gene,65419,71585,.,+,.,"gene_id ""ENSG00000186092.6""; gene_type ""protei...",OR4F5
19,chr1,HAVANA,gene,450703,451697,.,-,.,"gene_id ""ENSG00000284733.1""; gene_type ""protei...",OR4F29
27,chr1,HAVANA,gene,685679,686673,.,-,.,"gene_id ""ENSG00000284662.1""; gene_type ""protei...",OR4F16
35,chr1,HAVANA,gene,923928,944581,.,+,.,"gene_id ""ENSG00000187634.12""; gene_type ""prote...",SAMD11
392,chr1,HAVANA,gene,944203,959309,.,-,.,"gene_id ""ENSG00000188976.11""; gene_type ""prote...",NOC2L


### Comparing Regions
`get_atac_groupings()` to get ATAC groupings will search for overlap between the gene annotations and assay features. Then, using the genes in the annotations file and the genes in gene_library, assay peaks will be assigned to groupings in the new grouping dictionary. 

**NOTE**: This function will take a while to run on a full annotations file.

In [3]:
atac_grouping = scmkl.get_atac_groupings(gene_sets = gene_sets,
                                        feature_names = assay_peaks,
                                        gene_anno = gene_annotations
                                        )

print(atac_grouping.keys())

dict_keys(['HALLMARK_TNFA_SIGNALING_VIA_NFKB', 'HALLMARK_HYPOXIA', 'HALLMARK_CHOLESTEROL_HOMEOSTASIS', 'HALLMARK_MITOTIC_SPINDLE', 'HALLMARK_WNT_BETA_CATENIN_SIGNALING', 'HALLMARK_TGF_BETA_SIGNALING', 'HALLMARK_IL6_JAK_STAT3_SIGNALING', 'HALLMARK_DNA_REPAIR', 'HALLMARK_G2M_CHECKPOINT', 'HALLMARK_APOPTOSIS', 'HALLMARK_NOTCH_SIGNALING', 'HALLMARK_ADIPOGENESIS', 'HALLMARK_ESTROGEN_RESPONSE_EARLY', 'HALLMARK_ESTROGEN_RESPONSE_LATE', 'HALLMARK_ANDROGEN_RESPONSE', 'HALLMARK_MYOGENESIS', 'HALLMARK_PROTEIN_SECRETION', 'HALLMARK_INTERFERON_ALPHA_RESPONSE', 'HALLMARK_INTERFERON_GAMMA_RESPONSE', 'HALLMARK_APICAL_JUNCTION', 'HALLMARK_APICAL_SURFACE', 'HALLMARK_HEDGEHOG_SIGNALING', 'HALLMARK_COMPLEMENT', 'HALLMARK_UNFOLDED_PROTEIN_RESPONSE', 'HALLMARK_PI3K_AKT_MTOR_SIGNALING', 'HALLMARK_MTORC1_SIGNALING', 'HALLMARK_E2F_TARGETS', 'HALLMARK_MYC_TARGETS_V1', 'HALLMARK_MYC_TARGETS_V2', 'HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION', 'HALLMARK_INFLAMMATORY_RESPONSE', 'HALLMARK_XENOBIOTIC_METABOLISM', 'HAL