## Creating RNA feature groupings to be used with scMKL
There are three ways to read-in a grouping dictionary as a dictionary where grouping_dictionary['Group1_name'] = [gene_1, gene_2, ... gene_n] (dict[str] : [set | np.ndarray | list | tuple | pd.Series])
1) Reading in a gene set library that is saved as a gmt file
2) Using GSEApy to download one of their many gene sets across several organisms
3) Using scMKL's built-in gene set parsing functions

**NOTE: Although we are using gene symbols here, we recommend using gene IDs for both your grouping dictionary and scRNA feature array rather than gene symbols as they are more ambiguous.** 

In [1]:
import scmkl
import gseapy

# For data manipulation and saving
import numpy as np
import pickle

  from .autonotebook import tqdm as notebook_tqdm


### 1) Reading in a gene set library that is saved as a gmt file
A gmt file is a tab separated file where each line belongs to a gene set as:

gene_set_name_1   description  gene_1  gene_2  ....    gene_n

GMT files can be downloaded from https://www.gsea-msigdb.org/gsea/msigdb/.

In [2]:
with open("data/_hallmark_library.gmt", "r") as gmt_file:
    # Skipping the description column
    gmt_grouping = {line.split("\t")[0] : line.strip("\n").split("\t")[2:] 
                    for line in gmt_file}

print(gmt_grouping.keys())

dict_keys(['HALLMARK_ADIPOGENESIS', 'HALLMARK_ALLOGRAFT_REJECTION', 'HALLMARK_ANDROGEN_RESPONSE', 'HALLMARK_ANGIOGENESIS', 'HALLMARK_APICAL_JUNCTION', 'HALLMARK_APICAL_SURFACE', 'HALLMARK_APOPTOSIS', 'HALLMARK_BILE_ACID_METABOLISM', 'HALLMARK_CHOLESTEROL_HOMEOSTASIS', 'HALLMARK_COAGULATION', 'HALLMARK_COMPLEMENT', 'HALLMARK_DNA_REPAIR', 'HALLMARK_E2F_TARGETS', 'HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION', 'HALLMARK_ESTROGEN_RESPONSE_EARLY', 'HALLMARK_ESTROGEN_RESPONSE_LATE', 'HALLMARK_FATTY_ACID_METABOLISM', 'HALLMARK_G2M_CHECKPOINT', 'HALLMARK_GLYCOLYSIS', 'HALLMARK_HEDGEHOG_SIGNALING', 'HALLMARK_HEME_METABOLISM', 'HALLMARK_HYPOXIA', 'HALLMARK_IL2_STAT5_SIGNALING', 'HALLMARK_IL6_JAK_STAT3_SIGNALING', 'HALLMARK_INFLAMMATORY_RESPONSE', 'HALLMARK_INTERFERON_ALPHA_RESPONSE', 'HALLMARK_INTERFERON_GAMMA_RESPONSE', 'HALLMARK_KRAS_SIGNALING_DN', 'HALLMARK_KRAS_SIGNALING_UP', 'HALLMARK_MITOTIC_SPINDLE', 'HALLMARK_MTORC1_SIGNALING', 'HALLMARK_MYC_TARGETS_V1', 'HALLMARK_MYC_TARGETS_V2', 'HALLMAR

### 2) Using GSEApy to download one of their many gene sets across several organisms
GSEApy includes gene set libraries for 'Human', 'Mouse', 'Yeast', 'Worm', 'Fly', and 'Fish'

In [3]:
# Showing the number of gene set libraries for each organsim
for organism in ['Human', 'Mouse', 'Yeast', 'Worm', 'Fly', 'Fish']:
    print(f'There are {len(gseapy.get_library_name(organism = organism))} gene set libraries for {organism}')

# After choosing one of the library names from gseapy.get_library_name() for your organism, we can pull the library
# Here we will pull the MSigDB_Hallmark_2020 library for Human
gseapy_grouping = gseapy.get_library(name = 'MSigDB_Hallmark_2020', organism = 'Human')
print(gseapy_grouping.keys())

There are 218 gene set libraries for Human
There are 218 gene set libraries for Mouse
There are 21 gene set libraries for Yeast
There are 37 gene set libraries for Worm
There are 38 gene set libraries for Fly
There are 30 gene set libraries for Fish
dict_keys(['TNF-alpha Signaling via NF-kB', 'Hypoxia', 'Cholesterol Homeostasis', 'Mitotic Spindle', 'Wnt-beta Catenin Signaling', 'TGF-beta Signaling', 'IL-6/JAK/STAT3 Signaling', 'DNA Repair', 'G2-M Checkpoint', 'Apoptosis', 'Notch Signaling', 'Adipogenesis', 'Estrogen Response Early', 'Estrogen Response Late', 'Androgen Response', 'Myogenesis', 'Protein Secretion', 'Interferon Alpha Response', 'Interferon Gamma Response', 'Apical Junction', 'Apical Surface', 'Hedgehog Signaling', 'Complement', 'Unfolded Protein Response', 'PI3K/AKT/mTOR  Signaling', 'mTORC1 Signaling', 'E2F Targets', 'Myc Targets V1', 'Myc Targets V2', 'Epithelial Mesenchymal Transition', 'Inflammatory Response', 'Xenobiotic Metabolism', 'Fatty Acid Metabolism', 'Oxidati

## Using scMKL's built-in gene set parsing functions

If we are interested in B cells and T cells, we can use `scmkl.find_candidates()` to find possible groupings.

In [4]:
scmkl.find_candidates('human', key_terms=[' b ', ' t '])


Unnamed: 0,Library,No. Gene Sets,No. Key Terms Matching
0,Azimuth_2023,1241,39
1,Azimuth_Cell_Types_2021,341,55
2,Cancer_Cell_Line_Encyclopedia,967,0
3,CellMarker_2024,1134,161
4,CellMarker_Augmented_2021,1096,115
5,GO_Biological_Process_2025,5341,100
6,GO_Cellular_Component_2025,466,2
7,GO_Molecular_Function_2025,1174,1
8,KEGG_2021_Human,320,1
9,MSigDB_Hallmark_2020,50,0


Given `'CellMarker_2024'` has the largest number of hits for our terms of interest, we can now pull that geneset while simultaneously filtering out groupings that do not contain at least 2 genes from our data set features or do not contain our terms of interest.

In [5]:
features = np.load('data/_MCF7_RNA_feature_names.npy', allow_pickle=True)

group_dict = scmkl.get_gene_groupings('CellMarker_2024', 'human', 
                                    key_terms=[' b ', ' t '], 
                                    min_overlap=2, 
                                    genes=features)

print(f'{len(group_dict.keys())} gene groupings.')
group_dict.keys()

Not filtering with `blacklist` parameter.
161 gene groupings.


dict_keys(['Activated CD4+ T Cell Blood Human', 'Activated CD4+ T Cell Peripheral Blood Human', 'Activated CD8+ T Cell Peripheral Blood Human', 'Activated T Cell Skin Human', 'Activated T Cell Undefined Human', 'Activated Memory B Cell Blood Human', 'Activated Tissue Resident Memory CD8+ T Cell Airway Human', 'CD1C+ B Dendritic Cell Blood Human', 'CD4 T Cell Lung Human', 'CD4+ T Cell Blood Human', 'CD4+ T Cell Brain Human', 'CD4+ T Cell Kidney Human', 'CD4+ T Cell Liver Human', 'CD4+ T Cell Lung Human', 'CD4+ T Cell Lymphoid Tissue Human', 'CD4+ T Cell Peripheral Blood Human', 'CD4+ T Cell Skin Human', 'CD4+ T Cell Spleen Human', 'CD4+ T Cell Stomach Human', 'CD4+ T Cell Undefined Human', 'CD4+ Central Memory Like T (Tcm-like) Cell Peripheral Blood Human', 'CD4+ Recently Activated Effector Memory Or Effector T Cell (CTL) Blood Human', 'CD4-CD28+ T Cell Peripheral Blood Human', 'CD4-CD28- T Cell Peripheral Blood Human', 'CD40LG+ T Helper Cell Bile Duct Human', 'CD8 T Cell Lung Human', '

If we want the same gene grouping but without 'blood' or 'stomach' groupings, we use the `blacklist` parameter. **NOTE: Grouping names should be manually reviewed to avoid using nonsensical groupings for the classification task.**

In [6]:
group_dict = scmkl.get_gene_groupings('CellMarker_2024', 'human', 
                                    key_terms=[' b ', ' t '], 
                                    blacklist=['blood', 'stomach'],
                                    min_overlap=2, 
                                    genes=features)

print(f'{len(group_dict.keys())} gene groupings.')
group_dict.keys()

99 gene groupings.


dict_keys(['Activated T Cell Skin Human', 'Activated T Cell Undefined Human', 'Activated Tissue Resident Memory CD8+ T Cell Airway Human', 'CD4 T Cell Lung Human', 'CD4+ T Cell Brain Human', 'CD4+ T Cell Kidney Human', 'CD4+ T Cell Liver Human', 'CD4+ T Cell Lung Human', 'CD4+ T Cell Lymphoid Tissue Human', 'CD4+ T Cell Skin Human', 'CD4+ T Cell Spleen Human', 'CD4+ T Cell Undefined Human', 'CD40LG+ T Helper Cell Bile Duct Human', 'CD8 T Cell Lung Human', 'CD8+ T Cell Bile Duct Human', 'CD8+ T Cell Bone Marrow Human', 'CD8+ T Cell Brain Human', 'CD8+ T Cell Breast Human', 'CD8+ T Cell Epithelium Human', 'CD8+ T Cell Kidney Human', 'CD8+ T Cell Lung Human', 'CD8+ T Cell Lymph Human', 'CD8+ T Cell Lymphoid Tissue Human', 'CD8+ T Cell Nasal Polyp Human', 'CD8+ T Cell Skin Human', 'CD8+ T Cell Spleen Human', 'CD8+ T Cell Undefined Human', 'Central Memory CD4+ T Cell Liver Human', 'Central Memory CD4+ T Cell Spleen Human', 'Central Memory CD8+ T Cell Liver Human', 'Central Memory CD8+ T Cel

### Saving a Library for scMKL Grouping
If you plan doing multiple train/test splits or running with different sparsities (alpha), it is recommened that the grouping dictionaries are saved as a pickle.
As pickle files are byte-streams, they are very fast to read-in with `group_dict = np.load('group_dict.pkl', allow_pickle=True)`.

In [7]:
# Saving a dictionary as a pickle file

# with open('your_filename.pkl', 'wb') as output:
#     pickle.dump(group_dict, output)