## Creating RNA feature groupings to be used with scMKL
There are three ways to read-in a grouping dictionary as a dictionary where grouping_dictionary['Group1_name'] = [gene_1, gene_2, ... gene_n] (dict[str] : [set | np.ndarray | list | tuple | pd.Series])
1) Reading in a gene set library that is saved as a pickle file
2) Reading in a gene set library that is saved as a gmt file
3) Using GSEApy to download one of their many gene sets across several organisms

**NOTE: Although we are using gene symbols here, we recommend using gene IDs for both your grouping dictionary and scRNA feature array rather than gene symbols as they are more ambiguous.** 

In [1]:
# Importing GSEApy
import gseapy

# Importing numpy and pickle for data manipulation and loading/saving
import numpy as np
import pickle

### Reading in a gene set library that is saved as a pickle file

In [2]:
pickled_grouping = np.load("/home/vangordi/projects/scMKL/scMKL/example/data/RNA_hallmark_groupings.pkl", allow_pickle = True)
print(pickled_grouping)

{'HALLMARK_TNFA_SIGNALING_VIA_NFKB': {'GEM', 'BCL6', 'ICAM1', 'AREG', 'PTGS2', 'SPSB1', 'SDC4', 'CXCL6', 'TRIB1', 'DUSP2', 'IL6ST', 'EGR2', 'PHLDA1', 'SQSTM1', 'TRIP10', 'CFLAR', 'PDLIM5', 'BTG2', 'CCL2', 'DNAJB4', 'MAFF', 'PNRC1', 'LAMB3', 'NFKB1', 'CCN1', 'CEBPB', 'IFIH1', 'DENND5A', 'IRF1', 'TSC22D1', 'ID2', 'SNN', 'IL12B', 'ZFP36', 'ICOSLG', 'IL23A', 'RIPK2', 'IL18', 'PER1', 'HBEGF', 'KLF6', 'PLEK', 'SAT1', 'CD83', 'TLR2', 'SLC2A3', 'TNFAIP3', 'CXCL2', 'INHBA', 'KLF10', 'IRS2', 'MSC', 'SERPINB2', 'TIPARP', 'KDM6B', 'IL1A', 'GFPT2', 'MCL1', 'SLC16A6', 'SLC2A6', 'F2RL1', 'BTG1', 'TAP1', 'PANX1', 'GADD45A', 'KLF2', 'CSF1', 'GADD45B', 'CEBPD', 'PPP1R15A', 'PTX3', 'NR4A1', 'STAT5A', 'CXCL3', 'IFIT2', 'BCL2A1', 'BHLHE40', 'JAG1', 'VEGFA', 'FJX1', 'RELA', 'ZC3H12A', 'LIF', 'TNFAIP6', 'IER2', 'EDN1', 'CCND1', 'BIRC3', 'BCL3', 'REL', 'SGK1', 'NFKB2', 'F3', 'DUSP4', 'TUBB2A', 'OLR1', 'CD69', 'GCH1', 'CXCL1', 'ABCA1', 'MAP2K3', 'HES1', 'TNFAIP8', 'BIRC2', 'MAP3K8', 'DRAM1', 'EIF1', 'IL7R', 'P

### Reading in a gene set library that is saved as a gmt file
A gmt file is a tab separated file where each line belongs to a gene set as:

gene_set_name_1   description  gene_1  gene_2  ....    gene_n

GMT files can be downloaded from https://www.gsea-msigdb.org/gsea/msigdb/.

In [3]:
with open("data/hallmark_library.gmt", "r") as gmt_file:
    # Skipping the description column
    gmt_grouping = {line.split("\t")[0] : line.strip("\n").split("\t")[2:] for line in gmt_file}

print(gmt_grouping)

{'HALLMARK_ADIPOGENESIS': ['ABCA1', 'ABCB8', 'ACAA2', 'ACADL', 'ACADM', 'ACADS', 'ACLY', 'ACO2', 'ACOX1', 'ADCY6', 'ADIG', 'ADIPOQ', 'ADIPOR2', 'AGPAT3', 'AIFM1', 'AK2', 'ALDH2', 'ALDOA', 'ANGPT1', 'ANGPTL4', 'APLP2', 'APOE', 'ARAF', 'ARL4A', 'ATL2', 'ATP1B3', 'ATP5PO', 'BAZ2A', 'BCKDHA', 'BCL2L13', 'BCL6', 'C3', 'CAT', 'CAVIN1', 'CAVIN2', 'CCNG2', 'CD151', 'CD302', 'CD36', 'CDKN2C', 'CHCHD10', 'CHUK', 'CIDEA', 'CMBL', 'CMPK1', 'COL15A1', 'COL4A1', 'COQ3', 'COQ5', 'COQ9', 'COX6A1', 'COX7B', 'COX8A', 'CPT2', 'CRAT', 'CS', 'CYC1', 'CYP4B1', 'DBT', 'DDT', 'DECR1', 'DGAT1', 'DHCR7', 'DHRS7', 'DHRS7B', 'DLAT', 'DLD', 'DNAJB9', 'DNAJC15', 'DRAM2', 'ECH1', 'ECHS1', 'ELMOD3', 'ELOVL6', 'ENPP2', 'EPHX2', 'ESRRA', 'ESYT1', 'ETFB', 'FABP4', 'FAH', 'FZD4', 'G3BP2', 'GADD45A', 'GBE1', 'GHITM', 'GPAM', 'GPAT4', 'GPD2', 'GPHN', 'GPX3', 'GPX4', 'GRPEL1', 'HADH', 'HIBCH', 'HSPB8', 'IDH1', 'IDH3A', 'IDH3G', 'IFNGR1', 'IMMT', 'ITGA7', 'ITIH5', 'ITSN1', 'JAGN1', 'LAMA4', 'LEP', 'LIFR', 'LIPE', 'LPCAT3', '

### Using GSEApy to download one of their many gene sets across several organisms
GSEApy includes gene set libraries for 'Human', 'Mouse', 'Yeast', 'Worm', 'Fly', and 'Fish'

In [4]:
# Showing the number of gene set libraries for each organsim
for organism in ['Human', 'Mouse', 'Yeast', 'Worm', 'Fly', 'Fish']:
    print(f'There are {len(gseapy.get_library_name(organism = organism))} gene set libraries for {organism}')

# After choosing one of the library names from gseapy.get_library_name() for your organism, we can pull the library
# Here we will pull the MSigDB_Hallmark_2020 library for Human
gseapy_grouping = gseapy.get_library(name = 'MSigDB_Hallmark_2020', organism = 'Human')
print(gseapy_grouping)

There are 227 gene set libraries for Human
There are 227 gene set libraries for Mouse
There are 21 gene set libraries for Yeast
There are 37 gene set libraries for Worm
There are 38 gene set libraries for Fly
There are 30 gene set libraries for Fish
{'TNF-alpha Signaling via NF-kB': ['MARCKS', 'IL23A', 'NINJ1', 'TNFSF9', 'SIK1', 'ATF3', 'SERPINE1', 'MYC', 'HES1', 'CCN1', 'CCNL1', 'EGR1', 'EGR2', 'EGR3', 'JAG1', 'ABCA1', 'GADD45B', 'GADD45A', 'KLF10', 'PLK2', 'EIF1', 'EHD1', 'FOSL2', 'FOSL1', 'GPR183', 'PLPP3', 'IFIT2', 'ICAM1', 'ZC3H12A', 'IER2', 'IL12B', 'IER5', 'JUNB', 'IER3', 'STAT5A', 'DUSP5', 'EDN1', 'DUSP4', 'JUN', 'DUSP1', 'DUSP2', 'TSC22D1', 'CCL20', 'SPHK1', 'LIF', 'IL18', 'TUBB2A', 'RHOB', 'VEGFA', 'IL1A', 'PTPRE', 'TLR2', 'IL1B', 'BHLHE40', 'CLCF1', 'ID2', 'REL', 'FJX1', 'SGK1', 'BTG3', 'BTG2', 'BTG1', 'SDC4', 'LITAF', 'AREG', 'SOCS3', 'PANX1', 'RIPK2', 'NFIL3', 'SERPINB2', 'GCH1', 'IFNGR2', 'G0S2', 'FOS', 'F3', 'SERPINB8', 'SPSB1', 'FOSB', 'PER1', 'F2RL1', 'HBEGF', 'CD44', 

### Saving a Library for scMKL Grouping
If you plan doing multiple train/test splits or running with different sparsities (alpha), it is recommened that the grouping dictionaries are saved as a pickle.
As pickle files are byte-streams, they are very fast to read-in.

In [5]:
# Saving a dictionary as a pickle file

# with open('your_filename.pkl', 'wb') as output:
#     pickle.dump(group_dict, output)