## Using scMKL with single-cell mRNA and ATAC Data Simultaneously
Here we will run scMKL on a subset of the MCF-7 data (1,000 cells x 36,601 genes for RNA and 1,000 cells x 206,167 regions for ATAC) using Hallmark groupings.

### Importing Modules
Data is read-in and saved using numpy and pickle modules.

In [1]:
# Packages needed to import data
import numpy as np
import pickle
import sys
from scipy.sparse import load_npz

# This sys command allows us to import the scMKL_src_anndata module from any directory. '..' can be replaced by any path to the repository directory
sys.path.insert(0, '..')
import src.scMKL_src_anndata as src

# Modules for viewing results
import pandas as pd
from plotnine import *

### Reading in Data
There are 4 required pieces of data (per modality) required for scMKL
- The data matrix itself with cells as rows and features as columns.
    - Can be either a Numpy Array or Scipy Sparse array (scipy.sparse.csc_array is the recommended format).  
- The sample labels in a Numpy Array.  To perform group lasso, these labels must be binary.
- Feature names in a Numpy Array. These are the names of the features corresponding with the data matrix
- A dictionary with grouping data.  The keys are the names of the groups, and the values are the corresponding features.
    - Example: {Group1: [feature1, feature2, feature3], Group2: [feature4, feature5, feature6], ...}
    - GSEApy can also be used to access many other gene sets
        - See `getting_RNA_groupings.ipynb` and `getting_RNA_groupings.ipynb` for more information on pulling gene/peak sets

In [3]:
# Reading in RNA data
RNA_group_dict = np.load('./data/RNA_hallmark_groupings.pkl', allow_pickle = True)
RNA_X = load_npz('./data/MCF7_RNA_X.npz')
RNA_feature_names = np.load('./data/MCF7_RNA_feature_names.npy', allow_pickle = True)

# Reading in ATAC data
ATAC_group_dict = np.load('./data/MCF7_ATAC_hallmark_groupings.pkl', allow_pickle = True)
ATAC_X = load_npz('./data/MCF7_ATAC_X.npz')
ATAC_feature_names = np.load('./data/MCF7_ATAC_feature_names.npy', allow_pickle = True)

# Reading in cell labels
cell_labels = np.load('./data/MCF7_cell_labels.npy', allow_pickle = True)

# This value for D, the number of fourier features in Z, was found to be optimal in previous literature. Generally increasing D increases accuracy, but runs slower.
D = int(np.sqrt(len(cell_labels)) * np.log(np.log(len(cell_labels))))

### Creating an AnnData Object
scMKL takes advantage of AnnData's flexible structure to create a straight-forward approach to running scMKL. For multimodal expriments, we must first create adata objects for each modality.

In [7]:
# Creating RNA adata
RNA_adata = src.Create_Adata(X = RNA_X, feature_names = RNA_feature_names, cell_labels = cell_labels, group_dict = RNA_group_dict,
                         data_type = 'counts', D = D, filter_features = True, random_state = 100)

# Estamating sigma for RNA and calculating Z matrix
RNA_adata = src.Estimate_Sigma(RNA_adata, n_features = 200)
RNA_adata = src.Calculate_Z(RNA_adata, n_features = 5000)

# Creating ATAC adata
ATAC_adata = src.Create_Adata(X = ATAC_X, feature_names = ATAC_feature_names, cell_labels = cell_labels, group_dict = ATAC_group_dict,
                         data_type = 'binary', D = D, filter_features = True, random_state = 100)

# Estimating sigma for ATAC and calculating Z matrix
ATAC_adata = src.Estimate_Sigma(ATAC_adata, n_features = 200)
ATAC_adata = src.Calculate_Z(ATAC_adata, n_features = 5000)

## Combining Modalities into a Single Object

In [8]:
# Combining adatas
combined_adata = src.Combine_Modalities(Assay_1_name = 'RNA', Assay_2_name = 'ATAC',
                       Assay_1_adata = RNA_adata, Assay_2_adata = ATAC_adata,
                       combination = 'concatenate')

### Training and Evalutating Model
Here we will train and evaluate 10 models, each with a different `alpha`.

`alpha` (or lambda) is a regularization coefficient that deterimines how many groupings will be used to classify the test cells in the final model. Here, we will evalutate the model using a range of alphas (`alpha_list`) to get a range of selected groups.

In [None]:
# Setting list of regularization coefficients to generate models with
alpha_list = np.round(np.linspace(2.2,0.05,10), 2)

metric_dict = {}
selected_pathways = {}
group_norms = {}
group_names = list(combined_adata.uns['group_dict'].keys())
predicted = {}
auroc_array = np.zeros(alpha_list.shape)

# Iterating through alpha list, training/testing models, and capturing results
for i, alpha in enumerate(alpha_list):
    
    print(f'  Evaluating model. Alpha: {alpha}', flush = True)

    combined_adata = src.Train_Model(combined_adata, group_size= 2*D, alpha = alpha)
    predicted[alpha], metric_dict[alpha] = src.Predict(combined_adata, metrics = ['AUROC','F1-Score', 'Accuracy', 'Precision', 'Recall'])
    selected_pathways[alpha] = src.Find_Selected_Pathways(combined_adata)
    group_norms[alpha] = [np.linalg.norm(combined_adata.uns['model'].coef_[i * 2 * D: (i + 1) * 2 * D - 1]) for i in np.arange(len(group_names))]

results = {'Metrics' : metric_dict,
           'Selected_pathways' : selected_pathways,
           'Norms' : group_norms,
           'Predictions' : predicted,
           'Group_names' : group_names}