# SCENTInEL
#### sc ElasticNet transductive and inductive ensemble learning

# LR multi-tissue cross-comparison

##### Ver:: A2_V6
##### Author(s) : Issac Goh
##### Date : 220823;YYMMDD
### Author notes
    - Current defaults scrpae data from web, so leave as default and run
    - slices model and anndata to same feature shape, scales anndata object
    - added some simple benchmarking
    - creates dynamic cutoffs for probability score (x*sd of mean) in place of more memory intensive confidence scoring
    - Does not have majority voting set on as default, but module does exist
    - Multinomial logistic relies on the (not always realistic) assumption of independence of irrelevant alternatives whereas a series of binary logistic predictions does not. collinearity is assumed to be relatively low, as it becomes difficult to differentiate between the impact of several variables if this is not the case
    - Feel free to feed this model latent representations which capture non-linear relationships, the model will attempt to resolve any linearly seperable features. Feature engineering can be applied here.
    
### Features to add
    - Add ability to consume anndata zar format for sequential learning
### Modes to run in
    - Run in training mode
    - Run in projection mode

In [8]:
import scanpy as sc
import pandas as pd
import numpy as np
import scentinel as scent
import pickle as pkl

In [13]:
models = {
'pan_fetal_wget':'https://celltypist.cog.sanger.ac.uk/models/Pan_Fetal_Suo/v2/Pan_Fetal_Human.pkl',
'YS_wget':'https://storage.googleapis.com/haniffalab/yolk-sac/YS_X_A2_V12_lvl3_ELASTICNET_YS.sav',
}

adatas_dict = {
'pan_fetal_wget':'https://cellgeni.cog.sanger.ac.uk/developmentcellatlas/fetal-immune/PAN.A01.v01.raw_count.20210429.PFI.embedding.h5ad',
'YS_wget':'https://app.cellatlas.io/yolk-sac/dataset/23/download',
'FLIV_wget':'https://app.cellatlas.io/fetal-liver/dataset/1/download'
}

# Variable assignment
train_model = False
feat_use = 'cell.labels'
adata_key = 'FLIV_wget'#'fliv_wget_test' # key for dictionary entry containing local or web path to adata/s can be either url or local 
data_merge = False # read and merge multiple adata (useful, but keep false for now)
model_key = 'pan_fetal_wget'#'test_low_dim_ipsc_ys'# key for model of choice can be either url or local 
train_x_partition = 'X' # what partition was the data trained on? To keep simple, for now only accepts 'X'
dyn_std = 1.96 # Dynamic cutoffs using std of the mean for each celltype probability, gives a column notifying user of uncertain labels 1 == 68Ci, 1.96 = 95CI
freq_redist = 'cell.labels'#'cell.labels'#'False#'cell.labels'#False # False or key of column in anndata object which contains labels/clusters // not currently implemented
partial_scale = True # should data be scaled in batches?
QC_normalise = True # should data be normalised?

# training variables
penalty='elasticnet' # can be ["l1","l2","elasticnet"]
sparcity=0.5 # C penalty for degree of regularisation
thread_num = -1
l1_ratio = 0.5 # ratio between L1 and L2 regulrisatiuon depending on penatly method

# Read in query data for projection

In [15]:
if train_model == True:
    from sklearn.preprocessing import StandardScaler
    adata =  scent.load_adatas(adatas_dict, data_merge, adata_key, QC_normalise)
    print('adata_loaded')
    import time
    t0 = time.time()
    display_cpu = scent.DisplayCPU()
    display_cpu.start()
    try:
        model_trained = scent.prep_training_data(feat_use = feat_use,
        adata_temp = adata,
        train_x_partition = train_x_partition,
        model_key = model_key + '_lr_model',
        batch_correction = 'Harmony',
        var_length = 7500,
        batch_key = 'donor',
        penalty='elasticnet', # can be ["l1","l2","elasticnet"],
        sparcity=sparcity, #If using LR without optimisation, this controls the sparsity in model
        max_iter = 1000, #Increase if experiencing max iter issues
        l1_ratio = l1_ratio, #If using elasticnet without optimisation, this controls the ratio between l1 and l2)
        partial_scale = False, #partial_scale,
        tune_hyper_params = True # Current implementation is very expensive, intentionally made rigid for now
        )
        filename =model_name
        pkl.dump(model_trained, open(filename, 'wb'))
    finally: #
        current, peak = display_cpu.stop()
        t1 = time.time()
        time_s = t1-t0
        print('training complete!')
        time.sleep(3)
        print('projection time was ' + str(time_s) + ' seconds')
        print(f"Current memory usage is {current / 10**6}MB; Peak was {peak / 10**6}MB")
        print(f"starting memory usage is" +'' + str(display_cpu.starting))
        print('peak CPU % usage = '+''+ str(display_cpu.peak_cpu))
        print('peak CPU % usage/core = '+''+ str(display_cpu.peak_cpu_per_core))
    model_lr= model_trained
    adata =  scent.load_adatas(adatas_dict, data_merge, adata_key)
else:
    adata =  scent.load_adatas(adatas_dict, data_merge, adata_key,QC_normalise)
    model = scent.load_models(models,model_key)
    model_lr =  model
    
# run with usage logger
import time
t0 = time.time()
display_cpu = scent.DisplayCPU()
display_cpu.start()
try: #code here ##
    pred_out,train_x,model_lr,adata_temp = scent.reference_projection(adata, model_lr, dyn_std,partial_scale,train_x_partition)
    if freq_redist != False:
        pred_out = scent.freq_redist_68CI(adata,freq_redist)
        pred_out['orig_labels'] = adata.obs[freq_redist]
        adata.obs['consensus_clus_prediction'] = pred_out['consensus_clus_prediction']
    adata.obs['predicted'] = pred_out['predicted']
    adata_temp.obs = adata.obs
    
    # Estimate top model features for class descrimination
    feature_importance = scent.estimate_important_features(model_lr, 100)
    mat = scent.feature_importance.euler_pow_mat
    top_loadings = scent.feature_importance.to_n_features_long

finally: #
    current, peak = display_cpu.stop()
t1 = time.time()
time_s = t1-t0
print('projection complete!')
time.sleep(3)
print('projection time was ' + str(time_s) + ' seconds')
print(f"Current memory usage is {current / 10**6}MB; Peak was {peak / 10**6}MB")
print(f"starting memory usage is" +'' + str(display_cpu.starting))
print('peak CPU % usage = '+''+ str(display_cpu.peak_cpu))
print('peak CPU % usage/core = '+''+ str(display_cpu.peak_cpu_per_core))

# regression summary
idx_map = dict(zip(  list(adata.obs[feat_use].unique()),list(range(0,len(list(adata.obs[feat_use].unique()))))))
scent.regression_results(adata.obs[feat_use].map(idx_map), adata.obs['predicted'].map(idx_map))

Loading anndata from web source
option to apply standardisation to data detected, performing basic QC filtering
Loading model from web source


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
Exception in thread Thread-8:
Traceback (most recent call last):
  File "/home/jovyan/my-conda-envs/scentinel_test/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/jovyan/my-conda-envs/scentinel_test/lib/python3.10/site-packages/scentinel/general_utlities.py", line 140, in run


TypeError: reference_projection() takes 4 positional arguments but 5 were given

    self.peak_cpu_per_core = peak_cpu_per_core
UnboundLocalError: local variable 'peak_cpu_per_core' referenced before assignment


# View by median probabilities per classification

In [None]:
scent.plot_label_probability_heatmap(pred_out)

# View by cross-tabulation of two categorical attributes

In [None]:
scent.plot_crosstab_heatmap(adata, feat_use, 'predicted')
scent.plot_crosstab_heatmap(adata, 'consensus_clus_prediction', 'predicted')

# View top predictive features per class

In [None]:
#Estimate dataset specific feature impact
for classes in ['pDC precursor_ys_HL','AEC_ys_HL']:
    scent.model_class_feature_plots(top_loadings, [str(classes)], 'e^coef')
    plt.show()

# Save predicted output

In [None]:
pred_out.to_csv('./A1_V3_sk_sk_pred_outs.csv')

 # Assess feature impact on model predictions

In [None]:
# if using a low-dim model like PCA or ldVAE which has a weights layer
# top_loadings = compute_weighted_impact(varm_file = '/nfs/team205/ig7/projects/fetal_skin/3_160523_probabillistic_projection_organoid_adt_fetl/A2_V2_ldvae_models/v3_ldvae_obsm_weights.csv',top_loadings =  top_loadings, threshold=0.05)

for class_lin in top_loadings['class'].unique():
    scent.model_class_feature_plots(top_loadings, [class_lin], 'weighted_impact','e^coef',max_len= 20,title = lineage)
    scent.analyze_and_plot(top_loadings,class_lin, max_len=20, pre_ranked=True, database='GO_Biological_Process_2021', cutoff=0.25, min_s=5)

In [None]:
list(top_loadings['class'].unique())

In [None]:
top_loadings[top_loadings['class'].isin(['Tip cell (arterial)','HSC','SPP1+ proliferating neuron proneitors'])].groupby(['class']).head(10)

In [None]:
for classes in ['Tip cell (arterial)','HSC','SPP1+ proliferating neuron proneitors']:
    scent.model_class_feature_plots(top_loadings, [str(classes)], 'e^coef')
    plt.show()

# Let's calculate an impact and specificty score for each cell

- We create a variable model impact factor by multiplying gene x model coeficient for class
- This is the variable contribution of each feature for a class prediction given a model and data

$X = $$\begin{bmatrix}(e^{coeff}_{n} * g1),(e^{coeff}_{n} * g2) \\ (e^{coeff}_{n} * g1),(e^{coeff}_{n} * g2) \\ (e^{coeff}_{n} * g1),(e^{coeff}_{n} * g2) \\ ..... \end{bmatrix}$

- We create a summed feature impact score for each cell by summing per feature ipact scores which identifies the overarching impact of a model's contribution.. == total model impact score

$Impact_{cellx} = $$\begin{bmatrix}(e^{coeff}_{n} * g1) + \ (e^{coeff}_{n} * g2) \ +(e^{coeff}_{n} * g3) \ + ..... \end{bmatrix}$

- We measure the model feature effect on class decisoion betwene organs and withi organs
- We can now use these feature availability/impact metrics to compare the availability and differential impact of features between data for transductive and/or inductive runs

In [None]:
df_impact = scent.calculate_feature_distribution(adata, top_loadings, var='predicted')

# Label stability scoring for individual label performance

In [377]:
# pred_col shape should match the pred_out original labels, so some self-projection works best here
pred_col = list(pred_out.columns[pred_out.columns.isin(set(pred_out['orig_labels']))])
loss, log_losses, weights = scent.compute_label_log_losses(pred_out, 'orig_labels', pred_col)

# Label confidence scoring, weighted probabilities and label propagation

## Bayesian KNN label stability
For modelling label uncertainty given neighborhood membership and distances

#### Step 1: Generate Binary Neighborhood Membership Matrix
The first step is to generate a binary neighborhood membership matrix from the connectivity matrix. This is done with the function get_binary_neigh_matrix(connectivities), which takes a connectivity matrix as input and outputs a binary matrix indicating whether a cell is a neighbor of another cell.

The connectivity matrix represents the neighborhood relationships between cells, typically obtained from KNN analysis. In this matrix, each row and column represent a cell, and an entry indicates the 'connectivity' between the corresponding cells.

The function transforms the connectivity matrix into a binary matrix by setting all non-zero values to 1, indicating a neighborhood relationship, and all zero values remain as 0, indicating no neighborhood relationship.

#### Step 2: Calculate Label Counts
Next, the function get_label_counts(neigh_matrix, labels) is used to count the number of occurrences of each label in the neighborhood of each cell. The input to this function is the binary neighborhood membership matrix and a list of labels for each cell.

The function returns a matrix in which each row corresponds to a cell, and each column corresponds to a label. Each entry is the count of cells of a particular label in the neighborhood of a given cell.

#### Step 3: Compute Distance-Entropy Product
In the third step, the function compute_dist_entropy_product(neigh_membership, labels, dist_matrix) computes the product of the average neighborhood distance and the entropy of the label distribution in the neighborhood for each cell and each label.

The entropy of a label distribution in a neighborhood is a measure of the diversity or 'mix' of labels in that neighborhood, with higher entropy indicating a more diverse mix of labels. The average neighborhood distance for a cell is the average distance from that cell to all other cells in its neighborhood.

By multiplying the entropy with the average distance, this function captures two important aspects of the neighborhood:

Entropy: The diversity of labels in a neighborhood. High entropy means the neighborhood is a 'melting pot' of many different labels, while low entropy indicates a neighborhood dominated by a single label.
Distance: The spatial proximity of cells in a neighborhood. A high average distance means the cells in a neighborhood are widely dispersed, while a low average distance indicates a compact, closely-knit neighborhood.
Thus, the distance-entropy product for a cell provides a measure of the 'stability' of the cell's label, with lower values indicating a stable, consistent label and higher values indicating an unstable, inconsistent label.

#### Step 4: Bayesian Sampling and Weight Calculation
The final step is the compute_weights function, which uses Bayesian inference to compute a posterior distribution of the distance-entropy product for each label and calculates the weights.

In Bayesian inference, we start with a prior distribution that represents our initial belief about the parameter we're interested in, and we update this belief using observed data to get a posterior distribution.

In this case, the prior distribution is a normal distribution with mean and standard deviation equal to the mean and standard deviation of the distance-entropy product for the original labels. The observed data is the distance-entropy product for the predicted labels. A normal distribution is a reasonable choice for the prior because the distance-entropy product is a continuous variable that can theoretically take on any real value, and the normal distribution is the most common distribution for such variables.

After sampling from the posterior distribution, the weight for each label is calculated as one minus the ratio of the standard deviation of the posterior distribution to the maximum standard deviation across all labels. This means that labels with a larger standard deviation (indicating greater uncertainty about their stability) will have smaller weights, and labels with a smaller standard deviation (indicating less uncertainty) will have larger weights.

The weights are returned as a dictionary where each key-value pair corresponds to a label and its weight.

#### Step 5: Apply Weights to Probabilities
Finally, the weights are applied to the probability dataframe with the function apply_weights(prob_df, weights). The input to this function is a dataframe where each row corresponds to a cell and each column corresponds to a label, with each entry being the probability of the cell being of the label, and a dictionary of weights.

This function multiplies each column of the probability dataframe by the corresponding weight, effectively 'boosting' the probabilities of labels with larger weights and 'penalizing' the probabilities of labels with smaller weights. After applying the weights, the function normalizes the probabilities so that they sum to 1 for each cell, returning a dataframe of the same shape as the input but with the probabilities weighted and normalized.

Overall, this method provides a principled way to quantify label uncertainty and adjust the probabilities output by a logistic regression model accordingly. It combines the strengths of KNN, which can capture local structure and relationships in the data, and Bayesian inference, which provides a robust framework for dealing with uncertainty and incorporating prior knowledge. By weighting the probabilities according to the stability of the labels, this method can potentially improve the accuracy and interpretability of the logistic regression model's predictions.

In [None]:
weights = scent.compute_weights(adata,use_rep = 'neighbors', original_labels_col ='cell.labels', predicted_labels_col = 'cell.labels')
adata.obsm['pred_out'] = pred_out
adata.obsm['pred_out_weighted'] = apply_weights(adata.obsm['pred_out'],weights)

# Optionally now use the updated probabilities for label propagation 

In [403]:
# Here define new labels with the updated probabilities
# Run Freq-redist or 68CI redist amongst neighborhoods or new clusters