# Associations of correction slopes across datasets

Since ion suppression is a complicated and multifactorial process, it is hard to tackle it mechanistically. Although this method revolves around the mere area of overlap as influence parameter, the different ion-specific slopes gathered in the correction of various datasets should still contain molecular information about the susceptibility to ion suppression. An indication for that would be a correlation of the slopes for ions acquired in multiple datasets. As we analyzed two Metabolomics and two Lipidomics datasets, one can compare those separately.

In [None]:
import os
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import re

In [None]:
source_path = '/home/mklein/FDA_project/data'
# datasets = [dir.name for dir in os.scandir(source_path) if dir.is_dir() and dir.name[0] != "." and dir.name[2] == "_"]
datasets = ['Lx_Glioblastoma', 'Mx_Seahorse', 'Lx_Pancreatic_Cancer']

In [None]:
adatas = {} 
for dset in datasets:
    adata = sc.read(os.path.join(source_path, dset, 'corrected_batch_sm_matrix.h5ad'))
    df = adata.var[['corrected_only_using_pool', 'mean_correction_quantreg_slope', 'sum_correction_using_ion_pool']]
    adatas[dset] = df


All slope information is loaded from the respective annotated data matrices. As the datasets usually consist of multiple wells, the reported correction slopes are mean values of the corresponding set of wells. Ions that were only corrected using the reference pool are excluded (one could be more strict by thresholding for a max. fraction of wells corrected by reference pool, e.g. 50%).

In [None]:
df = pd.concat(adatas)
df.index.names = ['dataset', 'ion']
df.reset_index(inplace=True)
df = df[df['corrected_only_using_pool'] == False]

In [None]:
wide_df = df.pivot(index='ion', columns='dataset', values='mean_correction_quantreg_slope')
# wide_df["ion"] = 
wide_df["ion"] = [re.sub(r"([\d\w]+)[+-](\w)+", r"\1", i) for i in wide_df.index]
wide_df["adduct"] = [re.sub(r"([\d\w]+)([+-]\w+)", r"\2", i) for i in wide_df.index]
wide_df.value_counts("adduct")

In [None]:
sns.pairplot(wide_df)
wide_df.corr(method="spearman")

No metabolites were found in both Metabolomics datasets as the ions in the coculture set were only available as sum formulas without a specific adduct. All the other datasets use sum formulas with specific adducts. However, between the Lipidomics datasets, a number of ions overlap and their slopes have a positive correlation with Spearman r = 0.583. 

In [None]:
plt = sns.lmplot(wide_df[['Lx_Glioblastoma', 'Lx_Pancreatic_Cancer', 'adduct']], x='Lx_Pancreatic_Cancer', y='Lx_Glioblastoma', palette="cividis")
plt.set_axis_labels("correction slope (Pancreatic cancer)", "correction slope (Glioblastoma)")
wide_df[['Lx_Glioblastoma', 'Lx_Pancreatic_Cancer']].corr(method='spearman')

In [None]:
plt = sns.lmplot(wide_df[['Lx_Glioblastoma', 'Lx_Pancreatic_Cancer', 'adduct']], x='Lx_Pancreatic_Cancer', y='Lx_Glioblastoma', hue='adduct', col="adduct", palette="cividis")
plt.set_axis_labels("correction slope (Pancreatic cancer)", "correction slope (Glioblastoma)")
wide_df[['Lx_Glioblastoma', 'Lx_Pancreatic_Cancer', 'adduct']].groupby('adduct').corr(method='spearman')

In order to enable a comparison of the metabolites annotated in the Metabolomics datasets, one has to strip the adducts from the ions in the Seahorse dataset. However, due to the low number of metabolites annotated for the coculture dataset (58), still only a very small set of 8 jointly annotated metabolites can be found. The corresponding slopes do not show a positive correlation.

In [None]:
df["ion_stripped"] = df['ion'].str.extract(r'([^-^+]+)')
df_stripped = df.groupby(['dataset', 'ion_stripped']).mean(numeric_only=True).reset_index()

In [None]:
wide_df_stripped = df_stripped.pivot(index='ion_stripped', columns='dataset', values='mean_correction_quantreg_slope')
wide_df_stripped = wide_df_stripped.reset_index()

In [None]:
sns.lmplot(wide_df_stripped[['Mx_Co_Cultured', 'Mx_Seahorse']], x='Mx_Co_Cultured', y='Mx_Seahorse')
wide_df_stripped[['Mx_Co_Cultured', 'Mx_Seahorse']].corr(method='spearman')

So far, only mean correction slopes were compared. As the different wells of a dataset are corrected separately, a comparison of all the individual correction slopes would be interesting as well. To this end, all slopes across wells and datasets are combined in one table and visualized using PCA (NaNs are replaced by 0). This shows that the wells within a dataset have much higher similarity than across datasets. Also, Lipidomics datasets tend to localize to the left and changing the correction parameter `correction_proportion_threshold` has relatively little effect on the slopes.

In [None]:
all_adatas = {} 
for dset in datasets:
    samples = [dir.name for dir in os.scandir(os.path.join(source_path, dset)) if dir.is_dir() and dir.name[0] != "."]
    dset_adata = {}
    for s in samples:
        adata = sc.read(os.path.join(source_path, dset, s, 'cells_spatiomolecular_adata_corrected.h5ad'))
        df = adata.var[['correction_full_pixel_avg_intensities', 'correction_n_datapoints', 'correction_n_iterations', 
                'correction_quantreg_intersect', 'correction_quantreg_slope', 'correction_using_ion_pool']]
        dset_adata[s] = df
    all_adatas[dset] = pd.concat(dset_adata)

In [None]:
all_wells_df = pd.concat(all_adatas)
all_wells_df

In [None]:
all_wells_df.index.names = ['dataset', 'well', 'ion']
all_wells_df.reset_index(inplace=True)
all_wells_df = all_wells_df[all_wells_df['correction_using_ion_pool'] == False]
all_wells_df['set_well'] = all_wells_df['dataset'] +"_"+ all_wells_df['well']

In [None]:
all_wells_wide_df = all_wells_df.pivot(index='ion', columns=['dataset', 'well', 'set_well'], values='correction_quantreg_slope')
pc_df = all_wells_wide_df.T.reset_index(['dataset', 'well'])
pc_df

In [None]:
pc_df[all_wells_wide_df.index].replace(np.nan, 0)

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X = pc_df[all_wells_wide_df.index].replace(np.nan, 0)
y = [i[:4] for i in list(pc_df['dataset'])]
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data = principalComponents, columns = ['PC1', 'PC2'])
principalDf.index = pc_df.index

In [None]:
finalDf = pd.concat([principalDf, pc_df[['dataset', 'well']]], axis = 1)

In [None]:
sns.scatterplot(finalDf, x = 'PC1', y="PC2", hue='dataset', palette = "cividis")

In [None]:
pca.explained_variance_ratio_

In [None]:
from pca import pca
# Initialize
model = pca()
# Fit transform
out = model.fit_transform(X)

# Print the top features. The results show that f1 is best, followed by f2 etc
print(out['topfeat'])

In [None]:
#model.plot()
ax = model.biplot(n_feat=20, legend=False)