# Sample centering

In a downstream analysis I wish to apply the Slingshot trajectory inference algorithm. This requires dimensionality reduction of the data with independent component analysis. There are other approaches, but ICA is a promising choice. Input for ICA should be centered and whitened. In this notebook I will look at how to best perform the centering of the samples.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# %load common_.py
from common import *

In [34]:
# export
from ehv import core, load as e_load, correlation
from matplotlib import cm
import matplotlib
from sklearn.preprocessing import scale, robust_scale, minmax_scale
from sklearn.decomposition import PCA

In [11]:
samples = None
# samples = pandas.read_csv("data/selected_samples.csv")
df = e_load.load_raw_ideas_dir(
    Path("/data/weizmann/EhV/high_time_res"), 
    Path("/data/weizmann/EhV/weizmann-ehv-metadata/representations/ideas_features/"), 
    "ALL", 
    Path("/data/weizmann/EhV/weizmann-ehv-metadata/cell_populations/manual_gating/"),
    samples, "Low/*.cif")
df = e_load.remove_unwanted_features(df)
df = e_load.tag_columns(df)
df = e_load.clean_column_names(df)

df = df[df["meta_label_coi"]]
df.shape

import re
reg = r"^meta_label_(.+)$"
label_vec = numpy.full((df.shape[0]), fill_value="unknown", dtype=object)
for col in df.filter(regex="(?i)meta_label_.*psba.*"):
    label_vec[df[col].values] = re.match(reg, col).groups(1)
    
df["meta_label"] = label_vec

## Mean centering

In [64]:
diff = df.filter(regex="(meta_timepoint|feat)").groupby(["meta_timepoint"]).mean() - df.filter(regex="feat").mean()

In [65]:
diff = diff.reset_index()

In [68]:
diff.filter(regex="feat").mean().mean()

-233.62424

In [12]:
centered_df = df.copy()

In [13]:
def mean_center(df):
    cols = df.filter(regex="feat").columns
    df[cols] = df[cols] - df[cols].mean()
    return df

centered_df = centered_df.groupby(["meta_timepoint"]).apply(mean_center)

In [16]:
centered_df.filter(regex="feat").mean().mean()

1.6232061e-05

In [19]:
(centered_df.filter(regex="feat") - centered_df.filter(regex="feat").mean()).mean().mean()

4.3586856e-06

## Median centering

In [69]:
diff = df.filter(regex="(meta_timepoint|feat)").groupby(["meta_timepoint"]).median() - df.filter(regex="feat").median()

In [70]:
diff = diff.reset_index()

In [71]:
diff.filter(regex="feat").mean().mean()

134.28098