Keypoints

1. This notebook has been designed as a tutorial for using normalization from a scalr library.
2. Also, we have compared results using standard library like sklearn, scanpy etc.
3. These packages are built so to handle very large data say lakhs of samples with low resource constraints, which standard libraries can't handle at once.

# Imports

In [2]:
import sys
sys.path.append('/home/anand/bioc_repo/single_cell_classification/')

In [3]:
import numpy as np
import anndata

# scalr library normalization modules.
from _scalr.data.preprocess import standard_scale, sample_norm

# Scanpy library for sample-norm
import scanpy as sc
# Sklearn library for standard scaler object
from sklearn.preprocessing import StandardScaler

%reload_ext autoreload
%autoreload 2

# Data generation

In [7]:
# Setting seed for reproducibility.
np.random.seed(0)

In [8]:
# Anndata object is required for using pipeline normalization functions.
train_adata = anndata.AnnData(X=np.random.rand(10, 4))
train_adata

AnnData object with n_obs × n_vars = 10 × 4

In [9]:
train_adata.X

array([[0.5488135 , 0.71518937, 0.60276338, 0.54488318],
       [0.4236548 , 0.64589411, 0.43758721, 0.891773  ],
       [0.96366276, 0.38344152, 0.79172504, 0.52889492],
       [0.56804456, 0.92559664, 0.07103606, 0.0871293 ],
       [0.0202184 , 0.83261985, 0.77815675, 0.87001215],
       [0.97861834, 0.79915856, 0.46147936, 0.78052918],
       [0.11827443, 0.63992102, 0.14335329, 0.94466892],
       [0.52184832, 0.41466194, 0.26455561, 0.77423369],
       [0.45615033, 0.56843395, 0.0187898 , 0.6176355 ],
       [0.61209572, 0.616934  , 0.94374808, 0.6818203 ]])

In [10]:
# Creating test anndata object.
test_adata = anndata.AnnData(X=np.random.rand(5, 4))
test_adata

AnnData object with n_obs × n_vars = 5 × 4

In [11]:
test_adata.X

array([[0.3595079 , 0.43703195, 0.6976312 , 0.06022547],
       [0.66676672, 0.67063787, 0.21038256, 0.1289263 ],
       [0.31542835, 0.36371077, 0.57019677, 0.43860151],
       [0.98837384, 0.10204481, 0.20887676, 0.16130952],
       [0.65310833, 0.2532916 , 0.46631077, 0.24442559]])

# 1. StandardScaler

## scalr package - how to to use it?

In [35]:
# Creating object for standard scaling normalization.
obj_ss = standard_scale.StandardScaler(with_mean=False)

print('\n1. `fit()` function parameters :', obj_ss.fit.__annotations__)
print('\n2. `transform()` function parameters :', obj_ss.transform.__annotations__)

INFO:absl:Applying Standard Scaler normalization on data.



1. `fit()` function parameters : {'data': typing.Union[anndata._core.anndata.AnnData, anndata.experimental.multi_files._anncollection.AnnCollection], 'sample_chunksize': <class 'int'>, 'return': None}

2. `transform()` function parameters : {'data': <class 'numpy.ndarray'>, 'return': <class 'numpy.ndarray'>}


In [36]:
# Fitting object on train data.
## chunk size to process data in chunks - to extract required parameters from data. Enter value that can fit in your memory.
## It can be 2k, 3k , 5k, 10k etc...
sample_chunksize = 2
obj_ss.fit(train_adata, sample_chunksize=sample_chunksize)

# Transforming the test data using above created object.
test_adata_pipeline = obj_ss.transform(test_adata.X)

INFO:absl:Calculating mean of data...
INFO:absl:Calculating standard deviation of data...
INFO:absl:Setting `train_mean` to be zero, as `with_mean` is set to False!


## sklearn package for standardscaling
- Developers can ignore this section

In [37]:
# Standard scaling using sklearn package
std_scaler = StandardScaler(with_mean=False)
std_scaler.fit(train_adata.X)
test_adata_sklearn = std_scaler.transform(test_adata.X)
test_adata_sklearn

array([[0.79644353, 1.71130322, 1.45626525, 0.16287987],
       [1.36937826, 2.43447219, 0.40712456, 0.32324491],
       [0.64350622, 1.31152151, 1.09608649, 1.09235253],
       [2.33021957, 0.42523965, 0.46401686, 0.46427603],
       [1.39074325, 0.95334427, 0.93563236, 0.63540274]])

## Comparing scalr library results with sklearn's library results 

In [39]:
# Checking if error is less than 1e-15
abs(test_adata_sklearn - test_adata_pipeline) < 1e-15

array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]])

# 2. SampleNorm

## scalr package - how to to use it?

In [40]:
# Sample norm using pipeline
obj_sample_norm = sample_norm.SampleNorm()

print('\n1. `transform()` function parameters :', obj_sample_norm.transform.__annotations__)

INFO:absl:Applying Sample-wise normalization on data.



1. `transform()` function parameters : {'data': <class 'numpy.ndarray'>, 'return': <class 'numpy.ndarray'>}


In [41]:
# Fitting is not required on train data for sample-norm.

# Transforming on test data.
test_data_sample_norm_pipeline = obj_sample_norm.transform(test_adata.X)
test_data_sample_norm_pipeline

array([[0.23128455, 0.2811586 , 0.4488116 , 0.03874524],
       [0.39766289, 0.39997167, 0.12547318, 0.07689227],
       [0.18687207, 0.21547646, 0.33780682, 0.25984466],
       [0.67668801, 0.06986476, 0.14300702, 0.11044021],
       [0.40386721, 0.15662972, 0.28835589, 0.15114718]])

## Scanpy package for sample-norm
- Developers can ignore this section

In [42]:
# Sample norm using scanpy package
test_data_sample_norm_sc = sc.pp.normalize_total(test_adata, target_sum=1, inplace=False)
test_data_sample_norm_sc

{'X': array([[0.23128455, 0.2811586 , 0.4488116 , 0.03874524],
        [0.39766289, 0.39997167, 0.12547318, 0.07689227],
        [0.18687207, 0.21547646, 0.33780682, 0.25984466],
        [0.67668801, 0.06986476, 0.14300702, 0.11044021],
        [0.40386721, 0.15662972, 0.28835589, 0.15114718]]),
 'norm_factor': array([1., 1., 1., 1., 1.])}

## Comparing scalr library results with scanpy library results¶

In [43]:
# Checking if error is less than 1e-15
abs(test_data_sample_norm_sc['X'] - test_data_sample_norm_pipeline) < 1e-15

array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]])