# <span style="color: steelblue;">Normalization using scaLR</span>

Keypoints

1. This notebook is designed as a tutorial for using normalization from a scaLR library.
2. Also, we have compared results using standard library like sklearn, scanpy for normalization etc.
3. These packages are built so to handle very large data say lakhs of samples with low resource constraints, which standard libraries can't handle at once.

## <span style="color: steelblue;">Cloning scaLR</span>

In [None]:
!git clone https://github.com/infocusp/scaLR.git

## <span style="color: steelblue;">Library Installation and Import</span>

In [None]:
!pip install anndata scanpy pydeseq2

In [None]:
from copy import deepcopy
import sys
sys.path.append('scaLR')

import pandas as pd
import numpy as np
import anndata

# scalr library normalization modules.
from scalr.data.preprocess import standard_scale, sample_norm
from scalr.data_ingestion_pipeline import DataIngestionPipeline
from scalr.utils.file_utils import read_data, write_data, write_chunkwise_data

# Scanpy library for sample-norm
import scanpy as sc
# Sklearn library for standard scaler object
from sklearn.preprocessing import StandardScaler
from os import path

%reload_ext autoreload
%autoreload 2

## <span style="color: steelblue;">Downloading data</span>
- Downloading an anndata from `cellxgene` and making a subset anndata with 1000 genes for the downstream analysis.

In [None]:
# This shell will take approximately 00:00:24 (hh:mm:ss) to run.
!wget -P data https://datasets.cellxgene.cziscience.com/16acb1d0-4108-4767-9615-0b42abe09992.h5ad

In [None]:
# Reading data
adata = anndata.read_h5ad('data/16acb1d0-4108-4767-9615-0b42abe09992.h5ad')
print(f"\nThe anndata has '{adata.n_obs}' cells and '{adata.n_vars}' genes")

In [None]:
# Verifying expression values of 1-10th gene in first 10 cells
adata.X[:10,:10].A

- In the current `AnnData` object, the gene expression data in `X` has already been normalized. Ideally, normalization should be applied only if the raw data is present in `X`.
- For this tutorial, we will create a new `AnnData` object using the raw gene expression values.

In [None]:
# Checking for raw gene expression
print(f"Raw expression data in anndata : {adata.raw is not None}")

In [None]:
adata.raw.X[:10,:10].A

In [None]:
raw_adata = anndata.AnnData(X=adata.raw.X,var=adata.var,obs=adata.obs)
sc.write('/content/data/raw_adata.h5ad',raw_adata)

## <span style="color: steelblue;">Data Generation</span>

- In this section, the downloaded anndata will be split into train, validation, and test sets.
- To accomplish this, we’ll implement the `generate_train_val_test_split` method in the `DataIngestionPipeline` of scaLR.
- We need the required parameters in data config in the form of a dictionary. For more information, please refer to the `DATA CONFIG` section in the [config.yaml](https://github.com/infocusp/scaLR/blob/main/config/config.yaml) file of scaLR.


In [None]:
# Parameters of `DataIngestionPipeline`
data_config = {'train_val_test': {'full_datapath': '/content/data/16acb1d0-4108-4767-9615-0b42abe09992.h5ad',
                                  'splitter_config': {'name': 'GroupSplitter',
                                                      'params': {'split_ratio': [7, 1, 2.5],'stratify': 'donor_id'}}},
              'target': 'cell_type'}

datapath = './data'

In [None]:
data_config

In [None]:
# Splitting data
data_split = DataIngestionPipeline(data_config=data_config,
                                   dirpath = datapath)
data_split.generate_train_val_test_split()

### Verifying `train`, `val`, and `test` data

In [None]:
train_adata = read_data(path.join(datapath, 'train_val_test_split/train.h5ad'))
val_adata = read_data(path.join(datapath, 'train_val_test_split/val.h5ad'))
test_adata = read_data(path.join(datapath, 'train_val_test_split/test.h5ad'))

In [None]:
# Gene expression data for the first 10 cells and genes in `train.h5ad`.
train_adata.X[:10, :10].A

In [None]:
# Gene expression data for the first 10 cells and genes in `val.h5ad`.
val_adata.X[:10, :10].A

In [None]:
# Gene expression data for the first 10 cells and genes in `test.h5ad`.
test_adata.X[:10, :10].A

In [None]:
# Writing train data in chunks to be used with the StandardScaler method in scaLR.
write_chunkwise_data(full_data=train_adata,
                     sample_chunksize=1000,
                     dirpath=path.join(datapath,'train'))

In [None]:
# Writing val data in chunks to be used with the StandardScaler method in scaLR.
write_chunkwise_data(full_data=val_adata,
                     sample_chunksize=1000,
                     dirpath=path.join(datapath,'val'))

In [None]:
# Writing test data in chunks to be used with the StandardScaler method in scaLR.
write_chunkwise_data(full_data=test_adata,
                     sample_chunksize=1000,
                     dirpath=path.join(datapath,'test'))

## <span style="color: steelblue;">Normalization</span>
## <span style="color: steelblue;">1. StandardScaler</span>
This method used to normalize the data so that each gene has a mean of 0 and a standard deviation of 1. This standardization balances the data, reducing biases from genes with larger ranges or higher average expression, and improves the consistency of downstream analyses.

### <span style="color: steelblue;">scalr package - how to to use it?</span>

In [None]:
# Creating object for standard scaling normalization.
scalr_std_scaler = standard_scale.StandardScaler(with_mean=False)

print('\n1. `fit()` function parameters :', scalr_std_scaler.fit.__annotations__)
print('\n2. `transform()` function parameters :', scalr_std_scaler.transform.__annotations__)

In [None]:
# Datapath to store processed_data
processed_datapath = './processed_data_ss'

In [None]:
# Fitting object on train data.
## chunk size to process data in chunks - to extract required parameters from data. Enter value that can fit in your memory.
## It can be 2k, 3k , 5k, 10k etc...
sample_chunksize = 1000
scalr_std_scaler.fit(read_data(path.join(datapath, 'train')), sample_chunksize)

# Transforming the test data using above created object & storing it at `preprocessed_datapath`.
scalr_std_scaler.process_data(read_data(path.join(datapath, 'test')),
                                          sample_chunksize,
                                          path.join(processed_datapath, 'test'))

In [None]:
# Reading transformed test data
test_adata_pipeline = read_data(path.join(processed_datapath, 'test'))
test_adata_pipeline[:, :].X[:10, :10]

### <span style="color: steelblue;">sklearn package for standardscaling</span>
- Developers can ignore this section

In [None]:
# Standard scaling using sklearn package
sklearn_std_scaler = StandardScaler(with_mean=False)
sklearn_std_scaler.fit(train_adata.X[:].A)
test_adata_sklearn = sklearn_std_scaler.transform(test_adata.X[:].A)
test_adata_sklearn[:10, :10]

### <span style="color: steelblue;">Comparing scalr library results with sklearn's library results</span>

In [None]:
# Checking if error is less than 1e-15
assert sum(
abs(scalr_std_scaler.train_mean[0] -
    sklearn_std_scaler.mean_).flatten() < 1e-6
) == train_adata.shape[1], "Train data mean is not correctly calculated..."

assert sum(
abs(scalr_std_scaler.train_std[0] - sklearn_std_scaler.scale_).flatten() <
1e-6) == train_adata.shape[
    1], "Train data standard deviation is not correctly calculated..."

## <span style="color: steelblue;">2. SampleNorm</span>
- In scRNA-seq, each cell may have a different sequencing depth, resulting in some cells having higher total counts (or reads) than others. Normalizing each cell by its total gene count using `SampleNorm` addresses this variability, ensuring consistent expression levels across the dataset and enabling reliable cell-to-cell comparisons.

- After normalization, the default sum of gene expression in each cell becomes one. This can be adjusted by specifying a different total using the `scaling_factor` parameter, as in `sample_norm.SampleNorm(scaling_factor='intended sum value')`.

### <span style="color: steelblue;">scalr package - how to to use it?</span>

In [None]:
# Sample norm using pipeline
scalr_sample_norm = sample_norm.SampleNorm()

print('\n1. `transform()` function parameters :', scalr_sample_norm.transform.__annotations__)

In [None]:
# Datapath to store processed_data
processed_datapath = './processed_data_sn'

In [None]:
# Fitting is not required on train data for sample-norm.
sample_chunksize = 1000

# Transforming on test data.
scalr_sample_norm.process_data(read_data(path.join(datapath, 'test')),
                               sample_chunksize,
                               path.join(processed_datapath, 'test'))

In [None]:
# Reading transformed test data
test_data_sample_norm_pipeline = read_data(path.join(processed_datapath, 'test'))

### <span style="color: steelblue;">Scanpy package for sample-norm</span>
- Developers can ignore this section

In [None]:
test_adata = read_data(path.join(datapath, 'test'), backed=None)
test_adata = test_adata[:, :].to_adata()
test_adata

In [None]:
# Sample norm using scanpy package
test_data_sample_norm_sc = sc.pp.normalize_total(test_adata, target_sum=1, inplace=False)
test_data_sample_norm_sc['X'][:10, :10].A

### <span style="color: steelblue;">Comparing scalr library results with scanpy library results</span>

In [None]:
# Checking if error is less than 1e-15
(abs(test_data_sample_norm_sc['X'] - test_data_sample_norm_pipeline[:, :].X) < 1e-15)[:10, :10]