<left><img src="https://github.com/infocusp/scaLR/raw/sj/fullntest_samples_analysis/img/scaLR_logo.png" width="150" height="180"></left>

# <span style="color: steelblue;">Single-cell analysis using Low Resource (scaLR)</span>



**Note:**  
1. If scaLR is intended to be run on a local system, please ensure that an `ipy kernel` with Python version `3.10` is selected. Then, all the required installations can be performed as mentioned in the section below.

2. If scaLR has already been installed as mentioned in [Pre-requisites and installation scaLR](https://github.com/infocusp/scaLR), the repository cloning and requirement installation steps below can be skipped. Selecting the `ipy kernel` can be done as follows:

    - Open the terminal and run:  
     
        ```
        conda install -c anaconda ipykernel
        python -m ipykernel install --user --name=scaLR_env
        ```
    - Select `scaLR_env` as the `ipy kernel` in `scalr_pipeline.ipynb`.  
    - Finally, update the system path for scaLR, as mentioned in the shell before data download. e.g.:  
        ```
        sys.path.append('path/to/scaLR/')
        ```    
## <span style="color: steelblue;">Cloning scaLR</span>

In [None]:
!git clone https://github.com/infocusp/scaLR.git

Cloning into 'scaLR'...
remote: Enumerating objects: 3452, done.[K
remote: Counting objects: 100% (372/372), done.[K
remote: Compressing objects: 100% (181/181), done.[K
remote: Total 3452 (delta 243), reused 261 (delta 189), pack-reused 3080 (from 1)[K
Receiving objects: 100% (3452/3452), 170.03 MiB | 2.80 MiB/s, done.
Resolving deltas: 100% (2073/2073), done.


Install all requirements after cloning the repository, excluding packages that are pre-installed in Colab.

In [None]:
import sys
imported_packages = {pkg.split('.')[0] for pkg in sys.modules.keys()}
ignore_libraries = "|".join(imported_packages)

!pip install $(grep -ivE "$ignore_libraries" scaLR/requirements.txt)
!pip install memory-profiler==0.61.0

Defaulting to user installation because normal site-packages is not writeable
Collecting loky==3.4.1
  Downloading loky-3.4.1-py3-none-any.whl.metadata (6.4 kB)
Downloading loky-3.4.1-py3-none-any.whl (54 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.6/54.6 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: loky
Successfully installed loky-3.4.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

In [None]:
# # Uncomment and run the following if the scaLR pipeline is to be executed locally after installation, as explained in Note 2.
# import sys
# sys.path.append('path/to/scaLR/')

## <span style="color: steelblue;">Downloading input anndata from `cellxgene`</span>
- Currently the pipeline expects all datasets in [anndata](https://anndata.readthedocs.io/en/latest/tutorials/notebooks/getting-started.html) formats (`.h5ad` files only).
- The anndata object should contain cell samples as `obs` and genes as `var`.
- `adata.X`: contains normalized gene counts/expression values (Typically `log1p` normalized, data ranging from 0-10).
- `adata.obs`: contains any metadata regarding cells, including a column for `target` which will be used for classification. The index of `adata.obs` is cell_barcodes.
- `adata.var`: contains all gene_names as Index.

The dataset we are about to download contains two clinical conditions (COVID-19 and normal) and links variations in immune response to disease severity and outcomes over time[(Liu et al. (2021))](https://doi.org/10.1016/j.cell.2021.02.018)

In [3]:
# This shell will take approximately 00:00:53 (hh:mm:ss) to run.
!wget -P data https://datasets.cellxgene.cziscience.com/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad

--2025-02-27 18:52:02--  https://datasets.cellxgene.cziscience.com/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad
Resolving datasets.cellxgene.cziscience.com (datasets.cellxgene.cziscience.com)... 18.239.111.15, 18.239.111.109, 18.239.111.30, ...
Connecting to datasets.cellxgene.cziscience.com (datasets.cellxgene.cziscience.com)|18.239.111.15|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 980103606 (935M) [binary/octet-stream]
Saving to: ‘data/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad’


2025-02-27 18:56:51 (3.25 MB/s) - ‘data/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad’ saved [980103606/980103606]



## <span style="color: steelblue;">Data exploration</span>

In [4]:
from IPython.display import SVG, display
import warnings
import anndata as ad
from anndata import AnnData
import numpy as np
import pandas as pd

In [5]:
adata = ad.read_h5ad("data/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad",backed='r')

In [6]:
print(f"\nThe anndata has '{adata.n_obs}' cells and '{adata.n_vars}' genes")


The anndata has '125117' cells and '30695' genes


In [7]:
# Cell metadata
adata.obs.head()

Unnamed: 0,dsm_severity_score_group,disease_ontology_term_id,severity,tissue_ontology_term_id,timepoint,outcome,dsm_severity_score,days_since_hospitalized,age,donor_id,...,tissue_type,cell_type,assay,disease,organism,sex,tissue,self_reported_ethnicity,development_stage,observation_joinid
AAACCTGAGAAACCTA-1_1,DSM_low,MONDO:0100096,Moderate,UBERON:0000178,T0,alive,-1.950858,1.0,55.0,HGR0000083,...,tissue,non-classical monocyte,10x 5' v1,COVID-19,Homo sapiens,male,blood,European,55-year-old stage,!9L}G4hgnw
AAACCTGAGGGTTTCT-1_1,DSM_high,MONDO:0100096,Critical,UBERON:0000178,T0,alive,-0.092375,13.0,40.0,HGR0000078,...,tissue,classical monocyte,10x 5' v1,COVID-19,Homo sapiens,female,blood,European,40-year-old stage,YRcUzlVyg0
AAACCTGCACCTGGTG-1_1,DSM_high,MONDO:0100096,Critical,UBERON:0000178,T0,alive,2.95435,1.0,60.0,HGR0000098,...,tissue,"CD16-positive, CD56-dim natural killer cell, h...",10x 5' v1,COVID-19,Homo sapiens,male,blood,European,60-year-old stage,)*azge@M0l
AAACCTGGTCCGAGTC-1_1,DSM_high,MONDO:0100096,Critical,UBERON:0000178,T0,deceased,3.276233,6.0,76.0,HGR0000141,...,tissue,classical monocyte,10x 5' v1,COVID-19,Homo sapiens,male,blood,European,76-year-old stage,E<FU`+QN&T
AAACCTGGTGCCTTGG-1_1,DSM_low,MONDO:0100096,Critical,UBERON:0000178,T0,alive,-0.348888,1.0,70.0,HGR0000093,...,tissue,classical monocyte,10x 5' v1,COVID-19,Homo sapiens,male,blood,European,70-year-old stage,2MZ#6SX}{g


In [8]:
adata.obs.cell_type.value_counts()

classical monocyte                                       78908
CD16-positive, CD56-dim natural killer cell, human       28705
non-classical monocyte                                    6160
natural killer cell                                       3825
platelet                                                  3370
CD16-negative, CD56-bright natural killer cell, human     1237
conventional dendritic cell                                991
plasmacytoid dendritic cell                                787
granulocyte                                                776
intermediate monocyte                                      358
Name: cell_type, dtype: int64

In [9]:
# Number of cell types
adata.obs.cell_type.unique()

['non-classical monocyte', 'classical monocyte', 'CD16-positive, CD56-dim natural killer cell, ..., 'natural killer cell', 'plasmacytoid dendritic cell', 'conventional dendritic cell', 'platelet', 'CD16-negative, CD56-bright natural killer cel..., 'granulocyte', 'intermediate monocyte']
Categories (10, object): ['granulocyte', 'platelet', 'natural killer cell', 'plasmacytoid dendritic cell', ..., 'CD16-negative, CD56-bright natural killer cel..., 'CD16-positive, CD56-dim natural killer cell, ..., 'conventional dendritic cell', 'intermediate monocyte']

In [10]:
# Number of donors
adata.obs.donor_id.unique()

['HGR0000083', 'HGR0000078', 'HGR0000098', 'HGR0000141', 'HGR0000093', ..., 'SHD3', 'HGR0000101', 'HGR0000135', 'SHD5', 'SHD6']
Length: 46
Categories (46, object): ['AA220014', 'AA220534', 'AA220907', 'HDML', ..., 'SHD4', 'SHD5', 'SHD6', 'SHD7']

In [11]:
# Number of clinical conditions
adata.obs.disease.value_counts()

COVID-19    99152
normal      25965
Name: disease, dtype: int64

In [12]:
#Gene expression values of first 5 cells and 10 genes.
adata.X[:5,:10].A

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.99008936, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]],
      dtype=float32)

In [13]:
# Verifying normalized values in X
# Getting the sum of gene expression values for the first 10 cells (should be floating-point values).
adata.X[:10,:].A.sum(axis=1)

array([2264.9421, 2374.6707, 2097.2356, 2345.2798, 2542.3647, 2362.8406,
       2241.9297, 1986.2373, 2578.1968, 2652.637 ], dtype=float32)

In [14]:
# Getting the maximum and minimum gene expression values for the first 1000 cells.
max_val = np.max(adata.X[:1000, :].A)
min_val = np.min(adata.X[:1000, :].A)
print(f'Max value : {max_val} | Min value : {min_val}')
# Raising a warning if the values are outside the 0-10 range
if max_val > 10 or min_val < 0:
    warnings.warn(f"Warning: Expression Value out of range! Max: {max_val}, Min: {min_val}. Expected range is 0-10.", UserWarning)


Max value : 8.524538040161133 | Min value : 0.0


In [15]:
#Gene metadata
adata.var.head()

Unnamed: 0,mvp.mean,mvp.dispersion,mvp.dispersion.scaled,mvp.variable,feature_is_filtered,feature_name,feature_reference,feature_biotype,feature_length,feature_type
ENSG00000168454,0.00038,1.168876,0.181734,False,False,TXNDC2,NCBITaxon:9606,gene,1703,protein_coding
ENSG00000197852,0.035995,1.634179,0.886458,False,False,INKA2,NCBITaxon:9606,gene,1217,protein_coding
ENSG00000196878,0.008862,1.617729,0.861545,False,False,LAMB3,NCBITaxon:9606,gene,3931,protein_coding
ENSG00000256540,2.2e-05,1.660993,0.92707,False,False,IQSEC3-AS1,NCBITaxon:9606,gene,1065,lncRNA
ENSG00000139180,0.0901,1.18472,0.205731,False,False,NDUFA9,NCBITaxon:9606,gene,782,protein_coding


### <span style="color: steelblue;">Modifying `var` index (Optional)</span>
- The `index` values in this AnnData object are the `gene_ids`. To retrieve the literature genes associated with a particular cell type, we need the gene symbols, which are present in `feature_name` column. Therefore, we'll replace the index values with gene symbols.
- This will be helpful when analyzing the `GeneRecallCurve` later.
- This step can be skipped if the `reference_genes.csv` already contains gene IDs corresponding to each cell type, or if the user does not want to perform the `GeneRecallCurve` analysis.


In [16]:
adata.var.set_index('feature_name',inplace=True)

In [17]:
# Now the index values are the gene symbols.
adata.var.head()

Unnamed: 0_level_0,mvp.mean,mvp.dispersion,mvp.dispersion.scaled,mvp.variable,feature_is_filtered,feature_reference,feature_biotype,feature_length,feature_type
feature_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
TXNDC2,0.00038,1.168876,0.181734,False,False,NCBITaxon:9606,gene,1703,protein_coding
INKA2,0.035995,1.634179,0.886458,False,False,NCBITaxon:9606,gene,1217,protein_coding
LAMB3,0.008862,1.617729,0.861545,False,False,NCBITaxon:9606,gene,3931,protein_coding
IQSEC3-AS1,2.2e-05,1.660993,0.92707,False,False,NCBITaxon:9606,gene,1065,lncRNA
NDUFA9,0.0901,1.18472,0.205731,False,False,NCBITaxon:9606,gene,782,protein_coding


In [18]:
# Saving file for further analysis
# This shell will take approximately 00:00:47 (hh:mm:ss) to run.
adata.obs.index = adata.obs.index.astype(str)
adata.var.index = adata.var.index.astype(str)
AnnData(X=adata.X,obs=adata.obs,var=adata.var).write('data/modified_adata.h5ad',compression='gzip')

## <span style="color: steelblue;">scaLR pipeline </span>

1. The **scaLR** pipeline consists of four stages:
   - Data ingestion
   - Feature selection
   - Final model training
   - Analysis

2. The user needs to modify the configuration file (`config.yml`) available at `scaLR/config` for each stage of the pipeline according to the requirements. Simply omit or comment out the stages of the pipeline that you do not wish to run.

3. Refer to `config.yml` and its detailed configuration [README](https://github.com/infocusp/scaLR/blob/main/config/README.md) file for instructions on how to use different parameters and files.

### <span style="color: steelblue;">Config edits (For Cell Type Classification and Biomarker Identification)</span>

NOTE: Below are just suggestions for the model parameters. Feel free to play around with them for tuning the model & improving the results.

*An example configuration file for the current dataset, incorporating the edits below, can be found at `scaLR/tutorials/pipeline/config_celltype.yaml`. Please update the device as `cuda` or `cpu` as per runtype.*

- **Device setup**.
  -Update `device: 'cuda'` for `GPU` enabled runtype, else `device: 'cpu'` for `CPU` enabled  runtype.
- **Experiment Config**
  - The default `exp_run` number is `0`.If not changed, the celltype classification experiment would be `exp_run_0` with all the pipeline results.
- **Data Config**
  - Update the `full_datapath` to `data/modified_adata.h5ad` (as we will include `GeneRecallCurve` in the downstream).
  - Specify the `num_workers` value for effective parallelization.
  - Set `target` to `cell_type`.
- **Feature Selection**
  - Specify the `num_workers` value for effective parallelization.
  - Update the model layers to `[5000, 10]`, as there are only 10 cell types in the dataset.
  - Change `epoch` to `10`.
- **Final Model Training**
  - Update the model layers to the same as for feature selection: `[5000, 10]`.
  - Change `epoch` to `100`.
- **Analysis**
  - **Downstream Analysis**
    - Uncomment the `test_samples_downstream_analysis` section.
    - Update the `reference_genes_path` to `scaLR/tutorials/pipeline/grc_reference_gene.csv`.
    - Please refer to the section below:

    ```
    analysis:

        model_checkpoint: ''

        dataloader:
            name: SimpleDataLoader
            params:
                batch_size: 15000

        gene_analysis:
            scoring_config:
                name: LinearScorer

            features_selector:
                name: ClasswisePromoters
                params:
                    k: 100
        test_samples_downstream_analysis:
            - name: GeneRecallCurve
              params:
                reference_genes_path: 'scaLR/tutorials/pipeline/grc_reference_gene.csv'
                top_K: 300
                plots_per_row: 3
                features_selector:
                    name: ClasswiseAbs
                    params: {}
            - name: Heatmap
              params: {}
            - name: RocAucCurve
              params: {}



### <span style="color: steelblue;">Config edits (For clinical condition specific biomarker identification and DGE analysis) </span>

*An example configuration file for the current dataset, incorporating the edits below, can be found at : `scaLR/tutorials/pipeline/config_clinical.yaml`.Please update the device as `cuda` or `cpu` as per runtype*

- **Experiment Config**
  - Make sure to change the `exp_run` number if you have an experiment with the same number earlier related to cell classification.As we have done one experiment earlier, we'll change the number now to '1'.
- **Data Config**
  - The `full_datapath` remains the same as above.
  - Change the `target` to `disease` (this column contains data for clinical conditions, `COVID-19/normal`).
- **Feature Selection**
  - Update the model layers to `[5000, 2]`, as there are only two types of clinical conditions.
  -`epoch` as 10.
- **Final Model Training**
  - Update the model layers to the same as for feature selection: `[5000, 2]`.
  - `epoch` as 100.
- **Analysis**
  - **Downstream Analysis**
    - Uncomment the `full_samples_downstream_analysis` section.
    - We are not performing the 'gene_recall_curve' analysis in this case. It can be performed if the `COVID-19/normal` specific genes are available, but there are many possibilities of genes in the case of normal conditions.
    - There are two options to perform differential gene expression (DGE) analysis: `DgePseudoBulk` and `DgeLMEM`. The parameters are updated as follows. Note that `DgeLMEM` may take a bit more time, as the multiprocessing is not very efficient with only 2 CPUs in the current Colab runtime.
    - Please refer to the section below:
    ```
    analysis:

        model_checkpoint: ''

        dataloader:
            name: SimpleDataLoader
            params:
                batch_size: 15000

        gene_analysis:
            scoring_config:
                name: LinearScorer

            features_selector:
                name: ClasswisePromoters
                params:
                    k: 100
        full_samples_downstream_analysis:
            - name: Heatmap
              params:
                top_n_genes: 100
            - name: RocAucCurve
              params: {}
            - name: DgePseudoBulk
              params:
                  celltype_column: 'cell_type'
                  design_factor: 'disease'
                  factor_categories: ['COVID-19', 'normal']
                  sum_column: 'donor_id'
                  cell_subsets: ['conventional dendritic cell', 'natural killer cell']
            - name: DgeLMEM
              params:
                fixed_effect_column: 'disease'
                fixed_effect_factors: ['COVID-19', 'normal']
                group: 'donor_id'
                celltype_column: 'cell_type'
                cell_subsets: ['conventional dendritic cell']
                gene_batch_size: 1000
                coef_threshold: 0.1
                

### <span style="color: steelblue;">Run Pipeline </span>

In [19]:
# Possible flags using 'scaLR/pipeline.py'
!python scaLR/pipeline.py --help

/bin/bash: line 1: python: command not found


#### Cell type classification

In [21]:
# Command to run end to end pipeline.
# This shell will take approximately 00:21:15 (hh:mm:ss) on GPU to run.()

!python3 scaLR/pipeline.py --config scaLR/tutorials/pipeline/config_celltype.yaml -l -m

2025-02-27 19:02:51,535 - ROOT - INFO : Experiment directory: `scalr_experiments/exp_name_0`
2025-02-27 19:02:51,544 - ROOT - INFO : Data Ingestion pipeline running
2025-02-27 19:02:51,544 - DataIngestion - INFO : Generating Train, Validation and Test sets
2025-02-27 19:03:35,769 - DataIngestion - INFO : Generate label mappings for all columns in metadata
2025-02-27 19:03:36,946 - ROOT - INFO : Feature Extraction pipeline running
2025-02-27 19:03:36,946 - File Utils - INFO : Data Loaded from Final datapaths
2025-02-27 19:03:37,467 - FeatureExtraction - INFO : Feature subset models training
2025-02-27 19:05:09,181 - ModelTraining - INFO : Building model training artifacts
2025-02-27 19:05:09,253 - ModelTraining - INFO : Building model training artifacts
2025-02-27 19:05:09,295 - ModelTraining - INFO : Building model training artifacts
2025-02-27 19:05:09,393 - ModelTraining - INFO : Building model training artifacts
2025-02-27 19:05:09,750 - ModelTraining - INFO : Training the model
202

#### Clinical condition specific biomarker identification and differential gene expression analysis

In [None]:
## It takes 01:16:58 (hh:mm:ss) to run on the CPU for clinical condition-specific biomarker identification.
## To reduce the runtime, please comment out the 'DgeLMEM' section under the 'full_samples_downstream_analysis.

!python scaLR/pipeline.py --config scaLR/tutorials/pipeline/config_clinical.yaml -l -m

Pipeline logs can be found at `scalr_experiments/exp_name_0/logs.txt` (cell type classification)

For clinical condition specific biomarker identification, the logs can be found at `scalr_experiments/exp_name_1/logs.txt`

### <span style="color: steelblue;">Results </span>
We have done the celltype classification and biomarker discovery with name `exp_name_0`.

- The  classification report can be found at `scalr_experiments/exp_name_0/analysis/classification_report.csv`

- Top-5k Biomarkers can be found at `scalr_experiments/exp_name_0/analysis/gene_analysis/top_features.json`.

- `Heatmaps` for each class(cell types) can be found at `scalr_experiments/exp_name_0/analysis/test_samples/heatmaps`

- `Gene_recall_curve`, and `roc_auc` data can be found at `scalr_experiments/exp_name_0/analysis/test_samples/`.

- `score_matrix.csv` with gene scores for all classes can be found at `scalr_experiments/exp_name_0/analysis/gene_analysis/score_matrix.csv`

In [None]:
#Classification report
pd.read_csv('/content/scalr_experiments/exp_name_0/analysis/classification_report.csv',index_col=0)

In [None]:
#ROC_AUC
display(SVG('/content/scalr_experiments/exp_name_0/analysis/test_samples/roc_auc.svg'))

In [None]:
# Heatmap for cell type 'classical monocyte'
display(SVG('/content/scalr_experiments/exp_name_0/analysis/test_samples/heatmaps/classical monocyte.svg'))

In [None]:
# Gene recall curve
display(SVG('scalr_experiments/exp_name_0/analysis/test_samples/gene_recall_curve.svg'))


For clinical condition-specific biomarker identification and DGE analysis with the experiment name `exp_name_1`. All analysis results can be viewed in the `exp_name_1` directory, as explained above for cell type classification. The difference is that we have results for only two classes in `exp_name_1`, namely `COVID-19` and `normal`, along with the results for DGE analysis.

In [None]:
# DgePseudoBulk results for 'conventional dendritic cell' in 'COVID-19' w.r.t. 'normal' samples
pd.read_csv('/content/scalr_experiments/exp_name_1/analysis/full_samples/pseudobulk_dge_result/pbkDGE_conventionaldendriticcell_COVID-19_vs_normal.csv')

In [None]:
# Volcano plot of `log2FoldChange` vs `-log10(pvalue)` in gene expression for
# 'conventional dendritic cell' in 'COVID-19' w.r.t. 'normal' samples.
display(SVG('/content/scalr_experiments/exp_name_1/analysis/full_samples/pseudobulk_dge_result/pbkDGE_conventionaldendriticcell_COVID-19_vs_normal.svg'))

*Note*: A `Fold Change (FC)` of 1.5 units in the figure above is equivalent to a `log2 Fold Change` of 0.584.

## <span style="color: steelblue;">Running scaLR in modules</span>

### Imports

In [None]:
import sys
sys.path.append('scaLR/')
import os
from os import path

from scalr.data_ingestion_pipeline import DataIngestionPipeline
from scalr.eval_and_analysis_pipeline import EvalAndAnalysisPipeline
from scalr.feature_extraction_pipeline import FeatureExtractionPipeline
from scalr.model_training_pipeline import ModelTrainingPipeline
from scalr.utils import read_data
from scalr.utils import write_data

### Load Config

Running with example config files with required edits. Make sure to change the experiment name if required.

In [None]:
config = read_data('scaLR/tutorials/pipeline/config_celltype.yaml')
# config = read_data('scaLR/tutorials/pipeline/config_clinical.yaml')
config

In [None]:
dirpath = config['experiment']['dirpath']
exp_name = config['experiment']['exp_name']
exp_run = config['experiment']['exp_run']
dirpath = os.path.join(dirpath, f'{exp_name}_{exp_run}')
os.makedirs(dirpath, exist_ok=True)
device = config['device']

### Data Ingestion

In [None]:
# This shell will take approximately 00:01:23 (hh:mm:ss) to run.

data_dirpath = path.join(dirpath, 'data')
os.makedirs(data_dirpath, exist_ok=True)

# Initialize Data Ingestion object
ingest_data = DataIngestionPipeline(config['data'], data_dirpath)

# Generate Train, Validation and Test Splits for pipeline
ingest_data.generate_train_val_test_split()

# Apply pre-processing on data
# Fit on Train data, and then apply on the entire data
ingest_data.preprocess_data()

# We generate label mapings from the metadata, which is used for
# labels, etc.
ingest_data.generate_mappings()

# All the additional data generated (label mappings, data splits, etc.)
# are passed onto the config for future use in pipeline
config['data'] = ingest_data.get_updated_config()
write_data(config, path.join(dirpath, 'config.yaml'))
del ingest_data

### Feature Selection

In [None]:
# This shell will take approximately 00:19:02 (hh:mm:ss) to run.

feature_extraction_dirpath = path.join(dirpath, 'feature_extraction')
os.makedirs(feature_extraction_dirpath, exist_ok=True)

# Initialize Feature Extraction object
extract_features = FeatureExtractionPipeline(
    config['feature_selection'], feature_extraction_dirpath, device)
extract_features.load_data_and_targets_from_config(config['data'])

# Train feature subset models and get scores for each feature/genes
extract_features.feature_subsetted_model_training()
extract_features.feature_scoring()

# Extract top features by some algorithm, and write a feature-subsetted
# dataset
extract_features.top_feature_extraction()
config['data'] = extract_features.write_top_features_subset_data(
    config['data'])

# All the additional data generated (subset data splits, etc.)
# are passed onto the config for future use in pipeline
config['feature_selection'] = extract_features.get_updated_config()
write_data(config, path.join(dirpath, 'config.yaml'))
del extract_features

### Final Model Training

In [None]:
# This shell will take approximately 00:06:20 (hh:mm:ss) to run.

model_training_dirpath = path.join(dirpath, 'model')
os.makedirs(model_training_dirpath, exist_ok=True)

# Initialize Final Model Training object
model_trainer = ModelTrainingPipeline(
    config['final_training']['model'],
    config['final_training']['model_train_config'],
    model_training_dirpath, device)
model_trainer.load_data_and_targets_from_config(config['data'])

# Build the training artifacts from config, and train the model
model_trainer.build_model_training_artifacts()
model_trainer.train()

# All the additional data generated (model defaults filled, etc.)
# are passed onto the config for future use in pipeline
model_config, model_train_config = model_trainer.get_updated_config()
config['final_training']['model'] = model_config
config['final_training']['model_train_config'] = model_train_config
write_data(config, path.join(dirpath, 'config.yaml'))
del model_trainer

### Evaluation and Analysis

In [None]:
# This shell will take approximately 00:00:26 (hh:mm:ss) to run.

analysis_dirpath = path.join(dirpath, 'analysis')
os.makedirs(analysis_dirpath, exist_ok=True)

# Get path of the best trained model
config['analysis']['model_checkpoint'] = path.join(
    model_training_dirpath, 'best_model')

# Initialize Evaluation and Analysis Pipeline object
analyser = EvalAndAnalysisPipeline(config['analysis'], analysis_dirpath,
                                    device)
analyser.load_data_and_targets_from_config(config['data'])

# Perform evaluation of trained model on test data and generate
# classification report
analyser.evaluation_and_classification_report()

# Perform gene analysis based on the trained model to get
# top genes / biomarker analysis
analyser.gene_analysis()

# Perform downstream analysis on all samples / test samples
analyser.full_samples_downstream_anlaysis()
analyser.test_samples_downstream_anlaysis()

# All the additional data generated
# are passed onto the config for future use in pipeline
config['analysis'] = analyser.get_updated_config()
write_data(config, path.join(dirpath, 'config.yaml'))
del analyser

Analysis results can be viewed inside `scalr_experiments` under the `exp_name` specified in the `config.yaml`, as mentioned above.