# SimiC Pipeline - Simple Tutorial

>*Author: Irene Marín-Goñi, PhD student - ML4BM group (CIMA University of Navarra)*

This notebook demonstrates how to run the new SimiC pipeline with minimal configuration.

## Overview
This simple tutorial covers:
1. Package installation and set up
2. Basic pipeline initialization
3. Running the core SimiC regression
5. Basic results exploration

For full pipeline see `Tutorial_SimiCPipeline_full.ipynb`

## Introduction
SimiC is a GRN inference algorithm for scRNA-Seq data that takes as input single-cell imputed expression data, a list of driver genes, the cell labels (cell phenotypes), and the ordering information, and produces a GRN for each of the different phenotypes. Given the provided ordering between the cell phenotypes, SimiC adds a similarity constraint when jointly inferring the GRNs for each phenotype, ensuring a smooth transition between the corresponding GRNs.

For more information check our publication:

Peng, J., Serrano, G., Traniello, I.M. et al. SimiC enables the inference of complex gene regulatory dynamics across cell phenotypes. Commun Biol 5, 351 (2022). https://doi.org/10.1038/s42003-022-03319-7


## Setup


<div class="alert alert-block alert-warning">
<b>Warning: Need to include here Installing instructions (github/Docker/dependencies</b>
</div>




## Pipeline steps
First, import the necessary modules and set up the path.

In [None]:
import sys
# Add scripts directory to path
sys.path.append('./scripts/')

from SimiCPipeline import SimiCPipeline

### Step 1: Initialize the Pipeline

Create a pipeline instance by specifying:
- `workdir`: Working directory path where input files are located and output files will be saved
- `run_name`: Unique identifier for this analysis run (used as prefix for output files)

In [None]:
print("Initializing SimiC pipeline")
pipeline = SimiCPipeline(
    workdir="./SimiC_results/OLD_RUNS/K25L/Tumor",
    run_name="experiment1"
)

### Step 2: Set Input File Paths

Point the pipeline to your input files:
- `p2df`: Path to expression matrix file (genes × cells) stored as a pandas DataFrame in pickle format
- `p2assignment`: Path to cell cluster assignment file (.txt format) containing phenotype labels as integers matching expression matrix cell order
- `p2tf`: Path to transcription factor list file (pickle format) containing TF gene names to use as drivers

In [None]:
print("Setting input file paths")
pipeline.set_paths(
    p2df=pipeline.workdir / "inputFiles/all_100_1000_subset_matrix.pickle",
    p2assignment=pipeline.workdir / "inputFiles/all_100_1000_subset_matrix.pickle",
    p2tf=pipeline.workdir / "inputFiles/all_100TF_list.pickle"
)

### Step 3: Set Parameters (Optional)

Customize the regression parameters:
- `lambda1`: L1 regularization strength controlling sparsity (higher values = sparser networks, default: $1e^{-1}$)
- `lambda2`: L2 regularization strength controlling similarity between phenotypes (higher values = more similar networks across phenotypes, default: $1e^{-5}$)

In [None]:
print("Setting custom parameters")
pipeline.set_parameters(
    lambda1=1e-1,
    lambda2=1e-2
)

### Step 4: Run the Pipeline

Execute the core SimiC regression with the following options:
- `skip_filtering`: If True, skips post-regression filtering of weights (default: False)
- `calculate_raw_auc`: If True, calculates AUC scores on unfiltered weights (default: False)
- `calculate_filtered_auc`: If True, calculates AUC scores on filtered weights (default: True)

This runs:
1. Input validation
2. SimiC regression algorithm
3. Result saving

In [None]:
print("Running simple SimiC pipeline")
pipeline.run_pipeline(
    skip_filtering=True,
    calculate_raw_auc=False, 
    calculate_filtered_auc=False
)

<div class="alert alert-block alert-info">
<b>Note:</b> The following code is equivalent to the previous cell
</div>

In [None]:
import time
total_start = time.time()

pipeline.validate_inputs()
pipeline.run_simic_regression()

total_end = time.time()

pipeline.timing['total'] = total_end - total_start
pipeline._print_summary()

<div class="alert alert-block alert-success">
<b>Success!</b> Check what results are available from the pipeline run.
</div>

In [None]:
pipeline.available_results()

## How to continue?
### 1. Filter Weights

After the basic run, you can filter the regression weights to remove noise and keep those with significant importance for target regulation.

In [None]:
pipeline.filter_weights()

Generate summary statistics and visualizations of the learned weights.

In [None]:
pipeline.analyze_weights()

### 2. Calculate AUC Scores

Calculate TF activity scores for each cell.
- `use_filtered`: If True, uses filtered weights; if False, uses raw weights
- `num_cores`: Number of CPU cores for parallel processing (-1 uses all available cores, default: 1)

In [None]:
pipeline.calculate_auc(use_filtered=True, num_cores=-1)


Generate summary statistics for the calculated AUC scores.

In [None]:
pipeline.analyze_auc_scores()

Compute dissimilarity between different cell populations based on regulatory networks.

In [None]:
MinMax = pipeline.calculate_dissimilarity()

### Print Summary

Display a comprehensive summary of the pipeline run including timing information.

In [None]:
pipeline._print_summary()

## Load and Inspect Results (examples)
### Load Filtered AUC Scores

In [None]:
auc_filtered = pipeline.load_results('auc_filtered')
print(f"Available labels: {list(auc_filtered.keys())}")
print(f"AUC matrix shape: {list(auc_filtered.values())[0].shape}")

### Extract AUC for Specific Phenotype

Get AUC scores for a specific label (e.g., label 3).
- `result_type`: Name of the AUC results to load ('auc_raw' for unfiltered or 'auc_filtered' for filtered weights)
- `label`: Integer specifying which cell phenotype/population to extract (must match labels in assignment file)

In [None]:
print("Get AUC scores for specific label...\n")
auc_3 = pipeline.subset_label_specific_auc('auc_filtered', label=3)
print(f"AUC for label 3 shape: {auc_3.shape}")
print("\nFirst 5 rows and columns:")
print(auc_3.iloc[0:5, 0:5])

### Extract TF Regulatory Network

Get the regulatory network for a specific transcription factor.
- `TF_name`: Name of the transcription factor gene (must be present in the TF list provided to the pipeline)
- `stacked`: If True, returns a pandas Dataframe with GRN weights for all labels in separate columns; if False, returns dict of separate pandas Series per label

In [None]:
# Get network for Bnc2 across all cell populations
bnc2_network = pipeline.get_TF_network("Bnc2", stacked=True)
print(f"Bnc2 network shape: {bnc2_network.shape}")
print("\nTop 10 targets:")
print(bnc2_network.head(10))

## Summary

This tutorial covered:
✓ Basic pipeline initialization and configuration
✓ Running the core SimiC regression
✓ Post-processing with filtering and AUC calculation
✓ Analyzing and extracting results
✓ Exploring TF-target networks

For more advanced features including:
- Cross-validation
- Parameter sweeps
- Custom filtering thresholds
- Parallel AUC computation

Please see `Tutorial_SimiCPipeline_full.ipynb`