# Important Concepts and Metrics Used for scRNA-seq Analysis

## Concepts in the scRNA - seq Analysis

```{admonition}Gene
A **gene** is a sequence of DNA that contains the instructions to produce a specific molecule, typically a protein or functional RNA. In scRNA-seq, gene expression is quantified by counting the RNA molecules transcribed from genes.
```

```{admonition}

```

```{admonition}Cells
**Cells** are the basic structural and functional units of all living organisms. In scRNA-seq, each cell is analyzed individually to understand its gene expression profile.
```

```{admonition}

```

```{admonition}RNA
**RNA (Ribonucleic Acid)** is a nucleic acid that plays essential roles in coding, decoding, regulation, and expression of genes. scRNA-seq measures the amount of RNA in each cell to infer gene expression.
```

```{admonition}

```

```{admonition}Apoptosis
**Apoptosis** is a form of programmed cell death that occurs in multicellular organisms. It is characterized by specific cellular changes such as cell shrinkage, chromatin condensation, and membrane blebbing.
```

```{admonition}

```

```{admonition}Apoptotic Cell
An **apoptotic cell** is a cell undergoing apoptosis. In scRNA-seq, high mitochondrial gene expression is often used as an indicator of apoptosis.
```

```{admonition}

```

```{admonition}Gene Detection Rate
The **Gene Detection Rate** is the proportion of genes detected (i.e., with non-zero expression) in a given cell. It serves as a quality metric.

**Formula:**
$$
\text{Gene Detection Rate} = \frac{\text{Number of detected genes in a cell}}{\text{Total number of genes}}
$$
```

```{admonition}

```

```{admonition}Mitochondrial Cells
In the context of scRNA-seq, **mitochondrial cells** usually refer to cells with abnormally high proportions of reads mapped to mitochondrial genes, often considered stressed or dying.
```

```{admonition}

```

```{admonition}Mitochondrial Proportion
The **Mitochondrial Proportion** is the percentage of total reads in a cell that map to mitochondrial genes. High values suggest cell stress or apoptosis.

**Formula:**
$$
\text{Mitochondrial Proportion} = \frac{\text{Mitochondrial gene counts}}{\text{Total gene counts}} \times 100
$$
```

```{admonition}

```

```{admonition}UMI Counts
**Unique Molecular Identifiers (UMIs)** are short sequences used to count unique RNA molecules, minimizing amplification bias during PCR. Total UMI counts per cell indicate its transcriptome size.
```

```{admonition}Batch
A **batch** refers to a group of cells that have been processed together under the same experimental conditions. Variability between batches is common and can introduce confounding effects.
```

```{admonition}

```

```{admonition}Batch Effect
**Batch Effect** is the unwanted variation in data due to technical differences between batches (e.g., reagent lots, sequencing runs), rather than biological differences.
```

```{admonition}

```

```{admonition}Normalization
**Normalization** is the process of adjusting raw gene expression counts to account for differences in sequencing depth and other technical factors across cells.
```

```{admonition}

```

```{admonition}Log Normalization
**Log Normalization** transforms the normalized counts using a logarithmic function to stabilize variance across genes.

**Formula:**
$$
\text{LogNormalized}(x) = \log_2\left(\frac{x}{\text{total counts per cell}} \times \text{scale factor} + 1\right)
$$
Typically, the scale factor is 10,000.
```

```{admonition}

```

```{admonition}Highly Variable Genes (HVGs)
**Highly Variable Genes (HVGs)** are genes that show greater expression variability across cells than expected by chance. HVGs are useful for downstream dimensionality reduction and clustering.
```

```{admonition}

```

```{admonition}Principal Component Analysis (PCA)
**PCA** is a linear dimensionality reduction technique that transforms the data into a set of orthogonal axes (principal components), each capturing a portion of the variance.

**Formula:**
PCA is computed by solving:
$$
\text{Cov}(X) = V \Lambda V^T
$$
Where:
- $X$: mean-centered gene expression matrix  
- $V$: eigenvectors (principal components)  
- $\Lambda$: eigenvalues (explained variances)
```

```{admonition}

```

```{admonition}

```

## Key Metrics in scRNA seq Analysis

```{admonition}Total Loss (ZINB Loss)

Combines NB loss, dropout loss, and KL divergence with a weighting term \( \beta \).

**Formula:**

$$
\mathcal{L}_{Total} = \mathcal{L}_{NB} + \mathcal{L}_{Bern} + \beta \cdot \mathcal{L}_{KL}
$$```

```{admonition}KL Divergence (Kullback-Leibler)

Measures divergence between posterior $q(z|x)$ and prior $p(z)$, used in VAEs.

**Formula:**

$$
\mathcal{L}_{KL} = -\frac{1}{2} \sum_{i=1}^n \left( 1 + \log \sigma_i^2 - \mu_i^2 - \sigma_i^2 \right)
$$

```

```{admonition}

```

```{admonition}Bernoulli Dropout Loss

Models dropout events as a Bernoulli process with probability $p_0$.

**Formula:**

$$
\mathcal{L}_{Bern} = - \sum_{i,j} \left[ x_{ij} \log(1 - p_{0ij}) + (1 - x_{ij}) \log(p_{0ij}) \right]
$$

```

```{admonition}Negative Binomial Loss (NB)

Used to model count data with overdispersion. Assumes each gene follows a NB distribution.

**Formula:**

$$
\mathcal{L}_{NB} = - \sum_{i,j} \left[ \log \Gamma(x_{ij} + \theta) - \log \Gamma(x_{ij} + 1) - \log \Gamma(\theta) + x_{ij} \log(\mu_{ij}) + \theta \log \left( \frac{\theta}{\theta + \mu_{ij}} \right) \right]
$$
```

```{admonition}

```

```{admonition}Silhouette Score

Assesses how similar a cell is to its own cluster versus other clusters.

**Formula:**

$$
s(i) = \frac{b(i) - a(i)}{max(a(i), b(i))}
$$

Where:
- $a(i)$ = mean intra-cluster distance
- $b(i)$ = mean nearest-cluster distance
```

```{admonition} Adjusted Rand Index (ARI)

Evaluates clustering similarity between predicted and true labels, adjusted for chance.

**Formula:**

$$
ARI = \frac{RI - E[RI]}{max(RI) - E[RI]}
$$

Where $RI$ is the Rand Index.

```

```{admonition}

```

```{admonition}Normalized Mutual Information (NMI)

Measures mutual dependence between predicted and true clusters.

**Formula:**

$$
NMI = \frac{2 \cdot I(Y; \hat{Y})}{H(Y) + H(\hat{Y})}
$$

Where:
- $I$ is mutual information.
- $H$ is entropy.

```

```{admonition}

```

```{admonition}Calinski-Harabasz Index

Cluster dispersion score; higher values indicate better-defined clusters.

**Formula:**

$$
CH = \frac{Tr(B_k)}{k - 1} \cdot \frac{n - k}{Tr(W_k)}
$$

Where:
- $Tr(B_k)$ = between-cluster dispersion
- $Tr(W_k)$ = within-cluster dispersion
- $k$ = number of clusters
- $n$ = total number of samples
```

```{admonition}

```

```{admonition}Pearson Correlation Coefficient

Measures linear correlation between real and reconstructed gene expressions.

**Formula:**

$$
r = \frac{\sum_i (x_i - \bar{x})(\hat{x}_i - \bar{\hat{x}})}{\sqrt{\sum_i (x_i - \bar{x})^2 \sum_i (\hat{x}_i - \bar{\hat{x}})^2}}
$$
```

```{admonition}

```

```{admonition}Reconstruction Loss (MSE)

Basic loss function to measure average squared difference between original and reconstructed values.

**Formula:**

$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (x_i - \hat{x}_i)^2
$$
```