# Integration with bulk RNA-seq data

A current limitation of single-cell datasets is the high cost, low sample size and often the lack of associated clinical information. On the other hand bulk RNA-seq experiments are comparatively cheap, and vast amounts of experimental data has accumulated in public repositories, including large-scale datasets with thousands of samples. This motivates integrating bulk RNA-seq data with cell-type information from single-cell RNA-seq datasets.

:::{note}
*Deconvolution methods* allow to infer cell-type proportions from bulk RNA-seq datasets based on reference signatures. Methods such as CIBERSORTx {cite}`newmanDeterminingCellType2019`, DWLS {cite}`tsoucasAccurateEstimationCelltype2019` or MuSiC {cite}`wangBulkTissueCell2019` can build such reference signatures based on single-cell RNA-seq datasets. The derived cell-type fractions can be tested for associations with clinical data. In {cite:t}`pournaraPowerAnalysisCelltype2023a`, the authors provide an overview and a comparative benchmark of such methods. 

Alternatively, *Scissor* {cite}`sunIdentifyingPhenotypeassociatedSubpopulations2022` works independent of predefined cell-populations and tests for each cell if it is positively or negatively associated with phenotypic information (e.g. survival data or driver mutations) that comes with bulk RNA-seq datasets. To do so, it calculates Pearson correlation of each cell’s gene
expression profile with the bulk RNA-seq data and uses a L1-constrained linear model to explain the phenotypic information with the correlation values.

Even though scissor has the advantage that it works independent of known cell-types, it comes with several limitations, namely: 

 * it is computationally expensive (can run hours on a single sample)
 * it does not natively take biological replicates into account (we circumvent that by running scissor independently on each patient)
 * it does not allow to model covariates (e.g. sex, age)
:::

In {cite:t}`salcherHighresolutionSinglecellAtlas2022a`, we used Scissor to test associations of certain driver mutations with certain cell-types. In this example, we are going to demonstrate how to use scissor for testing the effect of EGFR mutation on cell-type composition in *LUAD*. 

## 1. Import the required libraries