Skip to content

Filtering the RNAs expression matrix by function

Felipe Vaz Peres edited this page Feb 24, 2024 · 2 revisions

Visualizing coefficient of variation of the 3 datasets

After applying VST as described here, we decided to focus only on the most highly expressed genes under the analyzed conditions, that is, those with significant coefficients of variation.

To determine this, we computed the coefficient of variation for each gene, along with its distribution by function (coding and non-coding), using this Python script. This resulted in the distributions below for the genes from the Hoang2017, Correr2020, and Perlo2020 datasets, respectively.

Removing genes with "low" Coefficient of Variation (CV)

We established a cutoff using this script to include only the top genes with the highest coefficient of variation, maintaining matrices with a reasonable number of genes (200-300k) for co-expression analysis.

The following coefficient of variation values were chosen as filters: 1.2, 2.0, and 0.6; applied to the respective datasets: Hoang2017, Correr2020, and Perlo2022.

After these filters, I ended up with a filtered matrix containing 240,323 coding and non-coding genes for the Hoang2017 dataset:

ddsColl_top_20_percent

class: DESeqDataSet 
dim: 240323 15 

Following the same steps for the Correr2020 dataset, using this script, this process resulted in a filtered matrix with 255,816 non-coding genes:

ddsColl_top_20_percent

class: DESeqDataSet 
dim: 255816 12 

The same was done for the Perlo2022 dataset, using this script, resulting in 304,872 non-coding genes:

ddsColl_top_20_percent

class: DESeqDataSet 
dim: 304872 63