z-scores of gene abudance #44

RubenRBakker · 2021-04-30T15:59:51Z

Hi Ipantano,

I have more a theoretical question about your package than a software bug.

I have ran the function degPatterns and have been using the resulting plot data for further analysis. I am wondering what the "z-score of gene abundance" means in the plot for degPatterns. I read: "The y-axis in the figure is the results of applying scale() R function, what is similar to creating a Z-score where values are centered to the mean and scaled to the standard deviation by each gene."

However, there is one dot per gene per condition. If I have three replicates there are not three dots per conditions. Hence, the question: how are the replicates combined to get a single z-score?

Thank you in advance,

Ruben

lpantano · 2021-04-30T16:57:24Z

Good question, the answer is at the beginning of the Details section. Short, it is the mean across replicates. A little longer description:

Before calculating the genes similarity among samples, all samples inside the same time point (time parameter) and group (col parameter) are collapsed together, and the mean value is the representation of the group for the gene abundance.

Happy to explain further, thanks!

dbcraig · 2021-05-27T23:20:21Z

Related to this... when using scale() the Z-score is by column, which means each sample (or element of a time series) is normalized within the sample. However, the counts from DESeq2 are already normalized across samples using median of ratios. Won't this re-normalization with scale() possibly distort the trend across a time series? Meaning, Z-scores across samples are not necessarily comparable since they use independent means and standard deviations (i.e., by column).

I know a Z-score is necessary to compare plots between gene groups, but wouldn't it make more sense to compute a common Z-score using all DESeq expression counts (i.e., across all columns)?

An example of the problem is in the attached PDF.

Thanks, Doug

DEGpatterns_issue.pdf

lpantano · 2021-05-28T01:40:24Z

Hi Dough,

You are right, I am not using scale to normalize within the sample but across sample for each gene. The idea is to be able to plot genes that have the same pattern but maybe some are highly expressed, some are medium expressed, but the pattern across samples is the same.

Happy to point to the code or if you have seen something that you think is wrong, happy to double check. And I agree with you, scaling within the sample, I don't think is a good idea in the majority of the cases.

As well, the cluster calculation doesn't happen with scaled values, but with the original values, kendall correlation is used for that. So, scale only is used for plotting by default.

Thanks.

RubenRBakker · 2021-05-28T07:20:50Z

Hi both,

These comments have been very useful! Thank you both for the interesting discussion.

dbcraig · 2021-05-30T00:26:19Z

Lorena,

Thank you for the prompt reply. It was very helpful.

Just to further clarify... you say the cluster calculation is on the original expression values (not scaled values), so if gene A has values of: (10,20,30,40) over four time points, and gene B has expression values of: (1010,1020,1030,1040) they are not likely to be clustered together since the magnitude of their values is 100x different (thus the distance measure used for clustering is large). So even though they have the same "pattern" of increase, because the magnitudes differ they would not likely be clustered together in the same plot. But because you use a Z-score to display we would recognize similar patterns between the two different plots that have genes A and B respectively. So, plots clusters genes with similar expression magnitude. Is this a correct interpretation?

The only way I can think of to cluster genes with similar patterns and large magnitude differences would be to convert individual gene expression values to Z-scores, but I don't think you're doing anything like that. Correct? For example, genes A and B above would have the same Z-score (even though they have very different means, they have the same standard deviation). If you were to cluster using this Z-score, then gene A and B would likely be in the same plot.

Best, Doug

lpantano · 2021-06-01T13:11:51Z

mm, actually, correlation of that example is 1, because correlation looks for same changes independently of the total value:



> cor.test(c(10,20,30,40), c(1010,1020,1030,1040))

	Pearson's product-moment correlation

data:  c(10, 20, 30, 40) and c(1010, 1020, 1030, 1040)
t = Inf, df = 2, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 1 1
sample estimates:
cor 
  1

so, this function is aimed to find those genes that they change in the same way, even if the numbers are in different quantiles. That could be a technology bias, because not all genes are sequenced the same, and not all normalization works 100%. Or it could be biological, some RNA could be more stable than others, but what this function is looking are genes that goes up/down in the same condition so you can perform functional analysis after wards and see if there are common pathways, for instance.

And to the point, the scale is performed so when you plot those genes, you have it under a -2,2 range and not one at 1000s and another at 10s.

If you want genes that are equally expressed and change together, then that is another question indeed.

Cheers

lpantano · 2021-06-01T13:16:20Z

(sorry if you got a long email, I copied a full of R code that wasn't need it, you can come to the web page to read the fixed comment)

dbcraig · 2021-06-01T14:47:15Z

Okay, that makes more sense.
Sorry, I had assumed a Euclidean distance measure for clustering instead of the correlation that you had already mentioned.
Thanks for patiently explaining this to me.
Doug

lpantano · 2021-06-02T02:00:32Z

no worries, every analysis is different and it is good to know exactly what is happening. I always try to put all the information in the docs, but it needs more time to make it better. I am having very little time lately to reply appropriately, but I appreciate all questions, some times I cannot address it all throughly. I will leave this open so people can follow the thread since we all put time for a good discussion. cheers.

sum732 · 2022-11-23T16:25:39Z

Thanks @lpantano for leaving the thread open. These comments are very helpful and much appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

z-scores of gene abudance #44

z-scores of gene abudance #44

RubenRBakker commented Apr 30, 2021 •

edited

lpantano commented Apr 30, 2021

dbcraig commented May 27, 2021

lpantano commented May 28, 2021

RubenRBakker commented May 28, 2021

dbcraig commented May 30, 2021

lpantano commented Jun 1, 2021 •

edited

lpantano commented Jun 1, 2021

dbcraig commented Jun 1, 2021

lpantano commented Jun 2, 2021

sum732 commented Nov 23, 2022

z-scores of gene abudance #44

z-scores of gene abudance #44

Comments

RubenRBakker commented Apr 30, 2021 • edited

lpantano commented Apr 30, 2021

dbcraig commented May 27, 2021

lpantano commented May 28, 2021

RubenRBakker commented May 28, 2021

dbcraig commented May 30, 2021

lpantano commented Jun 1, 2021 • edited

lpantano commented Jun 1, 2021

dbcraig commented Jun 1, 2021

lpantano commented Jun 2, 2021

sum732 commented Nov 23, 2022

RubenRBakker commented Apr 30, 2021 •

edited

lpantano commented Jun 1, 2021 •

edited