Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

z-scores of gene abudance #44

Open
RubenRBakker opened this issue Apr 30, 2021 · 10 comments
Open

z-scores of gene abudance #44

RubenRBakker opened this issue Apr 30, 2021 · 10 comments

Comments

@RubenRBakker
Copy link

RubenRBakker commented Apr 30, 2021

Hi Ipantano,

I have more a theoretical question about your package than a software bug.

I have ran the function degPatterns and have been using the resulting plot data for further analysis. I am wondering what the "z-score of gene abundance" means in the plot for degPatterns. I read: "The y-axis in the figure is the results of applying scale() R function, what is similar to creating a Z-score where values are centered to the mean and scaled to the standard deviation by each gene."

However, there is one dot per gene per condition. If I have three replicates there are not three dots per conditions. Hence, the question: how are the replicates combined to get a single z-score?

Thank you in advance,

Ruben

@lpantano
Copy link
Owner

Good question, the answer is at the beginning of the Details section. Short, it is the mean across replicates. A little longer description:

Before calculating the genes similarity among samples, all samples inside the same time point (time parameter) and group (col parameter) are collapsed together, and the mean value is the representation of the group for the gene abundance. 

Happy to explain further, thanks!

@dbcraig
Copy link

dbcraig commented May 27, 2021

Related to this... when using scale() the Z-score is by column, which means each sample (or element of a time series) is normalized within the sample. However, the counts from DESeq2 are already normalized across samples using median of ratios. Won't this re-normalization with scale() possibly distort the trend across a time series? Meaning, Z-scores across samples are not necessarily comparable since they use independent means and standard deviations (i.e., by column).

I know a Z-score is necessary to compare plots between gene groups, but wouldn't it make more sense to compute a common Z-score using all DESeq expression counts (i.e., across all columns)?

An example of the problem is in the attached PDF.

Thanks, Doug

DEGpatterns_issue.pdf

@lpantano
Copy link
Owner

Hi Dough,

You are right, I am not using scale to normalize within the sample but across sample for each gene. The idea is to be able to plot genes that have the same pattern but maybe some are highly expressed, some are medium expressed, but the pattern across samples is the same.

Happy to point to the code or if you have seen something that you think is wrong, happy to double check. And I agree with you, scaling within the sample, I don't think is a good idea in the majority of the cases.

As well, the cluster calculation doesn't happen with scaled values, but with the original values, kendall correlation is used for that. So, scale only is used for plotting by default.

Thanks.

@RubenRBakker
Copy link
Author

Hi both,

These comments have been very useful! Thank you both for the interesting discussion.

@dbcraig
Copy link

dbcraig commented May 30, 2021

Lorena,

Thank you for the prompt reply. It was very helpful.

Just to further clarify... you say the cluster calculation is on the original expression values (not scaled values), so if gene A has values of: (10,20,30,40) over four time points, and gene B has expression values of: (1010,1020,1030,1040) they are not likely to be clustered together since the magnitude of their values is 100x different (thus the distance measure used for clustering is large). So even though they have the same "pattern" of increase, because the magnitudes differ they would not likely be clustered together in the same plot. But because you use a Z-score to display we would recognize similar patterns between the two different plots that have genes A and B respectively. So, plots clusters genes with similar expression magnitude. Is this a correct interpretation?

The only way I can think of to cluster genes with similar patterns and large magnitude differences would be to convert individual gene expression values to Z-scores, but I don't think you're doing anything like that. Correct? For example, genes A and B above would have the same Z-score (even though they have very different means, they have the same standard deviation). If you were to cluster using this Z-score, then gene A and B would likely be in the same plot.

Best, Doug

@lpantano
Copy link
Owner

lpantano commented Jun 1, 2021

mm, actually, correlation of that example is 1, because correlation looks for same changes independently of the total value:



> cor.test(c(10,20,30,40), c(1010,1020,1030,1040))

	Pearson's product-moment correlation

data:  c(10, 20, 30, 40) and c(1010, 1020, 1030, 1040)
t = Inf, df = 2, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 1 1
sample estimates:
cor 
  1 

so, this function is aimed to find those genes that they change in the same way, even if the numbers are in different quantiles. That could be a technology bias, because not all genes are sequenced the same, and not all normalization works 100%. Or it could be biological, some RNA could be more stable than others, but what this function is looking are genes that goes up/down in the same condition so you can perform functional analysis after wards and see if there are common pathways, for instance.

And to the point, the scale is performed so when you plot those genes, you have it under a -2,2 range and not one at 1000s and another at 10s.

If you want genes that are equally expressed and change together, then that is another question indeed.

Cheers

@lpantano
Copy link
Owner

lpantano commented Jun 1, 2021

(sorry if you got a long email, I copied a full of R code that wasn't need it, you can come to the web page to read the fixed comment)

@dbcraig
Copy link

dbcraig commented Jun 1, 2021

Okay, that makes more sense.
Sorry, I had assumed a Euclidean distance measure for clustering instead of the correlation that you had already mentioned.
Thanks for patiently explaining this to me.
Doug

@lpantano
Copy link
Owner

lpantano commented Jun 2, 2021

no worries, every analysis is different and it is good to know exactly what is happening. I always try to put all the information in the docs, but it needs more time to make it better. I am having very little time lately to reply appropriately, but I appreciate all questions, some times I cannot address it all throughly. I will leave this open so people can follow the thread since we all put time for a good discussion. cheers.

@sum732
Copy link

sum732 commented Nov 23, 2022

Thanks @lpantano for leaving the thread open. These comments are very helpful and much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants