-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLR: Not able to reproduce estimates & p-values in R #26
Comments
Estimate mixup fixed in 009a1ad. As for the p-values, there are a few places at which, assuming the regression in R is accurate, the regression_single.py p-values could deviate. The t statistics seem to match, so it is a problem with the p-value calculation. The p-value is calculated using torch.distributions.studentT.StudentT(df).log_prob. The degrees of freedom seem to match; with dummy data of the same size, I got 96 degrees of freedom according to:
The log_prob function runs on the t_stats in the following on line 217 of regression_single.py: p_value = torch.exp(log_prob(t_stats)) Assuming the base of log_prob is e (which is confirmed by the source code), I do not see where deviation could be aside from cumulative inaccuracy in either R or Is the tecpg p-value consistently and substantially higher than the R p-value for other gene expression and methylation loci pairs? Or, is it sometimes greater than and sometimes less than the R p-value? |
Is it a one-sided test? |
I could not find that information in the torch documentation but it appears to be two-sided. Plugging the T value from #26 (comment), 1.8765793, into this calculator with 96 degrees of freedom yields 0.031812 for a one sided test and 0.063624, which is very close to the p-value of both analyses. |
It appears the p-value is for a one-sided test (see 0.5 in code below for the StudentT function). Can we convert the one-sided value to two-sided?
From https://pytorch.org/docs/stable/_modules/torch/distributions/studentT.html#StudentT
|
The current p-values between tecpg and R are quite close. From what I understand, applying this transformation In #26 (comment):
to the allegedly one-sided p-value from the first comment computed by tecpg, 0.06933532, yields 1.872752. This does not seem like a valid two-sided p-value. Are you sure tecpg is outputting a one-sided value? The p-values from tecpg are close to those outputted by R. With the provided formula, these p-values would be scaled beyond even the valid p-value range, (0, 1). According to the values in #26 (comment), the percent error between the tecpg t value and R t value is 0.022413425%. The percent error between the p-values is less than 10%, which is significant but too small to be a mix up with the "sidedness" of the test. |
I'm not confident the conversion formula I provided was correct. I do consider the R results as truth, so we will have to make adjustments to tecpg until we are comfortable that the p-values are reasonable. Adam or Ritu will need to provide us guidance. |
We are comfortable with small difference in the test statistic for the R versus tecpg (StudentT). However, the p-values are not concordant with Ritu's implementation in R to a degree that is acceptable. Given what we've looked at so far, the simplest explaination for the discrepency is a two-sided versus one-sided (left or right) test. The test statistic would remain the same, and only the determination of p will change. We need to settle on what p-value is being supplied by the StudentT so we can decide if we need to change how it is calculated or convert it ourselves. It's up to Ritu and Adam to make this determination. For example, here is a nice description of converting between and right or left one tailed test and a two-tailed test (https://stats.stackexchange.com/questions/267192/doubling-or-halving-p-values-for-one-vs-two-tailed-tests) |
These are the p-values in R: |
From Ritu: log_prob() returns probability density function. We need the cumulative density function (cdf) to calculate the p-value p-value for 2-sided test would be 2*cdf of the value From Kord: Based on RItu's comments and the reference site above and (https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/StudentT), the pseudo code may be this:
|
Ok. I am glad you are confident about the issue. Can you clarify what the def _stdtr_computation(df, t): Compute cumulative distribution function of Student T distribution. """Compute cumulative distribution function of Student T distribution."""
dtype = dtype_util.common_dtype([df, t], tf.float32)
numpy_dtype = dtype_util.as_numpy_dtype(dtype)
half = numpy_dtype(0.5)
one = numpy_dtype(1.)
df, t = [tf.convert_to_tensor(param, dtype=dtype) for param in (df, t)]
broadcast_shape = functools.reduce(
ps.broadcast_shape, [ps.shape(df), ps.shape(t)])
df, t = [tf.broadcast_to(param, broadcast_shape) for param in (df, t)]
# For moderate df and relatively small t**2, or in case of large df, we use
# asymptotic expansion [1, 2] to compute stdtr(df, t). The condition to use
# it was specified by experimentation for np.float32 and was taken from [2,
# page 249] for np.float64.
if numpy_dtype == np.float32:
use_asymptotic_expansion = (
(df >= 10.) & (tf.math.square(t) < (16. * df - 5.))) | (df > 30.)
else:
use_asymptotic_expansion = (
(df >= 100.) & (tf.math.square(t) < (0.1 * df - 5.))) | (df > 1000.)
result = _stdtr_asymptotic_expansion(df, t, numpy_dtype)
# Otherwise, we evaluate stdtr(df, t) using the regularized incomplete beta
# function [3, page 323, equation 6.14.10]:
# stdtr(df, t) =
# 0.5 * betainc(0.5 * df, 0.5, df / (df + t**2)), when t < 0
# 1. - 0.5 * betainc(0.5 * df, 0.5, df / (df + t**2)), when t >= 0
# To avoid rounding error when t**2 is small compared to df, we compute the
# ratio df / (df + t**2) in the logarithmic space. If ratio > 0.99, we then
# use the following symmetry relation:
# betainc(a, b, x) = 1 - betainc(b, a, 1 - x) .
raw_ratio = t * tf.math.rsqrt(df)
ratio = tf.math.exp(-log1psquare(raw_ratio))
one_minus_ratio = tf.math.exp(-log1psquare(tf.math.reciprocal(raw_ratio)))
# The maximum value for the ratio was set by experimentation.
use_symmetry_relation = (ratio > 0.99)
half_df = half * df
a = tf.where(use_symmetry_relation, half, half_df)
b = tf.where(use_symmetry_relation, half_df, half)
x = tf.where(use_symmetry_relation, one_minus_ratio, ratio)
y = special.betainc(a, b, x)
result_betainc = half * tf.where(use_symmetry_relation, one - y, y)
# Handle the case (t >= 0).
result_betainc = tf.where(t >= 0., one - result_betainc, result_betainc)
result = tf.where(use_asymptotic_expansion, result, result_betainc)
# Determine if df is out of range (should return NaN output).
result = tf.where(df <= 0., numpy_dtype(np.nan), result)
return result Is this the right CDF function? If so, I will implement it in torch within |
Given we have still not found a torch library to provide us a CDF for the StudentT distribution, Ritu and I discussed potentially looking at other distributions to potentially estimate one (e.g., Chi2). We found a discussion about adding distributions to Torch here: With one developer releasing extensions for univariate distributions: Here is a description of other distributions currently available that we may (hopefully) find useful: Adam is away on holiday and we will discuss more with him next week. |
Ok. Let me know what you final decision is. Also note that in the end, the probability function will likely be a custom implementation in tecpg rather than a torch implementation. This is because optimizations can be made to the function by calculating certain constants before the main calculation (constant folding). This is what was done in #24 with massive performance improvements of around 120,739% faster total execution time. Therefore, a torch implementation of the function is not necessary. An implementation in tensorflow or even just the matrix algebra will be enough to form a tecpg implementation. |
That is good to know. Can the cdf() function from scipy work? On a related note, the cdf() function is available for a distribution in torch: Given we are using the StudentT distribution in torch, will this cdf() work? |
The scipy CDF is function works with the form Many of the distributions available in torch's distributions module have CDFs. Unfortunately, the StudentT distribution does not yet have this function. There was discussion about this in 2019: pytorch/pytorch#26334. The Wikipedia page on the StudentT distribution (https://en.wikipedia.org/wiki/Student%27s_t-distribution) defines the CDF as a function of This would be easily implementable in torch aside from one issue: Scipy has the hypergeometric function: https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.hyp2f1.html#scipy.special.hyp2f1. Similarly to the scipy Student's T CDF, the implementation in C is hard to access. However, the documentation helpfully includes the definition of the hypergeometric function: I do not know if this definition would yield a reasonable speed when implemented in torch. One alternative would be to convert the T tensor to a numpy array and use scipy to compute the p values. This would require moving from CUDA memory to CPU memory, which is probably exceedingly slow. |
After further research: pytorch/pytorch#50345 tracks the addition of various special functions including the hypergeometric function and the incomplete beta integral which would also aid in writing a torch Student's T CDF with the relation Scipy, among other packages, use cephes, a mathematical C library, for many functions including the Student's T CDF, the hypergeometric function, and the incomplete beta integral (links go to cephes functions inside scipy.special: https://github.com/scipy/scipy/tree/main/scipy/special/cephes). Developers are working on a CDF for Student's T distribution (pytorch/pytorch#26334) and the incomplete beta function (pytorch/pytorch#58700) in torch. Neither of these are complete, from my understanding. This comment, pytorch/pytorch#58700 (comment), even mentions how the incomplete beta function would be useful in calculating the CDF. Until either the Student's T CDF, hypergeometric function, or the incomplete beta function is implemented in torch and usable with CUDA, I do not see how we can compute the CDF. Let me know if you have any ideas. |
I was playing around with trying to use TensorFlow Probability (following this guide: https://towardsdatascience.com/introduction-to-tensorflow-probability-6d5871586c0e) and was able to generate a CDF from a T distribution using only tensor library function (see below). Do I misunderstand the cdf() available function here versus it not being available (per Liams comments above)? That said, I have no idea how this will work with tensors (i.e., not individial scores) and why the p-value calculation from the cdf value does not appear to be valid (i.e., >1.0).
|
Given the large sample size the t-test and the z-test should be the same, so we should be able to use a Gaussian distribution to estimate the p-value. |
#26 (comment): In tensorflow, the Student's T distribution has a CDF function. Overall, tensorflow has slightly more functions than pytorch, including the hypergeometric function among others. Torch does not yet have such functions. Converting a CUDA torch tensor to tensorflow is a very slow operation. Alternatively, tecpg could move to using tensorflow completely, but this migration could bring up unexpected issues due to the minor differences between these packages. #26 (comment): I was wondering about this as well, as the Student's T distribution approximates the normal distribution for many degrees of freedom. The GTP data, for reference, has 340 samples, and thus, 336 degrees of freedom. Using R's t statistic for methylation in the first comment, 1.877, and 336 degrees of freedom, we obtain the following values (calculations done in scipy): normal_p = 0.9697409530685338
student_p = 0.9693077874772256
percent_error = 0.04468813692662252% For 50 degrees of freedom (54 samples) and the same t statistic, we have: normal_p = 0.9697409530685338
student_p = 0.9668205118453522
percent_error = 0.3020665353497092% Are these errors reasonable? The fewer the degrees of freedom, the higher the error this approximation will yield. Test codeimport scipy.stats
normal_cdf = scipy.stats.distributions.norm.cdf
student_cdf = scipy.stats.distributions.t.cdf
t_stat = 1.877
df = 336
normal_p = normal_cdf(t_stat)
student_p = student_cdf(t_stat, df)
percent_error = abs((normal_p - student_p) / student_p * 100)
print(f'{normal_p = }')
print(f'{student_p = }')
print(f'{percent_error = }%') |
p-value = 2 × cdf( - abs(x) ) |
Nice work! Let's move forward with using the Gaussian distribution with Ritu's p-value calculation and continue to use torch. We can revisit migrating to tensor in a future major release. We will make a note of the difference in the p-value estimate in the documentation and include Liam's differences for n=50 and n=300 so the user can make their own decisions. |
I will work on implementing the p-value from the normal/gaussian distribution CDF. Hopefully, torch will implement the Student's T CDF or one of its constituent functions before such a future major release to avoid the migration. Regarding #26 (comment), p-value = 2 × ( - abs(cdf(x) ), wouldn't this guarantee that the p-value is negative, and thus, out of the [0, 1] range? |
I will start working once the above comment is addressed. Is the two-sided p-value |
Sorry for the typo in the previous message, the p-value would be 2 × cdf( - abs(x) ) |
Implemented in a413de4. 0.01896% error rate with 296 degrees of freedom tested with https://www.socscistatistics.com/pvalues/tdistribution.aspx. |
Used "dummy" data consisting of 300 methylation loci, 300 expression genes and 100 samples.
Ran linear regression for each methylation - gene expression pair with the following model:
expression ~ methlylation + age +sex
Seems like "estimate" being returned is from the "intercept" term, "standard error" & "t_statistics" are from the "methylation" term. P-value does not match that of "methylation" term.
Output from tecpg:
gt_site mt_site mt_est mt_err mt_t mt_p
1 ILMN_001 cg001 25.55615 10.433939 1.8765793 0.06933532
Output from R:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.5561 11.5837 2.206 0.0298 *
meth 19.5802 10.4339 1.877 0.0636 .
age 0.2658 0.2071 1.284 0.2024
sex 5.4035 5.7596 0.938 0.3505
Residual standard error: 28.67 on 96 degrees of freedom
Multiple R-squared: 0.05588, Adjusted R-squared: 0.02638
F-statistic: 1.894 on 3 and 96 DF, p-value: 0.1357
The text was updated successfully, but these errors were encountered: