Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How exactly does the p-value calculation work? #87

Open
MikiSchikora opened this issue Dec 20, 2021 · 5 comments
Open

How exactly does the p-value calculation work? #87

MikiSchikora opened this issue Dec 20, 2021 · 5 comments

Comments

@MikiSchikora
Copy link

Good morning,

Thanks for developing this great tool. I am trying to understand more deeply how the phyC and syncronous methods work to be sure that it can be used on my data.

Reading your paper I came across this sentence (in the PhyC section): "For permutation tests to determine the significance of associations genotype transitions are randomized on the tree with probability proportional to the branch length. The number of edges where the permuted genotype mutation intersects with phenotype presence edges is recorded for each permutation;". My first question is:

Q1) Does the syncronous test also randomize genotype transitions with probability proportional to the branch length? This is not clear to me from the paper, although looking at the code I think that this is the case. Could you confirm this?

In addition, I think that I do not fully understand how exactly is this randomization done. After going over your code, I understood that, in each resample, you permute the genotype transitions across nodes without replacement, with an uneven probability that is proportional to the branch length of each node. This means that in each resample the array of permuted transitions will have the values from longer branches at the beginning (since they are picked first). I understand that you then map these resampled transitions to an array of nodes (each with a given phenotype transition), which will be used to calculate the empirical distribution of N. I have a question about this:

Q2) How exactly do you map this array of resampled transitions to the array of nodes (which I assume it is always the same)? Is it possible that you sort the array of nodes by edge length (in a descending way), so that the longer edges will more likely have resampled transitions from long edges?

I would really appreciate if you may clarify these questions. Thanks.

Miquel Àngel Schikora

@katiesaund
Copy link
Owner

katiesaund commented Jan 17, 2022

Hi Miquel,

Q1)
Yes, you are correct: the synchronous test also randomizes genotype transitions with probability proportional to the branch length. I use the same p-value function discrete_calculate_pvals for both PhyC and the synchronous test. That function calls the function discrete_permutation. My apologies that the text of the paper is unclear on this point.

Q2)
This function assigns genotype transitions with probability proportional to the branch lengths in the following block of code (index = the integer corresponding the position of the genotype vector in the list of genotype vectors).

 for (j in 1:num_perm) {
    permuted_geno_trans_mat[j, ] <-
      sample(1:number_hi_conf_edges[index],
             size = number_edges_with_geno_trans[index],
             replace = FALSE,
             prob = tr$edge.length[high_conf_edges[[index]]] /
               sum(tr$edge.length[high_conf_edges[[index]]]))
  }

So you are absolutely correct that "in each resample, you permute the genotype transitions across nodes without replacement, with an uneven probability that is proportional to the branch length of each node."

I'm not confident I understand the rest of your question but I will try to answer:

The next step in the algorithm (Line 346) is to record in a matrix which tree edges were assigned to have these randomized genotype transitions. In this matrix, redistributed_hits, each column corresponds to a tree edge and each row corresponds to a permutation. A value of 1 means the tree edge is a randomized genotype transition, 0 means no transition.

The next step is to compare redistributed_hits (the matrix of randomized genotypes) to the phenotype matrix (either phenotype presence/absence in phyc or phenotype transitions in the synchronous test). This is done in the function count_empirical_both_present on line 247. This function counts the number of times the randomized genotype is on the same edge as the phenotype (presence/absence in phyc or transitions in the synchronous test).

If I didn't answer your question, please be welcome reopen the issue to ask follow ups or rephrase. Please let me know if you have other questions or run into bugs.

All the best,
Katie

@MikiSchikora
Copy link
Author

Good morning,

Thanks for the answer, it clarifies many things that will be useful to me.

I am still curious about how permuted_geno_trans_mat is ordered. I understand that each column of permuted_geno_trans_mat corresponds to a high-confidence node in the tree. How is the order of these nodes (columns) decided?

I ask this because the first columns (nodes) will tend to have the genotype transition values (1 or 0) from longer branches (since the the sampling is without replacement and biased by branch length). I wonder if this could generate some bias in some trees with variable branch lengths, which is the case for my data. For example If the order comes from traversing the tree, the nodes firstly traversed will tend to have the genotype transition values from long branches right? Would this be a correct modelling of the null hypothesis?

I thought that one way to correct for such a bias (and maybe you are already doing it) could be to sort the columns by the edge length of the corresponding node (in a descending way), so that the longer edges will more likely have resampled genotype transitions from long edges. Is this the case? If not, would this make sense to you?

Thanks for your time,

Miquel Àngel Schikora

@katiesaund katiesaund reopened this Jan 21, 2022
@katiesaund
Copy link
Owner

Thank you for restating the question, Miquel! Your observation has made it clear to me that hogwash could calculate a far more accurate null hypothesis!

As currently written this line:
sample(1:number_hi_conf_edges[index]
does, as you pointed out, select from only the earliest branches in the tree as ordered by default in ape. This is not a desired behavior because although it subsets the tree to a correct number of edges, it does not subset to only the edges with high confidence genotype reconstructions.

I appreciate the solution you proposed. Here's the solution I am currently leaning towards:
sample(which(as.logical(high_conf_edges[[index]]))
Where index is the current genotype and high_conf_edges is a vector of 0s and 1s (0 = low confidence edge, 1 = high confidence edge). This new code would result in the sample of edges taken only from the high confidence tree edges no matter where they happened to be ordered on the tree. Does this proposed solution make sense to you? Would this be a satisfactory resolution to the bias you noticed in the current calculation of the null distribution? I'd appreciate your thoughts and concerns! I'll work on some plots that demonstrate how the new code (as I'm currently envisioning it) would fix the problem and post here.

@katiesaund
Copy link
Owner

I made a quick (and ugly) plot to demonstrate the differences between the current version of the null distribution generation and the version I proposed as the fix.

I generated some test data and ran either the current or prosed version of the sampling step to generate the null distribution. Below are plots of the null distribution generated from one test genotype. In both plots the y axis = count and x axis = the tree edge (in the ape default order). A black "x" indicates the edges with dummy low confidence genotype ancestral reconstruction and a grey caret indicates hi confidence. Orange circles are the expected values of the null distribution given the probability of choosing each edge based on relative edge length when low confidence edges are excluded.

Current version:
image

Proposed new version:
image

Code to generate these two figures is here.

@MikiSchikora
Copy link
Author

Good morning,

Thanks for this quick response. I think that your improvement is great, so that (if I understand correctly) you only re-sample the values of high-confidence nodes. However, I still wonder how this default ordering of the columns by appearance in the tree (in permuted_geno_trans_mat ) could affect the results. I understand that this is not changed in this PROPOSED NEW VERSION rigth? I think that I did not phrase my question properly, and I will illustrate it with an example.

For example, imagine the tree like the one below:

Screenshot_20220121_112504

The names of the nodes are 1-6 (ordered as they appear in the tree), and 1 or 0 indicates whether the node has a genotype transition (GT). I understand that permuted_geno_trans_mat would have columns for the nodes 2 - 6 (ordered 2,3,4,5,6), and node 1 would be discarded because it has a long branch.

In most resamples, node 2 (which picks the first resampled GT) will get a GT of 1 coming from node 4 (which has the longer branch and it has the highest probability of getting picked first). However, if you use another ordering (i.e.: changing the type of tree traversal method) the results will be very different, and the first column will mostly get the value of node 4. I understand that the point of re-sampling with a probability proportional to branch length is to prioritize the GT values of longer branches right? The problem I see with the current method is that the GT values of longer branches will mostly be assigned to the first nodes in the columns, and I maybe this can generate biased null distributions.

I thought that one way to correct for such a bias could be to sort the columns by the edge length of the corresponding node (in a descending way), so that the longer edges will more likely have resampled GTs from long edges. This would be a way that short branches get the GT values of short branches and vice-versa. Would this make sense to you?

Thanks,

Miquel Àngel Schikora

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants