Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Response to review comments on testEdgeThresh.R #37

Closed
FarzanT opened this issue Mar 23, 2017 · 13 comments
Closed

Response to review comments on testEdgeThresh.R #37

FarzanT opened this issue Mar 23, 2017 · 13 comments

Comments

@FarzanT
Copy link

FarzanT commented Mar 23, 2017

Comments are tagged

@hyginn
Copy link
Owner

hyginn commented Mar 24, 2017

Here are the major concerns I have at this point.
(1) I think you are misunderstanding what the delta value means. Please refer back to the task page and then describe in your own words. In that description, you need to define ord and Lmax.

(1b) Once you have worked this out, tell me specifically whether a graph of order < Lmax is invalid.

(2) Your description of a swap of bidirected edges appears wrong. Please describe clearly how bidirected edges are to be swapped.

(3) I have requested an answer to the question whether the algorithm needs to find simply connected components, or strongly connected components. Answer, and justify.

(4) You are varying Q in your tests. I need you to explain why.

(5) You are claiming that the range of delta should be between 0 and 1. Why do you think that?

(6) The considerations about disconnecting the graph may be misplaced. It is prohibitively expensive to test connectedness during edge-swapping, or to test for bridges, but it is trivial to reconnect components at the end of the permutation if any have been produced.

When you are done, reopen the issue.

@hyginn hyginn closed this as completed Mar 24, 2017
@FarzanT
Copy link
Author

FarzanT commented Mar 24, 2017

  • When considering each permutation, for each size of the largest strongly connected component ord (which increases from 2 to Lmax, since Lmax is an upper limit for the size of components to be queried), delta, the threshold for edge removal, starts from a small value. For each ord, edges that have weights below the current delta are removed, and the remaining graph is queried for the size of its largest strongly connected component. If it is larger than current ord, keep increasing delta and query again. When the largest connected component has size equal to ord, stop and record that delta. Do this for each ord from 2 to Lmax, per permutation.

    • You start from a small value of delta, and then increase it in small steps: will the size of the largest connected component increase, or decrease?

    • "For each size of the largest largest strongly connected component" ... is that your loop iterator?

    • "For each ord" is that your loop iterator?

    • It sounds more like the right thing, but not quite right yet. Write it out as a loop.

      • @hyginn @RicardoRam This is what I have for the loops I was trying to describe above. The component size should decrease as the threshold increases, since the graph is becoming less connected:
for (i in 1:N) {
       # Permute graph with given Q
       sprintf("Creating permutation %d of %d", i, N)
       currentGraph <- .permuteGraph(EGG, Q)

       # Set vector to hold deltas, initial delta, maxComponentSize (ord)
       graphDeltas = vector("numeric", Lmax - 1)
       currentDelta = 0.01
       maxComponentSize = 2

       # For each value of maxComponentSize (ord):
       while (maxComponentSize < Lmax) {
           # Find edges that have weights lower than current delta and delete
           edgesToRemove = igraph::E(currentGraph)[E(currentGraph)$weight < currentDelta]
           currentGraph = igraph::delete.edges(currentGraph, edgesToRemove)

           # Find the size of the largest strongly connected component
           currentLargest <-
               max(igraph::components(currentGraph, "strong")$csize)

           # If largest component has size less than Lmax, continue incrementing δ
           # else, stop and record δ in the vector, and increase maxComponentSize
           # by 1
           if (currentLargest >= maxComponentSize) {
               currentDelta <- currentDelta + 0.01
           } else {
               graphDeltas[i] <- currentDelta
               maxComponentSize = maxComponentSize + 1
               currentDelta <- 0.01
           }
       }

       # Append this graph's deltas to the master list of deltas
       masterDeltaList[[i]] <- graphDeltas
   }
        • Rather than answer my question, you are writing code. The point I am making is that your design is wrong and your code has the process backwards.
  • The graph with order less than Lmax is not invalid, but an ord > maxTotalOrderOfGraph cannot ever be equal to the size of the largest connected component, however low delta is.

    • What do you mean "is not invalid, but..."? Do you need to reject the graph or not and what happens to your algorithm of collecting delta values?

      • @hyginn @RicardoRam I mean that the graph is a graph if it follows our standards, but even if our delta threshold is 0, the largest strongly connected component is not equal to ord, and based on how the loop is set up above, it will never stop iterating...

        • because your design is wrong.
  • Since bidirected edges are to be swapped with each other, when an edge is selected, first it is checked whether it is a bidirectional edge (checked against a hashed structure). Then another bidirectional edge from a pool of recognized bidirectional edges is randomly selected. Only if the pair involves 4 distinct vertices, one vertex from each edge is selected, and these form a new bidirected edge. The remaining vertices also form a new bidirected edge. The new edges replace the ones in the pool of bidirected edges.

    • This description is still not correct since it implies that (t, u), (v, w) -> (u, t), (w, v) is a valid swap.

      • @hyginn @RicardoRam No, "one vertex from each edge is selected, and these form a new bidirected edge", t and u are in the same edge, so that swap will not happen.

        • oK, this is technically correct even though it includes an unnecessary step (choosing which vertex to swap). I misunderstood what you had written.
    • "checked against a hashed structure..." See, this is the problem: you are so invested in premature development decisions that you still come up with the wrong data structure, even when you are reviewing the algorithm. If you are working with a pool of bidirected edges, there is no need to check whether they are bidirected. You selected them that way when you put them in the pool. You just randomly sample() rows from a table of vertex pairs, each of which designates an edge. Algorithm: Let N be the number of bidirected edges in the pool. If N > 2: Let nSwap <- 0. while (nSwap < (Q*N)) select two bidirected edges (t, u), (v, w). If t ≠ u ≠ v ≠ w: replace (t, u), (v, w) with (v, u), (t, w) and increment nSwap.

      • @hyginn @RicardoRam No, as you have mentioned before, we are not going to swap bidirected edges first or last, but randomly. Thus I have to have a way to know whether the currently sampled edges are both bidirected, only one is bidirected, or none are bidirected. I am not going to "select two bidirected edges", as this is against your own comment, here: "Why do you swap bidirected first? You can just swap randomly from both categories. This is a stochastic swap, not iterating over the edges."
  • Based on our design, the algorithm that finds deltas will only look at the size of the largest strongly connected component, and compare that to ord. During edge swapping, it is ensured that the resulting graph is still weakly connected.

    • you are not answering the question. The question is not about "our design", but about what it should do and why. "Needs" here means: what is the objective. It does not mean "what is implied by the algorithm".

      • @hyginn @RicardoRam Based on my understanding, we are permuting the graph to find deltas that allow us to separate subcomponents in the FINDSUB step based on significance. i.e. random permutations do not result in such an arrangement of vertices in the graph. I think "weakly" connected components do not allow distinguishing randomness from significant subcomponents (well, the definition of subcomponents relies on being strongly connected, right?). Thus I think we should check the size of the strongly connected components during the "delta-finding" step.

        • The short answer is: threshold finding must use exactly the same definition of components as subnet discovery. (In fact integration analysis of the code – later – will show that it should use the same core function.)
  • I vary Q in the tests to ensure that the .permuteGraph() function can handle differing Q's (especially larger than 100 and smaller than 10). Since generating valid permutations with Q = 100 does not mean that the algorithm doesn't behave differently with Q = 1. Maybe I'm being unreasonably cautious.

    • Under what assumption could any of the tests that you are proposing have a different outcome with a different value of Q? Tell me. What is the significance of Q = 100 anyway? Why 100? Tell me that too.

      • @hyginn @RicardoRam I don't think there is something special about Q = 100, as each graph probably has a certain number of permutations that make it "very" random (within our rules), depending on its size and arrangement. Q = 100 is based on observation that Q * |E| make a good random graph for similar experiments.

        • You are not answering my first question.
  • From my understanding, delta is the threshold of edge weights, and since edge weights are values between 0 and 1, the threshold should also be within the same range.

    • Where do these "edge weights" come from?

      • @hyginn @RicardoRam Well, I thought I shouldn't worry about that, as that is what DIFFUSE does, correct? "Edge weights are extracted from the diffusion matrix after post-multiplying with the heat-vector as a diagonal matrix, and collected as edge attributes in the EGG output graph object". I though that this results in (normalized?) values between 0 and 1, am I wrong?

        • You are saying not worry about that - but you are making strong (and erroneous) assumptions about the allowed range.
  • I was thinking of finding the bridges first, and simply disallowing an edge swap between two bridges, as well as a unidirectional bridge pointing towards the articulation point and another edge. I'm certain these two cases work. There is no need to find bridges at each swap.

    • No. This is just all wrong. First of all, we are swapping edges, so we need to think about 2-connected graphs, not bridges (draw out how you can disconnect a bidirected 6-cycle with a single swap, even though there are no bridges.). Secondly, identifying bridges in the graph helps you nothing, since they change at every swap. Assume you have two sets of strongly connected vertices A and B with four edges connecting the set. (A1,B1), (B2,A2), (A3,B3), (B4,A4). Swap the first pair to get (A1,A2),(B2,B1). This makes the graph 2-connected and you can disconnect with the next swap. This shows that you need to check for pairs of 2-connecting edges after every swap. The set of such two-connecting edges is on the order of |E|^2 anyway. Most importantly: do you even have an algorithm that will work under the circumstances of swapping bidirected edges en bloc (which actually is a 4-edge swap), i.e. it would need to find 2-, 3-, or 4- connected subgraphs depending on what type the edges are? I am not sure this can be done in less than O(|E|^5) times the number of swaps you are performing which makes it O(|E|^6). Ah. Yes - still polynomial. No worries.

      • @hyginn @RicardoRam I agree, and your suggestion that it's best to reconnect the graph after the swaps is better, but I still cannot find a way to maintain node degree distributions this way. How can this be done?

        • Edge swapping guarantees conservation of degree. This is why we swap in the first place.

@FarzanT
Copy link
Author

FarzanT commented Mar 24, 2017

@hyginn I cannot reopen this issue, so I will open another one.

@hyginn
Copy link
Owner

hyginn commented Mar 24, 2017

Commented.

@FarzanT
Copy link
Author

FarzanT commented Mar 26, 2017

@hyginn Ok, I guess I have a working algorithm, I hoped that my questions above would have been answered by now. Anyways, I'll submit a pull request for a semi-final review.

@hyginn
Copy link
Owner

hyginn commented Mar 26, 2017

I have added comments to your comments. We are not making enough progress in the discussion. There are fundamental flaws in your design. Do not continue further development. I will write pseudocode for you to work with.

@FarzanT
Copy link
Author

FarzanT commented Mar 26, 2017

@hyginn Would you tell me which part of the process I have backwards? Incrementing delta (threshold), and having the size of the largest connected component decrease gradually? Which part of my design is wrong? I was trying to follow the design on the task page.

@FarzanT
Copy link
Author

FarzanT commented Mar 26, 2017

@hyginn This is what I had in mind, the delta values decrease gradually, as seen in Leiserson's paper.
I hope this is similar to what you had in mind.

# Create a list of deltas for each permuted graph
    masterDeltaList <- vector("list", length = N)

    for (i in 1:N) {
        # Permute graph with given Q
        cat(sprintf("Creating permutation %d of %d\n", i, N))
        currentGraph <- .permuteGraph(EGG, Q)
        cat(sprintf("\nCalculating deltas for graph %d of %d\n", i, N))

        # Set vector to hold deltas, initial delta, maxComponentSize (ord)
        graphDeltas <- vector("numeric", Lmax - 1)
        currentDelta <-  0.0005
        maxComponentSize <-  2
        j <- 1
        k <- 0
        testGraph <- currentGraph

        # For each value of maxComponentSize (ord):
        while (maxComponentSize <= Lmax) {

            # Find edges that have weights lower than current delta and delete
            edgesToRemove = igraph::E(testGraph)[E(testGraph)$weight < currentDelta]
            testGraph = igraph::delete.edges(testGraph, edgesToRemove)

            # Find the size of the largest strongly connected component
            currentLargest <-
                max(igraph::components(testGraph, "strong")$csize)

            # If largest component has size less than Lmax, continue incrementing δ
            # else, stop and record δ in the vector, and increase maxComponentSize
            # by 1
            if (currentLargest >= maxComponentSize) {
                currentDelta <- currentDelta + 0.0005
            } else {
                .pBar(k, Lmax - 1, 80)
                # Reset and increase ord
                graphDeltas[j] <- currentDelta
                maxComponentSize = maxComponentSize + 1
                currentDelta <- 0.0005
                j <- j + 1
                k <- k + 1
                testGraph <- currentGraph
            }
        }

        # Append this graph's deltas to the master list of deltas
        masterDeltaList[[i]] <- graphDeltas
    }

    # (By how much should δ be incremented at each step?)

    # After generating N vectors of size (Lmax - 1), find the median δ for each
    # δ ord (maxComponentSize). i.e. Lmax - 1 median values.
    deltaOrd <- vector("numeric", Lmax - 1)
    for (i in 1:(Lmax - 1)) {
        deltaOrd[i] <- median(sapply(masterDeltaList, FUN = "[", i))
    }

    # Attach the vector as graph metadata to the original EGG and return.

    igraph::graph_attr(EGG, "delta") <- deltaOrd
    return(EGG)

@hyginn
Copy link
Owner

hyginn commented Mar 26, 2017

Here is the algorithm in Pseudocode. Comment here if you have questions, otherwise transfer to your rete fork as required and start designing (not writing!) the required tests. Make sure you understand the code and that it implements the requirements. Bonus marks for finding significant errors, but don't get distracted by that.

# Thresh algorithm
#
# Note: "connected" in this algorithm always means "strongly connected"
#
# Input: EGG    weighted, directed graph
#        Lmax   largest order of largest connected subnetwork
#               to be considered
#        Lmin   smallest order of largest connected subnetwork (Default 2)
#        N      Number of permuted graphs to use
#        Q      Q * |E| is the number of edge-swaps to be performed
#               to consider an input graph randomized
#
# let W be the vector of edge-weights of EGG
#
# determine Lmax as the requested Lmax or the order of EGG
#     let Lmax be min(Lmax, |V| of EGG)
#
# let deltaDF be a data frame with N rows and (Lmax - Lmin + 1) columns to store
#    delta values for each permuted graph in rows, delta values for each
#    order of the largest connected component in columns. Initialize with NA
#
# Determine steps to increase delta. Many ways to do this, probably most
# efficient in this context is to use step sizes that are determined by the
# distribution of edge weights. How many steps? min(|E|, 100) seems reasonable
# i.e if we have less then 100 edges, we remove one edge for each step in
# weight-sorted order:
#    Let nQuant be min(|E|, 100)
#    Let stepsDelta be the nQuant quantiles of W i.e. a vector of
#       delta values in which adjacent values bracket approximately
#       equal numbers of edge weights.
#
# Principle: we successively remove edges of the permuted graph G according
#            to the values in stepsDelta. After each removal, we determine ord,
#            the order of largest connected component. Since ord must change
#            monotonously with removal of edges, we need to remove edges in
#            steps of delta only once to cover the range between Lmin and
#            Lmax
#
# calcDeltas(EGG, Lmin, Lmax, N, Q)
# for (i in 1:N) {
#     G <- permute(EGG, Q)
#     for (j in 1:nQuant) {
#         remove edges with W_edge < stepsDelta[j]  Note: < ensure NO edges are
#                                                           removed in the
#                                                           first iteration
#         let ord be the order of the largest connected component
#         if (ord < Lmin) break    #  i.e. break if the graph has no more
#                                  #     connected components of interest
#         if (ord < Lmax) {
#            deltaDF[i, ord] <- stepsDelta[j]  # i.e. store the current delta
#                                              # for ord in the i_th graph
#         }
#      }
#  }
#  Interpolate NA values row-wise in deltaDF (i.e. values in which a
#      step of delta has skipped an ord)
#  Let deltaOrd be the column-means of deltaDF
#  return deltaDF
#
#
# permute(G, Q)
#     let E1 be the set of unidirected edges, E2 the set of
#         bidirected edges of G:
#     if (|E1| > 4) {
#         for (i in 1:(Q*|E1|))  {
#            pick two swappable edges (t,u), (v,w) at random from E1
#            swap edges to (t,w), (v,u)
#         }
#     }
#     if (|E2| > 4) {
#         for i in 1:(Q*|E2|) {
#            pick two swappable bidirected edges (t,u), (v,w) at random from E2
#            swap both edges to (t,w), (v,u)
#         }
#     }
# return G
#
#
# pickSwappable(E)
#   (x, y) <- sample(1:|E|, 2) # initialize
#   let (t, u, v, w) be the connected vertices of E[x], E[y]
#   while(unique(t, u, v, w) < 4) {
#       (x, y) <- sample(1:|E|, 2)
#       let (t, u, v, w) be the connected vertices of E[x], E[y]
#   }
#   return(x, y)
#
#
#

@hyginn
Copy link
Owner

hyginn commented Mar 26, 2017

Pseudocode and instructions added.

@hyginn hyginn closed this as completed Mar 26, 2017
@FarzanT
Copy link
Author

FarzanT commented Mar 26, 2017

@hyginn From the design page:"The algorithm computes delta values for all subnetwork orders from 2 to Lmax." So this is where I had it wrong? I thought I had to increase ord from 2 to Lmax (or Lmin to Lmax, based on the recent change) and capture the delta values that create a largest connected component of size ord, I guess I was mistaken. And yes, that was my loop iterator.

@hyginn
Copy link
Owner

hyginn commented Mar 26, 2017

I have updated the pseudocode.
Was:

# pickSwappable(E)
#   (E[x], E[y]) <- sample(E, 2) # initialize
#   let (t, u, v, w) be the connected vertices of E[x], E[y]
#   while(unique(t, u, v, w) < 4) {
#       (E[x], E[y]) <- sample(E, 2)
#       let (t, u, v, w) be the connected vertices of E[x], E[y]
#   }
#   return(E[x], E[y])

Corrected to:

# pickSwappable(E)
#   (x, y) <- sample(1:|E|, 2) # initialize
#   let (t, u, v, w) be the connected vertices of E[x], E[y]
#   while(unique(t, u, v, w) < 4) {
#       (x, y) <- sample(1:|E|, 2)
#       let (t, u, v, w) be the connected vertices of E[x], E[y]
#   }
#   return(x, y)

@FarzanT
Copy link
Author

FarzanT commented Mar 26, 2017

@hyginn
Input:

ORD2      ORD3      ORD4 ORD5 ORD6    ORD7 ORD8 ORD9 ORD10    ORD11     ORD12 ORD13     ORD14
1   NA 0.7083058 0.6161198   NA   NA 0.56619   NA   NA    NA 0.533328 0.4948098    NA 0.4994613
  ORD15 ORD16     ORD17 ORD18 ORD19 ORD20
1    NA    NA 0.4238037    NA    NA    NA

Output of .interpolateNA(unlist(masterDataFrame[1,]), 0, 1)

ORD2      ORD3      ORD4      ORD5      ORD6      ORD7      ORD8      ORD9     ORD10 
0.3541529 0.7083058 0.6161198 0.5994766 0.5828333 0.5661900 0.5579745 0.5497590 0.5415435 
    ORD11     ORD12     ORD13     ORD14     ORD15     ORD16     ORD17     ORD18     ORD19 
0.5333280 0.4948098 0.4971355 0.4994613 0.4742421 0.4490229 0.4238037 0.5678528 0.7119019 
    ORD20 
0.8559509 

As is shown, there is a fluctuation after ORD13.
Sorry for the delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants