Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

propose nodes to merge non-uniformly #60

Closed
koadman opened this issue Nov 12, 2012 · 19 comments
Closed

propose nodes to merge non-uniformly #60

koadman opened this issue Nov 12, 2012 · 19 comments

Comments

@koadman
Copy link
Contributor

koadman commented Nov 12, 2012

At early generations there are a very large number O(n^2) of possible node pairs to merge and a very small number of good merge pairs O(1). This causes most of the particles produced in early generations to have low likelihood.

Rather than proposing joins uniformly we could bias the proposal toward joins that produce a high LL under the natural forest extension. In general these will be nodes with short branches in the true tree. One way to obtain a proposal distribution is to calculate all internode distances from a FastTree guide tree and use the inverse of distances to generate a sampling distribution for leaf pairs. At each merge, a pair could be sampled. If they are part of different trees in the forest, their trees could be joined at the root. If they belong to the same tree, zero or more additional proposals could be made until a valid merge is found. If a valid merge is not found, this proposal could fall back to the uniform merge.

@koadman
Copy link
Contributor Author

koadman commented Nov 13, 2012

Worked on this a bit today. As of now, using the guided merge pair proposal gives:

Iter 1 max ll -41894.4 ESS: 8.49452
Iter 2 max ll -40501.4 ESS: 3.56237
Iter 3 max ll -39115.4 ESS: 4.6236
Iter 4 max ll -37910.1 ESS: 2.69429
Iter 5 max ll -36668.2 ESS: 1
Iter 6 max ll -35306.1 ESS: 1
Iter 7 max ll -34157.2 ESS: 1.11207
Iter 8 max ll -33017 ESS: 5.60864
Iter 9 max ll -31912.5 ESS: 1.70446
Iter 10 max ll -30840.6 ESS: 4.44762
Iter 11 max ll -29781.8 ESS: 1.97459
Iter 12 max ll -28655.3 ESS: 1.00026
Iter 13 max ll -27495.1 ESS: 1.14868
Iter 14 max ll -26416.5 ESS: 1.02748
Iter 15 max ll -25527.5 ESS: 1.10675
Iter 16 max ll -24689.3 ESS: 1.85626
Iter 17 max ll -23912.5 ESS: 1.2003
Iter 18 max ll -23170.8 ESS: 3.9946
Iter 19 max ll -22328.7 ESS: 1.45438
Iter 20 max ll -21630.5 ESS: 3.80698
Iter 21 max ll -20991.3 ESS: 2.15786
Iter 22 max ll -20419.7 ESS: 5.05691
Iter 23 max ll -19891.8 ESS: 5.35881
Iter 24 max ll -19583.4 ESS: 4.08674
Iter 25 max ll -19240.8 ESS: 1.79467
Iter 26 max ll -18881.7 ESS: 4.1033
Iter 27 max ll -18718.4 ESS: 22.7285
Iter 28 max ll -18625.4 ESS: 13.8119
Iter 29 max ll -18625.4 ESS: -nan

And the uniform pair proposal is giving:

Iter 1 max ll -41894.5 ESS: 1
Iter 2 max ll -40501.9 ESS: 1.00174
Iter 3 max ll -39182.1 ESS: 1
Iter 4 max ll -37752.4 ESS: 1
Iter 5 max ll -36510.4 ESS: 1.05639
Iter 6 max ll -35375.7 ESS: 1
Iter 7 max ll -34236.1 ESS: 1.78139
Iter 8 max ll -33032.8 ESS: 1
Iter 9 max ll -31927.8 ESS: 1.07705
Iter 10 max ll -30856.2 ESS: 2.9304
Iter 11 max ll -29799.8 ESS: 1
Iter 12 max ll -28691.2 ESS: 1.99718
Iter 13 max ll -27533.4 ESS: 1.00028
Iter 14 max ll -26441.2 ESS: 2.73419
Iter 15 max ll -25556.7 ESS: 3.19573
Iter 16 max ll -24717.1 ESS: 1.15374
Iter 17 max ll -23941 ESS: 1
Iter 18 max ll -23197 ESS: 1.04459
Iter 19 max ll -22371.6 ESS: 1
Iter 20 max ll -21677.6 ESS: 2.93049
Iter 21 max ll -21051.2 ESS: 2.55918
Iter 22 max ll -20498.5 ESS: 3.27221
Iter 23 max ll -19970.6 ESS: 5.26381
Iter 24 max ll -19666.7 ESS: 1.79627
Iter 25 max ll -19327.7 ESS: 7.2991
Iter 26 max ll -18975 ESS: 3.33462
Iter 27 max ll -18810.4 ESS: 20.1285
Iter 28 max ll -18697.5 ESS: 13.7451
Iter 29 max ll -18697.5 ESS: -nan

This is with --bl-opt-steps 5 in both cases.

@matsen
Copy link
Contributor

matsen commented Nov 13, 2012

Interesting.

I do think that topology (rather than distance) guided proposals have some
advantages, because they explicitly encode the sequence of events that have
to occur to get a winning topology.

I think the current formulation in smctex is correct, but will check it
again today and let you know. Then we can talk about if implementation is
practical.

2012/11/12 Aaron Darling notifications@github.com

Worked on this a bit today. As of now, using the guided merge pair
proposal gives:

Iter 1 max ll -41894.4 ESS: 8.49452
Iter 2 max ll -40501.4 ESS: 3.56237
Iter 3 max ll -39115.4 ESS: 4.6236
Iter 4 max ll -37910.1 ESS: 2.69429
Iter 5 max ll -36668.2 ESS: 1
Iter 6 max ll -35306.1 ESS: 1
Iter 7 max ll -34157.2 ESS: 1.11207
Iter 8 max ll -33017 ESS: 5.60864
Iter 9 max ll -31912.5 ESS: 1.70446
Iter 10 max ll -30840.6 ESS: 4.44762
Iter 11 max ll -29781.8 ESS: 1.97459
Iter 12 max ll -28655.3 ESS: 1.00026
Iter 13 max ll -27495.1 ESS: 1.14868
Iter 14 max ll -26416.5 ESS: 1.02748
Iter 15 max ll -25527.5 ESS: 1.10675
Iter 16 max ll -24689.3 ESS: 1.85626
Iter 17 max ll -23912.5 ESS: 1.2003
Iter 18 max ll -23170.8 ESS: 3.9946
Iter 19 max ll -22328.7 ESS: 1.45438
Iter 20 max ll -21630.5 ESS: 3.80698
Iter 21 max ll -20991.3 ESS: 2.15786
Iter 22 max ll -20419.7 ESS: 5.05691
Iter 23 max ll -19891.8 ESS: 5.35881
Iter 24 max ll -19583.4 ESS: 4.08674
Iter 25 max ll -19240.8 ESS: 1.79467
Iter 26 max ll -18881.7 ESS: 4.1033
Iter 27 max ll -18718.4 ESS: 22.7285
Iter 28 max ll -18625.4 ESS: 13.8119
Iter 29 max ll -18625.4 ESS: -nan

And the uniform pair proposal is giving:

Iter 1 max ll -41894.5 ESS: 1
Iter 2 max ll -40501.9 ESS: 1.00174
Iter 3 max ll -39182.1 ESS: 1
Iter 4 max ll -37752.4 ESS: 1
Iter 5 max ll -36510.4 ESS: 1.05639
Iter 6 max ll -35375.7 ESS: 1
Iter 7 max ll -34236.1 ESS: 1.78139
Iter 8 max ll -33032.8 ESS: 1
Iter 9 max ll -31927.8 ESS: 1.07705
Iter 10 max ll -30856.2 ESS: 2.9304
Iter 11 max ll -29799.8 ESS: 1
Iter 12 max ll -28691.2 ESS: 1.99718
Iter 13 max ll -27533.4 ESS: 1.00028
Iter 14 max ll -26441.2 ESS: 2.73419
Iter 15 max ll -25556.7 ESS: 3.19573
Iter 16 max ll -24717.1 ESS: 1.15374
Iter 17 max ll -23941 ESS: 1
Iter 18 max ll -23197 ESS: 1.04459
Iter 19 max ll -22371.6 ESS: 1
Iter 20 max ll -21677.6 ESS: 2.93049
Iter 21 max ll -21051.2 ESS: 2.55918
Iter 22 max ll -20498.5 ESS: 3.27221
Iter 23 max ll -19970.6 ESS: 5.26381
Iter 24 max ll -19666.7 ESS: 1.79627
Iter 25 max ll -19327.7 ESS: 7.2991
Iter 26 max ll -18975 ESS: 3.33462
Iter 27 max ll -18810.4 ESS: 20.1285
Iter 28 max ll -18697.5 ESS: 13.7451
Iter 29 max ll -18697.5 ESS: -nan

This is with --bl-opt-steps 5 in both cases.


Reply to this email directly or view it on GitHubhttps://github.com//issues/60#issuecomment-10317880.

Frederick "Erick" Matsen, Assistant Member
Fred Hutchinson Cancer Research Center
http://matsen.fhcrc.org/

@koadman
Copy link
Contributor Author

koadman commented Nov 13, 2012

Well, I'm not sure the distance guide is implemented correctly just yet, and even if it is, it's not clear that its much better than uniform merging. 50LL units could be within the realm of normal run-to-run variation. One thing I'm noticing is that even though ESS is a little higher at the first few iters with the guided prop, that seems to get wiped out by iter 5. This could be that the bias distribution fails to adapt in the current implementation, or could be that distance bias is no longer useful after the first few joins, or a bit of both.

The topology bias would surely guide proposal of internal nodes and anywhere that short internal branches make choosing close pairs of nodes to merge based on distance ambiguous or misleading.

But I think for the first few iters under the natural forest extension, the topology guide only reduces our O(n^2) possible pairs problem to somewhere between O(n) and O(1) depending on topology while the distance guide may get us a little closer to O(1). Maybe the real underlying problem is the natural forest extension??

@matsen
Copy link
Contributor

matsen commented Nov 13, 2012

Hm, interesting.

As I think you already note, the thing that scares me about a distance
guide are collections of closely related triplets, quadruplets, etc, where
if we want to inject some randomness we might kick these distances into the
wrong regime. This has to do with the fact that distances are only pairwise
rather than 3-wise.

I'll also review and flesh out the smctex section on simulations on the
natural forest extension.

On Tue, Nov 13, 2012 at 8:02 AM, Aaron Darling notifications@github.comwrote:

Well, I'm not sure the distance guide is implemented correctly just yet,
and even if it is, it's not clear that its much better than uniform
merging. 50LL units could be within the realm of normal run-to-run
variation. One thing I'm noticing is that even though ESS is a little
higher at the first few iters with the guided prop, that seems to get wiped
out by iter 5. This could be that the bias distribution fails to adapt in
the current implementation, or could be that distance bias is no longer
useful after the first few joins, or a bit of both.

The topology bias would surely guide proposal of internal nodes and
anywhere that short internal branches make choosing close pairs of nodes to
merge based on distance ambiguous or misleading.

But I think for the first few iters under the natural forest extension,
the topology guide only reduces our O(n^2) possible pairs problem to
somewhere between O(n) and O(1) depending on topology while the distance
guide may get us a little closer to O(1). Maybe the real underlying problem
is the natural forest extension??


Reply to this email directly or view it on GitHubhttps://github.com//issues/60#issuecomment-10331414.

Frederick "Erick" Matsen, Assistant Member
Fred Hutchinson Cancer Research Center
http://matsen.fhcrc.org/

@koadman
Copy link
Contributor Author

koadman commented Nov 13, 2012

Do you think the distance approach is going to prevent the correct merge from being selected in such cases? Or will it just be less informative/efficient than a topology-guide?

@matsen
Copy link
Contributor

matsen commented Nov 13, 2012

Less informative/efficient.

I just made a pass through the smctex manuscript and I'm happy with the
sections on topology-guided proposals and the local->global simulations.
The former may seem like a pipe dream, but I have some ideas about how to
make it more efficient that I'm working out now. But if you have an
alternative proposal let it rip!

On Tue, Nov 13, 2012 at 9:14 AM, Aaron Darling notifications@github.comwrote:

Do you think the distance approach is going to prevent the correct merge
from being selected in such cases? Or will it just be less
informative/efficient than a topology-guide?


Reply to this email directly or view it on GitHubhttps://github.com//issues/60#issuecomment-10334164.

Frederick "Erick" Matsen, Assistant Member
Fred Hutchinson Cancer Research Center
http://matsen.fhcrc.org/

@koadman
Copy link
Contributor Author

koadman commented Dec 9, 2012

w00t, only 100 log units to go baby!

koadman@winnebago:~/git/sts$ ./_build/sts --bl-opt-steps 10 -t data/thirty.tree data/thirty.ma > bobochan
Iter 1 max ll -41894.4 ESS: 85.9111
Iter 2 max ll -40501.3 ESS: 45.0118
Iter 3 max ll -39115.4 ESS: 20.2245
 Iter 4 max ll -37751.1 ESS: 21.7675
Iter 5 max ll -36509.1 ESS: 71.5553
Iter 6 max ll -35303.8 ESS: 49.1422
Iter 7 max ll -34153.5 ESS: 4.23615
Iter 8 max ll -33013.3 ESS: 57.4791
Iter 9 max ll -31909.7 ESS: 16.37
Iter 10 max ll -30837.9 ESS: 16.1731
Iter 11 max ll -29779 ESS: 10.7576
Iter 12 max ll -28652.2 ESS: 7.66589
Iter 13 max ll -27477.3 ESS: 12.9671
Iter 14 max ll -26342.3 ESS: 10.2697
Iter 15 max ll -25455 ESS: 18.2079
Iter 16 max ll -24616.6 ESS: 7.25548
Iter 17 max ll -23840.4 ESS: 4.17219
Iter 18 max ll -23105 ESS: 12.967
Iter 19 max ll -22260.5 ESS: 6.36874
Iter 20 max ll -21545.6 ESS: 30.6595
Iter 21 max ll -20940.2 ESS: 27.5663
Iter 22 max ll -20374.5 ESS: 6.36076
Iter 23 max ll -19846.6 ESS: 2.9915
Iter 24 max ll -19536.4 ESS: 9.97693
Iter 25 max ll -19208.6 ESS: 12.6592
Iter 26 max ll -18854.1 ESS: 9.56112
Iter 27 max ll -18690.2 ESS: 37.3598
Iter 28 max ll -18562.5 ESS: 14.7043
Iter 29 max ll -18562.5 ESS: 624.238

@matsen
Copy link
Contributor

matsen commented Dec 10, 2012

... and check out those ESS's!

2012/12/9 Aaron Darling notifications@github.com

w00t, only 100 log units to go baby!

koadman@winnebago:~/git/sts$ ./_build/sts --bl-opt-steps 10 -t data/thirty.tree data/thirty.ma > bobochan
Iter 1 max ll -41894.4 ESS: 85.9111
Iter 2 max ll -40501.3 ESS: 45.0118
Iter 3 max ll -39115.4 ESS: 20.2245
Iter 4 max ll -37751.1 ESS: 21.7675
Iter 5 max ll -36509.1 ESS: 71.5553
Iter 6 max ll -35303.8 ESS: 49.1422
Iter 7 max ll -34153.5 ESS: 4.23615
Iter 8 max ll -33013.3 ESS: 57.4791
Iter 9 max ll -31909.7 ESS: 16.37
Iter 10 max ll -30837.9 ESS: 16.1731
Iter 11 max ll -29779 ESS: 10.7576
Iter 12 max ll -28652.2 ESS: 7.66589
Iter 13 max ll -27477.3 ESS: 12.9671
Iter 14 max ll -26342.3 ESS: 10.2697
Iter 15 max ll -25455 ESS: 18.2079
Iter 16 max ll -24616.6 ESS: 7.25548
Iter 17 max ll -23840.4 ESS: 4.17219
Iter 18 max ll -23105 ESS: 12.967
Iter 19 max ll -22260.5 ESS: 6.36874
Iter 20 max ll -21545.6 ESS: 30.6595
Iter 21 max ll -20940.2 ESS: 27.5663
Iter 22 max ll -20374.5 ESS: 6.36076
Iter 23 max ll -19846.6 ESS: 2.9915
Iter 24 max ll -19536.4 ESS: 9.97693
Iter 25 max ll -19208.6 ESS: 12.6592
Iter 26 max ll -18854.1 ESS: 9.56112
Iter 27 max ll -18690.2 ESS: 37.3598
Iter 28 max ll -18562.5 ESS: 14.7043
Iter 29 max ll -18562.5 ESS: 624.238


Reply to this email directly or view it on GitHubhttps://github.com//issues/60#issuecomment-11176594.

Frederick "Erick" Matsen, Assistant Member
Fred Hutchinson Cancer Research Center
http://matsen.fhcrc.org/

@koadman
Copy link
Contributor Author

koadman commented Dec 10, 2012

yeah, the ESS really did come up. i guess the proposals are almost all equally good (or equally bad) under the natural forest extension. :^)

@koadman
Copy link
Contributor Author

koadman commented Dec 10, 2012

Eked out another 40 log units with c700db8 :
Iter 29 max ll -18519.7 ESS: 398.772

@matsen
Copy link
Contributor

matsen commented Dec 11, 2012

@koadman -- AG has implemented a nice way of visualizing particle history in sts-log-graph branch. I'd be curious about what your runs look like.

@koadman
Copy link
Contributor Author

koadman commented Dec 11, 2012

good idea! trying to do it but having some trouble with the grapher:

koadman@winnebago:~/git/sts$ python ./python/generate_graph.py bobo10.log bobograph
Traceback (most recent call last):
  File "./python/generate_graph.py", line 29, in <module>
    main()
  File "./python/generate_graph.py", line 27, in main
    graph.write_png(sys.argv[2])
  File "/usr/lib/pymodules/python2.7/pydot.py", line 1602, in <lambda>
    lambda path, f=frmt, prog=self.prog : self.write(path, format=f, prog=prog))
  File "/usr/lib/pymodules/python2.7/pydot.py", line 1696, in write
    dot_fd.write(self.create(prog, format))
  File "/usr/lib/pymodules/python2.7/pydot.py", line 1740, in create
    self.write(tmp_name)
  File "/usr/lib/pymodules/python2.7/pydot.py", line 1694, in write
    dot_fd.write(self.to_string())
  File "/usr/lib/pymodules/python2.7/pydot.py", line 1452, in to_string
    graph.append( node.to_string()+'\n' )
  File "/usr/lib/pymodules/python2.7/pydot.py", line 722, in to_string
    node_attr.append( attr + '=' + quote_if_necessary(value) )
TypeError: cannot concatenate 'str' and 'int' objects

I posted the log file in question here:
http://edhar.genomecenter.ucdavis.edu/~koadman/misc/bobo10.log.bz2

This is based on a fork of master from a while back, has the log format changed?

@matsen
Copy link
Contributor

matsen commented Dec 11, 2012

Did you just grab that script?

You will need to merge in that branch in order to use it. There are some
minor changes in the format.

In fact, merging that branch into master then merging master into your
branch would seem sage.

On Tue, Dec 11, 2012 at 7:32 AM, Aaron Darling notifications@github.comwrote:

good idea! trying to do it but having some trouble with the grapher:

koadman@winnebago:~/git/sts$ python ./python/generate_graph.py bobo10.log bobograph
Traceback (most recent call last):
File "./python/generate_graph.py", line 29, in
main()
File "./python/generate_graph.py", line 27, in main
graph.write_png(sys.argv[2])
File "/usr/lib/pymodules/python2.7/pydot.py", line 1602, in
lambda path, f=frmt, prog=self.prog : self.write(path, format=f, prog=prog))
File "/usr/lib/pymodules/python2.7/pydot.py", line 1696, in write
dot_fd.write(self.create(prog, format))
File "/usr/lib/pymodules/python2.7/pydot.py", line 1740, in create
self.write(tmp_name)
File "/usr/lib/pymodules/python2.7/pydot.py", line 1694, in write
dot_fd.write(self.to_string())
File "/usr/lib/pymodules/python2.7/pydot.py", line 1452, in to_string
graph.append( node.to_string()+'\n' )
File "/usr/lib/pymodules/python2.7/pydot.py", line 722, in to_string
node_attr.append( attr + '=' + quote_if_necessary(value) )
TypeError: cannot concatenate 'str' and 'int' objects

I posted the log file in question here:
http://edhar.genomecenter.ucdavis.edu/~koadman/misc/bobo10.log.bz2

This is based on a fork of master from a while back, has the log format
changed?


Reply to this email directly or view it on GitHubhttps://github.com//issues/60#issuecomment-11247912.

Frederick "Erick" Matsen, Assistant Member
Fred Hutchinson Cancer Research Center
http://matsen.fhcrc.org/

@cmccoy
Copy link
Contributor

cmccoy commented Dec 11, 2012

I think Erick's right about needing more than the script...it works on our
setup:

http://cl.ly/image/322W47053h0Y/bobo10.png

On Tue, Dec 11, 2012 at 8:06 AM, Erick Matsen notifications@github.comwrote:

Did you just grab that script?

You will need to merge in that branch in order to use it. There are some
minor changes in the format.

In fact, merging that branch into master then merging master into your
branch would seem sage.

On Tue, Dec 11, 2012 at 7:32 AM, Aaron Darling notifications@github.comwrote:

good idea! trying to do it but having some trouble with the grapher:

koadman@winnebago:~/git/sts$ python ./python/generate_graph.py
bobo10.log bobograph
Traceback (most recent call last):
File "./python/generate_graph.py", line 29, in
main()
File "./python/generate_graph.py", line 27, in main
graph.write_png(sys.argv[2])
File "/usr/lib/pymodules/python2.7/pydot.py", line 1602, in
lambda path, f=frmt, prog=self.prog : self.write(path, format=f,
prog=prog))
File "/usr/lib/pymodules/python2.7/pydot.py", line 1696, in write
dot_fd.write(self.create(prog, format))
File "/usr/lib/pymodules/python2.7/pydot.py", line 1740, in create
self.write(tmp_name)
File "/usr/lib/pymodules/python2.7/pydot.py", line 1694, in write
dot_fd.write(self.to_string())
File "/usr/lib/pymodules/python2.7/pydot.py", line 1452, in to_string
graph.append( node.to_string()+'\n' )
File "/usr/lib/pymodules/python2.7/pydot.py", line 722, in to_string
node_attr.append( attr + '=' + quote_if_necessary(value) )
TypeError: cannot concatenate 'str' and 'int' objects

I posted the log file in question here:
http://edhar.genomecenter.ucdavis.edu/~koadman/misc/bobo10.log.bz2

This is based on a fork of master from a while back, has the log format
changed?


Reply to this email directly or view it on GitHub<
https://github.com/metatangle/sts/issues/60#issuecomment-11247912>.

Frederick "Erick" Matsen, Assistant Member
Fred Hutchinson Cancer Research Center
http://matsen.fhcrc.org/

Reply to this email directly or view it on GitHubhttps://github.com//issues/60#issuecomment-11249457.

@koadman
Copy link
Contributor Author

koadman commented Dec 11, 2012

thanks for the graph link. any ideas on how it compares to others? i'll work on the merges...

@koadman
Copy link
Contributor Author

koadman commented Dec 11, 2012

pretty sobering that all but 1 final generation particles in that graph have a common ancestor only 3 generations ago. dear great grandma: your particles need recombination!

@cmccoy
Copy link
Contributor

cmccoy commented Dec 11, 2012

Seems a lot better than master: http://cl.ly/image/3d203R1D281e/thirty.master.png

@matsen
Copy link
Contributor

matsen commented Dec 11, 2012

FYI, Aaron, http://cl.ly/image/3d203R1D281e/thirty.master.png was called
"the pushbroom" by @habnabit. We will probably use that terminology.

@koadman
Copy link
Contributor Author

koadman commented Dec 11, 2012

I love it!

@matsen matsen closed this as completed Jun 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants