Incremental indexing missing from generate-names.pl v1.11.0 algorithm #478

Closed
rdhayes opened this Issue May 19, 2014 · 15 comments

Comments

Projects
None yet
2 participants
@rdhayes
Contributor

rdhayes commented May 19, 2014

Incremental indexing is broken since the v1.11.0 release. I'm testing now whether incremental indexing with a v1.10.12 release codebase fixes the problems we have been seeing with keyword searching.

Specifically, the code that calls Bio::JBrowse::Cmd::IndexNames->_mergeIndexEntries() when generate-names.pl is run with the --incremental flag is missing from minified release v1.11.0 and newer and the master branch. Incremental indexing originally was introduced in Bio::JBrowse::Cmd::IndexNames at lines 103-109 (in the load() method) in this commit from 6 months ago:
c04241f

Rob updated the indexing code to use a single new algorithm about a month later (December 2013). When this change was made, the call to

$self->name_store->stream_set(
    $self->make_key_value_stream( $operation_stream ),
    $self->{stats}{key_count},
( $self->opt('incremental')
          ? sub {
              return $self->_mergeIndexEntries( @_ );
            }
          : ()
    )
);

became

$self->name_store->stream_do(
$operation_stream,
    sub {
        my ( $operation, $data ) = @_;
        my %fake_store = ( $operation->[0] => $data );
        $self->do_hash_operation( \%fake_store, $operation );
        return $fake_store{ $operation->[0] } ;
    },
    $self->{stats}{operation_stream_estimated_count},
    );

There is no longer a check for self->opt('incremental')apparent in Bio::JBrowse::Cmd::IndexNames or Bio::JBrowse::HashStore that would trigger the necessary call to $self->mergeIndexEntries( @ ); that performs the actual merge step.

I will try adding the missing logic to Bio::JBrowse::HashStore->stream_do() as a start towards building a fix that reintroduces incremental indexing.

@rdhayes rdhayes self-assigned this May 19, 2014

@cmdcolin cmdcolin added this to the 1.11.5 milestone May 22, 2014

@cmdcolin

This comment has been minimized.

Show comment
Hide comment
@cmdcolin

cmdcolin Jul 25, 2014

Contributor

I am still not sure of the solution to this but I think it's important that this gets done.

The thing that I saw that is surprising for me is that the code seems like it is effectively hashing the new files incrementally, but the output is not being written to disk. Effectively, Bio::JBrowse::HashStore::Bucket DESTROY provides the "write to disk" function for the hash buckets

Also, there is Bio::JBrowse::HashStore STORE that is marking the new hash data with a dirty bit telling it to be written to disk (using Tie::Hash). When I use the incremental algorithm, somewhere along the line, the dirty bits are NOT being set, even though it seems like it builds the new hash tree....so either they are not being hashed correctly, or merged correctly, or something of that nature.

That's just some notes from investigating, hope we can get this fixed.

Contributor

cmdcolin commented Jul 25, 2014

I am still not sure of the solution to this but I think it's important that this gets done.

The thing that I saw that is surprising for me is that the code seems like it is effectively hashing the new files incrementally, but the output is not being written to disk. Effectively, Bio::JBrowse::HashStore::Bucket DESTROY provides the "write to disk" function for the hash buckets

Also, there is Bio::JBrowse::HashStore STORE that is marking the new hash data with a dirty bit telling it to be written to disk (using Tie::Hash). When I use the incremental algorithm, somewhere along the line, the dirty bits are NOT being set, even though it seems like it builds the new hash tree....so either they are not being hashed correctly, or merged correctly, or something of that nature.

That's just some notes from investigating, hope we can get this fixed.

@rdhayes

This comment has been minimized.

Show comment
Hide comment
@rdhayes

rdhayes Jul 25, 2014

Contributor

Thanks for looking into this. I think I'll have time this afternoon to
investigate a bit myself.

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov

On Fri, Jul 25, 2014 at 6:50 AM, Colin notifications@github.com wrote:

I am still not sure of the solution to this but I think it's important
that this gets done.

The thing that I saw that is surprising for me is that the code seems like
it is effectively hashing the new files incrementally, but the output is
not being written to disk. Effectively, Bio::JBrowse::HashStore::Bucket
DESTROY provides the "write to disk" function for the hash buckets, and it
is part of the Tie::Hash custom implementation (ie DESTROY is not called
anywhere, it's implicitly called due to using Tie::Hash)

Also, there is Bio::JBrowse::HashStore STORE that is marking the new hash
data with a dirty bit telling it to be written to disk (also using
Tie::Hash). When I use the incremental algorithm, somewhere along the line,
the dirty bits are being set, even though it seems like it builds the new
hash tree....so either they are not being hashed correctly, or merged
correctly, or something of that nature.

That's just some notes from investigating, hope we can get this fixed.


Reply to this email directly or view it on GitHub
#478 (comment).

Contributor

rdhayes commented Jul 25, 2014

Thanks for looking into this. I think I'll have time this afternoon to
investigate a bit myself.

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov

On Fri, Jul 25, 2014 at 6:50 AM, Colin notifications@github.com wrote:

I am still not sure of the solution to this but I think it's important
that this gets done.

The thing that I saw that is surprising for me is that the code seems like
it is effectively hashing the new files incrementally, but the output is
not being written to disk. Effectively, Bio::JBrowse::HashStore::Bucket
DESTROY provides the "write to disk" function for the hash buckets, and it
is part of the Tie::Hash custom implementation (ie DESTROY is not called
anywhere, it's implicitly called due to using Tie::Hash)

Also, there is Bio::JBrowse::HashStore STORE that is marking the new hash
data with a dirty bit telling it to be written to disk (also using
Tie::Hash). When I use the incremental algorithm, somewhere along the line,
the dirty bits are being set, even though it seems like it builds the new
hash tree....so either they are not being hashed correctly, or merged
correctly, or something of that nature.

That's just some notes from investigating, hope we can get this fixed.


Reply to this email directly or view it on GitHub
#478 (comment).

@cmdcolin

This comment has been minimized.

Show comment
Hide comment
@cmdcolin

cmdcolin Jul 25, 2014

Contributor

I found that basically you can manually set the dirty bit after the "do_operation" on HashStore.pm and it "fixes" the problem but i'm not sure if this is appropriate (in fact i'm pretty sure this is not...the STORE method should be invoked somewhere to set the dirty bit instead of manually setting it this way).

@@ -224,13 +226,16 @@ sub stream_do {
         while ( my $rec = eval { Storable::fd_retrieve( $log_fh ) } ) {
             my ( $hex, $op ) = @$rec;
             my $bucket = $self->_getBucketFromHex( $hex );
             $bucket->{data}{$op->[0]} = $do_operation->( $op, $bucket->{data}{$op->[0]} );
+            $bucket->{dirty} = 1;

If you want to take a crack at this then that would be great! I guess the problem might be in the do_hash_operation type code in the IndexNames.pm module instead

Contributor

cmdcolin commented Jul 25, 2014

I found that basically you can manually set the dirty bit after the "do_operation" on HashStore.pm and it "fixes" the problem but i'm not sure if this is appropriate (in fact i'm pretty sure this is not...the STORE method should be invoked somewhere to set the dirty bit instead of manually setting it this way).

@@ -224,13 +226,16 @@ sub stream_do {
         while ( my $rec = eval { Storable::fd_retrieve( $log_fh ) } ) {
             my ( $hex, $op ) = @$rec;
             my $bucket = $self->_getBucketFromHex( $hex );
             $bucket->{data}{$op->[0]} = $do_operation->( $op, $bucket->{data}{$op->[0]} );
+            $bucket->{dirty} = 1;

If you want to take a crack at this then that would be great! I guess the problem might be in the do_hash_operation type code in the IndexNames.pm module instead

@cmdcolin

This comment has been minimized.

Show comment
Hide comment
@cmdcolin

cmdcolin Jul 27, 2014

Contributor

@rdhayes: would you be able to "performance test" this change from the previous line? It might be the best we can do to restore functionality.

Notes from my testing:

  • I tried initializing a new names hash with generate-names.pl, and then add one track with one GFF feature, and reindex it using --incremental (explicitly specifying the --tracks parameter). This causes many of the buckets get reinitialized (3562/4096), but many of the reinitialized buckets are the same except for slight shuffling of the JSON structure. Not sure how bad this is for performance, but the incremental function does work in this case.
  • The parameter --workdir is not used AFAIK.
  • I am not even sure the mergeIndexEntries is needed. The merging seems to happen "auto-magically" in the code already.
Contributor

cmdcolin commented Jul 27, 2014

@rdhayes: would you be able to "performance test" this change from the previous line? It might be the best we can do to restore functionality.

Notes from my testing:

  • I tried initializing a new names hash with generate-names.pl, and then add one track with one GFF feature, and reindex it using --incremental (explicitly specifying the --tracks parameter). This causes many of the buckets get reinitialized (3562/4096), but many of the reinitialized buckets are the same except for slight shuffling of the JSON structure. Not sure how bad this is for performance, but the incremental function does work in this case.
  • The parameter --workdir is not used AFAIK.
  • I am not even sure the mergeIndexEntries is needed. The merging seems to happen "auto-magically" in the code already.
@rdhayes

This comment has been minimized.

Show comment
Hide comment
@rdhayes

rdhayes Jul 29, 2014

Contributor

The fact that you see incremental updating working when the dirty bit is
always on is encouraging. I was afraid this was a more fundamental
algorithm problem from the changes made in 1.11.0.

I agree that the _mergeIndexEntries sub looks like code from previous
implementations that wasn't intended for use going forward. The only
obvious call to this sub was removed in this set of commits:
https://github.com/GMOD/
jbrowse/commit/77f006898561b18c4c091978cc1a5f8429744650
(see the change to src/perl5/Bio/JBrowse/Cmd/IndexNames.pm).

Currently, the --incremental flag is only checked to determine whether to
delete an existing hash store to replace existing bucket files completely.
The HashStore attribute "empty" is set as the inverse of the incremental
parameter, so I can use that later on.

My impressions below should be taken at face value. I don't have much
experience with use of perltie, which is controlling the "magic" here if
I'm reading things correctly. According to perldoc, STORE is called when a
tied hash key is assigned a value.

I am currently thinking the call to _getBucketFromHex() on line 226 of
HashStore.pm is involved here. This in turn calls _readBucket(), which is
the only code that directly interacts with potentially interesting name
store files, I believe. On Line 401, the "dirty" bit is set only if no
file for this bucket already exists, otherwise the data is reserialized and
returned. But, if no new data is added to this bucket, STORE is never
called, so Bio::JBrowse::HashStore::Bucket::DESTROY never acts as we want.

I think this is the source of the bug. The dirty bit is only set if there
is no existing data, or when new data is set by assigning newly indexed
data to a name_store hex key. We need to set dirty => 1 for existing data,
too, to preserve the original existing bucket data.

I've made a small update to _readBuckets() that conditionally sets dirty to
0 or 1 for existing data, a situation where it is not set at all currently.

*** HashStore.pm.orig Mon Jul 28 17:21:17 2014
--- HashStore.pm Mon Jul 28 17:23:41 2014
*************** sub _readBucket {
*** 405,410 ****
--- 405,411 ----
JSON::from_json( scalar <$in> )
}
} || {}

  •             , dirty => ($self->{empty}) ? 0 : 1
            )
          : ( data => {}, dirty => 1 )
      ));
    

After a few test runs with with multiple tracks adding individually in
various orders, I'll report back.

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov

On Sun, Jul 27, 2014 at 10:24 AM, Colin notifications@github.com wrote:

@rdhayes https://github.com/rdhayes: would you be able to "performance
test" this change from the previous line? It might be the best we can do to
restore functionality.

Notes from my testing:

  • I tried initializing a new names hash with generate-names.pl, and
    then add one track with one GFF feature, and reindex it using --incremental
    (explicitly specifying the --tracks parameter). This causes many of the
    buckets get reinitialized (3562/4096), but many of the reinitialized
    buckets are the same except for slight shuffling of the JSON structure. Not
    sure how bad this is for performance, but the incremental function does
    work in this case.
  • The parameter --workdir is not used AFAIK.
  • I am not even sure the mergeIndexEntries is needed. The merging
    seems to happen "auto-magically" in the code already.


Reply to this email directly or view it on GitHub
#478 (comment).

Contributor

rdhayes commented Jul 29, 2014

The fact that you see incremental updating working when the dirty bit is
always on is encouraging. I was afraid this was a more fundamental
algorithm problem from the changes made in 1.11.0.

I agree that the _mergeIndexEntries sub looks like code from previous
implementations that wasn't intended for use going forward. The only
obvious call to this sub was removed in this set of commits:
https://github.com/GMOD/
jbrowse/commit/77f006898561b18c4c091978cc1a5f8429744650
(see the change to src/perl5/Bio/JBrowse/Cmd/IndexNames.pm).

Currently, the --incremental flag is only checked to determine whether to
delete an existing hash store to replace existing bucket files completely.
The HashStore attribute "empty" is set as the inverse of the incremental
parameter, so I can use that later on.

My impressions below should be taken at face value. I don't have much
experience with use of perltie, which is controlling the "magic" here if
I'm reading things correctly. According to perldoc, STORE is called when a
tied hash key is assigned a value.

I am currently thinking the call to _getBucketFromHex() on line 226 of
HashStore.pm is involved here. This in turn calls _readBucket(), which is
the only code that directly interacts with potentially interesting name
store files, I believe. On Line 401, the "dirty" bit is set only if no
file for this bucket already exists, otherwise the data is reserialized and
returned. But, if no new data is added to this bucket, STORE is never
called, so Bio::JBrowse::HashStore::Bucket::DESTROY never acts as we want.

I think this is the source of the bug. The dirty bit is only set if there
is no existing data, or when new data is set by assigning newly indexed
data to a name_store hex key. We need to set dirty => 1 for existing data,
too, to preserve the original existing bucket data.

I've made a small update to _readBuckets() that conditionally sets dirty to
0 or 1 for existing data, a situation where it is not set at all currently.

*** HashStore.pm.orig Mon Jul 28 17:21:17 2014
--- HashStore.pm Mon Jul 28 17:23:41 2014
*************** sub _readBucket {
*** 405,410 ****
--- 405,411 ----
JSON::from_json( scalar <$in> )
}
} || {}

  •             , dirty => ($self->{empty}) ? 0 : 1
            )
          : ( data => {}, dirty => 1 )
      ));
    

After a few test runs with with multiple tracks adding individually in
various orders, I'll report back.

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov

On Sun, Jul 27, 2014 at 10:24 AM, Colin notifications@github.com wrote:

@rdhayes https://github.com/rdhayes: would you be able to "performance
test" this change from the previous line? It might be the best we can do to
restore functionality.

Notes from my testing:

  • I tried initializing a new names hash with generate-names.pl, and
    then add one track with one GFF feature, and reindex it using --incremental
    (explicitly specifying the --tracks parameter). This causes many of the
    buckets get reinitialized (3562/4096), but many of the reinitialized
    buckets are the same except for slight shuffling of the JSON structure. Not
    sure how bad this is for performance, but the incremental function does
    work in this case.
  • The parameter --workdir is not used AFAIK.
  • I am not even sure the mergeIndexEntries is needed. The merging
    seems to happen "auto-magically" in the code already.


Reply to this email directly or view it on GitHub
#478 (comment).

@cmdcolin

This comment has been minimized.

Show comment
Hide comment
@cmdcolin

cmdcolin Jul 29, 2014

Contributor

I see what you mean, but i don't see the need for specifying the dirty bit for buckets on loading them?

Contributor

cmdcolin commented Jul 29, 2014

I see what you mean, but i don't see the need for specifying the dirty bit for buckets on loading them?

@rdhayes

This comment has been minimized.

Show comment
Hide comment
@rdhayes

rdhayes Jul 29, 2014

Contributor

I admit that I might be turned around on this.

The dirty bit, the trigger to write the bucket hash to disk, is only set
when:

  1. the tied hash is updated, as would happen for adding new data to a
    bucket in the current run
  2. the bucket has no existing file path

For runs without --incremental, HashStore->empty() is always called before
any action is taken, so there are no longer any existing files to attempt
to keep.
For runs with --incremental, it's not clear if existing data is retained
properly when there is no new data to add to a bucket, as the dirty bit
currently is not set in this condition.

I haven't looked at what a malformed incremental index looks like in a
while, but I believe that the JSON only contained records for the last
track added, but the meta.json file always contained all track names that
were added.

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov

On Mon, Jul 28, 2014 at 6:27 PM, Colin notifications@github.com wrote:

I see what you mean, but i don't see the need for specifying the dirty bit
for buckets on loading them?


Reply to this email directly or view it on GitHub
#478 (comment).

Contributor

rdhayes commented Jul 29, 2014

I admit that I might be turned around on this.

The dirty bit, the trigger to write the bucket hash to disk, is only set
when:

  1. the tied hash is updated, as would happen for adding new data to a
    bucket in the current run
  2. the bucket has no existing file path

For runs without --incremental, HashStore->empty() is always called before
any action is taken, so there are no longer any existing files to attempt
to keep.
For runs with --incremental, it's not clear if existing data is retained
properly when there is no new data to add to a bucket, as the dirty bit
currently is not set in this condition.

I haven't looked at what a malformed incremental index looks like in a
while, but I believe that the JSON only contained records for the last
track added, but the meta.json file always contained all track names that
were added.

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov

On Mon, Jul 28, 2014 at 6:27 PM, Colin notifications@github.com wrote:

I see what you mean, but i don't see the need for specifying the dirty bit
for buckets on loading them?


Reply to this email directly or view it on GitHub
#478 (comment).

@cmdcolin

This comment has been minimized.

Show comment
Hide comment
@cmdcolin

cmdcolin Jul 29, 2014

Contributor

For runs with --incremental, it's not clear if existing data is retained properly when there is no new data to add to a bucket, as the dirty bit currently is not set in this condition.

I'm not sure if that concern is necessary. The code won't remove any existing name stores if --incremental is set, and if they aren't being updated then I'm not sure the dirty flag needs to be set.

Contributor

cmdcolin commented Jul 29, 2014

For runs with --incremental, it's not clear if existing data is retained properly when there is no new data to add to a bucket, as the dirty bit currently is not set in this condition.

I'm not sure if that concern is necessary. The code won't remove any existing name stores if --incremental is set, and if they aren't being updated then I'm not sure the dirty flag needs to be set.

@rdhayes

This comment has been minimized.

Show comment
Hide comment
@rdhayes

rdhayes Jul 29, 2014

Contributor

Perhaps. I will have a better idea after looking at my test runs today.

What's the answer for how universally setting the dirty bit solved the
issue in your testing over the weekend, then?

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov

On Tue, Jul 29, 2014 at 11:38 AM, Colin notifications@github.com wrote:

For runs with --incremental, it's not clear if existing data is retained
properly when there is no new data to add to a bucket, as the dirty bit
currently is not set in this condition.
I'm not sure if that concern is necessary. The code won't remove any
existing name stores if --incremental is set, and if they aren't being
updated then I'm not sure the dirty flag needs to be set.


Reply to this email directly or view it on GitHub
#478 (comment).

Contributor

rdhayes commented Jul 29, 2014

Perhaps. I will have a better idea after looking at my test runs today.

What's the answer for how universally setting the dirty bit solved the
issue in your testing over the weekend, then?

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov

On Tue, Jul 29, 2014 at 11:38 AM, Colin notifications@github.com wrote:

For runs with --incremental, it's not clear if existing data is retained
properly when there is no new data to add to a bucket, as the dirty bit
currently is not set in this condition.
I'm not sure if that concern is necessary. The code won't remove any
existing name stores if --incremental is set, and if they aren't being
updated then I'm not sure the dirty flag needs to be set.


Reply to this email directly or view it on GitHub
#478 (comment).

@cmdcolin

This comment has been minimized.

Show comment
Hide comment
@cmdcolin

cmdcolin Jul 29, 2014

Contributor

In my patch, I don't think it's necessarily "universally" setting the flag. I am setting it after all do_hash_operation calls, which is sort of a proxy for the bucket being updated. As i said my patch isn't perfect (it still doesn't use the STORE method, and it possibly updates buckets unnecessarily e.g. the 3562/4096 for a single line gff) but it is fixing the behavior. Your added patch doesn't cause problems, but it isn't necessary to set the dirty bit on things unless they are being changed.

Contributor

cmdcolin commented Jul 29, 2014

In my patch, I don't think it's necessarily "universally" setting the flag. I am setting it after all do_hash_operation calls, which is sort of a proxy for the bucket being updated. As i said my patch isn't perfect (it still doesn't use the STORE method, and it possibly updates buckets unnecessarily e.g. the 3562/4096 for a single line gff) but it is fixing the behavior. Your added patch doesn't cause problems, but it isn't necessary to set the dirty bit on things unless they are being changed.

@cmdcolin

This comment has been minimized.

Show comment
Hide comment
@cmdcolin

cmdcolin Aug 1, 2014

Contributor

Any updates? Still hashing :)?

Contributor

cmdcolin commented Aug 1, 2014

Any updates? Still hashing :)?

@rdhayes

This comment has been minimized.

Show comment
Hide comment
@rdhayes

rdhayes Aug 13, 2014

Contributor

This month filled up quickly. I believe I can finish testing this out by
the end of this week, though.

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov

On Fri, Aug 1, 2014 at 10:43 AM, Colin notifications@github.com wrote:

Any updates? Still hashing :)?


Reply to this email directly or view it on GitHub
#478 (comment).

Contributor

rdhayes commented Aug 13, 2014

This month filled up quickly. I believe I can finish testing this out by
the end of this week, though.

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov

On Fri, Aug 1, 2014 at 10:43 AM, Colin notifications@github.com wrote:

Any updates? Still hashing :)?


Reply to this email directly or view it on GitHub
#478 (comment).

@cmdcolin

This comment has been minimized.

Show comment
Hide comment
@cmdcolin

cmdcolin Aug 18, 2014

Contributor

Thanks for the follow up. If we get this --incremental flag sorted out, can we also sort out the fact that --workdir is no longer used? I feel like it should (a) be removed from the documentation and pushed off or (b) fixed so that a --workdir can be used.

Contributor

cmdcolin commented Aug 18, 2014

Thanks for the follow up. If we get this --incremental flag sorted out, can we also sort out the fact that --workdir is no longer used? I feel like it should (a) be removed from the documentation and pushed off or (b) fixed so that a --workdir can be used.

@cmdcolin cmdcolin closed this in 5432aab Aug 25, 2014

@cmdcolin

This comment has been minimized.

Show comment
Hide comment
@cmdcolin

cmdcolin Aug 25, 2014

Contributor

I closed this issue with my proposed fix, and I added a short test in generate-names.pl to test the incremental function by updating a partial name store.

Let me know if there are any additional concerns.

Contributor

cmdcolin commented Aug 25, 2014

I closed this issue with my proposed fix, and I added a short test in generate-names.pl to test the incremental function by updating a partial name store.

Let me know if there are any additional concerns.

@cmdcolin

This comment has been minimized.

Show comment
Hide comment
@cmdcolin

cmdcolin Aug 26, 2014

Contributor

There is a small typo in the test of incremental names 32d3c53
I'll try and fix that shortly

Contributor

cmdcolin commented Aug 26, 2014

There is a small typo in the test of incremental names 32d3c53
I'll try and fix that shortly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment