Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use compute.FilterRecord instead of bitmask and manually filtering #672

Closed
garrensmith opened this issue Jan 4, 2024 · 10 comments
Closed
Labels

Comments

@garrensmith
Copy link
Contributor

A lot of the filters use a bitmask and then manually filtering the record batch. I think this could be simplified by using the built in compute.FilterRecord to do all the work. I'm not 100% if it would improve performance or not, but I think it should give a memory improvement.

@thorfour thorfour reopened this Jan 16, 2024
@thorfour
Copy link
Contributor

Sorry, apparently fat fingered the close button.

I played around with this today

 func filter(ctx context.Context, pool memory.Allocator, filterExpr BooleanExpression, ar arrow.Record) (arrow.Record, bool, error) {
      bitmap, err := filterExpr.Eval(ar)
      if err != nil {
          return nil, true, err
      }

      if bitmap.IsEmpty() {
          return nil, true, nil
      }

      // Construct filter array
      // NOTE: this is intermediary right now. The Eval function should return a boolean array instead so we can directly pass it in.
      bldr := array.NewBooleanBuilder(pool)
      defer bldr.Release()
      for i := 0; i < int(ar.NumRows()); i++ {
          bldr.Append(bitmap.Contains(uint32(i)))
      }
      filterArr := bldr.NewArray()
      defer filterArr.Release()

      result, err := compute.FilterRecordBatch(ctx, ar, filterArr, compute.DefaultFilterOptions())
      if err != nil {
          return nil, true, err
      }
      return result, false, nil
  }

And ended up with a panic

--- FAIL: Test_DB_All (0.00s)
    db_test.go:2085:
        	Error Trace:	/Users/thor/go/src/github.com/polarsignals/frostdb/db_test.go:2085
        	Error:      	Received unexpected error:
        	            	not implemented: function 'array_take' has no kernel matching input types (dictionary<values=utf8, indices=uint32, ordered=false>, uint16)
        	Test:       	Test_DB_All

Looks like it's not implemented for all types.

@garrensmith
Copy link
Contributor Author

That is unfortunate. Should we open an issue on apache arrow for this?

@thorfour
Copy link
Contributor

Yea and link it here so we know if/when we can actually implement this.

@gernest
Copy link
Contributor

gernest commented Jan 20, 2024

Yea and link it here so we know if/when we can actually implement this.

No need for this. We can in fact implement it now. Basically, this is how filter on records works

  • build array of indices containing rows you wan't to choose
  • for each record column use array_take to select rows of interest
  • assemble result of previous step (concurrently)
  • build a new record with the result.

So, actually @thorfour solution is correct, we just need to be smart with array.Dictionary columns. We can be smart, and take and assemble ourselves when we know we have *array.Dictionary column and delegate to compute for the rest.

So, @thorfour can you please open the PR with your changes ? I will help and make sure we massage it until it works for any case we currently have, we will probably need to run benchmarks as well to make sure we don't introduce regressions.

@gernest
Copy link
Contributor

gernest commented Jan 20, 2024

I took another look, we already have arrowutils.ReorderRecord which takes care of all the case I mentioned. We can go back to @thorfour sample and build array.Int32 instead of *array.Boolean from the bitmap. Pretty much calling (*Bitmap).ToArray() will give you the indices or we can avoid allocation and iterate on the bitmap to build the indices array.

Something like

	b := array.NewInt32Builder(...)
	b.Reserve(int(bitmap.GetCardinality()))
	bitmap.Iterate(func(x uint32) bool {
		b.Append(int32(x))
		return true
	})
	indices := b.NewArray()

So steps will be

  • *Bitmap -> *array.Int32
  • call arrowutils.ReorderRecord with the result
  • Profit

@gernest
Copy link
Contributor

gernest commented Jan 21, 2024

Quick check says it is not a simple change. I'm taking this task. Will submit supplementary patches to make it possible.

gernest added a commit to gernest/frostdb that referenced this issue Jan 21, 2024
This is supplementary patch needed for polarsignals#672. I am submitting it separately to
simplify reviewing.
gernest added a commit to gernest/frostdb that referenced this issue Jan 21, 2024
The result arrow.Record has the same number of rows as the the input indices
array.

This is part of polarsignals#672
gernest added a commit to gernest/frostdb that referenced this issue Jan 21, 2024
 `r` and `indices` are externally managed resources. Owning them increases ref
 count ,since we never releases them they will leak when `r` has no dictionary field.

 There is no need to own these resources inside `Take`.

 This is part of polarsignals#672
@gernest
Copy link
Contributor

gernest commented Jan 21, 2024

😓 finally I have this working. I will wait for supplementary patches to land then I will drop the PR.

thorfour pushed a commit that referenced this issue Jan 21, 2024
This is supplementary patch needed for #672. I am submitting it separately to
simplify reviewing.
thorfour pushed a commit to gernest/frostdb that referenced this issue Jan 21, 2024
The result arrow.Record has the same number of rows as the the input indices
array.

This is part of polarsignals#672
thorfour pushed a commit that referenced this issue Jan 21, 2024
The result arrow.Record has the same number of rows as the the input indices
array.

This is part of #672
thorfour pushed a commit to gernest/frostdb that referenced this issue Jan 21, 2024
 `r` and `indices` are externally managed resources. Owning them increases ref
 count ,since we never releases them they will leak when `r` has no dictionary field.

 There is no need to own these resources inside `Take`.

 This is part of polarsignals#672
thorfour pushed a commit that referenced this issue Jan 21, 2024
`r` and `indices` are externally managed resources. Owning them increases ref
 count ,since we never releases them they will leak when `r` has no dictionary field.

 There is no need to own these resources inside `Take`.

 This is part of #672
gernest added a commit to gernest/frostdb that referenced this issue Jan 21, 2024
@gernest
Copy link
Contributor

gernest commented Jan 21, 2024

Done #697 . I wish someone with access to production like workload would do some benchmarks and give us numbers.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Feb 21, 2024
@asubiotto asubiotto removed the Stale label Feb 21, 2024
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Mar 23, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants