#### keep obj refs in buckets

There's an optimization possible with the postgres stuff
- if the bucket owns its objects, e.g. in list form
- and we want the entire bucket (because it only has 1 key, and we want that key)
- we could just grab all the objects and return them in one shot

downside: if we have an arbitrary obj id, which bucket did it come from? we dunno
need to drag obj refs along with the obj ids up to higher levels then

so when I query a bucket, the return value looks like:

```
[
  obj_ids: sorted np array
  objs: list
  all_true: bool, True if there's no mask needed because we still want every obj 
  mask: bool np array or None. 
]
```

Then we maintain a whole lot of those structures as we go up. Like we'll have a list of 'em that we wanna union() at the end of `match`, and another list for everything we wanna `exclude`. 

Problem: duplicated storage
 - We are now storing many obj references in each idx instead of just once overall. 
 - RAM cost per index is higher by 8 bytes / index / item. 

#### Full match

To handle a query like:

`find(match={}, exclude={'something': 'small'})`

we'll need to do a scan across one of the indices to get all items first. 

Optionally, we could preserve a dummy index that contains all items in one key, just for the "I need everything" query (e.g. when doing an `__iter__`.


#### multiple matches / excludes

Currently some queries are not expressible. Find me everything with `a=1 and [(b != 2) or (c != 3)]`. 

We could get one level deeper by accepting lists, like `match={a:1}, exclude=[{b: 2}, {c: 3}]`. Accepting list-of-list and so on could keep going deeper still. It makes a tree, which we can eval from the ground up.

Not sure whether we wanna go down that road. Sounds long. 

Just something to keep in mind as we update the ol' query evaluation engine to fit this new approach.

ON SECOND THOUGHT. We can index general functions. A user could just make an index on:

```
def not_these(obj):
    return b != 2 or c != 3
```

And then
```
hi = HashIndex(items, on=['a', not_these]
hi.find({'a': 1, not_these: True)
```

which would be more efficient anyway. So screw complexity. Yay.

In [84]:
from dataclasses import dataclass
from bisect import bisect_left
import random
import time
import numpy as np
from sortedcontainers import SortedDict
from pympler.asizeof import asizeof
import sortednp as snp
from operator import itemgetter
from typing import List, Optional, Any

In [60]:
@dataclass
class SomeObjs:
    # todo: slots=true
    obj_ids: np.ndarray = np.array(dtype='uint64') # sorted np array of ints
    objs: List[Any] # sorted by obj_id
    has_nan_ids: bool = False # Some obj_ids will be set to np.NaN during intersect operations

        
def remove_nan_ids(s: SomeObjs):
    # mutates s
    if s.all_true:
        return
    s.obj_ids = s.obj_ids[s.mask]
    # itemgetter is faster than a generator expression for subselecting a list by indices
    # just needs extra handling 
    pos = np.where(s.mask)[0]
    if len(pos) == 0:
        s.objs = []
    elif len(pos) == 1:
        s.objs = [s.objs[pos[0]]]
    else:
        s.objs = list(itemgetter(*pos)(s.objs))
    s.all_true = True
    s.mask = None  


In [57]:
# proof that itemgetter is faster, even with the cast to list
pos = list(range(10**6))
objs = list(range(10**6))
t0 = time.time()
q = list(itemgetter(*pos)(objs))
t1 = time.time()
r = [objs[i] for i in pos]
t2 = time.time()
print('itemgetter', t1-t0)
print('list comp', t2-t1)

itemgetter 0.028338909149169922
list comp 0.0499112606048584


In [66]:
def test_remove_nan_ids():
    # mask bits typically get set after an intersect / difference operation
    s = SomeObjs(obj_ids = np.array([1,2,3], dtype='uint64'), objs=['a', 'b', 'c'], all_true=True, mask=None)

    s.all_true = False
    # todo parameterize for 0, 1, 2, and 3 trues -- all different outcomes
    s.mask = np.array([True, False, True], dtype=bool)

    apply_mask(s)
    assert s.objs == ['a', 'c']
    
test_apply_mask()

In [None]:
def sort_by_id(s: SomeObjs):
    # mutates s
    sort_order = s.obj_ids.argsort()
    s.obj_ids = s.obj_ids[sort_order]
    s.objs = s.objs[sort_order]

In [67]:
def union_all(ls: List[SomeObjs]) -> SomeObjs:
    # so we take the union'd IDs
    # then we intersect them with each SomeObjs one at a time
    # on intersect, we:
    # - grab the obj and its id, add it to the output SomeObjs
    # - nan out the copy of it in the unionset
    # Last step, sort. We'll need a function for that.
    result = SomeObjs()
    union_ids = snp.kway_merge(*[s.obj_ids for s in ls], assume_sorted=True, duplicates=snp.DROP)
    for s in ls:
        isect_idxs = snp.[1]
    
    print(ids)

s = SomeObjs(obj_ids = np.array([1,2,3], dtype='uint64'), objs=['a', 'b', 'c'], all_true=True, mask=None)

SyntaxError: invalid syntax (3620735981.py, line 10)

In [None]:
# idea: use nans instead of a mask

In [None]:

def intersect_all(ls: List[SomeObjs]):
    pass

def difference(objs: SomeObjs, not_these: SomeObjs):
    pass

