#### keep obj refs in buckets

There's an optimization possible with the postgres stuff
- if the bucket owns its objects, e.g. in list form
- and we want the entire bucket (because it only has 1 key, and we want that key)
- we could just grab all the objects and return them in one shot

downside: if we have an arbitrary obj id, which bucket did it come from? we dunno
need to drag obj refs along with the obj ids up to higher levels then

so when I query a bucket, the return value looks like:

```
[
  obj_ids: sorted np array
  objs: list
  all_true: bool, True if there's no mask needed because we still want every obj 
  mask: bool np array or None. 
]
```

Then we maintain a whole lot of those structures as we go up. Like we'll have a list of 'em that we wanna union() at the end of `match`, and another list for everything we wanna `exclude`. 

Problem: duplicated storage
 - We are now storing many obj references in each idx instead of just once overall. 
 - RAM cost per index is higher by 8 bytes / index / item. 

#### Full match

To handle a query like:

`find(match={}, exclude={'something': 'small'})`

we'll need to do a scan across one of the indices to get all items first. 

Optionally, we could preserve a dummy index that contains all items in one key, just for the "I need everything" query (e.g. when doing an `__iter__`.


#### multiple matches / excludes

Currently some queries are not expressible. Find me everything with `a=1 and [(b != 2) or (c != 3)]`. 

We could get one level deeper by accepting lists, like `match={a:1}, exclude=[{b: 2}, {c: 3}]`. Accepting list-of-list and so on could keep going deeper still. It makes a tree, which we can eval from the ground up.

Not sure whether we wanna go down that road. Sounds long. 

Just something to keep in mind as we update the ol' query evaluation engine to fit this new approach.

ON SECOND THOUGHT. We can index general functions. A user could just make an index on:

```
def not_these(obj):
    return b != 2 or c != 3
```

And then
```
hi = HashIndex(items, on=['a', not_these]
hi.find({'a': 1, not_these: True)
```

which would be more efficient anyway. So screw complexity. Yay.

In [84]:
from dataclasses import dataclass
from bisect import bisect_left
import random
import time
import numpy as np
from sortedcontainers import SortedDict
from pympler.asizeof import asizeof
import sortednp as snp
from operator import itemgetter
from typing import List, Optional, Any

In [60]:
@dataclass
class MatchedArray:
    """
    Return type of a FrozenFieldIndex. Contains objects and their ids in parallel arrays.
    Implements efficient union / intersection operations for combining results.
    """
    id_arr: np.ndarray # sorted np array of ints
    obj_arr: np.ndarray # array of Python objects, sorted by obj_id
    
    def intersection(self, other:MatchedArray):
        
    
    def union(self, other: MatchedArray):
        """This array becomes the union of self and other."""
        self.id_arr, indices = snp.merge(a0, a1, indices=True, duplicates=snp.DROP)
        obj_arr = np.empty_like(merged_id_arr, dtype='O')
        obj_arr[indices[0]] = o_arr_1
        obj_arr[indices[1]] = o_arr_2
        self.obj_arr = obj_arr
    
    def difference(self, other: MatchedArray):
        """Remove elements in other from self."""
        # TODO: use an array hit_counts instead
        # each time an idx is in the intersection, += 1 it
        # at the end, obj_arr = obj_arr[np.where(hit_counts == n_intersections)]
        matched_positions = snp.intersect(self.id_arr, other.id_arr, indices=True)[1][0]
        nonmatches = np.ones_like(self.delete_arr, dtype=bool)
        nonmatches[matched_positions] = False
        self.id_arr = self.id_arr[nonmatches]
        self.obj_arr = self.obj_arr[nonmatches]
        
    def __len__(self):
        return len(self.id_arr)

def intersect_all(marrs: List[MatchedArray]) -> MatchedArray:
    """Find the intersection of all MatchedArrays."""
    _, pos = min(len(m) for m in marrs)
    for i, m in enumerate(marrs):
        if i == pos:
            continue
        m[pos].difference(m[i])
        if len(m[pos]) == 0:
            break
    return m[pos]

def union_all(marrs: List[MatchedArray]):
    """Find the union of all MatchedArrays"""

### are obj arrays actually slow to make? is it worth doing the lazy eval here?


In [154]:
n = 10**6
lots = np.array([str(i) for i in range(n)])

n_masks = 10
masks = []
for k in range(n_masks):
    mask = np.zeros_like(lots, dtype=bool)
    for i in range(n):
        if i % (k+1) == 0:
            mask[i] = True
    masks.append(mask)

    

In [155]:
# just like copy them idk
t0 = time.time()
result = np.copy(lots)
for m in masks:
    result = np.copy(result[:int(0.9*len(result))])
t1 = time.time()
print(t1-t0)

0.022984027862548828


In [156]:
t0 = time.time()
result = np.copy(lots)
for m in masks:
    m1 = np.copy(mask)
t1 = time.time()
print(t1-t0)

0.012212514877319336


In [145]:
masks[9]

array([ True,  True,  True, ...,  True,  True,  True])

In [134]:
a0 = np.array([5,6,7,99])
a1 = np.array([7,8,9,9])
o_arr_1 = np.array(['a', 'b', 'c','q'], dtype='O')
o_arr_2 = np.array(['c', 'd', 'e','e'], dtype='O')

merged_id_arr, indices = snp.merge(a0, a1, indices=True, duplicates=snp.DROP)
obj_arr = np.empty_like(merged_id_arr, dtype='O')
indices

(array([0, 1, 2, 5]), array([2, 3, 4, 4]))

In [135]:
obj_arr[indices[0]] = o_arr_1
obj_arr[indices[1]] = o_arr_2
obj_arr

array(['a', 'b', 'c', 'd', 'e', 'q'], dtype=object)

NameError: name 'union' is not defined

In [66]:
def test_remove_nan_ids():
    # mask bits typically get set after an intersect / difference operation
    s = SomeObjs(obj_ids = np.array([1,2,3], dtype='uint64'), objs=['a', 'b', 'c'], all_true=True, mask=None)

    s.all_true = False
    # todo parameterize for 0, 1, 2, and 3 trues -- all different outcomes
    s.mask = np.array([True, False, True], dtype=bool)

    apply_mask(s)
    assert s.objs == ['a', 'c']
    
test_apply_mask()

In [None]:
def sort_by_id(s: SomeObjs):
    # mutates s
    sort_order = s.obj_ids.argsort()
    s.obj_ids = s.obj_ids[sort_order]
    s.objs = s.objs[sort_order]

In [67]:
def union_all(ls: List[SomeObjs]) -> SomeObjs:
    # so we take the union'd IDs
    # then we intersect them with each SomeObjs one at a time
    # on intersect, we:
    # - grab the obj and its id, add it to the output SomeObjs
    # - nan out the copy of it in the unionset
    # Last step, sort. We'll need a function for that.
    result = SomeObjs()
    union_ids = snp.kway_merge(*[s.obj_ids for s in ls], assume_sorted=True, duplicates=snp.DROP)
    for s in ls:
        isect_idxs = snp.[1]
    
    print(ids)

s = SomeObjs(obj_ids = np.array([1,2,3], dtype='uint64'), objs=['a', 'b', 'c'], all_true=True, mask=None)

SyntaxError: invalid syntax (3620735981.py, line 10)

In [None]:
# idea: use nans instead of a mask

In [None]:

def intersect_all(ls: List[SomeObjs]):
    pass

def difference(objs: SomeObjs, not_these: SomeObjs):
    pass

