Find objects by attribute in Python -- fast

Suppose you have a million objects in memory. And you want to find the few objects that matter to you. Fast.

Let's make a million random fruits.
```
import random

fruit_names = ['apple', 'grape', 'mango', 'banana', 'peach']
colors = ['red', 'orange', 'yellow', 'green', 'blue']

fruits = [
    {
        'fruit_num': i,
        'name': random.choice(fruit_types),
        'size': random.randint(1, 101),
        'color: random.choice(color)
    }
    for i in range(1_000_000)
]
```

**Challenge: Find all the blue grapes. You have 50 microseconds. Go!**

The usual answers are to use an O(n) search, like a list comprehension, filter, or Pandas dataframe. That's fine for a small number of items, but it bogs down the more objects you have. Those methods can take 50 milliseconds, easy. That may not sound like much, but it's a thousand times too slow to beat the challenge.

Let's run the timings and see.

List comprehension:
```
%%timeit
[]
```

```

```

OK, how about SQLite?

SQLite looks like a great fit for this problem. With its tree-based indexing, it will surely beat the linear methods.



In [1]:
import random
import pandas as pd
from litebox import LiteBox
from hashbox import HashBox, FrozenHashBox
import cProfile
import pstats
from pstats import SortKey

In [2]:
fruit_names = ['apple', 'grape', 'mango', 'banana', 'peach']
colors = ['red', 'orange', 'yellow', 'green', 'blue']
shapes = ['cube', 'sphere', 'pyramid', 'dodecahedron']

fruits = [
    {
        'fruit_num': i,
        'name': random.choice(fruit_names),
        'size': random.randint(1, 101),
        'shape': random.choice(shapes),
        'color': random.choice(colors)
    }
    for i in range(1_000_000)
]

In [3]:
fhb = FrozenHashBox(
    fruits,
    ['name', 'color', 'size', 'shape']
)

In [4]:
fhb.find({'name': 'grape', 'color': 'blue', 'size': 100, 'shape': 'sphere'})

array([{'fruit_num': 990604, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 988937, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 980652, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 903416, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 894470, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 894564, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 891593, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 867905, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 864861, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 946943, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 947015, 'name': 'gr

In [5]:
%%timeit
[f for f in fruits if f['name'] == 'grape' and f['color'] == 'blue' and f['size'] == 100 and 'shape' == 'sphere']

55.1 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [6]:
df = pd.DataFrame.from_records(fruits)

In [7]:
%%timeit
df.query('name == "grape" and color == "blue" and size == 100 and shape == "sphere"')

68.5 ms ± 674 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [8]:
lb = LiteBox(
    fruits,
    {'name': str, 'color': str, 'size': int, 'shape': str}
)

In [9]:
%%timeit
_ = lb.find('name == "grape" and color == "blue" and size == 100 and shape == "sphere"')

5.35 ms ± 591 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [10]:
hb = HashBox(
    fruits,
    ['name', 'color', 'size', 'shape']
)

In [11]:
%%timeit
_ = hb.find({'name': 'grape', 'color': 'blue', 'size': 100, 'shape': 'sphere'})

1.08 ms ± 9.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [12]:
fhb = FrozenHashBox(
    fruits,
    ['name', 'color', 'size', 'shape']
)

In [13]:
%%timeit
_ = fhb.find({'name': 'grape', 'color': 'blue', 'size': 100, 'shape': 'sphere'})

593 µs ± 5.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [14]:
cProfile.run("fhb.find({'name': 'grape', 'color': 'blue', 'size': 100, 'shape': 'sphere'})", 'fstats')

In [15]:
p = pstats.Stats('fstats')
p.sort_stats(SortKey.CUMULATIVE)
p.print_stats()

Sun Jul 31 14:45:50 2022    fstats

         36 function calls in 0.001 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.001    0.001 {built-in method builtins.exec}
        1    0.000    0.000    0.001    0.001 <string>:1(<module>)
        1    0.000    0.000    0.001    0.001 /home/theo/anaconda3/lib/python3.9/site-packages/hashbox/frozen/main.py:49(find)
        4    0.001    0.000    0.001    0.000 {built-in method sortednp._internal.intersect}
        1    0.000    0.000    0.000    0.000 /home/theo/anaconda3/lib/python3.9/site-packages/hashbox/frozen/main.py:157(_get_objs_by_ids)
        1    0.000    0.000    0.000    0.000 /home/theo/anaconda3/lib/python3.9/site-packages/hashbox/utils.py:22(validate_query)
        4    0.000    0.000    0.000    0.000 /home/theo/anaconda3/lib/python3.9/site-packages/hashbox/frozen/main.py:135(_match_any_of)
        4    0.000    0.000    0.000    0.

<pstats.Stats at 0x7fb4f98354c0>

In [16]:
len(fhb.find({'name': "grape", 'color': "blue", 'size': 100}))

413

In [17]:
len(hb.find({'name': "grape", 'color': "blue", 'size': 100}))

413

In [18]:
len(lb.find('name == "grape" and color == "blue" and size == 100'))

413