Find objects by attribute in Python -- fast

Suppose you have a million objects in memory. And you want to find the few objects that matter to you. Fast.

Let's make a million random fruits.
```
import random

fruit_names = ['apple', 'grape', 'mango', 'banana', 'peach']
colors = ['red', 'orange', 'yellow', 'green', 'blue']

fruits = [
    {
        'fruit_num': i,
        'name': random.choice(fruit_types),
        'size': random.randint(1, 101),
        'color: random.choice(color)
    }
    for i in range(1_000_000)
]
```

**Challenge: Find all the blue grapes. You have 50 microseconds. Go!**

The usual answers are to use an O(n) search, like a list comprehension, filter, or Pandas dataframe. That's fine for a small number of items, but it bogs down the more objects you have. Those methods can take 50 milliseconds, easy. That may not sound like much, but it's a thousand times too slow to beat the challenge.

Let's run the timings and see.

List comprehension:
```
%%timeit
[]
```

```

```

OK, how about SQLite?

SQLite looks like a great fit for this problem. With its tree-based indexing, it will surely beat the linear methods.



In [2]:
import random
import pandas as pd
from litebox import LiteBox
from hashbox import HashBox, FrozenHashBox
import cProfile
import pstats
from pstats import SortKey

Dataset - 2 million fruits of different sizes, shapes, colors, and types.

In [3]:
fruit_names = ['apple', 'grape', 'mango', 'banana', 'peach']
colors = ['red', 'orange', 'yellow', 'green', 'blue']
shapes = ['cube', 'sphere', 'pyramid', 'dodecahedron']

fruits = [
    {
        'fruit_num': i,
        'type': random.choice(fruit_names),
        'size': random.randint(1, 101),
        'shape': random.choice(shapes),
        'color': random.choice(colors)
    }
    for i in range(2_000_000)
]

In [3]:
fhb = FrozenHashBox(
    fruits,
    ['name', 'color', 'size', 'shape']
)

In [4]:
fhb.find({'name': 'grape', 'color': 'blue', 'size': 100, 'shape': 'sphere'})

array([{'fruit_num': 11643, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 22593, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 55809, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 61252, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 67027, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 71211, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 84675, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 85854, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 86649, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 87664, 'name': 'grape', 'size': 100, 'shape': 'sphere', 'color': 'blue'},
       {'fruit_num': 89116, 'name': 'grape', 'size

In [21]:
%%timeit
_ = fhb.find({'name': 'grape', 'color': 'blue', 'size': 100, 'shape': 'sphere'})

563 µs ± 32.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [6]:
%%timeit
[f for f in fruits if f['name'] == 'grape' and f['color'] == 'blue' and f['size'] == 100 and 'shape' == 'sphere']

53.3 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [7]:
df = pd.DataFrame.from_records(fruits)

In [8]:
%%timeit
df.query('name == "grape" and color == "blue" and size == 100 and shape == "sphere"')

70.9 ms ± 896 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [4]:
lb = LiteBox(
    fruits,
    {'name': str, 'color': str, 'size': int, 'shape': str}
)

<litebox.main.LiteBox at 0x7fe06e599400>

In [20]:
%%timeit
_ = lb.find('name == "grape" and color == "blue" and size == 100 and shape == "sphere"')

5.01 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
hb = HashBox(
    fruits,
    ['name', 'color', 'size', 'shape']
)

In [None]:
%%timeit
_ = hb.find({'name': 'grape', 'color': 'blue', 'size': 100, 'shape': 'sphere'})

In [None]:
cProfile.run("fhb.find({'name': 'grape', 'color': 'blue', 'size': 100, 'shape': 'sphere'})", 'fstats')

In [None]:
p = pstats.Stats('fstats')
p.sort_stats(SortKey.CUMULATIVE)
p.print_stats()

In [None]:
len(fhb.find({'name': "grape", 'color': "blue", 'size': 100}))

In [None]:
len(hb.find({'name': "grape", 'color': "blue", 'size': 100}))

In [None]:
len(lb.find('name == "grape" and color == "blue" and size == 100'))