we can easily store in Polars:
 - an object reference
 - The attributes

but we still need a way to remove(obj) or update(obj) quickly
log(n) time is okay.

To work out:
 - Store id(obj) as a column `obj_id` in the Polars dataframe
 - keep the dataframe sorted by obj_id. (think about how to make this not suck for doing many inserts, e.g. sort at end of an add_many call). 
 - Can we insert_at(pos) in a polars dataframe? 
 - How to bisect on a Polars series? does that work?
 

Nope, that's not gonna work. Polars requires a different design due to its "not-very-mutable" nature.
Gotta be less clever.

- `remove(obj)` means set the obj to null and obj_id to 0. It takes O(n). And we'll need to garbage collect eventually.
- `update(obj)` means an O(n) search.


In [1]:
import random
import sys
import time
import numpy as np
import polars as pl
from pympler.asizeof import asizeof

In [2]:
class Thing:
    def __init__(self):
        self.x = random.random()
        self.y = random.random()
        self.z = random.random()

N_THINGS = 10**7

things = [Thing() for _ in range(N_THINGS)]
# 2.6GB @ 10^7 items

In [3]:
t0 = time.time()
df = pl.DataFrame({
    'x': [t.x for t in things], 
    'y': [t.y for t in things], 
    't': things, 
    'obj_id': [id(t) for t in things]
}).lazy()
t1 = time.time()
print('built polars df in', t1-t0)

# 3.0GB @ 10^7 items
# so about 40MB per million items
# compared with 100MB (SQLite).
# polars is also 10x faster to build.

built polars df in 2.9653241634368896


In [4]:
n_runs = 10
t_polars = 0
t_generator = 0
n_items = 10**0
thresh = n_items / len(things)
for _ in range(n_runs):
    t0 = time.time()
    ls = df.select( 
        pl.col("t").filter(
                           (pl.col('y') < thresh) &
                           (pl.col('x') < 1.0)
                          )
    ).collect().t.to_list()
    t1 = time.time()
    ls_gen = list(t for t in things if (t.x < 1 and t.y < thresh))
    t2 = time.time()
    t_polars += (t1-t0)/n_runs
    t_generator += (t2-t1)/n_runs
print('polars', t_polars, len(ls))
print('generator', t_generator, len(ls_gen))
print('speedup', round(t_generator/t_polars, 3), 'x')

polars 0.043565535545349116 1
generator 0.8961155652999877 1
speedup 20.569 x


In [7]:
# implement remove
t_dead = random.choice(things)
# uhhh... there's no remove really.

In [23]:
row.x = 1


NameError: name 'row' is not defined

x,y,z,t,obj_id
f64,f64,f64,object,i64
0.676989,0.6292,0.738254,<__main__.Thing object at 0x7f6cc2459b80>,140105092537216


In [25]:
df2 = pl.DataFrame({"a": [1, 2, None], "b": [4, None, 6]})

df2.fill_null("zero")
df2

a,b
i64,i64
1.0,4.0
2.0,
,6.0


In [26]:
df2['a'].set_at_idx((1,), None)  # does nothing
df2['a'] = df2['a'].set_at_idx((1,), None)
df2

a,b
i64,i64
1.0,4.0
2.0,
,6.0
