Feature/sql store #125

simonwoerpel · 2023-08-09T11:03:14Z

This implements a SQLStore for statements following the general Store interface as seen in MemoryStore and LevelDBStore.

There might be a few things to discuss, that's why there are several commits.

33e5456
Added sql index for prop, prop_type and schema columns. As all these values have a relatively low cardinality, this should not be a massive overhead

84600b6
I am not sure if this is the right way, but during sqlite tests the transaction was not commited without the finally statement. But this breaks mypy -.-

c4f7503
A small fix, without it the LevelDBStore was not deleting all the statements it should

cebd4d7
A test that tests all the stores that they behave exactly the same ;) (they do.)

pudo

Very fucking cool, this looks awesome!

pudo · 2023-08-10T06:31:38Z

nomenklatura/db.py

+            with engine.begin() as conn:
+                yield conn
+    finally:
+        if conn is not None:


Wait, are we committing after an exception here? That feels a bit unhealty.

Fixed it here: f968f90 and here: 2389094

pudo · 2023-08-10T06:33:06Z

nomenklatura/store/sql.py

+    data: dict[str, Any] = stmt.to_row()
+    data["target"] = as_bool(data["target"])
+    data["external"] = as_bool(data["external"])
+    data["first_seen"] = data["first_seen"] or datetime.utcnow()


I don't think we should make up dates somewhere in the middle of the pipeline. That would lead to pretty random outcomes. Maybe we make those columns nullable instead?

Been struggling with the same issue: https://github.com/opensanctions/opensanctions/blob/main/zavod/zavod/tools/load_db.py#L42-L45

Made it nullable: 2d0dd85

I don't know what side-effects this could have in other parts of your pipeline? Especially for existing statement tables... (migrations?)

pudo · 2023-08-10T06:35:34Z

nomenklatura/store/sql.py

+        table = self.store.table
+        q = (
+            select(table)
+            .where(table.c.prop_type == "entity", table.c.value == id)


I think it needs to check self.store.resolver.connected(id) not just id.

pudo · 2023-08-10T06:37:32Z

One extra thing we can consider is if any of the functions (pack_stmt, iterate_stmts) should live in nomenklatura.statement.db so they're re-usable outside of the store context.

simonwoerpel · 2023-08-21T09:32:37Z

One extra thing we can consider is if any of the functions (pack_stmt, iterate_stmts) should live in nomenklatura.statement.db so they're re-usable outside of the store context.

Found the serializers module a good place for it, but feel free to move it to somewhere else: f2c60cf

pudo

Looks amazing :)

pudo · 2023-08-22T08:33:01Z

nomenklatura/store/sql.py

+    def __init__(self, store: SqlStore[DS, CE]):
+        self.store: SqlStore[DS, CE] = store
+        self.batch: Optional[Set[Statement]] = None
+        self.batch_size = 0


Do we need batch_size if we have a batch array? I assume len(batch) is more precise.

3d25cb4

(this originally came from my believe that int > int is more performant than len(foo) > int but gosh, this would never be the bottleneck in the overall sql store implementation) 😂 🙈

pudo · 2023-08-22T08:34:13Z

nomenklatura/store/sql.py

+
+    def get_entity(self, id: str) -> Optional[CE]:
+        table = self.store.table
+        ids = [str(i) for i in self.store.resolver.connected(id)]


Ah, interesting, so the idea here is that we're not believing the stmt.canonical_id in the table? While that works, it's a a bit different from all the other store implementations we have now....

hm, no, i guess i just misunderstood your comment here: #125 (comment)

pudo · 2023-08-22T08:35:08Z

nomenklatura/store/sql.py

+        q = (
+            select(table)
+            .where(table.c.dataset.in_(self.dataset_names))
+            .order_by("entity_id")


Does this need to be canonical_id? Otherwise it'll return fragmented entities, no?

totally right. fixed here: c1dc320

but, canonical_id column was nullable in the sql table, which should not, right?

pudo · 2023-08-22T08:45:06Z

nomenklatura/store/sql.py

+
+    def _iterate_stmts(self, q: Select) -> Generator[Statement, None, None]:
+        with self.engine.connect() as conn:
+            conn = conn.execution_options(stream_results=True)


This has quite a bit of overhead. I wonder if we should pass an option into this func that can disable it for get_entity.

simonwoerpel added 5 commits August 9, 2023 12:52

Fix leveldb store.pop()

c4f7503

SQL statement table: more index

33e5456

SQL: mess around with ensure_tx contextmanager

84600b6

Implement SQLStore for statements

ea0c427

Add rests for sqlstor vs all the stores

cebd4d7

simonwoerpel requested a review from pudo August 9, 2023 11:04

pudo reviewed Aug 10, 2023

View reviewed changes

Sql: Tweak statements query lookup

20d5daa

simonwoerpel marked this pull request as draft August 16, 2023 18:01

simonwoerpel added 5 commits August 20, 2023 13:10

Tests: Ensure same number of entities after upsert

5d9dd5a

Revert ensure_tx function to original

f968f90

Make statement table dt fields nullable

2d0dd85

Move pack_stmt function to serializers helper module

f2c60cf

Make mypy happy with simplified transaction context handling

2389094

simonwoerpel marked this pull request as ready for review August 21, 2023 09:33

simonwoerpel requested a review from pudo August 21, 2023 09:33

simonwoerpel added 2 commits August 21, 2023 11:42

pack_sql_statement: ensure iso dateformat

565dc10

Sql: fix resolver connected lookup

86f288d

pudo reviewed Aug 22, 2023

View reviewed changes

simonwoerpel added 3 commits August 22, 2023 20:01

Sql: make sql iteration streamable or not

8ed0544

Sql: order_by canonical_id and make canonical_id required in table

c1dc320

Sql: drop unecesarry batch_size in bulk writer

3d25cb4

pudo merged commit 4ef0ca5 into main Aug 23, 2023
3 checks passed

pudo deleted the feature/sql-store branch August 23, 2023 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/sql store #125

Feature/sql store #125

simonwoerpel commented Aug 9, 2023 •

edited

pudo left a comment

pudo Aug 10, 2023

simonwoerpel Aug 21, 2023

pudo Aug 10, 2023

simonwoerpel Aug 21, 2023

pudo Aug 10, 2023

simonwoerpel Aug 15, 2023

pudo commented Aug 10, 2023

simonwoerpel commented Aug 21, 2023

pudo left a comment

pudo Aug 22, 2023

simonwoerpel Aug 22, 2023

pudo Aug 22, 2023

simonwoerpel Aug 22, 2023

pudo Aug 22, 2023

simonwoerpel Aug 22, 2023

pudo Aug 22, 2023

simonwoerpel Aug 22, 2023

Feature/sql store #125

Feature/sql store #125

Conversation

simonwoerpel commented Aug 9, 2023 • edited

pudo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pudo commented Aug 10, 2023

simonwoerpel commented Aug 21, 2023

pudo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonwoerpel commented Aug 9, 2023 •

edited