Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/sql store #125

Merged
merged 16 commits into from Aug 23, 2023
Merged

Feature/sql store #125

merged 16 commits into from Aug 23, 2023

Conversation

simonwoerpel
Copy link
Contributor

@simonwoerpel simonwoerpel commented Aug 9, 2023

This implements a SQLStore for statements following the general Store interface as seen in MemoryStore and LevelDBStore.

There might be a few things to discuss, that's why there are several commits.

33e5456
Added sql index for prop, prop_type and schema columns. As all these values have a relatively low cardinality, this should not be a massive overhead

84600b6
I am not sure if this is the right way, but during sqlite tests the transaction was not commited without the finally statement. But this breaks mypy -.-

c4f7503
A small fix, without it the LevelDBStore was not deleting all the statements it should

cebd4d7
A test that tests all the stores that they behave exactly the same ;) (they do.)

@simonwoerpel simonwoerpel requested a review from pudo August 9, 2023 11:04
Copy link
Member

@pudo pudo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very fucking cool, this looks awesome!

with engine.begin() as conn:
yield conn
finally:
if conn is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, are we committing after an exception here? That feels a bit unhealty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it here: f968f90 and here: 2389094

data: dict[str, Any] = stmt.to_row()
data["target"] = as_bool(data["target"])
data["external"] = as_bool(data["external"])
data["first_seen"] = data["first_seen"] or datetime.utcnow()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should make up dates somewhere in the middle of the pipeline. That would lead to pretty random outcomes. Maybe we make those columns nullable instead?

Been struggling with the same issue: https://github.com/opensanctions/opensanctions/blob/main/zavod/zavod/tools/load_db.py#L42-L45

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it nullable: 2d0dd85

I don't know what side-effects this could have in other parts of your pipeline? Especially for existing statement tables... (migrations?)

table = self.store.table
q = (
select(table)
.where(table.c.prop_type == "entity", table.c.value == id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it needs to check self.store.resolver.connected(id) not just id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pudo
Copy link
Member

pudo commented Aug 10, 2023

One extra thing we can consider is if any of the functions (pack_stmt, iterate_stmts) should live in nomenklatura.statement.db so they're re-usable outside of the store context.

@simonwoerpel simonwoerpel marked this pull request as draft August 16, 2023 18:01
@simonwoerpel
Copy link
Contributor Author

One extra thing we can consider is if any of the functions (pack_stmt, iterate_stmts) should live in nomenklatura.statement.db so they're re-usable outside of the store context.

Found the serializers module a good place for it, but feel free to move it to somewhere else: f2c60cf

@simonwoerpel simonwoerpel marked this pull request as ready for review August 21, 2023 09:33
@simonwoerpel simonwoerpel requested a review from pudo August 21, 2023 09:33
Copy link
Member

@pudo pudo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks amazing :)

def __init__(self, store: SqlStore[DS, CE]):
self.store: SqlStore[DS, CE] = store
self.batch: Optional[Set[Statement]] = None
self.batch_size = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need batch_size if we have a batch array? I assume len(batch) is more precise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3d25cb4

(this originally came from my believe that int > int is more performant than len(foo) > int but gosh, this would never be the bottleneck in the overall sql store implementation) 😂 🙈


def get_entity(self, id: str) -> Optional[CE]:
table = self.store.table
ids = [str(i) for i in self.store.resolver.connected(id)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting, so the idea here is that we're not believing the stmt.canonical_id in the table? While that works, it's a a bit different from all the other store implementations we have now....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, no, i guess i just misunderstood your comment here: #125 (comment)

q = (
select(table)
.where(table.c.dataset.in_(self.dataset_names))
.order_by("entity_id")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be canonical_id? Otherwise it'll return fragmented entities, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally right. fixed here: c1dc320

but, canonical_id column was nullable in the sql table, which should not, right?


def _iterate_stmts(self, q: Select) -> Generator[Statement, None, None]:
with self.engine.connect() as conn:
conn = conn.execution_options(stream_results=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has quite a bit of overhead. I wonder if we should pass an option into this func that can disable it for get_entity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pudo pudo merged commit 4ef0ca5 into main Aug 23, 2023
3 checks passed
@pudo pudo deleted the feature/sql-store branch August 23, 2023 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants