Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a command to delete orphan data #340

Closed
jpmckinney opened this issue Aug 24, 2021 · 1 comment
Closed

Add a command to delete orphan data #340

jpmckinney opened this issue Aug 24, 2021 · 1 comment
Labels
feature Relating to loading data from the web API or CLI command
Milestone

Comments

@jpmckinney
Copy link
Member

jpmckinney commented Aug 24, 2021

Old version:

def delete_orphan_data(self):
self._delete_orphan_data_data()
self._delete_orphan_data_package_data()
def _delete_orphan_data_data(self):
data_get = {}
sql_get = """
SELECT data.id
FROM data
LEFT JOIN release ON release.data_id = data.id
LEFT JOIN record ON record.data_id = data.id
LEFT JOIN compiled_release ON compiled_release.data_id = data.id
WHERE release.data_id IS NULL AND record.data_id IS NULL AND compiled_release.data_id IS NULL
LIMIT 10000;
"""
logger = logging.getLogger('ocdskingfisher.database.delete-collection')
logger.debug("Deleting data")
while True:
with self.get_engine().begin() as connection:
ids_to_delete = [str(row['id'])
for row in connection.execute(sa.text(sql_get), data_get)]
if len(ids_to_delete) == 0:
return
connection.execute(
sa.text("DELETE FROM data WHERE id IN (" + ",".join(ids_to_delete) + ")"),
{}
)
def _delete_orphan_data_package_data(self):
sql_get = """
SELECT package_data.id
FROM package_data
LEFT JOIN release ON release.package_data_id = package_data.id
LEFT JOIN record ON record.package_data_id = package_data.id
WHERE release.package_data_id IS NULL AND record.package_data_id IS NULL
LIMIT 10000;
"""
logger = logging.getLogger('ocdskingfisher.database.delete-collection')
logger.debug("Deleting package_data")
while True:
with self.get_engine().begin() as connection:
ids_to_delete = [row['id'] for row in connection.execute(sa.text(sql_get))]
if not ids_to_delete:
return
connection.execute(
sa.text("DELETE FROM package_data WHERE id IN :ids"),
ids=tuple(ids_to_delete)
)

The queries in that version are slow. I had optimized a similar-looking query in these commits: https://github.com/open-contracting/kingfisher-process/compare/9b51911..667d8be

This is not as necessary for the data registry, where new copies are regularly created. It's more relevant to the helpdesk server, where data is more temporary.

@jpmckinney jpmckinney added this to the V3 milestone Aug 24, 2021
@jpmckinney jpmckinney modified the milestones: V3, V2 (Django) Nov 26, 2021
@jpmckinney
Copy link
Member Author

jpmckinney commented Nov 26, 2021

For the data-registry, this results in data and package_data never being cleared (though as mentioned above, they would likely be repopulated with the same data).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Relating to loading data from the web API or CLI command
Projects
None yet
Development

No branches or pull requests

1 participant