Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Purge static files on delete #2488

Merged
merged 44 commits into from
Jul 21, 2020
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
3b17367
added root filename to resource mixin
Jun 9, 2020
0b9e7b1
delete file from resource in api
Jun 9, 2020
00aca38
add removal in purge task
Jun 9, 2020
cb79291
test storage deletion
Jun 10, 2020
2cf3da4
added verifications
Jun 10, 2020
69762f5
add user avatar deletion
Jun 10, 2020
267cfce
add migration
Jun 11, 2020
e9d49b8
added community resource
Jun 12, 2020
94d73b2
complete migration script
Jun 15, 2020
c40692f
add counters in migration
Jun 15, 2020
f5860b4
add community deletion
Jun 15, 2020
245518b
add tests
Jun 16, 2020
6571ace
add rest of tests
Jun 17, 2020
a426a07
remove useless tabs
Jun 17, 2020
37b7ffc
update changelog
Jun 17, 2020
0f8f397
Merge branch 'master' into purgeStaticFileOnDelete
quaxsze Jun 17, 2020
1ce174c
Update CHANGELOG.md
quaxsze Jun 17, 2020
298d3df
fix after comments
Jun 17, 2020
69b26bc
add community task
Jun 19, 2020
1ff91e9
add task test
Jun 19, 2020
1c83125
fix migration
Jun 19, 2020
e661c6c
update changelog
Jun 19, 2020
987c912
Merge branch 'master' into purgeStaticFileOnDelete
abulte Jun 24, 2020
e207a77
Update CHANGELOG.md
quaxsze Jul 1, 2020
566362c
Update CHANGELOG.md
quaxsze Jul 1, 2020
fc72b7d
Merge branch 'master' into purgeStaticFileOnDelete
quaxsze Jul 1, 2020
571d244
checks url instead of filetype
Jul 1, 2020
526dee3
Merge branch 'purgeStaticFileOnDelete' of github.com:quaxsze/udata in…
Jul 1, 2020
e41e8ea
remove log
Jul 1, 2020
c726020
update migrations
Jul 3, 2020
8de0475
add log
Jul 6, 2020
c69bc65
trying to make migration faster
Jul 6, 2020
3400c5e
removed deletion from migration
Jul 8, 2020
aeb593c
Update CHANGELOG.md
quaxsze Jul 15, 2020
6b903b5
Update CHANGELOG.md
quaxsze Jul 15, 2020
57aa765
Merge branch 'master' into purgeStaticFileOnDelete
quaxsze Jul 15, 2020
1c5bd1d
add get
Jul 15, 2020
5f7fd77
Merge branch 'master' into purgeStaticFileOnDelete
quaxsze Jul 16, 2020
aceb761
add no cache to queryset
Jul 17, 2020
fbc4cae
Merge branch 'purgeStaticFileOnDelete' of github.com:quaxsze/udata in…
Jul 17, 2020
558f3fe
add timeout false
Jul 17, 2020
e64d25a
trying by batch
Jul 20, 2020
a27ccc7
add no cache to batch
Jul 20, 2020
e740e63
save now datasets
Jul 21, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@

## Current (in progress)

- Nothing yet
- :warning: Deletion workflow changes [#2488](https://github.com/opendatateam/udata/pull/2488):
- Deleting a resource now triggers the deletion of the corresponding static file
- Deleting a dataset now triggers the deletion of the corresponding resources (including community resources) and their static files
- Adding a celery job `purge-orphan-community-resources` to remove community resources not linked to a dataset. This should be scheduled regularly.
- ⚠️ Adding a migration file to populate resources fs_filename new field, and to delete orphaned resources files
quaxsze marked this conversation as resolved.
Show resolved Hide resolved

## 2.1.3 (2020-06-29)

Expand Down
7 changes: 7 additions & 0 deletions udata/core/dataset/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -361,6 +361,10 @@ def delete(self, dataset, rid):
'''Delete a given resource on a given dataset'''
ResourceEditPermission(dataset).test()
resource = self.get_resource_or_404(dataset, rid)
# Deletes resource's file from file storage
if resource.fs_filename is not None:
storages.resources.delete(resource.fs_filename)

dataset.resources.remove(resource)
dataset.last_modified = datetime.now()
dataset.save()
Expand Down Expand Up @@ -437,6 +441,9 @@ def put(self, community):
def delete(self, community):
'''Delete a given community resource'''
ResourceEditPermission(community).test()
# Deletes community resource's file from file storage
if community.fs_filename is not None:
storages.resources.delete(community.fs_filename)
community.delete()
return '', 204

Expand Down
1 change: 1 addition & 0 deletions udata/core/dataset/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,7 @@ class ResourceMixin(object):
format = db.StringField()
mime = db.StringField()
filesize = db.IntField() # `size` is a reserved keyword for mongoengine.
fs_filename = db.StringField()
extras = db.ExtrasField()

created_at = db.DateTimeField(default=datetime.now, required=True)
Expand Down
30 changes: 27 additions & 3 deletions udata/core/dataset/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
Organization)
from udata.tasks import job

from .models import Dataset, Resource, UPDATE_FREQUENCIES, Checksum
from .models import Dataset, Resource, CommunityResource, UPDATE_FREQUENCIES, Checksum

log = get_task_logger(__name__)

Expand All @@ -33,7 +33,7 @@ def flatten(iterable):
@job('purge-datasets')
def purge_datasets(self):
for dataset in Dataset.objects(deleted__ne=None):
log.info('Purging dataset "{0}"'.format(dataset))
log.info(f'Purging dataset {dataset}')
quaxsze marked this conversation as resolved.
Show resolved Hide resolved
# Remove followers
Follow.objects(following=dataset).delete()
# Remove issues
Expand All @@ -49,10 +49,34 @@ def purge_datasets(self):
topic.update(datasets=datasets)
# Remove HarvestItem references
HarvestJob.objects(items__dataset=dataset).update(set__items__S__dataset=None)
# Remove
# Remove each dataset's resource's file
storage = storages.resources
for resource in dataset.resources:
if resource.fs_filename is not None:
storage.delete(resource.fs_filename)
# Remove each dataset related community resource and it's file
community_resources = CommunityResource.objects(dataset=dataset)
for community_resource in community_resources:
if community_resource.fs_filename is not None:
storage.delete(community_resource.fs_filename)
community_resource.delete()
# Remove dataset
dataset.delete()


@job('purge-orphan-community-resources')
def purge_orphan_community_resources(self):
'''
Gets community resources not linked with a dataset
and deletes them along with their files.
'''
community_resources = CommunityResource.objects(dataset=None)
for community_resource in community_resources:
if community_resource.fs_filename is not None:
storages.resources.delete(community_resource.fs_filename)
community_resource.delete()


@job('send-frequency-reminder')
def send_frequency_reminder(self):
# We exclude irrelevant frequencies.
Expand Down
10 changes: 9 additions & 1 deletion udata/core/organization/tasks.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from udata import mail
from udata.i18n import lazy_gettext as _
from udata.core import storages
from udata.models import Follow, Activity, Dataset
from udata.search import reindex
from udata.tasks import job, task, get_logger
Expand All @@ -14,14 +15,21 @@
@job('purge-organizations')
def purge_organizations(self):
for organization in Organization.objects(deleted__ne=None):
log.info('Purging organization "{0}"'.format(organization))
log.info(f'Purging organization {organization}')
# Remove followers
Follow.objects(following=organization).delete()
# Remove activity
Activity.objects(related_to=organization).delete()
Activity.objects(organization=organization).delete()
# Store datasets for later reindexation
d_ids = [d.id for d in Dataset.objects(organization=organization)]
# Remove organization's logo in all sizes
if organization.logo.filename is not None:
storage = storages.avatars
storage.delete(organization.logo.filename)
storage.delete(organization.logo.original)
for key, value in organization.logo.thumbnails.items():
storage.delete(value)
# Remove
organization.delete()
# Reindex the datasets that were linked to the organization
Expand Down
10 changes: 9 additions & 1 deletion udata/core/reuse/tasks.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from udata import mail
from udata.i18n import lazy_gettext as _
from udata.core import storages
from udata.models import Activity, Issue, Discussion, Follow
from udata.tasks import get_logger, job, task

Expand All @@ -11,7 +12,7 @@
@job('purge-reuses')
def purge_reuses(self):
for reuse in Reuse.objects(deleted__ne=None):
log.info('Purging reuse "{0}"'.format(reuse))
log.info(f'Purging reuse {reuse}')
# Remove followers
Follow.objects(following=reuse).delete()
# Remove issues
Expand All @@ -20,6 +21,13 @@ def purge_reuses(self):
Discussion.objects(subject=reuse).delete()
# Remove activity
Activity.objects(related_to=reuse).delete()
# Remove reuse's logo in all sizes
if reuse.image.filename is not None:
storage = storages.images
storage.delete(reuse.image.filename)
storage.delete(reuse.image.original)
for key, value in reuse.image.thumbnails.items():
storage.delete(value)
reuse.delete()


Expand Down
16 changes: 10 additions & 6 deletions udata/core/storages/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,20 +135,24 @@ def handle_upload(storage, prefix=None):
if uploaded_file:
save_chunk(uploaded_file, args)
else:
filename = combine_chunks(storage, args, prefix=prefix)
fs_filename = combine_chunks(storage, args, prefix=prefix)
elif not uploaded_file:
raise UploadError('Missing file parameter')
else:
# Normalize filename including extension
filename = utils.normalize(uploaded_file.filename)
filename = storage.save(uploaded_file, prefix=prefix,
filename=filename)

metadata = storage.metadata(filename)
fs_filename = storage.save(
uploaded_file,
prefix=prefix,
filename=filename
)

metadata = storage.metadata(fs_filename)
metadata['fs_filename'] = fs_filename
checksum = metadata.pop('checksum')
algo, checksum = checksum.split(':', 1)
metadata[algo] = checksum
metadata['format'] = utils.extension(filename)
metadata['format'] = utils.extension(fs_filename)
return metadata


Expand Down
7 changes: 7 additions & 0 deletions udata/core/user/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

from udata import search
from udata.api import api, API
from udata.core import storages
from udata.auth import admin_permission
from udata.models import CommunityResource, Dataset, Reuse, User

Expand Down Expand Up @@ -298,6 +299,12 @@ def delete(self, user):
if user == current_user._get_current_object():
api.abort(403, 'You cannot delete yourself with this API. ' +
'Use the "me" API instead.')
if user.avatar.filename is not None:
storage = storages.avatars
storage.delete(user.avatar.filename)
storage.delete(user.avatar.original)
for key, value in user.avatar.thumbnails.items():
storage.delete(value)
user.mark_as_deleted()
return '', 204

Expand Down
35 changes: 0 additions & 35 deletions udata/migrations/2019-05-09-harvest-items-deleted-datasets.js

This file was deleted.

19 changes: 0 additions & 19 deletions udata/migrations/2019-07-17-delete-permitted-reuses.js

This file was deleted.

28 changes: 0 additions & 28 deletions udata/migrations/2019-07-23-reversed-date-range.js

This file was deleted.

15 changes: 0 additions & 15 deletions udata/migrations/2019-09-09-dataset-private-none-to-false.js

This file was deleted.

41 changes: 41 additions & 0 deletions udata/migrations/2020-06-11-add-resource-fs-filename.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
'''
The purpose here is to fill every resource with a fs_filename string field.
'''
import logging
from urllib.parse import urlparse

from udata.models import Dataset, CommunityResource

log = logging.getLogger(__name__)


def migrate(db):
quaxsze marked this conversation as resolved.
Show resolved Hide resolved
log.info('Processing resources resources.')

datasets = Dataset.objects()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
datasets = Dataset.objects()
datasets = Dataset.objects().no_cache()

Using this may avoid the current migration failure on demo. Use it on every query set below too.

for dataset in datasets:
for resource in dataset.resources:
if resource.url.startswith('https://static.data.gouv.fr'):
parsed = urlparse(resource.url)
fs_name = parsed.path.strip('/resource/')
resource.fs_filename = fs_name
try:
resource.save()
except Exception as e:
log.warning(e)
pass

log.info('Processing community resources.')

community_resources = CommunityResource.objects()
for community_resource in community_resources:
parsed = urlparse(community_resource.url)
fs_name = parsed.path.strip('/resource/')
community_resource.fs_filename = fs_name
try:
community_resource.save()
except Exception as e:
log.warning(e)
pass

log.info('Completed.')