Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Purge static files on delete #2488

Merged
merged 44 commits into from Jul 21, 2020

Conversation

quaxsze
Copy link
Contributor

@quaxsze quaxsze commented Jun 9, 2020

No description provided.

@quaxsze quaxsze changed the title Purge static file on delete WIP: Purge static file on delete Jun 10, 2020
@quaxsze
Copy link
Contributor Author

quaxsze commented Jun 10, 2020

After testing it is possible to create resource with distant link instead of local fs file, as well as an organization or reuse without a logo image. I added try except on delete or None verification.

@quaxsze quaxsze changed the title WIP: Purge static file on delete Purge static file on delete Jun 11, 2020
@quaxsze quaxsze requested a review from abulte June 11, 2020 19:28
@quaxsze quaxsze changed the title Purge static file on delete Purge static files on delete Jun 15, 2020
Copy link
Contributor

@abulte abulte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid work 💪 A few comments below :p

There's one more case we need to handle I think: community resources files not associated to a dataset — this happens regularly because our creation process is f****. So we should:

  • either strengthen the upload process: if a dataset is not selected, delete the file (not sure we can do that, it's pretty stateless)
  • make a purge scheduled script for community resources files

CHANGELOG.md Outdated Show resolved Hide resolved
udata/core/dataset/api.py Outdated Show resolved Hide resolved
udata/core/dataset/api.py Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
udata/core/dataset/tasks.py Show resolved Hide resolved
udata/migrations/2020-06-11-add-resource-fs-filename.py Outdated Show resolved Hide resolved
udata/migrations/2020-06-11-add-resource-fs-filename.py Outdated Show resolved Hide resolved
udata/migrations/2020-06-11-add-resource-fs-filename.py Outdated Show resolved Hide resolved
udata/migrations/2020-06-11-add-resource-fs-filename.py Outdated Show resolved Hide resolved
udata/migrations/2020-06-11-add-resource-fs-filename.py Outdated Show resolved Hide resolved
quaxsze and others added 3 commits June 17, 2020 16:56
Co-authored-by: Alexandre Bulté <alexandre@bulte.net>
Copy link
Contributor

@abulte abulte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
udata/migrations/2020-06-11-add-resource-fs-filename.py Outdated Show resolved Hide resolved
resource = get_resource(key)
resource.fs_filename = fs_filename
resource.save()
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe delete the resource from the index at this point, but totally optional.

Copy link
Contributor

@abulte abulte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 what about when we replace a ressource? not sure we handle that case.

@quaxsze
Copy link
Contributor Author

quaxsze commented Jul 13, 2020

Script creatinf indexes for deletion

def main():
    app = create_app()
    init_app(app)
    with app.app_context():
        print('Processing resources and community resources.')

        datasets = Dataset.objects()
        community_resources = CommunityResource.objects()
        with open('resource_fs.txt', 'w') as f:
            for dataset in datasets:
                for resource in dataset.resources:
                    if resource.url.startswith('https://static.data.gouv.fr'):
                        fs_name = resource.url[38:]
                        f.write(f'{fs_name}\n')

            for community_resource in community_resources:
                if community_resource.url.startswith('https://static.data.gouv.fr'):
                    fs_name = community_resource.url[38:]
                    f.write(f'{fs_name}\n')

        print('Processing organizations logos and users avatars.')
        orgs = Organization.objects()
        users = User.objects()
        with open('avatar_fs.txt', 'w') as f:
            for org in orgs:
                if org.logo.filename:
                    split_filename = org.logo.filename.split('.')
                    f.write(f'{split_filename[0]}\n')

            for user in users:
                if user.avatar.filename:
                    split_filename = user.avatar.filename.split('.')
                    f.write(f'{split_filename[0]}\n')

        print('Processing reuses logos.')
        reuses = Reuse.objects()
        with open('image_fs.txt', 'w') as f:
            for reuse in reuses:
                if reuse.image.filename:
                    split_filename = reuse.image.filename.split('.')
                    f.write(f'{split_filename[0]}\n')

        print('Completed.')


if __name__ == "__main__":
    main()

Deletion bash script

#!/bin/bash

resources_path="instance/fs/resources"
avatars_path="instance/fs/avatars"
images_path="instance/fs/images"

echo "Processing resources"
resources_counter=0
for f in $(find -type f "$resources_path")
do
    foo=${f#"$resources_path/"}
    if ! grep -Fxq "$foo" resource_fs.txt
    then
        echo $f >> resources_deletion.txt
        rm $f
        (( resources_counter++ ))
    fi
done

echo "Resources completed"
echo "Resources deletion: ${resources_counter}"

echo "Processing avatars"
avatars_counter=0
for f in $(find -type f "$avatars_path")
do
    match=0
    while read p
    do
        if  [[ $f == *$p* ]]
        then
            match=1
            break
        fi
    done < avatar_fs.txt
    if [ $match -eq 0 ]
    then
        echo $f >> avatars_deletion.txt
        rm $f
        (( avatars_counter++ ))
    fi
done
echo "Avatars completed"
echo "Avatars deletion: ${avatars_counter}"

echo "Processing images"
images_counter=0
for f in $(find -type f "$images_path")
do
    match=0
    while read p
    do
        if  [[ $f == *$p* ]]
        then
            match=1
            break
        fi
    done < image_fs.txt
    if [ $match -eq 0 ]
    then
        echo $f >> images_deletion.txt
        rm $f
        (( images_counter++ ))
    fi
done
echo "Images completed"
echo "Images deletion: ${images_counter}"

@quaxsze
Copy link
Contributor Author

quaxsze commented Jul 13, 2020

Resources completed
Resources deletion: 82783
Processing avatars
Avatars completed
Avatars deletion: 17854
Processing images
Images completed
Images deletion: 7420

real    29m53.011s
user    22m29.135s
sys     6m31.937s

CHANGELOG.md Outdated Show resolved Hide resolved
def migrate(db):
log.info('Processing resources resources.')

datasets = Dataset.objects()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
datasets = Dataset.objects()
datasets = Dataset.objects().no_cache()

Using this may avoid the current migration failure on demo. Use it on every query set below too.

@quaxsze quaxsze merged commit 8bf4c98 into opendatateam:master Jul 21, 2020
@quaxsze quaxsze deleted the purgeStaticFileOnDelete branch July 21, 2020 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants