Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AB-298: Add a feature to download an entire dataset in one click #267

Merged
merged 12 commits into from Jun 29, 2021

Conversation

@rsh7
Copy link
Contributor

@rsh7 rsh7 commented Apr 6, 2018

This is a...

  • Bug Fix
  • Feature addition
  • Refactoring
  • Minor / simple change (like a typo)
  • Other

Present case:
No support to download the complete dataset at once.

Solution:
I've added a feature which allows downloading an entire dataset of any user in one click. It create a zip archive of all items in the dataset, with folders for each class and files in each folder with low-level json data.

@paramsingh paramsingh self-assigned this May 4, 2018
@rsh7
Copy link
Contributor Author

@rsh7 rsh7 commented Aug 20, 2018

@alastair, @paramsingh, Hi! If you have gone through this PR you must have seen that right now I am making a new directory locally and saving the zip files(of classes and datasets as json) there and deleting them after user downloads the respective file. Should we save them somewhere instead?

Copy link
Contributor

@alastair alastair left a comment

Thanks for the patch! We should think a bit more about a more efficient way of generating and returning the zip file before continuing with this.

// function to redirect to download page

$(function() {
$("a#download_dataset").click(function() {

This comment has been minimized.

@alastair

alastair Aug 23, 2018
Contributor

Instead of doing this in javascript we should use the dataset id that we have in the template file dataset['id'] -

return render_template(
"datasets/view.html",
dataset=ds,
author=db.user.get(ds["author"]),
)

We can add the href directly to the <a> tag in datasets/view.html

utils.path.create_path(class_directory)
for recording in data["recordings"]:
file_name = recording + '.json'
low_level_data = json.dumps(db.data.load_low_level(recording), indent=4, sort_keys=True)

This comment has been minimized.

@alastair

alastair Aug 23, 2018
Contributor

We don't need to indent these data files or sort the keys

for recording in data["recordings"]:
file_name = recording + '.json'
low_level_data = json.dumps(db.data.load_low_level(recording), indent=4, sort_keys=True)
with open(class_directory +'/' + file_name, "w+") as file:

This comment has been minimized.

@alastair

alastair Aug 23, 2018
Contributor

If we were continuing with this method you should use os.path.join(class_directory, file_name) to create this path.
Also, 'w' would be a better mode - if two people try and download this dataset at the same time then we might end up with some files with two copies of the data!

file_name = recording + '.json'
low_level_data = json.dumps(db.data.load_low_level(recording), indent=4, sort_keys=True)
with open(class_directory +'/' + file_name, "w+") as file:
file.write(low_level_data)

This comment has been minimized.

@alastair

alastair Aug 23, 2018
Contributor

a more compact way to do this is json.dump(data, file)

containing low-level json data
"""
utils.path.create_path(zipped_dataset_dir)
dataset_dir = os.path.join(os.path.abspath(zipped_dataset_dir), ds["name"])

This comment has been minimized.

@alastair

alastair Aug 23, 2018
Contributor

You're right with your comment asking about creating this zip file. There are a few problems with doing it this way:

  • By using the name of the dataset for the folder name/archive name, if two people download two different datasets at the same time with the same name, they might get a mix of files in a single file
  • We will end up with lots of zip files containing these datasets, and we don't clean them up
  • If a dataset is big, the view that creates this zip file might time out while waiting to load all of the lowlevel files.

I would prefer to perform this operation by streaming a zip file directly to the user's browser. This means that we don't need to create any temporary files, and we will always be streaming data to a client, which means that we won't have a problem with timeouts.

I know of two ways to do this. One which I've used before is to use mod_zip in nginx: https://github.com/evanmiller/mod_zip
The way you use it is to create a specially formatted text file and return it to nginx, which will then create a zip-file in real-time and return it to the user: https://github.com/MTG/freesound/blob/master/utils/downloads.py#L30-L54
This will require us to compile and maintain our own version of nginx, as the extension isn't installed by default. We need to check with @zas if he's OK with us running our own custom version of nginx.

An alternative is to stream the zip file from flask. I've seen that there are a few packages which help doing this: https://github.com/allanlei/python-zipstream. I've not used this package before so I don't know how efficient it is, or how well it will work with large datasets.

This comment has been minimized.

@alastair

alastair Aug 23, 2018
Contributor

After talking with @zas a bit we decided that using python-zipstream is probably a better choice for us than mod_zip at the moment (it means that we can use operating system updates for any possible security updates in nginx).
We will need to monitor how long it takes to generate these zip files, especially for large datasets, as this operation will block a flask backend while creating the zip file.
We can write log messages with the dataset id and start/end time of generating the zip file so that we can analyse it in the future.
If we find that it affects performance we might have to move back to a method where we generate the zip file and then serve it, but we should consider generating the file in the background using a process worker like celery.

@rsh7 rsh7 force-pushed the rsh7:download_ds branch from 917fd53 to 0b17ebd Sep 3, 2018
@rsh7 rsh7 force-pushed the rsh7:download_ds branch from 0b17ebd to 6afef46 Sep 20, 2018
@rsh7 rsh7 force-pushed the rsh7:download_ds branch from 6afef46 to 6fbd548 Feb 15, 2019
@rsh7
Copy link
Contributor Author

@rsh7 rsh7 commented Feb 15, 2019

Rebased to master!

@rsh7
Copy link
Contributor Author

@rsh7 rsh7 commented Mar 4, 2019

Hi @alastair,
I am getting back to this PR after a long time. Should we work on it using python-zipstream or any other way you might have thought?

@alastair
Copy link
Contributor

@alastair alastair commented Mar 5, 2019

let's go with python-zipstream

@alastair alastair force-pushed the rsh7:download_ds branch from 6fbd548 to 9dab214 Aug 19, 2020
@alastair
Copy link
Contributor

@alastair alastair commented Aug 19, 2020

I've updated this to build a zip of the dataset in memory before serving it to the client. However, I'd like to make some additional improvements before we merge it:

  • Use a streaming library to not keep the whole dataset in memory. Unfortunately python-zipstream is GPL3, so we need to double-check the license compatibility or use something else
  • Don't load all recordings for a dataset at once. Instead bundle them into chunks of 100 or 1000 so that we don't try and pass too many args to the bulk get method
  • return text instead of dictionaries from the bulk get method so that we don't have to json.dump them
  • Include a readme file in the zip which includes information about the dataset and who made it
  • include the csv file of the dataset metadata in the zip
  • If we can't stream the zip file, or if large files take too long to download and cause a timeout, add a limit so that datasets with more than x files can't be downloaded. Maybe 10,000 or 100,000 is a good initial threshold.
@amCap1712 amCap1712 force-pushed the rsh7:download_ds branch from 9dab214 to 5b3e75d Jun 2, 2021
amCap1712 added 4 commits Jun 2, 2021
Datasets with a large number of recordings take a lot of time time to
fetch and generate the zip file. This can cause timeouts, we cannot
increase timeouts on gateways as that can lead to consuming too much
resources. Hence, set a upper limit on the size of datasets that can
be downloaded.
@amCap1712 amCap1712 force-pushed the rsh7:download_ds branch from cd7adf0 to 5b7aab7 Jun 3, 2021
amCap1712 and others added 2 commits Jun 18, 2021
Show the "download dataset" button even if it's too large, showing a
popup that explains why it can't be used
@alastair alastair merged commit 6a20c1b into metabrainz:master Jun 29, 2021
1 of 2 checks passed
1 of 2 checks passed
@github-actions
build build
Details
@github-actions
test test
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants