Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AB-298: Add a feature to download an entire dataset in one click #267

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
3 participants
@rsh7
Copy link
Contributor

rsh7 commented Apr 6, 2018

This is a...

  • Bug Fix
  • Feature addition
  • Refactoring
  • Minor / simple change (like a typo)
  • Other

Present case:
No support to download the complete dataset at once.

Solution:
I've added a feature which allows downloading an entire dataset of any user in one click. It create a zip archive of all items in the dataset, with folders for each class and files in each folder with low-level json data.

@paramsingh paramsingh self-assigned this May 4, 2018

@rsh7

This comment has been minimized.

Copy link
Contributor Author

rsh7 commented Aug 20, 2018

@alastair, @paramsingh, Hi! If you have gone through this PR you must have seen that right now I am making a new directory locally and saving the zip files(of classes and datasets as json) there and deleting them after user downloads the respective file. Should we save them somewhere instead?

@alastair
Copy link
Contributor

alastair left a comment

Thanks for the patch! We should think a bit more about a more efficient way of generating and returning the zip file before continuing with this.

// function to redirect to download page

$(function() {
$("a#download_dataset").click(function() {

This comment has been minimized.

@alastair

alastair Aug 23, 2018

Contributor

Instead of doing this in javascript we should use the dataset id that we have in the template file dataset['id'] -

return render_template(
"datasets/view.html",
dataset=ds,
author=db.user.get(ds["author"]),
)

We can add the href directly to the <a> tag in datasets/view.html

utils.path.create_path(class_directory)
for recording in data["recordings"]:
file_name = recording + '.json'
low_level_data = json.dumps(db.data.load_low_level(recording), indent=4, sort_keys=True)

This comment has been minimized.

@alastair

alastair Aug 23, 2018

Contributor

We don't need to indent these data files or sort the keys

for recording in data["recordings"]:
file_name = recording + '.json'
low_level_data = json.dumps(db.data.load_low_level(recording), indent=4, sort_keys=True)
with open(class_directory +'/' + file_name, "w+") as file:

This comment has been minimized.

@alastair

alastair Aug 23, 2018

Contributor

If we were continuing with this method you should use os.path.join(class_directory, file_name) to create this path.
Also, 'w' would be a better mode - if two people try and download this dataset at the same time then we might end up with some files with two copies of the data!

file_name = recording + '.json'
low_level_data = json.dumps(db.data.load_low_level(recording), indent=4, sort_keys=True)
with open(class_directory +'/' + file_name, "w+") as file:
file.write(low_level_data)

This comment has been minimized.

@alastair

alastair Aug 23, 2018

Contributor

a more compact way to do this is json.dump(data, file)

containing low-level json data
"""
utils.path.create_path(zipped_dataset_dir)
dataset_dir = os.path.join(os.path.abspath(zipped_dataset_dir), ds["name"])

This comment has been minimized.

@alastair

alastair Aug 23, 2018

Contributor

You're right with your comment asking about creating this zip file. There are a few problems with doing it this way:

  • By using the name of the dataset for the folder name/archive name, if two people download two different datasets at the same time with the same name, they might get a mix of files in a single file
  • We will end up with lots of zip files containing these datasets, and we don't clean them up
  • If a dataset is big, the view that creates this zip file might time out while waiting to load all of the lowlevel files.

I would prefer to perform this operation by streaming a zip file directly to the user's browser. This means that we don't need to create any temporary files, and we will always be streaming data to a client, which means that we won't have a problem with timeouts.

I know of two ways to do this. One which I've used before is to use mod_zip in nginx: https://github.com/evanmiller/mod_zip
The way you use it is to create a specially formatted text file and return it to nginx, which will then create a zip-file in real-time and return it to the user: https://github.com/MTG/freesound/blob/master/utils/downloads.py#L30-L54
This will require us to compile and maintain our own version of nginx, as the extension isn't installed by default. We need to check with @zas if he's OK with us running our own custom version of nginx.

An alternative is to stream the zip file from flask. I've seen that there are a few packages which help doing this: https://github.com/allanlei/python-zipstream. I've not used this package before so I don't know how efficient it is, or how well it will work with large datasets.

This comment has been minimized.

@alastair

alastair Aug 23, 2018

Contributor

After talking with @zas a bit we decided that using python-zipstream is probably a better choice for us than mod_zip at the moment (it means that we can use operating system updates for any possible security updates in nginx).
We will need to monitor how long it takes to generate these zip files, especially for large datasets, as this operation will block a flask backend while creating the zip file.
We can write log messages with the dataset id and start/end time of generating the zip file so that we can analyse it in the future.
If we find that it affects performance we might have to move back to a method where we generate the zip file and then serve it, but we should consider generating the file in the background using a process worker like celery.

@rsh7 rsh7 force-pushed the rsh7:download_ds branch from 917fd53 to 0b17ebd Sep 3, 2018

@rsh7 rsh7 force-pushed the rsh7:download_ds branch from 0b17ebd to 6afef46 Sep 20, 2018

@rsh7 rsh7 force-pushed the rsh7:download_ds branch from 6afef46 to 6fbd548 Feb 15, 2019

@rsh7

This comment has been minimized.

Copy link
Contributor Author

rsh7 commented Feb 15, 2019

Rebased to master!

@rsh7

This comment has been minimized.

Copy link
Contributor Author

rsh7 commented Mar 4, 2019

Hi @alastair,
I am getting back to this PR after a long time. Should we work on it using python-zipstream or any other way you might have thought?

@alastair

This comment has been minimized.

Copy link
Contributor

alastair commented Mar 5, 2019

let's go with python-zipstream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.