Portal Download - bulk #354

ssarrafan · 2021-05-05T17:36:20Z

Researchers who are interested in raw data generally want to do bulk downloads (to a server/cloud) for custom analyses. Researchers who are interested in data products may want to download to a server or to their local computer for custom analysis or sending to KBase. Researchers are also interested in downloading the associated sample metadata only.

Priority - medium
Urgency - low

jbeezley · 2021-05-05T17:52:27Z

Marking this as X-Large for now, but it could be quicker depending on the technical solution.

First step here is to decide how to implement the feature. Some of the options include:

Streaming download as a zip file
Email a link to a zip file that is asynchronously created
Provide a shell (or windows .bat) script to download all of the requested files

The second is the most user friendly, stable, and scalable, but will require the most engineering effort. The first is probably a Large task, but brings with it a number of stability and scalability issues. The third is easiest (could be done entirely client side), but users may have issues running a script.

If we decide to go with the second option, I would want to break this subtasks.

ssarrafan · 2021-05-10T20:26:27Z

@jbeezley can we discuss option 1 at the tech sync or the kitware sync meetings this week? @emileyfadrosh would love if this could be done this sprint and thinks the users would prefer option 1.

jbeezley · 2021-05-10T23:08:32Z

I think mod_zip would be a good candidate for generating the streaming responses without incurring the cost on the API server.

One technical question to resolve is in regards to the persistence of the download cart. The simplest (server-side) implementation would make the download cart purely in memory on the web client. This would mean if the page is reloaded everything in the cart would disappear. If we want a persistent cart, then we probably need to store it in the database with an associated user id. This means creating a download cart API. Either way, for the server side of the implementation, I would give this a MEDIUM-LARGE size.

What might be a bigger component of this is to design and implement the UI. We should consider bringing in Faiza to create a design. This would also help clarify some of the technical requirements.

ssarrafan · 2021-05-25T16:26:56Z

The majority of this work will be done in a future sprint. Waiting to hear back on whether it should be in June or later. First task may be done this sprint. See Jon's email below:

From: Jonathan Beezley jonathan.beezley@kitware.com
Date: Fri, May 14, 2021 at 10:06 AM
Subject: Re: Portal issues for May
To: Setareh Sarrafan ssarrafan@lbl.gov
Cc: Jeff Baumes jeff.baumes@kitware.com, Brandon Davis brandon.davis@kitware.com, Kjiersten Fagnan kmfagnan@lbl.gov, Pajau Vangay pvangay@lbl.gov, Emiley Eloe-Fadrosh eaeloefadrosh@lbl.gov, David Hays dehays@lbl.gov

It looks like Brandon has basically finished the last two issues. The first one is primarily my responsibility and will take a little bit of work to expose "multi-omics" related information, but I should be able to do it.
For bulk download, it could be split into pieces:1. Create a zip streaming microservice (this is the part that I researched). It is probably a SMALL task to get it deployed. This would essentially take a list of data_object id's and return a streaming zip file.2. Requirements gathering and UI design. This could be working with Faiza to create a wireframe of the feature. With a basic design, we would likely have most of the technical implementation details clarified as well.3. Backend implementation. Depending on technical requirements, we might need new domain models and endpoints in the API server. This could be anywhere from XSmall to Medium.4. Client implementation.
I can most likely finish out 1 for this sprint. Assuming Faiza is available, 2 could set us up for finishing out the feature in June.
Jon

jeffbaumes · 2021-05-25T16:56:17Z

Yes moving most of this to June makes sense unless other priorities push it back further.

If we have the right people for the Kitware sync Thursday, we could discuss the requirements for the front end and have @faiza-a begin UI mockups.

pvangay · 2021-05-26T22:36:47Z

This may already have been discussed so feel free to ignore if so. When downloading files, the user will want to associate the data files back to the sample somehow. Will the files be packaged into folders named by sample identifiers? or renamed? Just jotting this here to make sure something is being considered. At the moment, individual downloads require the user to manually rename files to something more meaningful. Some files are named identically across samples (scaffolds fna), some seem to have an identifier appended (although not sure where it came from) (e.g. EC TSV). Let me know if this isn't making any sense!

jeffbaumes · 2021-05-27T14:37:23Z

Yes, we would need that in bulk download. I assume we can roll up the zip file in a way that will include a folder structure for each sample, but that needs @jbeezley 's input.

jeffbaumes · 2021-05-27T15:01:58Z

Here is a pre-mockup with some basic UI ideas for discussion today. @faiza-a @jbeezley @pvangay @kfagnan @emileyfadrosh

https://docs.google.com/presentation/d/1gMo1fmlneVEU2hjSoWrrKOjowcXuFStGjihOwf9adZU/edit?usp=sharing

jbeezley · 2021-05-27T15:04:16Z

I had in mind organizing the files something like <study>/<sample>/<omics processing>. At each level, we could include a json file containing the metadata from entity if that would help.

In terms of individual downloads, it is currently downloading with the original file name. If you could propose an alternative way to derive a file name, it would be easy to change the download name via content-disposition headers.

pvangay · 2021-05-27T16:05:17Z

@jbeezley i like <study>/<sample>/<omics processing> - this makes sense to me

good to know file names can be changed. i think that it would be pretty rare for users to download single files - so this is less of a concern knowing that the bulk download will have a structure like the above. Thanks!

emileyfadrosh · 2021-05-27T16:22:20Z

just a quick comment: I think we need to be careful and also get input from @scanon @hubin-keio about file naming since we need to preserve the names of files. Right now, I don't see a meaningful link to the IDs for any of the omics outputs (eg, assembly_contigs.fna). How is this being dealt with?

pvangay · 2021-05-27T16:27:22Z

@emileyfadrosh i mentioned something similar above :) it looks like some of this will be preserved in jon's proposed structure for file organization but we should think about whether we want to propose some kind of file naming scheme for all files in NMDC. probably needs to be done in conjunction with the conversation about how to list sample names: microbiomedata/nmdc-metadata#349 (definitely need input from others on path forward)

ssarrafan · 2021-05-27T22:25:34Z

Based on the Kitware sync meeting today I will move this to the June sprint. @jbeezley and @jeffbaumes if you prefer I close this and open new issues let me know.

ssarrafan · 2021-05-27T22:26:59Z

Notes from today's meeting about bulk download:

Kitware Sync
Faizah, Emiley, Kjiersten, Marisa, David, Pajau, Jeff, Jon, Set

5/27/21
Bulk Download - slides at https://docs.google.com/presentation/d/1gMo1fmlneVEU2hjSoWrrKOjowcXuFStGjihOwf9adZU/edit#slide=id.p
Feedback and discussion:

Concerned about too much data being downloaded through browser.
- Jon: A download manager could be added to resolve the issue if the download fails. Sizing doesn’t matter much on the server side. If a big file fails a curl command could help but that could fail as well. If it fails, will we save their selections? Persistent of the collection is a question that needs to be decided in the design. A saved search query could be another option but would need to be independent. If you refresh the page, you may need a backend to capture what you’re curating.
- Kjiersten: Users will wait 10+ days for a download to finish, keep the browser open for weeks. Ability to refresh will be important.
- Emiley: IMG reconstruct the query, there is a workspace where you can keep filtering of data. They send an email when an analysis job finishes so they don’t have to keep their browser open. Can an email be sent to them? We are not collecting emails so this will add some complexity. Give them scope of how long it will take would be great.
  - Kjiersten there should an email associated with ORCID
- Jeff: Email won’t be necessary because it will be instant. Being able to restart, progress etc. is solved by a download manager. He suggests just using a download manager. We can estimate the size.
  - Kjiersten, we need to give an estimate the size. Won’t know how long it will take.
Emiley - can we filter by file type?
- Jeff - check box all the files you want to get.
  - Emiley - Use case: User goes in and wants only the functional annotation data. Filtered by metagenome, list all file types? So they can select only the functional annotation information for all the metagenomes that they’re interested in.
  - Emiley - Use case: Only interested in QC info from MAGS, it would be hard to check hundreds of boxes.
    - Jeff - it can be a local or global selection of file types… add higher level search level to select file types for example.
      - Pajau - if first sample that shows up is organic matter they won’t know what else they can selected.
  - KJ - file type means different things to different people. Did something at JGI for the data portal and added bins for filtering above to make downloads easier.
    - Jeff - likes this approach and thinks it will useful to do something similar. Selection of file types could be a flat list or categorization by. Can start with something like file type and iterate and make it better over time. Start with something simple.

faiza-a · 2021-06-08T14:39:56Z

Updated UI Mockups based on feedback in Sync meeting on 06/03/21

dehays · 2021-06-13T22:57:02Z

@jbeezley If I understand correctly - what you really need from microbiomedata/nmdc-schema#20 isn't simply access to descriptions (which are already there, but not used, on all data objects) but a file type attribute on each data object to allow the UI discussed here to do filtering by file type. Am I missing something?

jbeezley · 2021-06-14T16:21:54Z

No, I don't think you are missing anything.

There is definitely a confusion about "file type" and "description". It appears to me that the "description" is just some free form text that isn't validated (or as you noted displayed in the UI). The file type on the other hand, is an enumerated type that we can do querying on.

subdavis · 2021-06-22T11:57:44Z

UI still needs to be completed

ssarrafan created this issue from a note in NMDC May 2021 Sprint (To do) May 5, 2021

jbeezley added the X LARGE More than 10 days label May 5, 2021

jbeezley mentioned this issue May 5, 2021

Portal Download - save #355

Open

jeffbaumes mentioned this issue May 11, 2021

Portal Search - filter by data product or file type #373

Closed

subdavis added the priority: medium label May 11, 2021

ssarrafan added this to the Sprint 2 milestone May 13, 2021

jbeezley mentioned this issue May 27, 2021

Build mod_zip into base nginx container #393

Merged

ssarrafan removed this from To do in NMDC May 2021 Sprint May 27, 2021

ssarrafan added this to To do in NMDC June 2021 Sprint via automation May 27, 2021

ssarrafan modified the milestones: Sprint 2, Sprint 3 May 27, 2021

jbeezley mentioned this issue May 28, 2021

Include CRC32 checksum as a required attribute of data objects microbiomedata/nmdc-schema#49

Closed

pvangay assigned jbeezley Jun 3, 2021

jbeezley moved this from To do to In progress in NMDC June 2021 Sprint Jun 4, 2021

jbeezley mentioned this issue Jun 4, 2021

Move file descriptors to the offical description field of data objects microbiomedata/nmdc-schema#20

Closed

ssarrafan added the In Progress label Jun 11, 2021

subdavis mentioned this issue Jun 16, 2021

Bulk download API #403

Merged

jbeezley closed this as completed in #403 Jun 22, 2021

NMDC June 2021 Sprint automation moved this from In progress to Done Jun 22, 2021

subdavis reopened this Jun 22, 2021

NMDC June 2021 Sprint automation moved this from Done to In progress Jun 22, 2021

subdavis assigned subdavis and unassigned jbeezley Jun 22, 2021

subdavis mentioned this issue Jun 23, 2021

Client/bulk download #411

Merged

3 tasks

subdavis closed this as completed in #411 Jun 23, 2021

NMDC June 2021 Sprint automation moved this from In progress to Done Jun 23, 2021

ssarrafan removed the In Progress label Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Portal Download - bulk #354

Portal Download - bulk #354

ssarrafan commented May 5, 2021

jbeezley commented May 5, 2021

ssarrafan commented May 10, 2021

jbeezley commented May 10, 2021

ssarrafan commented May 25, 2021

jeffbaumes commented May 25, 2021

pvangay commented May 26, 2021

jeffbaumes commented May 27, 2021

jeffbaumes commented May 27, 2021

jbeezley commented May 27, 2021

pvangay commented May 27, 2021 •

edited

emileyfadrosh commented May 27, 2021

pvangay commented May 27, 2021

ssarrafan commented May 27, 2021

ssarrafan commented May 27, 2021

faiza-a commented Jun 8, 2021

dehays commented Jun 13, 2021

jbeezley commented Jun 14, 2021

subdavis commented Jun 22, 2021

Portal Download - bulk #354

Portal Download - bulk #354

Comments

ssarrafan commented May 5, 2021

jbeezley commented May 5, 2021

ssarrafan commented May 10, 2021

jbeezley commented May 10, 2021

ssarrafan commented May 25, 2021

jeffbaumes commented May 25, 2021

pvangay commented May 26, 2021

jeffbaumes commented May 27, 2021

jeffbaumes commented May 27, 2021

jbeezley commented May 27, 2021

pvangay commented May 27, 2021 • edited

emileyfadrosh commented May 27, 2021

pvangay commented May 27, 2021

ssarrafan commented May 27, 2021

ssarrafan commented May 27, 2021

faiza-a commented Jun 8, 2021

dehays commented Jun 13, 2021

jbeezley commented Jun 14, 2021

subdavis commented Jun 22, 2021

pvangay commented May 27, 2021 •

edited