Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Portal Download - bulk #354

Closed
ssarrafan opened this issue May 5, 2021 · 18 comments · Fixed by #403 or #411
Closed

Portal Download - bulk #354

ssarrafan opened this issue May 5, 2021 · 18 comments · Fixed by #403 or #411
Assignees
Labels
Milestone

Comments

@ssarrafan
Copy link

Researchers who are interested in raw data generally want to do bulk downloads (to a server/cloud) for custom analyses. Researchers who are interested in data products may want to download to a server or to their local computer for custom analysis or sending to KBase. Researchers are also interested in downloading the associated sample metadata only.

Priority - medium
Urgency - low

@ssarrafan ssarrafan created this issue from a note in NMDC May 2021 Sprint (To do) May 5, 2021
@jbeezley jbeezley added the X LARGE More than 10 days label May 5, 2021
@jbeezley
Copy link

jbeezley commented May 5, 2021

Marking this as X-Large for now, but it could be quicker depending on the technical solution.

First step here is to decide how to implement the feature. Some of the options include:

  1. Streaming download as a zip file
  2. Email a link to a zip file that is asynchronously created
  3. Provide a shell (or windows .bat) script to download all of the requested files

The second is the most user friendly, stable, and scalable, but will require the most engineering effort. The first is probably a Large task, but brings with it a number of stability and scalability issues. The third is easiest (could be done entirely client side), but users may have issues running a script.

If we decide to go with the second option, I would want to break this subtasks.

@ssarrafan
Copy link
Author

@jbeezley can we discuss option 1 at the tech sync or the kitware sync meetings this week? @emileyfadrosh would love if this could be done this sprint and thinks the users would prefer option 1.

@jbeezley
Copy link

I think mod_zip would be a good candidate for generating the streaming responses without incurring the cost on the API server.

One technical question to resolve is in regards to the persistence of the download cart. The simplest (server-side) implementation would make the download cart purely in memory on the web client. This would mean if the page is reloaded everything in the cart would disappear. If we want a persistent cart, then we probably need to store it in the database with an associated user id. This means creating a download cart API. Either way, for the server side of the implementation, I would give this a MEDIUM-LARGE size.

What might be a bigger component of this is to design and implement the UI. We should consider bringing in Faiza to create a design. This would also help clarify some of the technical requirements.

@ssarrafan
Copy link
Author

The majority of this work will be done in a future sprint. Waiting to hear back on whether it should be in June or later. First task may be done this sprint. See Jon's email below:

From: Jonathan Beezley jonathan.beezley@kitware.com
Date: Fri, May 14, 2021 at 10:06 AM
Subject: Re: Portal issues for May
To: Setareh Sarrafan ssarrafan@lbl.gov
Cc: Jeff Baumes jeff.baumes@kitware.com, Brandon Davis brandon.davis@kitware.com, Kjiersten Fagnan kmfagnan@lbl.gov, Pajau Vangay pvangay@lbl.gov, Emiley Eloe-Fadrosh eaeloefadrosh@lbl.gov, David Hays dehays@lbl.gov

It looks like Brandon has basically finished the last two issues.  The first one is primarily my responsibility and will take a little bit of work to expose "multi-omics" related information, but I should be able to do it.
For bulk download, it could be split into pieces:1. Create a zip streaming microservice (this is the part that I researched).  It is probably a SMALL task to get it deployed.  This would essentially take a list of data_object id's and return a streaming zip file.2. Requirements gathering and UI design.  This could be working with Faiza to create a wireframe of the feature.  With a basic design, we would likely have most of the technical implementation details clarified as well.3. Backend implementation.  Depending on technical requirements, we might need new domain models and endpoints in the API server.  This could be anywhere from XSmall to Medium.4. Client implementation.
I can most likely finish out 1 for this sprint.  Assuming Faiza is available, 2 could set us up for finishing out the feature in June.
Jon

@jeffbaumes
Copy link
Collaborator

Yes moving most of this to June makes sense unless other priorities push it back further.

If we have the right people for the Kitware sync Thursday, we could discuss the requirements for the front end and have @faiza-a begin UI mockups.

@pvangay
Copy link

pvangay commented May 26, 2021

This may already have been discussed so feel free to ignore if so. When downloading files, the user will want to associate the data files back to the sample somehow. Will the files be packaged into folders named by sample identifiers? or renamed? Just jotting this here to make sure something is being considered. At the moment, individual downloads require the user to manually rename files to something more meaningful. Some files are named identically across samples (scaffolds fna), some seem to have an identifier appended (although not sure where it came from) (e.g. EC TSV). Let me know if this isn't making any sense!

@jeffbaumes
Copy link
Collaborator

Yes, we would need that in bulk download. I assume we can roll up the zip file in a way that will include a folder structure for each sample, but that needs @jbeezley 's input.

@jeffbaumes
Copy link
Collaborator

@jbeezley
Copy link

I had in mind organizing the files something like <study>/<sample>/<omics processing>. At each level, we could include a json file containing the metadata from entity if that would help.

In terms of individual downloads, it is currently downloading with the original file name. If you could propose an alternative way to derive a file name, it would be easy to change the download name via content-disposition headers.

@pvangay
Copy link

pvangay commented May 27, 2021

@jbeezley i like <study>/<sample>/<omics processing> - this makes sense to me

good to know file names can be changed. i think that it would be pretty rare for users to download single files - so this is less of a concern knowing that the bulk download will have a structure like the above. Thanks!

@emileyfadrosh
Copy link

just a quick comment: I think we need to be careful and also get input from @scanon @hubin-keio about file naming since we need to preserve the names of files. Right now, I don't see a meaningful link to the IDs for any of the omics outputs (eg, assembly_contigs.fna). How is this being dealt with?

@pvangay
Copy link

pvangay commented May 27, 2021

@emileyfadrosh i mentioned something similar above :) it looks like some of this will be preserved in jon's proposed structure for file organization but we should think about whether we want to propose some kind of file naming scheme for all files in NMDC. probably needs to be done in conjunction with the conversation about how to list sample names: microbiomedata/nmdc-metadata#349 (definitely need input from others on path forward)

@ssarrafan
Copy link
Author

Based on the Kitware sync meeting today I will move this to the June sprint. @jbeezley and @jeffbaumes if you prefer I close this and open new issues let me know.

@ssarrafan ssarrafan removed this from To do in NMDC May 2021 Sprint May 27, 2021
@ssarrafan ssarrafan added this to To do in NMDC June 2021 Sprint via automation May 27, 2021
@ssarrafan ssarrafan modified the milestones: Sprint 2, Sprint 3 May 27, 2021
@ssarrafan
Copy link
Author

Notes from today's meeting about bulk download:

Kitware Sync
Faizah, Emiley, Kjiersten, Marisa, David, Pajau, Jeff, Jon, Set

5/27/21
Bulk Download - slides at https://docs.google.com/presentation/d/1gMo1fmlneVEU2hjSoWrrKOjowcXuFStGjihOwf9adZU/edit#slide=id.p
Feedback and discussion:

  • Concerned about too much data being downloaded through browser.
    • Jon: A download manager could be added to resolve the issue if the download fails. Sizing doesn’t matter much on the server side. If a big file fails a curl command could help but that could fail as well. If it fails, will we save their selections? Persistent of the collection is a question that needs to be decided in the design. A saved search query could be another option but would need to be independent. If you refresh the page, you may need a backend to capture what you’re curating.
    • Kjiersten: Users will wait 10+ days for a download to finish, keep the browser open for weeks. Ability to refresh will be important.
    • Emiley: IMG reconstruct the query, there is a workspace where you can keep filtering of data. They send an email when an analysis job finishes so they don’t have to keep their browser open. Can an email be sent to them? We are not collecting emails so this will add some complexity. Give them scope of how long it will take would be great.
      • Kjiersten there should an email associated with ORCID
    • Jeff: Email won’t be necessary because it will be instant. Being able to restart, progress etc. is solved by a download manager. He suggests just using a download manager. We can estimate the size.
      • Kjiersten, we need to give an estimate the size. Won’t know how long it will take.
  • Emiley - can we filter by file type?
    • Jeff - check box all the files you want to get.
      • Emiley - Use case: User goes in and wants only the functional annotation data. Filtered by metagenome, list all file types? So they can select only the functional annotation information for all the metagenomes that they’re interested in.
      • Emiley - Use case: Only interested in QC info from MAGS, it would be hard to check hundreds of boxes.
        • Jeff - it can be a local or global selection of file types… add higher level search level to select file types for example.
          • Pajau - if first sample that shows up is organic matter they won’t know what else they can selected.
      • KJ - file type means different things to different people. Did something at JGI for the data portal and added bins for filtering above to make downloads easier.
        • Jeff - likes this approach and thinks it will useful to do something similar. Selection of file types could be a flat list or categorization by. Can start with something like file type and iterate and make it better over time. Start with something simple.

@faiza-a
Copy link

faiza-a commented Jun 8, 2021

Updated UI Mockups based on feedback in Sync meeting on 06/03/21

@dehays
Copy link
Contributor

dehays commented Jun 13, 2021

@jbeezley If I understand correctly - what you really need from microbiomedata/nmdc-schema#20 isn't simply access to descriptions (which are already there, but not used, on all data objects) but a file type attribute on each data object to allow the UI discussed here to do filtering by file type. Am I missing something?

@jbeezley
Copy link

No, I don't think you are missing anything.

There is definitely a confusion about "file type" and "description". It appears to me that the "description" is just some free form text that isn't validated (or as you noted displayed in the UI). The file type on the other hand, is an enumerated type that we can do querying on.

NMDC June 2021 Sprint automation moved this from In progress to Done Jun 22, 2021
@subdavis
Copy link
Contributor

UI still needs to be completed

@subdavis subdavis reopened this Jun 22, 2021
NMDC June 2021 Sprint automation moved this from Done to In progress Jun 22, 2021
@subdavis subdavis assigned subdavis and unassigned jbeezley Jun 22, 2021
@subdavis subdavis mentioned this issue Jun 23, 2021
3 tasks
NMDC June 2021 Sprint automation moved this from In progress to Done Jun 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

8 participants