Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long-term goal: rogue workshop #15

Open
nbren12 opened this issue Jul 7, 2020 · 12 comments
Open

Long-term goal: rogue workshop #15

nbren12 opened this issue Jul 7, 2020 · 12 comments

Comments

@nbren12
Copy link
Collaborator

nbren12 commented Jul 7, 2020

@jhamman had the great idea in today's meeting of organizing an independent workshop on pangeo + ML to occur towards the end of the year.

I think this is a great opportunity to focus our thoughts into a coherent story, and recommend some potent infrastructure/know-how combinations for ML research in the geosciences.

Other workshops have mostly focused on clean ML datasets, but this workshop could focus on producing them. We said something like "Constructing ML Pipelines" would be natural title.

My $0.02 is that the best practice will depend on the organizational/team context. For example, I sometimes dream at night about using a filesystem like GLADE, but my team doesn't have access to that kind of machine. While it would be great to emphasize a common toolkit, I think we should point out divergence points and make strong suggestions.

As a start, it would be great to gather some brief impressions about the ML pipelines this group is building. I'm including my own answers as a guide below:

  • Where is the pipeline run (HPC, cloud, or mixed)?
    Google Cloud Platform.
  • What system(s) are intermediate steps stored on? (e.g. shared filesystem, cloud storage, SQL database)
    Google cloud storage
  • If applicable, what is format of the intermediate steps?
    netCDFs, zarrs, pickle files.
  • What are the main "parallelization" engines of your workflow if any?
    Google Cloud Dataflow (apache beam), K8s jobs
  • How many chained processing steps are needed to produce the ML dataset? How are these steps orchestrated (e.g. manual, snakemake, airflow, argo, etc)?
    Less than 10. Orchestrated using a mix of methods including argo and custom scripting systems. Both launch/manage pods on a K8s cluster.
  • How are software dependencies managed?
    Docker containers with anaconda inside of them.
  • Size of input, output data, intermediate data?
    10s of TBs of input, 10s of GBs of final processed data. TBs of intermediate.
@raspstephan
Copy link
Collaborator

This sounds like a great idea. Particularly focusing on the data pipeline up to the ML model. One small comment: I think also catering to less high-performance audiences would probably be helpful to a lot of people. My workflow, and I imagine that of many other people, is currently all on a local server, far away from the pipeline @nbren12 is building. I personally would find it helpful to have sessions on deciding whether it makes sense to port my workflow to the cloud and how to get started (for cloud-noobs like me). It might also be interesting to talk about the workflow from model/observation netCDF to keras/pytorch dataloader.

Here is a rough outline of my current workflow:

  1. Download data from public servers (ERA, TIGGE, CMIP) in netCDF onto local server
  2. Regrid data (1 and 2 are done via snakemake with xesmf for regridding) --> Total amount, 100s of GB to some TB (depending on resolution)
  3. (Optional) Convert data to TFRecord with preshuffling to avoid CPU RAM limitations (reading raw netCDFs from disk is too slow!)
  4. Load NetCDFs (using xarray) or TFRecords using Keras Dataloader (whose complexity got totally out of hand...)

@nbren12
Copy link
Collaborator Author

nbren12 commented Jul 7, 2020

Thanks for sharing! Indeed, I think this model is probably optimal for a single researcher and can easily be replicated by with a single instance on the cloud. This is what I did at UW and why I said this

My $0.02 is that the best practice will depend on the organizational/team context

However, I would disagree that automated infrastructure is only for "high-performance audiences". The single server model scales poorly for groups larger than 1, and a more automated approach to infrastructure results in more reproducible research IMO. Here is a slide I recently made on the subject:

Screen Shot 2020-07-07 at 12 46 10 PM

This becomes even more important for complicated ML pipelines, so I think it's important to communicate these trade-offs.

@raspstephan
Copy link
Collaborator

Well that's great. The workshop could be a perfect opportunity then to teach plebs like myself, who are scared of the cloud, how to scale up!

@nbren12
Copy link
Collaborator Author

nbren12 commented Jul 7, 2020

plebs like myself

pfssh...not sure how many plebs can use tfrecords...

@jbednar
Copy link

jbednar commented Jul 8, 2020

Here is a slide I recently made on the subject:

That looks good, though I'd split the first stage "Local laptop/VM" splits into two varieties: (a) "unreproducible environment" and (b) "locally reproducible environment". Those two cases differ depending on whether results were run on whatever conda or pip packages happened to be installed, which itself is the result of some complex and unknown history of installations over time (b), or whether the environment has been captured in a pinned, reproducible, and hopefully minimal way (a). I think most results come from an environment of type (b), and I think the biggest increase in reproducibility comes from going from (b) to (a), because the conda or pip dependencies are generally the most specific to data science, the most quickly changing, and the most likely to affect the results, compared to all the other libraries on the system. After going from (b) to (a), then going to Docker or Kubernetes/CI achieves further reproducibility, but it's not as big a jump as simply pinning to make an environment reproducible locally...

@nbren12
Copy link
Collaborator Author

nbren12 commented Jul 8, 2020

@jbednar I agree. (a) to (b) is a quantum leap. Unfortunately, its hard to verify if someone else’s software project is of type (a) or (b) without CI. I’m not sure if CI is something vital to ML pipeline development though. Maybe 50-50 reproduciblity is close enough...

@jbednar
Copy link

jbednar commented Jul 8, 2020

Using CI to force an escape from Schrodinger's reproducibility! :-) In practice we too use CI to ensure that we're in case (a) and not case (b) (see examples.pyviz.org), but it's at least possible to do the same by just passing it to another colleague...

@nbren12
Copy link
Collaborator Author

nbren12 commented Jul 9, 2020

at least possible to do the same by just passing it to another colleague

Sometimes it's easier to be friends with Travis, haha!

@jsadler2
Copy link

jsadler2 commented Sep 4, 2020

I like the idea of this workshop and I think it would be useful - especially since so much of our time is spent on the data prep steps compared to the actual ML modeling.

Here are my answers to your questions:

  • Where is the pipeline run (HPC, cloud, or mixed)?
    HPC
  • What system(s) are intermediate steps stored on? (e.g. shared filesystem, cloud storage, SQL database)
    HPC shared filesystem
  • If applicable, what is format of the intermediate steps?
    zarr, npz
  • What are the main "parallelization" engines of your workflow if any?
    snakemake submits jobs to slurm scheduler
  • How many chained processing steps are needed to produce the ML dataset? How are these steps orchestrated (e.g. manual, snakemake, airflow, argo, etc)?
    Less than 10. Most of the orchestration is just in a python function, but some using snakemake.
  • How are software dependencies managed?
    conda
  • Size of input, output data, intermediate data?
    Megabytes

@nbren12
Copy link
Collaborator Author

nbren12 commented Sep 5, 2020

Thanks for sharing Jeff.

This is unrelated, but I've been thinking a lot lately about the concept of MLOps. IMO, the devops world is pretty far ahead of the scientific community when it comes to reproducibility, since lack of reproducing has much higher consequences in the commercial world. I wonder if any of it translates to the academic context?

Edit: corrected name

@djgagne
Copy link
Collaborator

djgagne commented Oct 5, 2020

I want to follow up on a short discussion we had at today's Pangeo ML group meeting. There is still interest in the workshop but no one has volunteered to take the lead on organizing the event, likely because of the time commitment involved. We also discussed the scope of the workshop, which could be very wide ranging but would ideally focus on some essentials. Two questions for the group:

  1. Who has an interest in organizing this workshop? Alternatively, do you know anyone who might be interested if presented the opportunity? They would mainly need to handle logistics, send out reminders, and ask people in this group for guidance/talks/etc as needed.

  2. What are the most common questions people have been getting about ML infrastructure building with Python/pangeo?

  3. What are the biggest headaches/time wasters people keep running into in their pipelines?

I hope the answers to these questions help us focus priorities for the workshop. December is not going to be a realistic date at this point, but spring may be a promising time, especially if 1/2 days and virtual.

@zhonghua-zheng
Copy link
Contributor

I want to follow up on a short discussion we had at today's Pangeo ML group meeting. There is still interest in the workshop but no one has volunteered to take the lead on organizing the event, likely because of the time commitment involved. We also discussed the scope of the workshop, which could be very wide ranging but would ideally focus on some essentials. Two questions for the group:

  1. Who has an interest in organizing this workshop? Alternatively, do you know anyone who might be interested if presented the opportunity? They would mainly need to handle logistics, send out reminders, and ask people in this group for guidance/talks/etc as needed.
  2. What are the most common questions people have been getting about ML infrastructure building with Python/pangeo?
  3. What are the biggest headaches/time wasters people keep running into in their pipelines?

I hope the answers to these questions help us focus priorities for the workshop. December is not going to be a realistic date at this point, but spring may be a promising time, especially if 1/2 days and virtual.

Hi @djgagne , although I don't have sufficient experience in organizing the workshop, I am interested in assisting the workshop! Please feel free to let me know if there is anything that I can contribute to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants