Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a Cluster Computing Framework for Dynamical Modeling #77

Closed
0u812 opened this issue Jan 15, 2017 · 25 comments
Closed

Develop a Cluster Computing Framework for Dynamical Modeling #77

0u812 opened this issue Jan 15, 2017 · 25 comments

Comments

@0u812
Copy link

0u812 commented Jan 15, 2017

Introduction

Data growth will be a major factor in the near future. However, most academic software in systems biology is not written with explosive growth in mind. This is unfortunate, as related fields have made great gains in scalability simply by leveraging the tools of big data, as evidenced by the great success of startups like H2O.ai.

Our group at the University of Washington is developing a Python-based framework for biological modeling. The core of this framework is a high-speed ODE/stochastic biochemical network simulator, Roadrunner, which pushes the limits of single-threaded computing. This summer, we would like to mentor a student in developing a cluster computing framework for running simulations more scalably.

Goal

The overall goal is to scale up common types of tasks in dynamical modeling. These tasks usually involve 1) loading a model (usu. SBML), 2) making some perturbation to the model (changing parameter values), 3) simulating the modified model, and 4) collecting some metrics from the results. In order to make this project tractable for a single summer, I suggest breaking it down into smaller tasks which can be used as milestones. For the initial phase of the project, we should ideally focus on feasibility and figuring out how to implement cluster computing in a uniform way. For example,

From here, the next step would be to construct a more general API that can handle the common analysis types in dynamical modeling. The common types of analyses that can be parallelized include parameter scans, parameter fitting, sensitivity analysis and parameter identifiability. If we can implement at least some of these during the summer, that would be great.

Skills Required

Familiarity with cluster computing, such as Spark or Hadoop (though Spark is preferred due to its lower overhead), would be ideal. Experience with Python and Linux would also be helpful. Above all, we want students who are self-driven, eager to learn, and excited about research. This is a highly unexplored application of cluster computing, and would likely lead to a peer-reviewed paper if successful.

Possible Mentors

Main Contact

References

Somogyi, E. T., Bouteiller, J. M., Glazier, J. A., König, M., Medley, J. K., Swat, M. H., & Sauro, H. M. (2015). libRoadRunner: a high performance SBML simulation and analysis library. Bioinformatics, btv363.

Sauro, H. M., Choi, K., Medley, J. K., Cannistra, C., Konig, M., Smith, L., & Stocking, K. (2016). Tellurium: A Python Based Modeling and Reproducibility Platform for Systems Biology. bioRxiv, 054601.

@0u812
Copy link
Author

0u812 commented Jan 17, 2017

Added the Java label because most modern cluster frameworks are Java- or Scala-based, so knowing one of these languages beforehand would be helpful.

@0u812 0u812 self-assigned this Jan 18, 2017
@matthiaskoenig
Copy link

+1 I second this proposal.

Just for clarification: In an implemented first version there will be no synchronization between the different distributed models/simulations, i.e. the simulation tasks are completely independent from each other? Also there is no dependence of simulations on each other, but every single simulation is an independent task.

@108krohan
Copy link

108krohan commented Feb 15, 2017

Hello Everyone,

My Masters degree (pursuing) in Biological Sciences should be of interest to a rapidly growing organisation like yours. My sound knowledge of Python, Java, C, C++, SQL matches the project description. Primary OS: Linux Ubuntu 16.04 LTS.

I'll be honest I'm new to high-performance computing. And you can expect nothing but eagerness for the research paper. You can expect S.O.L.I.D. programming principles followed rigorously because that would help the organisation in the long run.

I do have 3 questions in mind:

  1. Do I have to mail alex.pico [at] gladstone.ucsf.edu or the mentors, in order to get in touch?
  2. What steps should I take in order to be a strong candidate?
  3. Do I start with from NRNB GSoC Google Doc template?

Thank you for reading.
Hoping for a fast and positive response.

@0u812
Copy link
Author

0u812 commented Feb 15, 2017

Hi Rohan,

Thanks for your interest. I will try to answer each of your questions:

  1. At this stage, you're basically getting to know the mentors and bouncing ideas off of us, so posting here is fine.
  2. I think having a solid proposal is the most important thing. You can use the Google Doc that you linked and start filling it out (use File -> Make a copy). Once you have the content basically filled in you can share it with us for feedback. Feel free to reach out to us, especially for the parts you may not be familiar with such as parameter sweeps and parameter fitting. Google has some guidelines for selecting students. In addition to those, I would also pay specific attention to:
  • Does the student's plan have enough detail and does it lead to a useful feature of the software (such as the ability to perform parameter sweeps and parameter fitting on a cluster)?
  • Is the proposed work realistic for GSoC?
  • Does the student have the skills necessary to carry out the proposed work?

I think having all of these things would lead to a high chance of the project being successful, which is good for both us and the student.

  1. That is correct. You can make a copy for your own editing (use File -> Make a copy).

Regards,
Kyle

@108krohan
Copy link

Thanks for such a prompt response! As instructed, I've mailed a preliminary Document, awaiting suggestions.

Meanwhile, I've set up Tellurium and the tutorials from the Tellurium page are quite helpful. Could you please confirm if that's the right way to proceed?

This page has lots of relevant links, I just wanted to know which are the most important so I can dig more deeply for the project.

Thank you for taking the time to read and promptly reply :)

@0u812
Copy link
Author

0u812 commented Feb 20, 2017

Hi Rohan,

The tutorials you linked to should be helpful. You can also find more helpful tutorials at http://tellurium.readthedocs.io/en/stable/index.html, especially the Models & Model Building section. I can't provide feedback on the document you sent because the project proposal isn't filled it yet, but I assume you are trying to learn how to use tellurium first. Can you tell me how far along you are in the process? For example, if I gave you a description of a reaction network could you encode it and simulate it in tellurium?

@108krohan
Copy link

108krohan commented Feb 20, 2017

Thanks for the tutorial link (http://tellurium.readthedocs.io/en/stable/index.html)

Sorry, I've been busy with college tests (4 tomorrow). I'm trying to slip in an hour or two for Tellurium tutorials each day though. And I'll let you know when I'm through with encoding and simulation.

Regarding Reaction Network, does it entail Antimony usage?

You are busy, please don't trouble to reply if that's correct.

@108krohan
Copy link

108krohan commented Feb 26, 2017

Finished executing examples from documentation.
Where should one ideally go from here?

Okay, while going through the documentation I noticed certain things:

  1. Bioservices needs to be installed separately.
  2. Tellurium build installer for Linux? Initial setup via conda mentioned here wasn't enough. Had trouble with SED-ML and Combine examples because they need pygraphviz, and sbml2matlab.
  3. te.plotArray() used where r.plot() produced same results. Any reasons for this?

@matthiaskoenig
Copy link

matthiaskoenig commented Feb 27, 2017 via email

@108krohan
Copy link

108krohan commented Feb 27, 2017

Hi Matthias,

The tutorial is pretty accurate. But I'll try to go over the documentation again today and list out whichever errors, unclear information or missing information I find here.

One more thing, though pygraphviz (+sbml2matlab required for SED-ML and Combine) started working after some head-scratching, I wanted to confirm if more dependencies/libraries than just these conda installs is required. Because I needed to. (Example: pandas, bioservices)
conda install -c sys-bio tellurium
conda install jinja2 ipython
conda install -c SBMLTeam python-libsbml
More specifically, is there no way for enabling IDE plugins and SBOL functionality via conda-install method? Or are they optional?

Regards,
Rohan

@0u812
Copy link
Author

0u812 commented Mar 1, 2017

Hi Rohan, how is the application coming? What questions do you have? Do you think you need more info on modeling/Tellurium/cluster computing?

@108krohan
Copy link

108krohan commented Mar 3, 2017

Hi Kyle, really sorry for the late response. I figured it would be better to learn Spark before posting here or updating the application (I'll share the updated doc latest by day after tomorrow morning, EST for feedback).

Our overall goal is to scale up model 1) loading 2) perturbation 3) simulation and 4) metric generation through HPC via Spark, yes? You've already done a fantastic job of breaking our project into tasks. What kind of subtasks are you expecting? Can you meanwhile suggest names of other materials you might want me familiarised with?

Regards,
Rohan

@0u812
Copy link
Author

0u812 commented Mar 3, 2017

No worries 😄
I think you've got the right idea for scaling up. Now that you've finished the tellurium tutorials, I can give you more specific examples of the types of analysis we can parallelize. It might help to talk face-to-face. Are you free next week or during the weekend to Skype?

@108krohan
Copy link

Yes! Are you free between 7:30PM and 11:59PM Monday night EST? (Schedule EST/IST here)

@matthiaskoenig
Copy link

matthiaskoenig commented Mar 4, 2017 via email

@ShaikAsifullah
Copy link

Hi, I have been a little late. Can others join this meeting if it is not scheduled yet. Or if it is already done, may I get updates please. I am also planning to contribute to it.

@0u812
Copy link
Author

0u812 commented Mar 6, 2017

Hi all, for the meeting it looks like the best to for all three time zones (PST/IST/CET) is 8 am PST / 9:30 pm IST / 5 pm CET. Would it work to Skype Wednesday at that time for about an hour? If that doesn't work, I can set up a survey.

@hsauro
Copy link

hsauro commented Mar 6, 2017 via email

@108krohan
Copy link

Hi Kyle,
Yes! Awesome :D 👍 You'll receive an updated doc within the next 3-4 hours. Your feedback would be incredibly valuable. I noticed you had mentioned parameter sweeps and fitting in an earlier comment, and I've been trying to learn as much as I can. I'd like to be prepared when we Skype. Please tell me anything you'd like me to be completely thorough with.

Regards,
Rohan

@108krohan
Copy link

108krohan commented Mar 6, 2017

Hi Matthias @matthiaskoenig,

  • Been adding feedback comments to example codes from documentation in a private repository because 'Edit on Github' link on the tutorials page doesn't seem to be working for me. (returns a 404 error)

  • You'll also find a list of all the libraries I needed as an entire noob while getting the examples to work properly on my Linux Ubuntu 16.04. This could be great for new contributors not only to our project but to tellurium as a whole! Thanks Kyle for forking and improving the sbml2matlab at sys-bio.

  • One more thing, you might notice the cmake-config links for sbml2matlab are broken in the README.md. I finally got it to work, and you can find the screenshot on the repo so you don't have to do it again.

Hope it helps!

Regards,
Rohan

@0u812
Copy link
Author

0u812 commented Mar 7, 2017

It looks like we can have our first Skype meeting tomorrow at 8 am PST / 9:30 pm IST / 5 pm CET. Anyone who can make it is welcome. This meeting should be pretty informal. I just mainly want to get a sense of where the students are at and try to fill in any gaps in your knowledge of Tellurium.

My Skype user id is jkylemedley. If @108krohan and @ShaikAsifullah could please send me a contact request on Skype that would be great.

@hsauro
Copy link

hsauro commented Mar 8, 2017 via email

@hsauro
Copy link

hsauro commented Mar 8, 2017 via email

@matthiaskoenig
Copy link

matthiaskoenig commented Mar 8, 2017 via email

@khanspers
Copy link
Contributor

GSoC 2017 selected project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants