Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Storage file access add-on for cohorts (a.k.a. minibucket) #219

Merged
merged 3 commits into from
Jul 9, 2017

Conversation

armish
Copy link
Member

@armish armish commented Jun 26, 2017

@jburos: This adds a really light-weight but functional module to the cohorts so that instead of dealing with all the NFS setups and remote machine admin work, we can directly make use of files on the bucket for relatively small tasks. I don't think this solution will scale up to BAMs, but VCFs should be fine to handle this way.

Here is a notebook that goes over the basic functionality really quick:
https://gist.github.com/armish/b4994d51775a390fb1057d80501683d3

@timodonnell @ryan-williams @iskandr: I know TensorFlow has its own shady integration that is pretty smooth on the App/GPU Engines but doesn't work on our locals and I think Scala/Spark world is enjoying the DataFlow (Apache Beam) utilities for such integration, but let me know if there is a better way of doing this.

@armish armish requested a review from jburos June 26, 2017 13:44
@coveralls
Copy link

Coverage Status

Coverage decreased (-2.5%) to 54.427% when pulling 64a26a2 on gcloud into 31545c4 on master.

Apparently the Abstract interface has just been introduced so
it breaks earlier 3.x Python releases. Falling back to good old
object for now.
@jburos
Copy link
Member

jburos commented Jun 26, 2017

Looks great @armish, thanks so much. This will be super helpful.

Only thing I noticed is that this will not work for python <= 3.6 -- the AbstractContextManager class isn't available for earlier versions. It's a bit of a shame since we're using 3.5 for reticulate compatibility.

At any rate, should we make this requirement explicit for Cohorts to avoid errors with earlier versions of python? Or, is there a way to introduce this functionality optionally? Open to suggestions.

@armish
Copy link
Member Author

armish commented Jun 26, 2017

@jburos: oh, and I forgot to mention: cohorts still doesn't know about this new addition. This is on purpose since we should better do the integration slowly and carefully and make sure the additional support doesn't bite us for the upcoming wrap-ups, especially during these days/weeks.

So, consider this as being 1/N of minor refactorings coming in and definitely a non-urgent PR to CR. Completely fine with me to keep stacking them up until Tavi comes back and the projects start running a little bit slower.

@jburos
Copy link
Member

jburos commented Jun 26, 2017

@armish nvm see that you fixed the abstractManagerClass issue. What do you think - should we start working in a develop branch for the accumulation of these minor PRs?

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.5%) to 54.427% when pulling bf53812 on gcloud into 31545c4 on master.

Copy link
Member

@jburos jburos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, once you resolve the travis error on python 3.4 it's OK to merge.

@coveralls
Copy link

coveralls commented Jun 26, 2017

Coverage Status

Coverage decreased (-2.5%) to 54.427% when pulling aa0fd4e on gcloud into 31545c4 on master.

@armish armish merged commit f9d5fc7 into master Jul 9, 2017
@armish armish deleted the gcloud branch July 9, 2017 09:41
@tavinathanson
Copy link
Member

Great stuff, @armish!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants