-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python Cloud Object Storage Support #14
Conversation
Remove the version bump from this commit. I bump the version when I make a release (https://github.com/gadamc/release-python) |
I think it will be good practice going forward that you post a link here directly to the tests that demonstrate it's working. That would be the notebook link you sent to me on Slack. Remember to use the #@hidden_cell magic to hide your credentials thought |
You need to add instructions and examples to the README. Also, you need to describe the difference between 'CloudObjectStorage' and the bluemix/softlayer object storage |
I'm not a fan of the CamelCase class 'CloudObjectStorage'. At the same time, I don't like 'cloudobjectstorage' or 'cloud_object_storage'. |
@gadamc @bassel-zeidan I will review it. Some things not clear here, will comment later. |
python/ibmos2spark/osconfig.py
Outdated
# setup config | ||
prefix = "fs.s3d.service" | ||
hconf = sparkcontext._jsc.hadoopConfiguration() | ||
hconf.set(prefix + ".impl", driver) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bassel-zeidan This key suppose to be already setup in core-sites.xml. There is no need to setup it again.
@bassel-zeidan why we need this class? why not a simple function? |
@gilv @bassel-zeidan -- the reason Bassel has added this as a class is because he's following the precedent I set when I first created this project. In order to configure a connection to 'softlayer' or 'bluemix' object stores, one instantiates the appropriate class (see the README.md). I have separate classes, mostly for historical reasons, in order to support previous 'swift' and 'swift2d' protocols. But also, having these as classes that a user instantiates facilitates, slightly, using multiple different object storages simultaneously because the class will build the swift URL based on the protocol, and configuration name. So, it's meant to facilitate users experience. Perhaps there could have been a better design, but when I first built this I didn't actually expect anybody buy myself to do use it. :) |
@gadamc thanks for taking a look.
|
@gilv thanks for the feedback. regarding having it as a class, as Adam mentioned this is convention that has been followed before and i think it makes sense because normally you define your configurations once (make an instance of the class) and try to access files multiple times (using url function) and such a structure helps with that. |
In case you have no more concerns, let's get it merged. |
In the 'softlayer' and 'bluemix' classes, we append a configuration name to the "prefix". This allows a user to create and use multiple softlayer / bluemix object storage accounts. For example, if I need to read data from one account and then write to another. I think this would be especially important to be able to define multiple configurations in a parallelized process where each worker node is interacting with the COS. Right? Can we do this for COS? That is, can we change line 219? That would allow a person with multiple accounts configure access to them independently with the same code. |
raise ValueError("Invalid input: credentials.{} is required!".format(key)) | ||
|
||
# setup config | ||
prefix = "fs.s3d.service" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this 'fs.s3d.service.' + cos_id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried it. Didn't work out unfortunately
@gadamc unfortunately that is not possible. I tried to do that in different ways but it seems that it is not supported. |
@gilv do you know of a way to support connecting to multiple COSs by setting multiple configurations at the same time? I tried the same approach that has been followed in bluemix and softlayer classes for COS (appending unique OS id to the prefix of the settings e.g. |
@bassel-zeidan It was resolved actually. You can setup different service names for COS, similar to Swift. But it's not yet deployed to Spark as a Service. This commit https://github.ibm.com/cloud-platforms/stocator-s3/commit/519bf601e6d934f7c69de9739e4ea6afe6dc9c68 will part of 1.4 version that we will release in a week or so. |
Great! I think we should utilise this update in ibmos2spark implementation and support defining multiple configs. However, It might take some time until the new version of Stocator is released and until it gets deployed on Spark service. I would suggest that we introduce COS update to ibmos2spark (Python, Scala and R) and make a new release then we can support the multiple configs support in the next version. I will open a new item for that and assign it to myself. |
@bassel-zeidan I'd rather wait. If @gilv thinks its only a week or two away, I don't see the rush. |
As discussed, i am merging this PR. |
Background
This PR adds a class which gives the ability to setup the hadoop configs to connect to the cloud object storage (COS).