Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Cloud Object Storage Support #14

Merged
merged 14 commits into from
Jul 19, 2017
Merged

Python Cloud Object Storage Support #14

merged 14 commits into from
Jul 19, 2017

Conversation

bassel-zeidan
Copy link
Collaborator

Background

This PR adds a class which gives the ability to setup the hadoop configs to connect to the cloud object storage (COS).

@gadamc
Copy link
Collaborator

gadamc commented Jul 11, 2017

Remove the version bump from this commit. I bump the version when I make a release (https://github.com/gadamc/release-python)

@gadamc
Copy link
Collaborator

gadamc commented Jul 11, 2017

I think it will be good practice going forward that you post a link here directly to the tests that demonstrate it's working. That would be the notebook link you sent to me on Slack. Remember to use the #@hidden_cell magic to hide your credentials thought

@gadamc
Copy link
Collaborator

gadamc commented Jul 11, 2017

You need to add instructions and examples to the README. Also, you need to describe the difference between 'CloudObjectStorage' and the bluemix/softlayer object storage

@gadamc
Copy link
Collaborator

gadamc commented Jul 11, 2017

I'm not a fan of the CamelCase class 'CloudObjectStorage'. At the same time, I don't like 'cloudobjectstorage' or 'cloud_object_storage'.

@gilv
Copy link

gilv commented Jul 12, 2017

@gadamc @bassel-zeidan I will review it. Some things not clear here, will comment later.

# setup config
prefix = "fs.s3d.service"
hconf = sparkcontext._jsc.hadoopConfiguration()
hconf.set(prefix + ".impl", driver)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bassel-zeidan This key suppose to be already setup in core-sites.xml. There is no need to setup it again.

@gilv
Copy link

gilv commented Jul 12, 2017

@bassel-zeidan why we need this class? why not a simple function?

@gadamc
Copy link
Collaborator

gadamc commented Jul 12, 2017

@gilv @bassel-zeidan -- the reason Bassel has added this as a class is because he's following the precedent I set when I first created this project. In order to configure a connection to 'softlayer' or 'bluemix' object stores, one instantiates the appropriate class (see the README.md).

I have separate classes, mostly for historical reasons, in order to support previous 'swift' and 'swift2d' protocols. But also, having these as classes that a user instantiates facilitates, slightly, using multiple different object storages simultaneously because the class will build the swift URL based on the protocol, and configuration name. So, it's meant to facilitate users experience. Perhaps there could have been a better design, but when I first built this I didn't actually expect anybody buy myself to do use it. :)

@bassel-zeidan
Copy link
Collaborator Author

@gadamc thanks for taking a look.

  1. version upgrade removed.
  2. regarding testing:
https://apsportal.ibm.com/analytics/notebooks/628a1960-6f48-4dc3-b57b-5666363831d9/view?access_token=4ada3a91af5a4e11c0ee47b7c0b3f18a1623ba70075d0a0ab26443ad713f33cf
  1. I updated the Readme file and add an example on how to use COS.

@bassel-zeidan
Copy link
Collaborator Author

@gilv thanks for the feedback.

regarding having it as a class, as Adam mentioned this is convention that has been followed before and i think it makes sense because normally you define your configurations once (make an instance of the class) and try to access files multiple times (using url function) and such a structure helps with that.

@bassel-zeidan
Copy link
Collaborator Author

In case you have no more concerns, let's get it merged.

@gadamc
Copy link
Collaborator

gadamc commented Jul 14, 2017

@bassel-zeidan @gilv

In the 'softlayer' and 'bluemix' classes, we append a configuration name to the "prefix".
https://github.com/ibm-watson-data-lab/ibmos2spark/blob/PythonCOSSupport/python/ibmos2spark/osconfig.py#L158

This allows a user to create and use multiple softlayer / bluemix object storage accounts. For example, if I need to read data from one account and then write to another. I think this would be especially important to be able to define multiple configurations in a parallelized process where each worker node is interacting with the COS. Right?

Can we do this for COS? That is, can we change line 219? That would allow a person with multiple accounts configure access to them independently with the same code.

raise ValueError("Invalid input: credentials.{} is required!".format(key))

# setup config
prefix = "fs.s3d.service"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this 'fs.s3d.service.' + cos_id?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried it. Didn't work out unfortunately

@bassel-zeidan
Copy link
Collaborator Author

@gadamc unfortunately that is not possible. I tried to do that in different ways but it seems that it is not supported.

@bassel-zeidan
Copy link
Collaborator Author

bassel-zeidan commented Jul 14, 2017

@gilv do you know of a way to support connecting to multiple COSs by setting multiple configurations at the same time? I tried the same approach that has been followed in bluemix and softlayer classes for COS (appending unique OS id to the prefix of the settings e.g. "fs.swift2d.service." + name). This doesn't seem to work with COS. Any ideas?

@gilv
Copy link

gilv commented Jul 15, 2017

@bassel-zeidan It was resolved actually. You can setup different service names for COS, similar to Swift. But it's not yet deployed to Spark as a Service. This commit https://github.ibm.com/cloud-platforms/stocator-s3/commit/519bf601e6d934f7c69de9739e4ea6afe6dc9c68 will part of 1.4 version that we will release in a week or so.

@bassel-zeidan
Copy link
Collaborator Author

Great! I think we should utilise this update in ibmos2spark implementation and support defining multiple configs. However, It might take some time until the new version of Stocator is released and until it gets deployed on Spark service. I would suggest that we introduce COS update to ibmos2spark (Python, Scala and R) and make a new release then we can support the multiple configs support in the next version. I will open a new item for that and assign it to myself.

@gadamc @gilv Do you agree to that?

@bassel-zeidan
Copy link
Collaborator Author

@gadamc
Copy link
Collaborator

gadamc commented Jul 17, 2017

@bassel-zeidan I'd rather wait. If @gilv thinks its only a week or two away, I don't see the rush.
I'd rather have the COS usage consistent with the other classes -- which includes supplying a 'configuration_name'. And I'd also not want to release twice. @gilv -- how confident are you in the timeline for the update to support a multiple configurations?

@bassel-zeidan
Copy link
Collaborator Author

As discussed, i am merging this PR.

@bassel-zeidan bassel-zeidan merged commit b3c0f31 into master Jul 19, 2017
@bassel-zeidan bassel-zeidan deleted the PythonCOSSupport branch July 19, 2017 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants