Skip to content

Commit

Permalink
Merge pull request #39 from ibm-watson-data-lab/update_readmes
Browse files Browse the repository at this point in the history
Okay - going to merge this in without a review in oder to move this along. I hope any documentation problems will be flushed out by the next release.
  • Loading branch information
G Adam Cox committed Oct 26, 2017
2 parents c0d7517 + 7450363 commit 0dbe3d3
Show file tree
Hide file tree
Showing 7 changed files with 431 additions and 253 deletions.
8 changes: 1 addition & 7 deletions python/README.md
Expand Up @@ -92,12 +92,6 @@ data = sc.textFile(data_url)
Alternatively, you can connect to an IBM Bluemix COS using IAM token. Set the `auth_method` to `iam_token` and
provide the appropriate values in the credentials.

If you do not provide a `configuration_name`,
a default value will be used (`service`). However, if you are reading or
writing to multiple Object Storage instances you will need to define separate `configuration_name`
values for each Object Storage instance. Otherwise, only one connection will be
configured at a time, potentially causing errors and confusion.


```python
import ibmos2spark
Expand Down Expand Up @@ -170,7 +164,7 @@ This is the service described in **middle right** pane in the image above (and w
as Softlayer Swift Object Storage). [Documentation is here](https://ibm-public-cos.github.io/crs-docs/)

Note below that credentials are not passed in as a dictionary, like in the other implementations.
Rather, each piece of information is supplied as a separate, required arguement when instantiating
Rather, each piece of information is supplied as a separate, required argument when instantiating
a new `softlayer` object.


Expand Down
163 changes: 107 additions & 56 deletions r/sparklyr/README.md
@@ -1,27 +1,43 @@
# ibmos2sparklyr

The package sets Spark Hadoop configurations for connecting to
IBM Bluemix Object Storage and Softlayer Account Object Storage instances. This packages uses the new [stocator](https://github.com/SparkTC/stocator) driver, which implements the `swift2d` protocol, and is availble
on the latest IBM Apache Spark Service instances (and through IBM Data Science Experience).
The `ibmos2park` library facilitates data read/write connections between Apache Spark clusters and the various
[IBM Object Storage services](https://console.bluemix.net/catalog/infrastructure/object-storage-group).

Using the `stocator` driver connects your Spark executor nodes directly
to your data in object storage.
This is an optimized, high-performance method to connect Spark to your data. All IBM Apache Spark kernels
are instantiated with the `stocator` driver in the Spark kernel's classpath.
You can also run this locally by installing the [stocator driver](https://github.com/SparkTC/stocator)
and adding it to your local Apache Spark kernel's classpath.
![IBM Object Storage Services](fig/ibm_objectstores.png "IBM Object Storage Services")

### Object Storage Documentation

This package expects a SparkContext instantiated by sparklyr. It has been tested
to work with IBM RStudio from DSX, though it should work with other Spark
installations that utilize the [swift2d/stocator](https://github.com/SparkTC/stocator).
* [Cloud Object Storage](https://www.bluemix.net/docs/services/cloud-object-storage/getting-started.html) **Not Yet Supported.**
* [Cloud Object Storage (IaaS)](https://ibm-public-cos.github.io/crs-docs/) **Not Yet Supported.**
* [Object Storage OpenStack Swift (IaaS)](https://ibm-public-cos.github.io/crs-docs/)
* [Object Storage OpenStack Swift for Bluemix](https://www.ng.bluemix.net/docs/services/ObjectStorage/index.html)



## Requirements

* Apache Spark with `stocator` library

The easiest way to install the `stocator` library with Apache Spark is to
[pass the Maven coordinates at launch](https://spark-packages.org/package/SparkTC/stocator).
Other installation options are described in the [`stocator` documentation](https://github.com/SparkTC/stocator).


## Apache Spark at IBM

The `stocator` library is pre-installled and available on

* [Apache Spark through IBM Bluemix](https://console.bluemix.net/catalog/services/apache-spark)
* [IBM Analytics Engine (Beta)](https://console.bluemix.net/catalog/services/ibm-analytics-engine)
* [IBM Data Science Experience](https://datascience.ibm.com)

## Installation

library(devtools)
devtools::install_url("https://github.com/ibm-cds-labs/ibmos2spark/archive/<version>.zip", subdir= "r/sparklyr/",dependencies = FALSE)

where `version` should be a tagged release, such as `0.0.7`. (If you're daring, you can use `master`.)
where `version` should be a tagged release, such as `1.0.2`.


###### WARNING

Expand All @@ -33,65 +49,100 @@ where RVERSION is the newest install of R (currently 3.3) and delete the `sparkl
After deleting, choose File->Quit Session to refresh your R kernel. These steps will refresh your
sparklyr package to the special DSX version.


## Usage

The usage of this package depends on *from where* your Object Storage instance was created. This package
is intended to connect to IBM's Object Storage instances obtained from Bluemix or Data Science Experience
(DSX) or from a separate account on IBM Softlayer. The instructions below show how to connect to
either type of instance.
The instructions below demonstrate how to use this package to retrieve data from the various
IBM Object Storage services.

These instructions will refer to the image at the top of this README.

### Cloud Object Storage

This is the service described on the **far left** in the image above. This service is also called IBM Bluemix Cloud Object Storage (COS) in various locations. [Documentation is here](https://www.bluemix.net/docs/services/cloud-object-storage/getting-started.html).

Not Yet Implemented.

### Cloud Object Storage (IaaS)

This is the service described **middle left** pane in the image above. This service is sometimes refered to
as the Softlayer IBM Cloud Object Storage service.
[Documentation is here](https://ibm-public-cos.github.io/crs-docs/).

Not Yet Implemented.

### Object Storage OpenStack Swift (Iaas)

This is the service described in **middle right** pane in the image above (and was previously referred to
as Softlayer Swift Object Storage). [Documentation is here](https://ibm-public-cos.github.io/crs-docs/)

Note below that credentials are not passed in as a list of key-value pairs, like in the other implementations.
Rather, each piece of information is supplied as a separate, required argument when instantiating
a new `softlayer` object.

```
library(ibmos2sparklyr)
configurationname = "softlayerOScon" #can be any any name you like (allows for multiple configurations)
slconfig = softlayer(sparkcontext=sc,
name=configurationname,
auth_url="https://identity.open.softlayer.com",
tenant = "XXXXX",
username="XXXXX",
password="XXXXX"
)
container = "my_container" # name of your object store container
object = "my_data.csv" # name of object that you want to retrieve in the container
spark_object_name = "dataFromSwift" # name to assign to the new spark object
The connection setup is essentially the same. But the difference for you is how you deliver the
credentials. If your Object Storage was created with Bluemix/DSX, with a few clicks on the side-tab
within a DSX Jupyter notebook, you can obtain your account credentials in the form of a list.
If your Object Storage was created with a Softlayer account, each part of the credentials will
be found as text that you can copy and paste into the example code below.
data = sparklyr::spark_read_csv(sc, spark_object_name,slconfig$url(container,object))
```

### Bluemix / Data Science Experience
### Object Storage OpenStack Swift for Bluemix

library(ibmos2sparklyr)
configurationname = "bluemixOScon" #can be any any name you like (allows for multiple configurations)
This is the service described in **far right** pane in the image above.
This was previously referred to as Bluemix Swift Object Storage in this documentation. It is
referred to as ["IBM Object Storage for Bluemix" in Bluemix documenation](https://console.bluemix.net/docs/services/ObjectStorage/os_works_public.html). It has also been referred to as
"OpenStack Swift (Cloud Foundry)".

# In DSX notebooks, the "insert to code" will insert this credentials list for you
creds = list(
auth_url="https://identity.open.softlayer.com",
region="dallas",
project_id = "XXXXX",
user_id="XXXXX",
password="XXXXX")
bmconfig = bluemix(sparkcontext=sc, name=configurationname, credentials = creds)
container = "my_container" # name of your object store container
object = "my_data.csv" # name of object that you want to retrieve in the container
spark_object_name = "dataFromSwift" # name to assign to the new spark object

data = sparklyr::spark_read_csv(sc, spark_object_name,bmconfig$url(container,object))
Credentials are passed as
a list of key-value pairs and the `bluemix` object is used to configure the connection to
this Object Storage service.

If you do not provide a `configurationName`,
a default value will be used (`service`). However, if you are reading or
writing to multiple Object Storage instances you will need to define separate `configurationName`
values for each Object Storage instance. Otherwise, only one connection will be
configured at a time, potentially causing errors and confusion.

### Softlayer
```
library(ibmos2sparklyr)
configurationname = "bluemixOScon" #can be any any name you like (allows for multiple configurations)
library(ibmos2sparklyr)
configurationname = "softlayerOScon" #can be any any name you like (allows for multiple configurations)
# In DSX notebooks, the "insert to code" will insert this credentials list for you
creds = list(
auth_url="https://identity.open.softlayer.com",
region="dallas",
project_id = "XXXXX",
user_id="XXXXX",
password="XXXXX")
bmconfig = bluemix(sparkcontext=sc, name=configurationname, credentials = creds)
container = "my_container" # name of your object store container
object = "my_data.csv" # name of object that you want to retrieve in the container
spark_object_name = "dataFromSwift" # name to assign to the new spark object
slconfig = softlayer(sparkcontext=sc,
name=configurationname,
auth_url="https://identity.open.softlayer.com",
tenant = "XXXXX",
username="XXXXX",
password="XXXXX"
)
container = "my_container" # name of your object store container
object = "my_data.csv" # name of object that you want to retrieve in the container
spark_object_name = "dataFromSwift" # name to assign to the new spark object
data = sparklyr::spark_read_csv(sc, spark_object_name,bmconfig$url(container,object))
```

data = sparklyr::spark_read_csv(sc, spark_object_name,slconfig$url(container,object))



## License

Copyright 2016 IBM Cloud Data Services
Copyright 2017 IBM Cloud Data Services

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
Binary file added r/sparklyr/fig/ibm_objectstores.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 0dbe3d3

Please sign in to comment.