# Using Stocator library to connect to IBMCloud COS for sourcing objects as Spark DataFrame

The following examples are using IAM authentication to access objects

The Stocator jar used in this demo is built from 
https://github.com/CODAIT/stocator/tree/1.0.30-ibm-sdk

## I. Using local Stocator jar

In [4]:
from pyspark import SparkConf

In [5]:
from pyspark.sql import SparkSession

Assuming you have downloaded the jar to
"/Users/shengyipan/.m2/repository/com/ibm/stocator/stocator/1.0.30-IBM-SDK/stocator-1.0.30-IBM-SDK.jar"

In [12]:
spark = SparkSession.builder.master("local")\
.config("spark.jars", "/Users/shengyipan/.m2/repository/com/ibm/stocator/stocator/1.0.30-IBM-SDK/stocator-1.0.30-IBM-SDK.jar")\
.config("fs.stocator.scheme.list", "cos")\
.config("fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem")\
.config("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient")\
.config("fs.stocator.cos.scheme", "cos")\
.config("fs.cos.myCos.endpoint", "https://s3.ams03.objectstorage.softlayer.net")\
.config("fs.cos.myCos.iam.api.key", "YOUR_API_KEY").getOrCreate()
#Optional config: .config("fs.cos.myCos.iam.service.id", "crn:v1:bluemix:public:cloud-object-storage:global:a/xxx:abc::")\


In [13]:
df=spark.read.csv("cos://test-bucket-span001.myCos/source/year=2018/month=08/day=28/")

In [14]:
df.show()

+--------------------+
|                 _c0|
+--------------------+
|These examples gi...|
|Spark is built on...|
|      In the RDD API|
+--------------------+



## II. Using remote Stocator jar in JFrog
In general the steps are:

1. Create settings.xml and put it under $HOME/.ivy2/, or any location if you don't use ivy or you even don't know why ivy is :-)
2. Run following python code

### Step 1:
Create an empty file named "settings.xml" then copy and paste following contents into it.
Do remember to change the username and passwd to match yours.
```
<?xml version="1.0" encoding="UTF-8"?>
<ivy-settings>
  <settings defaultResolver="main" />
  <!--Authentication required for publishing (deployment). 'Artifactory Realm' is the realm used by Artifactory so don't change it.-->
  <credentials host="na.artifactory.swg-devops.com" realm="Artifactory Realm" username="your-user-name" passwd="your-password" />
  <resolvers>
    <chain name="main">
      <ibiblio name="public" m2compatible="true" root="https://na.artifactory.swg-devops.com:443/artifactory/txo-cedp-garage-artifacts-sbt-local" />
    </chain>
  </resolvers>
</ivy-settings>
```

In [1]:
# Step 2:
from pyspark import SparkConf

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.master("local")\
.config("spark.jars.ivySettings", "/Users/shengyipan/.ivy2/settings.xml")\
.config("spark.jars.packages", "com.ibm.stocator:stocator:1.0.30-IBM-SDK")\
.config("fs.stocator.scheme.list", "cos")\
.config("fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem")\
.config("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient")\
.config("fs.stocator.cos.scheme", "cos")\
.config("fs.cos.myCos.endpoint", "https://s3.ams03.objectstorage.softlayer.net")\
.config("fs.cos.myCos.iam.api.key", "").getOrCreate()
# .config("fs.cos.myCos.iam.service.id", "crn:v1:bluemix:public:cloud-object-storage:global:a/abc:xyz::")\


In [4]:
df=spark.read.csv("cos://test-bucket.myCos/source/year=2018/month=08/day=28/")

In [5]:
df.show()

+--------------------+
|                 _c0|
+--------------------+
|These examples gi...|
|Spark is built on...|
|      In the RDD API|
+--------------------+



## III. Using ibmos2spark package

In [2]:
import ibmos2spark

In [None]:
# DO NOTE THAT you NEED a service id here
credentials = {
    'endpoint': 'https://s3.ams03.objectstorage.softlayer.net',
    'api_key': 'YOUR API KEY',
    'service_id': 'crn:v1:bluemix:public:cloud-object-storage:global:a/xxx:aaa::'
}

# This can by any random string
configuration_name = 'myCOS'
cos = ibmos2spark.CloudObjectStorage(sc, credentials,
                                        configuration_name=configuration_name,
                                        cos_type='bluemix_cos',
                                        auth_method='api_key')

In [None]:
bucket_name = 'test-bucket'
object_name = 'source/year=2018/month=08/day=31/test.txt'
data_url = cos.url(object_name, bucket_name)
data = sc.textFile(data_url)

In [None]:
data.take(1)

In [None]:
# Assuming you have SparkSession available and have an instance named "spark"
# You can do following to get datafrom:
spark.read.csv(data_url).show()

## IV. If you would like to import Stocator for spark-submit command
1. Finish step 1 above
2. Run the following spark command

In [None]:
spark-submit \
--master {{ params.spark_master }} \
--deploy-mode {{ params.spark_deploy_mode }} \
--name {{ params.job_name }} \
--conf spark.jars.ivySettings=spark.jars.ivySettings \
--conf spark.jars.packages="com.ibm.stocator:stocator:1.0.30-IBM-SDK,com.ibm.cedpgarage:sample-project:1.7.0-SNAPSHOT" \
--driver-java-options \"-Dcos.apiKey={{ params.api_key }} -Dcos.bucketSource=cos://{{ params.input_bucket }}.{{ params.cos_service_name }}/{{ params.input_path }}{{ execution_date | date_to_path() }} -Dcos.bucketDest=cos://{{ params.output_bucket }}.{{ params.cos_service_name }}/{{ params.output_path }}{{ execution_date | date_to_path() }} -Dcos.endpoint={{ params.endpoint_url }} -Dcos.serviceId={{ params.cos_service_id }}\" \
--class {{ params.job_class }} \
/path/to/fake.jar