# Using Stocator library to connect to IBMCloud COS for sourcing objects as Spark DataFrame

The following examples are using IAM authentication to access objects

The Stocator jar used in this demo is built from 
https://github.com/CODAIT/stocator/tree/1.0.30-ibm-sdk

## Using local jar

In [1]:
from pyspark import SparkConf

In [2]:
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder.master("local")\
.config("spark.jars", "/Users/shengyipan/.m2/repository/com/ibm/stocator/stocator/1.0.30-IBM-SDK/stocator-1.0.30-IBM-SDK.jar")\
.config("fs.stocator.scheme.list", "cos")\
.config("fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem")\
.config("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient")\
.config("fs.stocator.cos.scheme", "cos")\
.config("fs.cos.myCos.endpoint", "https://s3.ams03.objectstorage.softlayer.net")\
.config("fs.cos.myCos.iam.api.key", "")\
.config("fs.cos.myCos.iam.service.id", "crn:v1:bluemix:public:cloud-object-storage:global:a/abc:xyz::")\
.getOrCreate()

In [6]:
df=spark.read.csv("cos://test-bucket.myCos/source/year=2018/month=08/day=28/")

In [7]:
df.show()

+--------------------+
|                 _c0|
+--------------------+
|These examples gi...|
|Spark is built on...|
|      In the RDD API|
+--------------------+



## Using remote jar
Steps:

1. Create settings.xml and put it under $HOME/.ivy2/
2. Use following code

Just create an empty file named "settings.xml" then copy and paste following contents into it.
Do remember to change the username and passwd to match yours.
```
<?xml version="1.0" encoding="UTF-8"?>
<ivy-settings>
  <settings defaultResolver="main" />
  <!--Authentication required for publishing (deployment). 'Artifactory Realm' is the realm used by Artifactory so don't change it.-->
  <credentials host="na.artifactory.swg-devops.com" realm="Artifactory Realm" username="your-user-name" passwd="your-password" />
  <resolvers>
    <chain name="main">
      <ibiblio name="public" m2compatible="true" root="https://na.artifactory.swg-devops.com:443/artifactory/txo-cedp-garage-artifacts-sbt-local" />
    </chain>
  </resolvers>
</ivy-settings>
```

In [1]:
from pyspark import SparkConf

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.master("local")\
.config("spark.jars.ivySettings", "/Users/shengyipan/.ivy2/settings.xml")\
.config("spark.jars.packages", "com.ibm.stocator:stocator:1.0.30-IBM-SDK")\
.config("fs.stocator.scheme.list", "cos")\
.config("fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem")\
.config("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient")\
.config("fs.stocator.cos.scheme", "cos")\
.config("fs.cos.myCos.endpoint", "https://s3.ams03.objectstorage.softlayer.net")\
.config("fs.cos.myCos.iam.api.key", "")\
.config("fs.cos.myCos.iam.service.id", "crn:v1:bluemix:public:cloud-object-storage:global:a/abc:xyz::")\
.getOrCreate()

In [4]:
df=spark.read.csv("cos://test-bucket.myCos/source/year=2018/month=08/day=28/")

In [5]:
df.show()

+--------------------+
|                 _c0|
+--------------------+
|These examples gi...|
|Spark is built on...|
|      In the RDD API|
+--------------------+



## Run with spark-submit command

In [None]:
spark-submit \
--master {{ params.spark_master }} \
--deploy-mode {{ params.spark_deploy_mode }} \
--name {{ params.job_name }} \
--conf spark.jars.ivySettings=spark.jars.ivySettings \
--conf spark.jars.packages="com.ibm.stocator:stocator:1.0.30-IBM-SDK,com.ibm.cedpgarage:sample-project:1.7.0-SNAPSHOT" \
--driver-java-options \"-Dcos.apiKey={{ params.api_key }} -Dcos.bucketSource=cos://{{ params.input_bucket }}.{{ params.cos_service_name }}/{{ params.input_path }}{{ execution_date | date_to_path() }} -Dcos.bucketDest=cos://{{ params.output_bucket }}.{{ params.cos_service_name }}/{{ params.output_path }}{{ execution_date | date_to_path() }} -Dcos.endpoint={{ params.endpoint_url }} -Dcos.serviceId={{ params.cos_service_id }}\" \
--class {{ params.job_class }} \
/path/to/fake.jar