## S3 access to MinIO object storage
This notebook demonstrates how to use MinIO object storage from Spark notebook.
In this demonstration MinIO is installed to same CSC Rahti namespace than Spark.

MinIO project: https://github.com/CSCfi/minio-OpenShift

Spark project: https://github.com/CSCfi/spark-openshift

In [1]:
# Add your cluster ip for MinIO service and keys defined when installed MinIO
from pyspark import SparkConf, SparkContext
conf = SparkConf()

conf.set("spark.jars", "file:/opt/jars/aws-java-sdk.jar,file:/opt/jars/hadoop-aws.jar")
conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
conf.set("spark.hadoop.fs.s3a.endpoint", "http://<<Rahti service ip for MinIO pod >>:9000")
conf.set("spark.hadoop.fs.s3a.access.key", "")
conf.set("spark.hadoop.fs.s3a.secret.key", "")

sc = SparkContext(conf=conf)

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark - MinIO reader-writer") \
    .getOrCreate()

In [3]:
# Read local csv file to DataFrame
titanic = spark.read.csv("data/Titanic_data.csv")

In [8]:
# Basic spark operations
titanic.count()

1314

In [4]:
titanic.show(10)

+---+------+----+------+--------+----------+
|_c0|   _c1| _c2|   _c3|     _c4|       _c5|
+---+------+----+------+--------+----------+
| id|PClass| Age|Gender|Survived|GenderCode|
|  1|   1st|  29|female|       1|         1|
|  2|   1st|   2|female|       0|         1|
|  3|   1st|  30|  male|       0|         0|
|  4|   1st|  25|female|       0|         1|
|  5|   1st|0.92|  male|       1|         0|
|  6|   1st|  47|  male|       1|         0|
|  7|   1st|  63|female|       1|         1|
|  8|   1st|  39|  male|       0|         0|
|  9|   1st|  58|female|       1|         1|
+---+------+----+------+--------+----------+
only showing top 10 rows



In [5]:
# Write dataframe to object storage as Parquet format
titanic.write.parquet("s3a://spark-test/titanic.parquet")

In [6]:
# Read data from object storage
iris = sc.textFile("s3a://data/iris.data")

In [9]:
# Basic spark operations
iris.count()

152

In [7]:
iris.take(10)

['sepal.length,sepal.width,petal.length,petal.width,class',
 '5.1,3.5,1.4,0.2,Iris-setosa',
 '4.9,3.0,1.4,0.2,Iris-setosa',
 '4.7,3.2,1.3,0.2,Iris-setosa',
 '4.6,3.1,1.5,0.2,Iris-setosa',
 '5.0,3.6,1.4,0.2,Iris-setosa',
 '5.4,3.9,1.7,0.4,Iris-setosa',
 '4.6,3.4,1.4,0.3,Iris-setosa',
 '5.0,3.4,1.5,0.2,Iris-setosa',
 '4.4,2.9,1.4,0.2,Iris-setosa']