# Spark read and write on remote Object Stores

By loading the right libraries, Spark is able to both read and write to external Object Stores in a distributed way.

SQL Server Server Big Data Clusters ships libraries to access S3 and ADLS Gen2 protocols.

Libraries are updated with each cumulative update, please make sure to list the available libraries. To list S3 protocol libraries use the following:

```
kubectl -n <YOUR-BDC-NAMESPACE> exec sparkhead-0 -- bash -c "ls /opt/hadoop/share/hadoop/tools/lib/*aws*"
```

If your scenario requires a library either unavailable or version incompatible with what is shipped with Big Data Clusters, you have some options:

1. Use a session based configure cell with dynamic library loading on Notebooks or Jobs.
2. Copy the additional libraries to a known HDFS on BDC and reference that at session configuration.

These two scenarios are described in detail in the [Manage libraries](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-install-packages?view=sql-server-ver15) and [Submit Spark jobs by using command-line tools](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-submit-job-command-line?view=sql-server-ver15) articles.

## Step 1 - Configure access to the remote storage

In this example we will access a remote S3 protocol object store.  
  
The example considers a [MinIO](https://min.io/) object store service, but would would work with other S3 protocol providers.

Please check your S3 object store provider documentation to understand which libraries are required.

With that information at hand, configure your notebook session or job to use the right library like the example bellow.

In [None]:
%%configure -f \
{
    "conf": {
        "spark.driver.extraClassPath": "/opt/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.271.jar:/opt/hadoop/share/hadoop/tools/lib/hadoop-aws-3.1.168513.jar",
        "spark.executor.extraClassPath": "/opt/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.271.jar:/opt/hadoop/share/hadoop/tools/lib/hadoop-aws-3.1.168513.jar",
        "spark.hadoop.fs.s3a.buffer.dir": "/var/opt/yarnuser"
    }
}

In [None]:
spark

## Step 2 - Add in access tokens to access the remote storage dynamically

Follow your S3 provider security documentation to change the following cells to correctly configure Spark to connect to the endpoint.

In [None]:
access_key="YOUR_ACCESS_KEY"
secret="YOUR_SECRET"

In [None]:
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret)
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "YOUR_ENDPOINT")
spark._jsc.hadoopConfiguration().set("spark.hadoop.fs.s3a.buffer.dir", "/var/opt/yarnuser") # Temp dir for writes back to S3
spark._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false")

## Spark read and write patterns

Use the following examples to cover a range of read and write scenarios to remote object stores.

### Read from external S3 and write to BDC HDFS as table

In [None]:
df = spark.read.csv("s3a://NYC-Cab/fhv_tripdata_2015-01.csv", header=True)

In [None]:
df.count()

In [None]:
df.write.format("parquet").save("/securelake/fhv_tripdata_2015-01")

In [None]:
%%sql
DROP TABLE tripdata

In [None]:
%%sql
CREATE TABLE tripdata
USING parquet
LOCATION '/securelake/fhv_tripdata_2015-01'

In [None]:
%%sql
select count(*) from tripdata

In [None]:
%%sql
select * from tripdata limit 10

### Write back to S3 as parquet

In [None]:
df.write.format("parquet").save("s3a://NYC-Cab/fhv_tripdata_2015-01-3")

### Create external table on S3

This example virtualizes a folder on external object store as a Hive table.

In [None]:
%%sql
DROP TABLE tripdata_s3

In [None]:
%%sql
CREATE TABLE tripdata_s3
USING parquet
LOCATION 's3a://NYC-Cab/fhv_tripdata_2015-01-3'

In [None]:
%%sql
select count(*) from tripdata_s3

In [None]:
%%sql
select * from tripdata_s3 limit 10