## Oracle AI Data Platform v1.0

Copyright Â© 2025, Oracle and/or its affiliates.

Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/

# Ingest from Multi Cloud Storage

This notebook illustrates how to ingest data from multiple cloud storage systems include;
-  Ingest from Azure ADLS - https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction
-  Ingest from AWS S3 - https://aws.amazon.com/pm/serv-s3

The pattern for the integration is common, the dependent libraries are needed which will downloading the correct versions identified below and installing into your compute cluster. The cloud specific details are in the notebook cells below for each platform and generally involve setting spark.conf values. 


## Ingest from Azure ADLS

This example ingests data from ADLS and writes it into a delta table in the catalog.

### Prerequisites

1. Install Azure JAR file from
 - https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-azure/3.3.4

2. Install dependent libraries
 - https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-databind/2.12.7
 - https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-core/2.12.7
 - https://mvnrepository.com/artifact/org.codehaus.jackson/jackson-mapper-asl/1.9.13

3. Restart the cluster.
4. Use your notebooks and python tasks.

In [None]:
# Change for your details
storage_account_name="your_storage_account_name"
client_id="your_client_id"
secret="your_secret"
tenant="your_tenant"
container="your_container"
data_file="your_file_name" #change to any type, just make sure the spark.read reflects the type
target_table_name="default.default.data_from_adls"
# end of changes

spark.conf.set(f"fs.azure.account.auth.type.{storage_account_name}.dfs.core.windows.net", "OAuth")
spark.conf.set(f"fs.azure.account.oauth.provider.type.{storage_account_name}.dfs.core.windows.net",  "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set(f"fs.azure.account.oauth2.client.id.{storage_account_name}.dfs.core.windows.net", client_id)
spark.conf.set(f"fs.azure.account.oauth2.client.secret.{storage_account_name}.dfs.core.windows.net",secret)
spark.conf.set(f"fs.azure.account.oauth2.client.endpoint.{storage_account_name}.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant}/oauth2/token")

df = spark \
    .read \
    .format("csv") \
    .option("header", True) \
    .load(f"abfss://{container}@{storage_account_name}.dfs.core.windows.net/{data_file}")
df.show()

df.write.mode("overwrite").format("delta").saveAsTable(target_table_name)

## Integration with AWS S3

This example ingests data from S3 and writes it into a delta table in the catalog.

### Prerequisites

1. Install Hadoop AWS JAR file from
 - https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.3.4

2. Install bundle - upload to object storage, external volume, install from
 - https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.12.262

The bundle is 280Mb, today you will **install in cluster by using external volume**! You woill need to upload to OCI Object Storage and then create an external volume.

3. Simple configuration can be done using spark configuration on cluster
```
spark.hadoop.fs.s3a.secret.key = your_secret_key
spark.hadoop.fs.s3a.access.key = your_access_key
```

4. Restart the cluster.

5. Use your notebooks and python tasks.

In [None]:
# Change for your details
bucket_name = 'your_bucket'
file_name = 'your_file_name'
target_table_name = 'default.default.data_from_s3'
region="us-east-1"
# end of changes

spark.conf.set(f"fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")

df = spark.read.json(f"s3a://{bucket_name}/{file_name}")
df.show()

df.write.mode("overwrite").format("delta").saveAsTable(target_table_name)


## AWS S3 with boto3

You will need to install boto3 by creating a requirements.txt file and including the package boto3 and installing as a library in your cluster.

In [None]:
import boto3

# Change for your details
secret="your_secret"
access="your_access_key"
region="us-east-1"
# end of changes

s3 = boto3.client('s3',aws_access_key_id=access,aws_secret_access_key=secret, region_name=region)
prefix = '/'

# List all objects in the bucket under the prefix
response = s3.list_objects_v2(Bucket=bucket_name, Prefix="")
for content in response.get('Contents', []):
    print(content['Key'])
