## Reading/Writing Parquet Files From/to OCI Object Storage with Pandas

The setup for this notebook is simple. I used the `generalml_p37_cpu_v1` conda environment. This environment has ADS,pyarrow, pandas, snappy, and fastparquet pre-installed. 

I also upgraded the version of ADS. In this notebook ADS version 2.5.8 was used. 

To read/write files to Object Storage, use `ocifs` : 
* github.com/oracle/ocifs
* docs: https://docs.oracle.com/en-us/iaas/tools/ocifs-sdk/latest/unix-operations.html 

In [None]:
#!pip install oracle-ads --upgrade

In [1]:
import ads 
from ads.common.auth import default_signer
import pandas as pd 
import fsspec
from ocifs import OCIFileSystem

# Using resource principal auth method: 
print(ads.__version__)
ads.set_auth(auth="resource_principal")

2.8.8


In [5]:
# object storage bucket + data 
# this bucket is publicly available 
bucket = "hosted-ds-datasets"
namespace = "bigdatadatasciencelarge"

In [8]:
!oci os object list --bucket-name {bucket} --namespace {namespace}

ServiceError:
{
    "client_version": "Oracle-PythonSDK/2.110.2, Oracle-PythonCLI/3.31.1",
    "code": "BucketNotFound",
    "logging_tips": "Please run the OCI CLI command using --debug flag to find more debug information.",
    "message": "Either the bucket named 'hosted-ds-datasets' does not exist in the namespace 'bigdatadatasciencelarge' or you are not authorized to access it",
    "opc-request-id": "fra-1:ZewZrFZQO0ZbOY29ZDJyuio-6ceCTAt_fHfUvZe1ora68c4CnYchC6ST33sckLBx",
    "operation_name": "list_objects",
    "request_endpoint": "GET https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/bigdatadatasciencelarge/b/hosted-ds-datasets/o",
    "status": 404,
    "target_service": "object_storage",
    "timestamp": "2023-08-23T10:33:08.654656+00:00",
    "troubleshooting_tips": "See [https://docs.oracle.com/iaas/Content/API/References/apierrors.htm] for more information about resolving this error. If you are unable to resolve this issue, run this CLI command with --debug option and

# Single Large File

In [3]:
# ~ 440 MB, 14M rows. 
large_files = ["nyc_tlc/2009/01/data.parquet"] # NYC Taxi dataset 

In [None]:
# using the `pyarrow` engine. You can also use `fastparquet`. 
for f in large_files: 
    df = pd.read_parquet(f"oci://{bucket}@{namespace}/{f}", 
                     storage_options=default_signer(),
                     engine="pyarrow")
    print(f"file {f}")
    print(df.head())
    print(f"size {df.shape}")

# Multiple Large Files

In [None]:
from oci.auth.signers import get_resource_principals_signer

In [None]:
# Using resource principal for authn. 
fs = OCIFileSystem(signer=get_resource_principals_signer())

In [None]:
# This is a small sample of the overall NYC Taxi dataset. There are 4 files in total, each of size ~ 450MB. 
relevant_files = fs.ls(f"{bucket}@{namespace}/nyc_tlc/2009/")

In [None]:
%%time 
# this operation takes about 50 secs: 
df = pd.concat((pd.read_parquet(f"oci://{f}/data.parquet", storage_options=default_signer(), engine="pyarrow") for f in relevant_files), 
               ignore_index=True)

In [None]:
# ~ 56M rows 
df.shape

## Write Parquet Files to Object Storage: 

In [None]:
%%time 

# Writing the dataframe as a parquet file to Object Storage. 
# Insert the name of your bucket and namespace you want to write parquet. 
# this operation takes about 2 mins
your_bucket = ""
your_namespace = ""

df.to_parquet(f"oci://{your_bucket}@{your_namespace}/taxi-data.parquet", 
                     storage_options=default_signer())