In [None]:
# Upgrade Oracle ADS to pick up latest features and maintain compatibility with Oracle Cloud Infrastructure.

!pip install -U oracle-ads

<font color=gray>Oracle Data Science service sample notebook.

Copyright (c) 2022 Oracle, Inc.  All rights reserved.
Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.
</font>

***
# <font color=red>How to Read Data with fsspec from Oracle Big Data Service (BDS)</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> Oracle Cloud Infrastructure Data Science Service Team </font></p>

***

## Overview:

Oracle Big Data Service is an Oracle Cloud Infrastructure service designed for a diverse set of big data use cases and workloads. From short-lived clusters used to tackle specific tasks to long-lived clusters that manage large data lakes, Big Data Service scales to meet an organization’s requirements at a low cost and with the highest levels of security. To connect to the BDS from the notebook session, you can reference the notebook `Connect_to_the_Oracle_Big_Data_Service.ipynb`. This notebook will demonstrate how to manage data using fsspec file system, read and save data using pandas and pyarrow through fsspec file system.

Compatible conda pack: [PySpark 3.0 and Data Flow](https://docs.oracle.com/en-us/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.7 (version 5.0)

---

## Contents:

* <a href='#setup'>Set Up</a>
* <a href='#fsspec'>FSSpec</a>
    * <a href='#list'>Listing</a>
    * <a href='#save_f'>Saving File</a>
    * <a href='#save_folder'>Saving Folder</a>
* <a href='#pandas'>Pandas</a>
* <a href='#pyarrow'>PyArrow</a>
    * <a href='#write'>Write</a>
    * <a href='#read'>Read</a>
    * <a href='#partition'>Partitioned Dataset</a>
        * <a href='#write'>Write</a>
        * <a href='#read'>Read</a>
        



---


Datasets are provided as a convenience. Datasets are considered third-party content and are not considered materials under your agreement with Oracle.
      
---


In [None]:
import os
import ads
import fsspec
import pandas as pd
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow as pa
import pandas as pd
import numpy as np

from ads.secrets.big_data_service import BDSSecretKeeper
from ads.bds.auth import has_kerberos_ticket, krbcontext

import logging

logging.basicConfig(format="%(levelname)s:%(message)s", level=logging.WARN)

ads.set_auth("resource_principal")

<a id='setup'></a>
# Setup

The following assumes that you have already saved your configuration with `BDSSecretKeeper` so that you can use `BDSSecretKeeper.load_secret` to load the configuration. To see how to connect without Vault or how to save configuration with `BDSSecretKeeper`, see the `Connect_to_the_Oracle_Big_Data_Service.ipynb.ipynb` notebook. In the next cell, you can replace `bds_secret_id` with your secret OCID. 

In [None]:
bds_secret_id = "<bds-secret-id>"

<a id='fsspec'></a>
## FSSpec

`FSSpec` provides a pythonic interface to local, remote and embedded file systems.  This notebook shows you how to use it to work with files in HDFS.

In [None]:
if bds_secret_id != "<bds-secret-id>":
    with BDSSecretKeeper.load_secret(bds_secret_id) as cred:
        with krbcontext(principal=cred["principal"], keytab_path=cred["keytab_path"]):
            hdfs_config = {
                "protocol": "webhdfs",
                "host": cred["hdfs_host"],
                "port": cred["hdfs_port"],
                "kerberos": "True",
            }
    print(hdfs_config)
else:
    hdfs_config = None
    print(
        "The secret OCID, bds_secret_id, is not defined. Enter configuration values in the Setup section."
    )

Instantiate filesystem for `hdfs_config`.

In [None]:
if hdfs_config is not None:
    fs = fsspec.filesystem(**hdfs_config)
else:
    print("hdfs config is not specified.  Provide value to complete setup")

<a id='list'></a>

After HDFS is instantiated, you can view and work with the files in the filesystem.
#### Listing
List objects at root path.

In [None]:
if hdfs_config is not None:
    print(fs.ls("/"))
else:
    print("hdfs config is not specified.  Provide value to complete setup")

You can also specify a path and list the objects in it.   For example, the next cell shows how to list objects in the `/data/biketrips` path. You can replace `file_path` with the path that you want to list the objects from. 

In [None]:
file_path = "/data/biketrips"
if hdfs_config is not None and file_path != "":
    print(fs.ls(file_path))
else:
    print("hdfs config is not specified.  Provide value to complete setup")

<a id='save_f'></a>
#### Saving File
You can download files from the HDFS file system to the local directory using `get` using `rpath` to specify the path of the source data. `lpath` defines where to save this data. The next cell saves the `/data/biketrips/JC-201901-citibike-tripdata.csv` file to the local directory, `./JC-201901-citibike-tripdata.csv`. You can replace `hdfs_data_path` and `dest_path` with your HDFS file system and destination information.

In [None]:
hdfs_data_path = "/data/biketrips/JC-201901-citibike-tripdata.csv"
dest_path = "./JC-201901-citibike-tripdata.csv"

In [None]:
if hdfs_config is not None:
    fs.get(rpath=hdfs_data_path, lpath=dest_path)

<a id='save_folder'></a>
#### Saving Folder
You can also copy a folder from the HDFS file system to the local directory. If lpath
ends with a "/", it is assumed to be a directory and target files
will go within. You can pass in a list of paths, which may be glob-patterns
are expanded. You can replace `source_directory_1`, `source_directory_2`, and `dest_directory` with your HDFS file system directories and local directory information. 

In [None]:
source_directory_1 = "/data/biketrip*/"
source_directory_2 = "/data/station*/"
dest_directory = "data/"

In [None]:
if hdfs_config is not None:
    fs.get([source_directory_1, source_directory_2], dest_directory, recursive=True)
else:
    print("hdfs config is not specified.  Provide value to complete setup")

<a id='pandas'></a>
## Pandas
You can also open the data using Pandas through fsspec.

In [None]:
if hdfs_config is not None:
    with fs.open(hdfs_data_path, "r") as f:
        df = pd.read_csv(f)
    display(df.head())
else:
    print("hdfs config is not specified.  Provide value to complete setup")

Call `to_csv` to save to a local directory. You can replace `csv_dest_path` with the destination path you want to save the file to. 

In [None]:
csv_dest_path = "./tripdata_example1.csv"

In [None]:
if hdfs_config is not None:
    df.to_csv(csv_dest_path)
else:
    print("hdfs config is not specified.  Provide value to complete setup")

Since pandas has integration with fsspec, you can also use pandas directly.

There are two ways to pass parameters to the backend file system driver. One way is to extend the URL to include the username, password, server, port, etc. and provide the storage_options. For example, `protocol://host:port/path/to/data`.

In [None]:
if hdfs_config is not None:
    df = pd.read_csv(
        f"webhdfs://{hdfs_config['host']}:{hdfs_config['port']}/{hdfs_data_path}",
        storage_options={"kerberos": "True"},
    )
    display(df.head())
    df.to_csv(csv_dest_path)
else:
    print("hdfs config is not specified.  Provide value to complete setup")

Call `to_csv` to save to the local directory. You can replace `csv_dest_path` with the destination path you want to save the file to. 

In [None]:
if hdfs_config is not None:
    df.to_csv(csv_dest_path)
else:
    print("hdfs config is not specified.  Provide value to complete setup")

The second method is more general, protocol://path/to/data and pass a dictionary of parameters to storage_options.

In [None]:
if hdfs_config is not None:
    df = pd.read_csv(f"webhdfs://{hdfs_data_path}", storage_options=hdfs_config)
    display(df.head())
else:
    print("hdfs config is not specified.  Provide value to complete setup")

Again, you can call `to_csv` to save the file locally. You can replace `csv_dest_path` with the destination path you want to save the file to. 

In [None]:
csv_dest_path = "./tripdata_example3.csv"
if hdfs_config is not None:
    df.to_csv(csv_dest_path)
else:
    print("hdfs config is not specified.  Provide value to complete setup")

<a id='pyarrow'></a>
## PyArrow

Apache Arrow is a cross-language development platform for in-memory analytics. It is the Python implementation for Arrow which contains a set of technologies that enable big data systems to store, process and move data fast. For more details, check this [link](https://arrow.apache.org/docs/python/index.html#:~:text=This%20is%20the%20documentation%20of,process%20and%20move%20data%20fast).


You can customize the connection parameters, create a file system object, and then pass it to the `filesystem` keyword.

In [None]:
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow as pa

if hdfs_config is not None:
    ds = ds.dataset(hdfs_data_path, format="csv", filesystem=fs)
    display(ds.to_table().to_pandas().head())
else:
    print("hdfs config is not specified.  Provide value to complete setup")

<a id='write'></a>
### Write
Call `write_table` to write back to a HDSF file system on BDS using `fsspec`. You can replace the `parquet_file_name`, `table_path`, and `dest_hdfs_path` values with the values you want to use.

In [None]:
parquet_file_name = "biketrips1.parquet"
table_path = "/data/test/parquet"
dest_hdfs_path = f"{table_path}/{parquet_file_name}"

In [None]:
if hdfs_config is not None:
    pq.write_table(ds.to_table(), dest_hdfs_path, filesystem=fs)
else:
    print("hdfs config is not specified.  Provide value to complete setup")

<a id='read'></a>
### Read
Call `read_table` to load the parquet file.

In [None]:
if hdfs_config is not None:
    display(pq.read_table(f"{table_path}/", filesystem=fs).to_pandas().head())
else:
    print("hdfs config is not specified.  Provide value to complete setup")

Remove the added parquet file.

In [None]:
if hdfs_config is not None:
    fs.rm(dest_hdfs_path)
else:
    print("hdfs config is not specified.  Provide value to complete setup")

<a id='partition'></a>
### Partitioned Dataset

You can write a partitioned dataset to HDFS file system using [PyArrow](https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets-multiple-files). First create a Pandas DataFrame and load it into PyArrow.

In [None]:
idx = pd.date_range("2022-01-01 12:00:00.000", "2022-03-01 12:00:00.000", freq="T")

df = pd.DataFrame(
    {
        "numeric_col": np.random.rand(len(idx)),
        "string_col": pd._testing.rands_array(8, len(idx)),
    },
    index=idx,
)
df["dt"] = df.index
df["dt"] = df["dt"].dt.date

table = pa.Table.from_pandas(df)

<a id='write_p'></a>
### Write

Call `write_to_dataset` to write the partitioned data. The keyword `root_path` specifies the root directory on the HDFS where the data is saved. `partition_cols` takes a list of column names by which to partition the dataset The flavor='spark' option sets these constraints on the types of Parquet files that it reads automatically, and also sanitizes field characters unsupported by Spark SQL. You can replace `root_path` with your path name.

In [None]:
root_path = f"{table_path}/partitioned"

In [None]:
if hdfs_config is not None:
    pq.write_to_dataset(
        table, root_path=root_path, partition_cols=["dt"], flavor="spark", filesystem=fs
    )

    # Check if the data is successfully written.
    print(fs.ls(root_path))
else:
    print("hdfs config is not specified.  Provide value to complete setup")

<a id='read_p'></a>
### Read

Call `read_table` to load the partitioned data.

In [None]:
if hdfs_config is not None:
    display(pq.read_table(root_path, filesystem=fs).to_pandas().head())
else:
    print("hdfs config is not specified.  Provide value to complete setup")

<a id="ref"></a>
# References

- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)