# Remote data paths in the catalog

<b>How does Kedro handle remote data paths defined in Data Catalog ?</b>

Kedro relies on fsspec to read and save data from a variety of data stores including 

- Local file systems
- Network file systems
- Cloud object stores
- Hadoop. 

When specifying a storage location in `filepath:`, you should provide a URL using the general form `protocol://path/to/data`. 
If no protocol is provided, the local file system is assumed (which is the same as `file://`).

The following protocols are available:

- Local or Network File System: `file://` - the local file system is default in the absence of any protocol, it also permits relative paths.
- Hadoop File System (HDFS): `hdfs://user@server:port/path/to/data` - Hadoop Distributed File System, for resilient, 
replicated files within a cluster.
- Amazon S3: `s3://my-bucket-name/path/to/data` - Amazon S3 remote binary store, often used with Amazon EC2, using the library s3fs.
- S3 Compatible Storage: `s3://my-bucket-name/path/_to/data` - for example, MinIO, using the s3fs library.
- Google Cloud Storage: `gcs://` - Google Cloud Storage, typically used with Google Compute resource using gcsfs (in development).
- Azure Blob Storage / Azure Data Lake Storage Gen2: `abfs://` - Azure Blob Storage, typically used when working on an Azure environment.
- HTTP(s): `http://` or `https://` for reading data directly from HTTP web servers.


In [None]:
%%writefile catalog.yml
companies:
  type: spark.SparkDataset
  filepath: /Volumes/<user-catalog>/<schema>/<volume>/companies.csv
  file_format: csv
  load_args:
    header: True
    inferSchema: True

In [None]:
pip install hdfs s3fs

In [None]:
# Initialize SparkSession (should be routed via databricks-connect)
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
print("Spark version:", spark.version)

In [None]:
# Load Kedro project context
from kedro.config import OmegaConfigLoader
from kedro.io import DataCatalog

conf_loader = OmegaConfigLoader(conf_source=".")
catalog_conf = conf_loader.get("catalog")
catalog = DataCatalog.from_config(catalog_conf)

# Load CSV from Unity Catalog Volumes via SparkDataSet
df = catalog.load("companies")  # Make sure catalog.yml uses SparkDataSet
df.show()

## Example using MinIO:

MinIO is an object storage solution that provides an Amazon Web Services S3-compatible API and supports all core S3 features. MinIO is built to deploy anywhere - public or private cloud, baremetal infrastructure, orchestrated environments, and edge infrastructure.

Using docker on MacOS -

```sh
mkdir -p ~/minio/data

docker run \
   -p 9000:9000 \
   -p 9001:9001 \
   --name minio \
   -v ~/minio/data:/data \
   -e "MINIO_ROOT_USER=<edit-ROOTNAME>" \
   -e "MINIO_ROOT_PASSWORD=<edit-CHANGEME123>" \
   quay.io/minio/minio server /data --console-address ":9001"
```

In [None]:
# Test to see if you have access
# import s3fs

# fs = s3fs.S3FileSystem(
#     key="ROOTNAME",
#     secret="CHANGEME123",
#     client_kwargs={"endpoint_url": "http://localhost:9000"},
# )

# # List all buckets
# print(fs.ls("/")) 

In [None]:
%%writefile catalog.yml
companies:
  type: pandas.CSVDataset
  filepath: "s3://kedro-databricks/companies.csv"
  credentials: minio
  fs_args:
    anon: false

In [None]:
%%writefile credentials.yml
minio:
  key: <edit-ROOTNAME>
  secret: <edit-CHANGEME123>
  client_kwargs:
    endpoint_url: http://localhost:9000

In [None]:
# Load Kedro project context
from kedro.config import OmegaConfigLoader
from kedro.io import DataCatalog

conf_loader = OmegaConfigLoader(conf_source=".")
catalog_conf = conf_loader.get("catalog")
credentials_conf = conf_loader.get("credentials")

catalog = DataCatalog.from_config(catalog_conf, credentials_conf)

catalog

In [None]:
df = catalog.load("companies")
df.head()

<b>References</b>

- Dataset filepath: https://docs.kedro.org/en/latest/data/data_catalog.html#dataset-filepath
- Dataset access credentials: https://docs.kedro.org/en/latest/data/data_catalog.html#dataset-access-credentials