## Dataset API 

This API makes it easy to upload data to your cluster, such as for example a python program to run as a job. Also to download data from the cluster to your local environment.

## Scope

This notebook showcases the following functionality:

* Upload file
* Download file
* Check for file existence
* Delete file

In [1]:
import hopsworks

## Connect to the cluster

In [2]:
# Connect to your cluster, to be used running inside Jupyter or jobs inside the cluster.
connection = hopsworks.connection()

Connected. Call `.close()` to terminate connection gracefully.


In [3]:
# Uncomment when connecting to the cluster from an external environment.
# connection = hopsworks.connection(project='my_project', host='my_instance', port=443, api_key_value='apikey')

## Get the project

In [4]:
# Get the project object, if used inside your hopsworks cluster it gets the current project
project = connection.get_project()

In [5]:
# Uncomment to get specific project
# project = connection.get_project('my_project')

## Get the API

In [6]:
dataset_api = project.get_dataset_api()

## Upload file to hopsworks

In [7]:
# Download example file to work with
!wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz

--2022-04-12 12:57:52--  http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Resolving yann.lecun.com (yann.lecun.com)... 188.114.96.7, 188.114.97.7, 2a06:98c1:3120::7, ...
Connecting to yann.lecun.com (yann.lecun.com)|188.114.96.7|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9912422 (9.5M) [application/x-gzip]
Saving to: ‘train-images-idx3-ubyte.gz’


2022-04-12 12:57:53 (8.76 MB/s) - ‘train-images-idx3-ubyte.gz’ saved [9912422/9912422]



In [8]:
# Upload the file from the local environment to the cluster
uploaded_file_path = dataset_api.upload("train-images-idx3-ubyte.gz", "Resources", overwrite=True)
print("File was uploaded on path: {}".format(uploaded_file_path))

HBox(children=(FloatProgress(value=0.0, description='Uploading', max=9912422.0, style=ProgressStyle(descriptio…


File was uploaded on path: Resources/train-images-idx3-ubyte.gz


In [9]:
# Check file existence
dataset_api.exists(uploaded_file_path)

True

## Download file from hopsworks

In [10]:
# Download the file from the cluster to the local environment
dataset_api.download(uploaded_file_path, overwrite=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=9912422.0, style=ProgressStyle(descript…




'/srv/hops/staging/private_dirs/ca53d36092d3792645bc527eb455d405542efd398f2f62c130c1310416c1aa1e/train-images-idx3-ubyte.gz'

In [11]:
# Remove the uploaded file or folder
dataset_api.remove(uploaded_file_path)

In [12]:
# File is now removed
dataset_api.exists(uploaded_file_path)

False