PythonFileSystem extension for Azure Datalake Storage gen.1.
PyFileSystem is a filesystem abstraction for Python that provides the same API on whatever storage backend it supports (Hard drive, clouds, archive files, ...).
Azure Datalake Store is a cloud storage dedicated at big data Hadoop like operations provided by Microsoft.
Warning
About Datalake store generation
fs.datalake does not (yet) support Azure Datalake Store Gen. 2 backends as long as the underlying
azure-datalake-store Python library doesn't.
Warning
Alpha version
This softare should not be used for production purposes, unless you tested it heavily with your application and
a sandbox Datalake store. The underlying azure-datalake-store Python package itself is not avowed stable
according to Microsoft contributors.
Install Python >= 3.6.
pip install fs.datalakefrom fs.datalake import DatalakeFS
storage_name = "mystorage" # Created through the Azure dashboard
tenant_id = "xxx" # Provided by your Azure dashboard
username = "myself" # Your Azure dashboard credentials
password = "my-secret"
datalake_storage = DatalakeFS(storage_name, tenant_id=tenant_id, username=username, password=password)
# Play with your storage using the usual FS API
print(datalake_storage.listdir("."))Note
About authentication methods
The above example shows an authentication with a tenant_id, a username and a password.
This is an authentication method among others you might prefer. And I have t admit that I didn't
understand everything in the Azure Datalake security policies.
Just remember that all keyword parameters you provide to DatalakeFS are transfered as-is to the
azure.datalake.store.lib.auth that provides the session authentication token.
https://github.com/Azure/azure-data-lake-store-python/blob/master/azure/datalake/store/lib.py
In example, you could prefer, depending on your security settings in the Datalake dashboard using a
client_id and client_secret couple. This works too:
storage_name = "mystorage" # Created through the Azure dashboard
tenant_id = "xxx" # Provided by your Azure dashboard
client_id = "yyyyy..." # Your Azure dashboard credentials
client_secret = "zzzzz..."
datalake_storage = DatalakeFS(storage_name, tenant_id=tenant_id, client_id=client_id,
client_secret=client_secret)As most backends for FS2, you may create a connection using the open_fs factory.
https://docs.pyfilesystem.org/en/latest/reference/opener.html#fs.opener.registry.Registry.open_fs
Example:
from fs import open_fs
if authenticate_with_username_password:
url = f"datalake://{username}:{password}@{storage_name}?tenant_id={tenant_id}"
elif authenticate_with_client_id:
url = f"datalake://{storage_name}?tenant_id={tenant_id}&client_id={client_id}&claient_secret={client_secret}"
else:
# Please read azure-datalake-store doc for other authentication options.
datalake_storage = open_fs(url)
# Play with your storage using the usual FS API
print(datalake_storage.listdir("."))Warning
You may need to "url_quote" your username and password and other parameters if these contain special
characters like "/", "=", space and some others.
You can read in the doc that the open_fs may take additional parameters after the URL. Note that with the
dtalake://... URLs, writable, create and default_protocol are ignored. Though you may provide the cwd
keyword parameter.
Example:
datalake_storage = open_fs(url, cwd="some/directory")Please use a dedicated virtualenv to maintain this package, but I should not need to say that.
Grab the source from the SCM repository, then cd to the root:
$ pip install -e .[testing]Running the tests requires an Azure account and a Datalake Gen.1 storage which credentials must be provided through environment variables, namely:
DL_TENANT_IDDL_USERNAMEDL_PASSWORDDL_CLIENT_IDDL_CLIENT_SECRETDL_STORE_NAME
Their respective content should be obvious if you have been reading all above documentation.
You may provide these environment variables with a .env file in this project or a parent directory. This file will
be loaded at the beginning of any test session.
$ python setup.py test
$ python run_tests.pyCopyright 2019 Gilles Lenfant
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- PythonFileSystem documentation
- Azure Datalake Storage
- azure datalake store Python lib
- https://docs.microsoft.com/azure/data-lake-store/data-lake-store-data-operations-python
https://github.com/glenfant/fs.datalake
https://github.com/glenfant/fs.datalake/issues
As written above, some data provided in the datalake://... URLs must be quoted unless they are corrupted when
building the DatalakeFS connector. This is a simple recipe for building a clean quoted URL:
from urllib.parse import urlunparse, urlencode, quote
query = {
# Provide all required parameters
"tenant_id": tenant_id,
"client_id": client_id,
"client_secret": client_secret
}
query = urlencode(query)
if username and password:
store_name = f"{quote(username)}:{quote(password)}@{store_name}"
parts = ("datalake", store_name, "", "", query, "")
datalake_url = urlunparse(parts)The first alpha release will support Python 3.6 and later. Older Python versions won't be supported unless contributions as PR that don't break the tests with later versions.
As Python 2.7 support by FS2 is planned to be dropped, I won't add Python 2.x complicated compatibility layer, and won't accept PR for Python 2.7 support.
The authentication against Azure services provide a one hour life token. This is not a major issue for CLI applications but could be an issue for long time running processes.
So I must find a way to refresh that token automatically (find what exception - if any - is raised from the lower level lib when trying to query the server with an outdated token)
Looking for doc about the various limitations of Datalake, and their consequences on this software.
- What is the encoding of the file / directory names ?
- Are there forbidden characters in the file / directory names
- What's the size limit of file / directory names ?
- Is there a limit of directory levels ?
There are lots of crypto options on Datalake storage. I have to admit that I am somehow stuck in that domain, and didn't provide specific features to play with encrypted Datalake stores. Any help in that field is welcome.