In [1]:
import os
import sys
from uuid import uuid4
import numpy as np
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell

ENDPOINT_URL = "http://0.0.0.0:9000"
os.environ["ENDPOINT_URL"] = ENDPOINT_URL
os.environ["AWS_ACCESS_KEY_ID"] = "minio"
os.environ["AWS_SECRET_ACCESS_KEY"] = "miniominio"

InteractiveShell.ast_node_interactivity = "all"

### Initial file structure

In [2]:
!tree

[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
└── [01;34mexample-bucket[00m
    └── data.txt

2 directories, 3 files


#### Start up s3 mock

In [3]:
!docker-compose up -d

[1A[1B[0G[?25l[+] Running 1/0
[34m ⠿ Container learnathon-minio-1  Running                                   0.0s
[0m[?25h

---
# DATA-TOOLZ TUTORIAL

---

This notebook demonstrats the capabilities a `data-toolz` package. An open-source python package for handling filesystem I/O and conveniant `pandas` wrapper features.

## Table of Contents:

* [What is data-toolz](#What-is-data-toolz?)
  * [Why use it](#Why-use-it)
  * [Building blocks](#Building-blocks)
* [Instalation](#Installation)
* [Feature overview](#Feature-overview)
  * [FileSystem](#FileSystem-(datatoolz.filesystem.FileSystem))
    * [Write and read](#Write-and-read)
    * [Basic operations](#Basic-operations)
    * [More examples](#More-examples)
    * [Exercise](#Exercise)
    * [AWS role-based access](#AWS-role-based-access)
  * [DataIO](#DataIO-(datatoolz.io.DataIO))
    * [Basic writing and reading parquet](#Basic-writing-and-reading-parquet)
    * [Advanced writing](#Advanced-writing)
    * [Other file types](#Other-file-types)
  * [JsonLogger](#JsonLogger-(datatoolz.logging.JsonLogger))
* [Notes](#Notes)

---
# What is [data-toolz](https://pypi.org/project/data-toolz/)?

`data-toolz` is an open-source python package providing convenient access to I/O operations for both local filesystem and cloud storage (currently AWS S3 supported), as well a layer for accessing data-like objects (`parquet`, `jsonlines`, `dsv`).

## Why use it

The rationale behind creating this package was to standardize common and recuring I/O operations and minimize boilerplate code.

Most data processes involve the following steps:
* reading input
* processing
* writing output

`data-toolz` goal is simplify the "read" and "write" steps providing a common interface for various file systems or file-system-like services.

---
Let's look at a simple example of reading a file, processing it and storing the results locally

In [4]:
with open("example-bucket/data.txt") as file:
    data = [int(item) for item in file]

processed = list((item, f"hello {item} from local") for item in data)

with open("example-bucket/processed-local.txt", mode="wt") as file:
    for item in processed:
        size = file.write(f"{item}\n")
        
!cat example-bucket/processed-local.txt && rm example-bucket/processed-local.txt

(1, 'hello 1 from local')
(2, 'hello 2 from local')
(3, 'hello 3 from local')
(4, 'hello 4 from local')
(5, 'hello 5 from local')


And now the same operation in a cloud environment

In [5]:
import boto3

s3_client = boto3.client("s3", endpoint_url=ENDPOINT_URL)

obj = s3_client.get_object(Bucket="example-bucket", Key="data.txt")
data = [int(item) for item in obj["Body"].read().decode("utf-8").split()]

processed = list((item, f"hello {item} from s3") for item in data)

body = "".join(f"{item}\n" for item in processed).encode("utf-8")
response = s3_client.put_object(Bucket="example-bucket", Key="processed-s3.txt", Body=body)

!aws s3 --endpoint-url=$ENDPOINT_URL cp s3://example-bucket/processed-s3.txt  - | head
!aws s3 --endpoint-url=$ENDPOINT_URL ls s3://example-bucket
!aws s3 --endpoint-url=$ENDPOINT_URL rm s3://example-bucket/processed-s3.txt

(1, 'hello 1 from s3')
(2, 'hello 2 from s3')
(3, 'hello 3 from s3')
(4, 'hello 4 from s3')
(5, 'hello 5 from s3')
2022-04-04 21:52:15         10 data.txt
2022-04-05 21:32:06        115 processed-s3.txt
delete: s3://example-bucket/processed-s3.txt


---
As you can probably see the interfacing with both storages looks very different, even though the performed operations are very similar.

There are few possibilities to address this issue

1. Replicate (cloud) production environment
  * unix vs windows
  * networking and connection issues
  * changing storage type
  * reflects production environment


2. Write own storage interface
  * flexible but needs maintanace
  * risk of code coupling


3. Use external dependency
  * less boilerplate
  * typical risks of using external packages

---
Let's see how `data-toolz` can help with the above task

In [6]:
from datatoolz.filesystem import FileSystem

fs = FileSystem()  # equivalent to `FileSystem("local")`

with fs.open("example-bucket/data.txt") as file:
    data = [int(item) for item in file]

processed = list((item, f"hello {item} from local datatoolz") for item in data)

with fs.open("example-bucket/processed-local-dt.txt", mode="wt") as file:
    for item in processed:
        size = file.write(f"{item}\n")

!cat example-bucket/processed-local-dt.txt && rm example-bucket/processed-local-dt.txt

(1, 'hello 1 from local datatoolz')
(2, 'hello 2 from local datatoolz')
(3, 'hello 3 from local datatoolz')
(4, 'hello 4 from local datatoolz')
(5, 'hello 5 from local datatoolz')


In [7]:
from datatoolz.filesystem import FileSystem

fs = FileSystem("s3", endpoint_url=ENDPOINT_URL)

with fs.open("example-bucket/data.txt") as file:
    data = [int(item) for item in file]

processed = list((item, f"hello {item} from s3 datatoolz") for item in data)

with fs.open("example-bucket/processed-s3-dt.txt", mode="wt") as file:
    for item in processed:
        size = file.write(f"{item}\n")

!aws s3 --endpoint-url=$ENDPOINT_URL cp s3://example-bucket/processed-s3-dt.txt  - | head
!aws s3 --endpoint-url=$ENDPOINT_URL ls s3://example-bucket
!aws s3 --endpoint-url=$ENDPOINT_URL rm s3://example-bucket/processed-s3-dt.txt

(1, 'hello 1 from s3 datatoolz')
(2, 'hello 2 from s3 datatoolz')
(3, 'hello 3 from s3 datatoolz')
(4, 'hello 4 from s3 datatoolz')
(5, 'hello 5 from s3 datatoolz')
2022-04-04 21:52:15         10 data.txt
2022-04-05 21:32:14        165 processed-s3-dt.txt
delete: s3://example-bucket/processed-s3-dt.txt


`data-toolz` allows to abstract the read and write operations and use the same code both for local development as well as cloud deployment.

---
## Building blocks

Source code can be found on Github: https://github.com/grzegorzme/data-toolz.
The package is a wrapper around [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) and [s3fs](https://s3fs.readthedocs.io/en/latest/)

* `datatoolz.filesystem.FileSystem` - main entrypoint for accessing file system layer based on `fsspec.AbstractFileSystem`


* `datatoolz.io.DataIO` - class for handling I/O operations on datasets. Accessed via two main methods

  * `read(path, ...)`

  * `write(dataframe, path, ...)`
  

* `datatoolz.logging.JsonLogger` - utlity class used for structured logging

---

# Installation

`data-toolz` is indexed on [PyPI](https://pypi.org/project/data-toolz/) and latest version can be installed via `pip`

```bash
pip install data-toolz
```

---
# Feature overview

---

## FileSystem (`datatoolz.filesystem.FileSystem`)

This main entrypoint for accessing file system layer based on `fsspec.AbstractFileSystem`. It can be used to perform common file system operations like:
* opening/writing files
* listing/deleting files/folders
* and few more depending on the underlying implementation


Initialisation:

```python
from datatoolz.filesystem import FileSystem

fs = FileSystem()                                       # simple instance pointing to local file system
fs = FileSystem("local")                                # same as above
fs = FileSystem("s3")                                   # pointer to s3 service
fs = FileSystem("s3", assumed_role="arn:aws:iam::123456789012:role/my-role")  # s3 with custom access role
fs = FileSystem("s3", endpoint_url="s3.amazonaws.com")  # custom endpoint url passed to the service client
```

---

### Write and read
* `open` - open file in text/binary read/write mode

In [8]:
from datatoolz.filesystem import FileSystem

fs_name = "local"

fs = FileSystem(fs_name, endpoint_url=ENDPOINT_URL)

with fs.open("example-bucket/example.txt", mode="wt") as file:
    size = file.write(f"Hello {fs.name}!")

with fs.open("example-bucket/example.txt", mode="rt") as file:
    print(file.read())
    
!tree && rm example-bucket/example.txt

Hello local!
[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
└── [01;34mexample-bucket[00m
    ├── data.txt
    └── example.txt

2 directories, 4 files


---
### Exercise 1
How would to write and read a binary object (think `pickle`)?

In [9]:
# write code here


---

### Basic operations

* `ls` - list contents of folder

* `mkdir` - create folder

* `rm` - remove file/folder


In [10]:
from datatoolz.filesystem import FileSystem

fs_name = "s3"
fs = FileSystem(fs_name, endpoint_url=ENDPOINT_URL)

PATH = f"example-bucket/new-directory-{str(uuid4())[:4]}"

!tree

[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
└── [01;34mexample-bucket[00m
    └── data.txt

2 directories, 3 files


---
#### Create a new directory

In [11]:
fs.mkdir(PATH)

!tree

[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
└── [01;34mexample-bucket[00m
    └── data.txt

2 directories, 3 files


---
#### Write couple files

In [12]:
for i in range(3):
    with fs.open(f"{PATH}/{i}.txt", mode="wt") as file:
        size = file.write(f"Hello {i}")

!tree

[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
└── [01;34mexample-bucket[00m
    ├── data.txt
    └── [01;34mnew-directory-5d2e[00m
        ├── 0.txt
        ├── 1.txt
        └── 2.txt

3 directories, 6 files


---
#### List contents

In [13]:
fs.ls("example-bucket")

[{'Key': 'example-bucket/data.txt',
  'LastModified': datetime.datetime(2022, 4, 4, 19, 52, 15, 650000, tzinfo=tzutc()),
  'ETag': '"00000000000000000000000000000000-1"',
  'Size': 10,
  'StorageClass': 'STANDARD',
  'Owner': {'DisplayName': 'minio',
   'ID': '02d6176db174dc93cb1b899f7c6078f08654445fe8cf1b6ce98d8855f66bdbf4'},
  'type': 'file',
  'size': 10,
  'name': 'example-bucket/data.txt'},
 {'Key': 'example-bucket/new-directory-5d2e',
  'Size': 0,
  'StorageClass': 'DIRECTORY',
  'type': 'directory',
  'size': 0,
  'name': 'example-bucket/new-directory-5d2e'}]

---
#### Remove file

In [14]:
fs.rm(f"{PATH}/1.txt")

!tree

[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
└── [01;34mexample-bucket[00m
    ├── data.txt
    └── [01;34mnew-directory-5d2e[00m
        ├── 0.txt
        └── 2.txt

3 directories, 5 files


---
#### Remove whole folder

In [15]:
fs.rm(PATH, recursive=True)

!tree

[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
└── [01;34mexample-bucket[00m
    └── data.txt

2 directories, 3 files


---

### More examples

* `exists` - check if object exists

* `copy` / `cp` - copy object between two location in the file system

* `move` / `mv` - move object between two location in the file system

* `walk` - walk the directory tree (see standard library method `os.walk`)

* `isdir` - check if path is a directory

* `disk_usage` / `du` - get disk usage of a path


As the `FileSystem` object inherits from it's base class there are many more methods available (some may be restricted to `local` or `s3` types)

In [16]:
fs_name = "s3"
fs = FileSystem(fs_name, endpoint_url=ENDPOINT_URL)

---
#### Check if file exists

In [17]:
for file_name in ["example-bucket/data.txt", "example-bucket/non-existent.txt"]:
    print(f"{file_name} exists: {fs.exists(file_name)}")
    
!tree

example-bucket/data.txt exists: True
example-bucket/non-existent.txt exists: False
[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
└── [01;34mexample-bucket[00m
    └── data.txt

2 directories, 3 files


---
#### Copy file

In [18]:
fs.cp("example-bucket/data.txt", "example-bucket/non-existent.txt")

!tree

[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
└── [01;34mexample-bucket[00m
    ├── data.txt
    └── non-existent.txt

2 directories, 4 files


---
#### Move file

In [19]:
fs.mv("example-bucket/non-existent.txt", "example-bucket/new.txt")

!tree

[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
└── [01;34mexample-bucket[00m
    ├── data.txt
    └── new.txt

2 directories, 4 files


---
#### Cleanup

In [20]:
fs.rm("example-bucket/new.txt")

!tree

[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
└── [01;34mexample-bucket[00m
    └── data.txt

2 directories, 3 files


---
#### Check if path is a directory

In [21]:
for path in ["example-bucket/example.txt", "example-bucket"]:
    print(f"{path} is a directory: {fs.isdir(path)}")

!tree

example-bucket/example.txt is a directory: False
example-bucket is a directory: True
[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
└── [01;34mexample-bucket[00m
    └── data.txt

2 directories, 3 files


---
#### Walk path structure

In [22]:
for r, d, f in fs.walk("example-bucket"):
    print(f"root = {r}\ndirs = {d}\nfiles = {f}\n")
    
!tree

root = example-bucket
dirs = []
files = ['data.txt']

[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
└── [01;34mexample-bucket[00m
    └── data.txt

2 directories, 3 files


---
#### Check disk usage

In [23]:
print(f"File disk usage: {fs.du('example-bucket/data.txt')}")
print(f"Folder disk usage: {fs.du('example-bucket')}\n")

!du -h example-bucket/data.txt

File disk usage: 10
Folder disk usage: 10

4.0K	example-bucket/data.txt


---
### Exercise 2
1. How to copy/move files between two `s3` buckets?
2. What if buckets your bucket access is via different roles?

In [24]:
# copy file `s3://example-bucket/data.txt` to `s3://another-bucket/new-data.txt`


---
### AWS role-based access

`FileSystem` allows to access a `s3` bucket via a provided role. This is useful when your current role does not have direct access to a bucket, but instead are allowed to assume a role which does.

In [25]:
# # if run on mocked setup this will fail as access to `sts` service is required
# fs = FileSystem(
#     "s3", 
#     endpoint_url=ENDPOINT_URL,
#     assumed_role="arn:aws:iam::123456789012:role/my-role"
# )

Alternatively if you need to jump through an "assume chain" it is also possible

In [26]:
# # if run on mocked setup this will fail as access to `sts` service is required
# fs = FileSystem(
#     "s3", 
#     endpoint_url=ENDPOINT_URL,
#     assumed_role=["arn:aws:iam::123456789012:role/role-1", "arn:aws:iam::123456789012:role/role-2"]
# )

Note that for the assume chain to work each role is assumed in succession, therefore each role in the provided list needs to be permitted to be assumed by the previous one.

The assumed credentials are automatically refreshed in case your application runs for longer then one hour (AWS default for assume role action)

## DataIO (datatoolz.io.DataIO)
Is a wrapper class for reading and writing data files into/from a `pandas.DataFrame`.

It exposes two main methods
  * `read(path, ...)`
  
  * `write(dataframe, path, ...)`

Initialisation:
```python
from datatoolz.io import DataIO
from datatoolz.filesystem import FileSystem

dio = DataIO()                     # basic instance pointing to local file system


fs = FileSystem(...)
dio = DataIO(filesystem=fs)        # instance pointing to a predefined file system (local/s3)


def my_partition_transformer(prefix, partitions, values, suffix):
    return "string"
dio = DataIO(
    partition_transformer=my_partition_transformer
)                                  # instance with a custom `partition_transformer` callable

```

---
### Basic writing and reading parquet

* `write(..., filetype="parquet")` - specified via the `filetype` argument, and is the default value if `filetype` is omited

* `read(..., filetype="parquet")` - same as above

In [27]:
df = pd.DataFrame({"col1": [1, 2, 1], "col2": ["a", "b", "c"]})

from datatoolz.io import DataIO
dio = DataIO()

In [28]:
dio.write(df, "my-parquet-file")

df_read = dio.read("my-parquet-file")
df_read

# note the created folder/prefix - default settings do not result in idempotent operations!
!tree && rm -rf my-parquet-file

Unnamed: 0,col1,col2
0,1,a
1,2,b
2,1,c


[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
├── [01;34mexample-bucket[00m
│   └── data.txt
└── [01;34mmy-parquet-file[00m
    └── 1649187138447104000-7b62420d-d595-440b-ab7c-702e4cf920f0

3 directories, 4 files


---
### Advanced writing
* `write(..., suffix=...)` - specify a custom output suffix

* `write(..., partition_by=...)` - specifies output partitioning

In [29]:
df = pd.DataFrame({"col1": [1, 2, 1], "col2": ["a", "b", "c"]})
df

from datatoolz.io import DataIO
dio = DataIO()

Unnamed: 0,col1,col2
0,1,a
1,2,b
2,1,c


---

#### With `suffix=""` the dataframe will be written under `path` as a single file

In [30]:
dio.write(df, path="my-parquet-file", suffix="")
dio.read("my-parquet-file")

!tree && rm -rf my-parquet-file

Unnamed: 0,col1,col2
0,1,a
1,2,b
2,1,c


[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
├── [01;34mexample-bucket[00m
│   └── data.txt
└── my-parquet-file

2 directories, 4 files


---
#### With `suffix="string"` the dataframe will be written under `path/suffix` as a single file

In [31]:
dio.write(df, path="my-parquet-file", suffix="my-file")
dio.read("my-parquet-file")

!tree && rm -rf my-parquet-file

Unnamed: 0,col1,col2
0,1,a
1,2,b
2,1,c


[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
├── [01;34mexample-bucket[00m
│   └── data.txt
└── [01;34mmy-parquet-file[00m
    └── my-file

3 directories, 4 files


---
#### With `suffix=["string1", "string2", ...]` the dataframe will be written in `path` under multiple files listed in `suffix` (uniform split)

In [32]:
dio.write(df, path="my-parquet-file", suffix=["my-file-1", "my-file-2"])
dio.read("my-parquet-file")

# read only one file
dio.read("my-parquet-file/my-file-1")

!tree && rm -rf my-parquet-file

Unnamed: 0,col1,col2
0,1,a
1,2,b
2,1,c


Unnamed: 0,col1,col2
0,1,a
1,2,b


[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
├── [01;34mexample-bucket[00m
│   └── data.txt
└── [01;34mmy-parquet-file[00m
    ├── my-file-1
    └── my-file-2

3 directories, 5 files


---
#### With `partition_by=["field"]` the output in `path` will be additionally split by partition value

In [33]:
dio.write(df, path="my-parquet-file", partition_by=["col1"])
dio.read("my-parquet-file")

# read a single partition
dio.read("my-parquet-file/col1=1")

!tree && rm -rf my-parquet-file

Unnamed: 0,col1,col2
0,1,a
1,1,c
2,2,b


Unnamed: 0,col1,col2
0,1,a
1,1,c


[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
├── [01;34mexample-bucket[00m
│   └── data.txt
└── [01;34mmy-parquet-file[00m
    ├── [01;34mcol1=1[00m
    │   └── 1649187139107421000-f28157d2-c1c0-4451-8777-37cccfca482e
    └── [01;34mcol1=2[00m
        └── 1649187139108683000-b5adb86a-91b8-4d72-a26f-9f7c2b40177d

5 directories, 5 files


---
#### The output can be partitioned by multiple fields `partition_by=["field1", "field2", ...]`

In [34]:
dio.write(df, path="my-parquet-file", partition_by=["col1", "col2"])
dio.read("my-parquet-file")

!tree && rm -rf my-parquet-file

Unnamed: 0,col1,col2
0,1,a
1,1,c
2,2,b


[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
├── [01;34mexample-bucket[00m
│   └── data.txt
└── [01;34mmy-parquet-file[00m
    ├── [01;34mcol1=1[00m
    │   ├── [01;34mcol2=a[00m
    │   │   └── 1649187139279990000-9bf19b58-f19f-4285-935d-17452d127ba1
    │   └── [01;34mcol2=c[00m
    │       └── 1649187139281053000-72a94260-9bf6-4823-ad90-592ee927c97d
    └── [01;34mcol1=2[00m
        └── [01;34mcol2=b[00m
            └── 1649187139281873000-9658b50b-693c-440f-9a18-9fa2b4cc3c94

8 directories, 6 files


---
#### You can overwrite the default partition_transformer resulting in different output path building e.g.
* default: `my-parquet-file/col1=1/1649112618853172000-25105f49-24dd-443b-b6bf-a5ca8f7a18d9`
* custom: `my-parquet-file/1/fixed-name`

In [35]:
def custom_partition_transformer(prefix, partitions, values, suffix):
    partition_part = "/".join(map(str, values))
    return f"{prefix}/{partition_part}/fixed-name"

dio = DataIO(partition_transformer=custom_partition_transformer)

dio.write(df, path="my-parquet-file", partition_by=["col1"])
dio.read("my-parquet-file")

!tree && rm -rf my-parquet-file

Unnamed: 0,col1,col2
0,1,a
1,1,c
2,2,b


[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
├── [01;34mexample-bucket[00m
│   └── data.txt
└── [01;34mmy-parquet-file[00m
    ├── [01;34m1[00m
    │   └── fixed-name
    └── [01;34m2[00m
        └── fixed-name

5 directories, 5 files


---
#### Additionally the partition columns can be dropped from output to reduce redundancy

__NOTE__: this process is NOT REVERSABLE by default!!!

In [36]:
dio = DataIO()

dio.write(df, path="my-parquet-file", partition_by=["col1"], drop_partitions=True)
dio.read("my-parquet-file/col1=1")

!tree && rm -rf my-parquet-file

Unnamed: 0,col2
0,a
1,c


[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
├── [01;34mexample-bucket[00m
│   └── data.txt
└── [01;34mmy-parquet-file[00m
    ├── [01;34mcol1=1[00m
    │   └── 1649187139609740000-4be580c0-82c4-41bb-87ea-4e6f30cb99f2
    └── [01;34mcol1=2[00m
        └── 1649187139611098000-abfd8ceb-650e-45e9-8848-55fe1a7a306b

5 directories, 5 files


---
### Other file types

Basides `parquet` the following types are handled

* `write(..., filetype="jsonlines")`

* `write(..., filetype="dsv")`

* `write(..., filetype="dsv", **pandas_kwargs)`


Both types support compression via `gzip=True`.

`pandas_kwargs` is passed to `pandas.DataFrame.to_csv` method.

In [37]:
df = pd.DataFrame({"col1": [1, 2, 1], "col2": ["a", "b", "c"]})
df

from datatoolz.io import DataIO
dio = DataIO()

Unnamed: 0,col1,col2
0,1,a
1,2,b
2,1,c


---
#### Write gzip-ed jsonlines

In [38]:
dio.write(df, path="my-data.json.gz", filetype="jsonlines", gzip=True, suffix="")
dio.read("my-data.json.gz", filetype="jsonlines", gzip=True)

!tree && gunzip my-data.json.gz && cat my-data.json && rm my-data.json

Unnamed: 0,col1,col2
0,1,a
1,2,b
2,1,c


[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
├── [01;34mexample-bucket[00m
│   └── data.txt
└── [01;31mmy-data.json.gz[00m

2 directories, 4 files
{"col1":1,"col2":"a"}
{"col1":2,"col2":"b"}
{"col1":1,"col2":"c"}


---
#### Write tab-separated (default) file

In [39]:
dio.write(df, path="my-data.txt", filetype="dsv", suffix="")
dio.read("my-data.txt", filetype="dsv")

!tree && cat my-data.txt && rm my-data.txt

Unnamed: 0,col1,col2
0,1,a
1,2,b
2,1,c


[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
├── [01;34mexample-bucket[00m
│   └── data.txt
└── my-data.txt

2 directories, 4 files
col1	col2
1	a
2	b
1	c


---
#### Write |-separated file without header

In [40]:
dio.write(df, path="my-data.txt", filetype="dsv", sep="|", header=None, suffix="")
dio.read("my-data.txt", filetype="dsv", sep="|", header=None)

!tree && cat my-data.txt && rm my-data.txt

Unnamed: 0,0,1
0,1,a
1,2,b
2,1,c


[01;34m.[00m
├── [01;34manother-bucket[00m
├── data-toolz-demo.ipynb
├── docker-compose.yaml
├── [01;34mexample-bucket[00m
│   └── data.txt
└── my-data.txt

2 directories, 4 files
1|a
2|b
1|c


### Exercise 3
You have a large dataset which doesn't fit into memory.
The dataset was stored in multiple small partitions.

Your goal is to aggregate it (compute `max` of `field`)

In [41]:
df = pd.DataFrame({"field": np.random.random(1000)})

dio = DataIO()
dio.write(df, "big-data", suffix=(f"file_{i}" for i in range(10)))

!tree

# write code here


!rm -rf big-data

[01;34m.[00m
├── [01;34manother-bucket[00m
├── [01;34mbig-data[00m
│   ├── file_0
│   ├── file_1
│   ├── file_2
│   ├── file_3
│   ├── file_4
│   ├── file_5
│   ├── file_6
│   ├── file_7
│   ├── file_8
│   └── file_9
├── data-toolz-demo.ipynb
├── docker-compose.yaml
└── [01;34mexample-bucket[00m
    └── data.txt

3 directories, 13 files


---
## JsonLogger (datatoolz.logging.JsonLogger)

This is a simple structured-logging class. You can use it to publish application logs as JSON encoded strings.

Initialisation
```python
from datatoolz.logging import JsonLogger

logger = JsonLogger(name="my-app", env="dev")
```

__Disclaimer__: currently this module is not further developed as may be removed alltogether in the future, as there are better, more specialized packages with the same functionality

In [42]:
from datatoolz.logging import JsonLogger

logger = JsonLogger(name="my-app", env="dev")

---
#### Log a simple message

In [43]:
logger.info("Wubba lubba dub dub")

{"logger": {"application": "my-app", "environment": "dev"}, "level": "info", "timestamp": "2022-04-05 19:32:20.497457", "message": "Wubba lubba dub dub"}


---
#### Log `extra` arguments

In [44]:
logger.info("Wubba lubba dub dub", purpose="pass butter")

{"logger": {"application": "my-app", "environment": "dev"}, "level": "info", "timestamp": "2022-04-05 19:32:20.504213", "message": "Wubba lubba dub dub", "extra": {"purpose": "pass butter"}}


In [45]:
logger.info("Wubba lubba dub dub", object={"name": "Jerry"})

{"logger": {"application": "my-app", "environment": "dev"}, "level": "info", "timestamp": "2022-04-05 19:32:20.510440", "message": "Wubba lubba dub dub", "extra": {"object": {"name": "Jerry"}}}


---
#### Use as decorator

In [46]:
@logger.decorate("Computing the sum")
def my_sum(x, y):
    return x + y

my_sum(1, 2)

{"logger": {"application": "my-app", "environment": "dev"}, "level": "info", "timestamp": "2022-04-05 19:32:20.516421", "message": "Computing the sum", "extra": {"function": "my_sum", "memory": {"current": 0, "peak": 0}, "duration": 4.4800000011946395e-06}}


3

---
#### Provide callback `extra` attributes

In [47]:
@logger.decorate(
    "Computing the sum", 
    python=sys.executable, 
    is_even=lambda x: x % 2 == 0, 
    is_negative=lambda x: x < 0
)
def my_sum(x, y):
    return x + y

my_sum(3, -7)

{"logger": {"application": "my-app", "environment": "dev"}, "level": "info", "timestamp": "2022-04-05 19:32:20.529520", "message": "Computing the sum", "extra": {"function": "my_sum", "memory": {"current": 29814, "peak": 86938}, "duration": 9.930000004487738e-07, "python": "/Library/Frameworks/Python.framework/Versions/3.8/bin/python3.8", "is_even": true, "is_negative": true}}


-4

---
# Notes

1. `s3` (or any other cloud storage) is not a file system!
  * a file system wrapper is conveniant but could result in sub-optimal implementatio


2. There are no "pre-checks" - the package will try to execute any requested command e.g. "list all objects in a bucket"


3. Copy between buckets - possible but not always optimal (see point 1.)

---
### If you like these features and would like to see more or if you find a bug
* __contributors are welcome__
* drop by https://github.com/grzegorzme/data-toolz and create a new issue