In [1]:
from tape import Ensemble, ColumnMapper
import dask.dataframe as dd
import dask

# Testing Dask Divisions
This notebook is aimed at trying to produce a number of example dataframes, representing possible TAPE datasets, and see the behavior of Dask's Divisions.

In [2]:
dask.__version__

'2023.10.0'

## Case 1: Unique Integer Indices

Case 1 Findings:
* No unexpected issues with a unique integer index
* Divisions being known in one dataframe but not both is problematic, Dask doesn't default to not using divisions and instead complains about the lack of divisions in the other dataframe
* Having different divisions between divisions between frames is fine, but the number of partitions in any resulting operation will be increased. For TAPE, this means that operations between Object and Source may explode Object's number of partitions if Source is not repartitioned
* The Sort flag will sort (shocking) and also populate divisions information, The sorted flag will populate divisions based on the min and max values, and will fail if the dataset is not actually sorted.

In [4]:
# Dataframe setup
rows = {
        "id": [1, 2, 3, 4, 5, 6, 7, 8, 9],
        "time": [10.1, 10.2, 10.2, 11.1, 11.2, 11.3, 11.4, 15.0, 15.1],
        "flux": [1.0, 2.0, 5.0, 3.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        }

df1 = dd.from_dict(rows, npartitions=3).set_index("id", sorted=True, sort=False)
print(df1.divisions)
df1

(1, 4, 7, 9)


Unnamed: 0_level_0,time,flux
npartitions=3,Unnamed: 1_level_1,Unnamed: 2_level_1
1,float64,float64
4,...,...
7,...,...
9,...,...


In [5]:
# Verify Access

print(df1.loc[2]) # Retrieve a row
print(df1.loc[2:4]) #Retrieve a range of rows across partition boundaries

Dask DataFrame Structure:
                  time     flux
npartitions=1                  
2              float64  float64
2                  ...      ...
Dask Name: loc, 4 graph layers
Dask DataFrame Structure:
                  time     flux
npartitions=2                  
2              float64  float64
4                  ...      ...
4                  ...      ...
Dask Name: loc, 4 graph layers


In [6]:
# Assign a new column from a series with equal partitions

new_row = {
        "id": [1, 2, 3, 4, 5, 6, 7, 8, 9],
        "err": [1.0, 2.0, 1.0, 3.0, 2.0, 3.0, 4.0, 5.0, 6.0],
        }

err = dd.from_dict(new_row, npartitions=3).set_index("id", sorted=True, sort=False)["err"]

print(err.divisions)
df1.assign(err=err)

(1, 4, 7, 9)


Unnamed: 0_level_0,time,flux,err
npartitions=3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,float64,float64,float64
4,...,...,...
7,...,...,...
9,...,...,...


The following cell errors, which is expected. Despite having a series and dataframe that can align row-by-row, the lack of divisions information makes this un-doable.

In [7]:
# Assign a new column from a series with unknown partitions
# Errors as expected

new_row = {
        "id": [1, 2, 3, 4, 5, 6, 7, 8, 9],
        "err": [1.0, 2.0, 1.0, 3.0, 2.0, 3.0, 4.0, 5.0, 6.0],
        }

err = dd.from_dict(new_row, npartitions=3).set_index("id", sorted=False, sort=False)["err"]

print(err.divisions)
df1.assign(err=err)

(None, None, None, None)


ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

### Case 1.1: Operation with unknown divisions

In the case that divisions are unknown in both frames, the operation is able to be performed, and divisions are not set in the result.

In [8]:
# Dataframe setup
rows = {
        "id": [1, 2, 3, 4, 5, 6, 7, 8, 9],
        "time": [10.1, 10.2, 10.2, 11.1, 11.2, 11.3, 11.4, 15.0, 15.1],
        "flux": [1.0, 2.0, 5.0, 3.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        }

df1_1 = dd.from_dict(rows, npartitions=3).set_index("id", sorted=False, sort=False)
print(df1_1.divisions)

# Assign a new column from a series with unknown partitions
# Errors as expected

new_row = {
        "id": [1, 2, 3, 4, 5, 6, 7, 8, 9],
        "err": [1.0, 2.0, 1.0, 3.0, 2.0, 3.0, 4.0, 5.0, 6.0],
        }

err = dd.from_dict(new_row, npartitions=3).set_index("id", sorted=False, sort=False)["err"]

print(err.divisions)
df1_1.assign(err=err)

(None, None, None, None)
(None, None, None, None)


Unnamed: 0_level_0,time,flux,err
npartitions=3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,float64,float64,float64
,...,...,...
,...,...,...
,...,...,...


### Case 1.2: Sorting Instead of Sorted

The `sorted` flag lets dask know that the dataset is sorted, allowing it to compute divisions based off the minimum and maximum values. Alternatively, we can set `sort` to have it sort the dataset. This will sort the index and compute divisions, it does find slightly different boundaries as a result here.

In [9]:
# Dataframe setup
rows = {
        "id": [1, 9, 3, 4, 5, 6, 7, 8, 2], #Ids have been swapped
        "time": [10.1, 10.2, 10.2, 11.1, 11.2, 11.3, 11.4, 15.0, 15.1],
        "flux": [1.0, 2.0, 5.0, 3.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        }

df1_2 = dd.from_dict(rows, npartitions=3).set_index("id", sorted=False, sort=True)
print(df1_2.divisions)

# Assign a new column from a series with unknown partitions
# Errors as expected

new_row = {
        "id": [1, 9, 3, 4, 5, 6, 7, 8, 2],
        "err": [1.0, 2.0, 1.0, 3.0, 2.0, 3.0, 4.0, 5.0, 6.0],
        }

err = dd.from_dict(new_row, npartitions=3).set_index("id", sorted=False, sort=True)["err"]

print(err.divisions)
res = df1_2.assign(err=err)
res

(1, 3, 6, 9)
(1, 3, 6, 9)


Unnamed: 0_level_0,time,flux,err
npartitions=3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,float64,float64,float64
3,...,...,...
6,...,...,...
9,...,...,...


### Case 1.3: Having Unequal Divisions

Often, divisions between two dataframes will be different. These operations seem fully viable, but can introduce some partition bloat.

In [10]:
# Dataframe setup
rows = {
        "id": [1, 9, 3, 4, 5, 6, 7, 8, 2], #Ids have been swapped
        "time": [10.1, 10.2, 10.2, 11.1, 11.2, 11.3, 11.4, 15.0, 15.1],
        "flux": [1.0, 2.0, 5.0, 3.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        }

df1_3 = dd.from_dict(rows, npartitions=3).set_index("id", sorted=False, sort=True)
print(df1_3.divisions)

# Assign a new column from a series with unknown partitions
# Errors as expected

new_row = {
        "id": [1, 2, 3, 4, 5, 6, 7, 8, 9], #Ids have not been swapped
        "err": [1.0, 2.0, 1.0, 3.0, 2.0, 3.0, 4.0, 5.0, 6.0],
        }

err = dd.from_dict(new_row, npartitions=3).set_index("id", sorted=True, sort=False)["err"]

print(err.divisions)
res = df1_3.assign(err=err)
print(res.divisions) # Increased number of partitions
res

(1, 3, 6, 9)
(1, 4, 7, 9)
(1, 3, 4, 6, 7, 9)


Unnamed: 0_level_0,time,flux,err
npartitions=5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,float64,float64,float64
3,...,...,...
...,...,...,...
7,...,...,...
9,...,...,...


### Case 1.4: Can we game `sorted`?

When sorted is set, it just calculates divisions based on the min and max within each partition, can this produce unintended behavior? The below example suggests it errors out instead. The state of the `sort` flag does not appear to affect this.

In [11]:
# Dataframe setup
rows = {
        "id": [1, 9, 3, 4, 5, 6, 7, 8, 2], #Ids have been swapped
        "time": [10.1, 10.2, 10.2, 11.1, 11.2, 11.3, 11.4, 15.0, 15.1],
        "flux": [1.0, 2.0, 5.0, 3.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        }

df1_4 = dd.from_dict(rows, npartitions=3).set_index("id", sorted=True, sort=True)
print(df1_4.divisions)

ValueError: ('Partitions are not sorted ascending by id', 'In your dataset the (min, max, len) values of id for each partition are : [(1, 9, 3), (4, 6, 3), (2, 8, 3)]')

## Case 2: Non-unique Integer Indices

The most relevant TAPE case.

Findings:
* Surprisingly, Dask appears to make sure that duplicate Ids are co-partitioned, both when sorting or when it's already sorted it will actually consider the indices on where it draws the partition boundaries
* There's a bug/issue where `sorted=True` will not fail when non-unique integer indices are not sorted, as in Case 1.4. It will even draw non-sensical division bounds. So we should be careful about users setting sorted=True on datasets that aren't actually sorted, perhaps an argument for checking if it's sorted within TAPE first and erroring out.

### Case 2.1: No spillage of Ids across partitions

This is interesting, as it looks like Dask will assign partitions in such a way to enforce no ID spillage. Even to the extent where it will leave one partition empty.

In [13]:
# Dataframe setup
rows = {
        "id": [1, 1, 1, 2, 2, 3, 3, 4, 4],
        "time": [10.1, 10.2, 10.2, 11.1, 11.2, 11.3, 11.4, 15.0, 15.1],
        "flux": [1.0, 2.0, 5.0, 3.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        }

df2_1 = dd.from_dict(rows, npartitions=3).set_index("id", sorted=True, sort=False)
print(df2_1.divisions)
for i in range(3):
    print(df2_1.partitions[i].compute())

(1, 2, 3, 4)
    time  flux
id            
1   10.1   1.0
1   10.2   2.0
1   10.2   5.0
    time  flux
id            
2   11.1   3.0
2   11.2   1.0
    time  flux
id            
3   11.3   2.0
3   11.4   3.0
4   15.0   4.0
4   15.1   5.0


In [14]:
print(df2_1.loc[2].compute()) # Retrieve a row
print(df2_1.loc[2:4].compute()) #Retrieve a range of rows across partition boundaries

    time  flux
id            
2   11.1   3.0
2   11.2   1.0
    time  flux
id            
2   11.1   3.0
2   11.2   1.0
3   11.3   2.0
3   11.4   3.0
4   15.0   4.0
4   15.1   5.0


In [16]:
# Dask really tries to avoid spillage of IDs, this creates an empty first partition
rows = {
        "id": [1, 1, 1, 1, 2, 2, 3, 4, 4],
        "time": [10.1, 10.2, 10.2, 11.1, 11.2, 11.3, 11.4, 15.0, 15.1],
        "flux": [1.0, 2.0, 5.0, 3.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        }

df2_1 = dd.from_dict(rows, npartitions=3).set_index("id", sorted=True, sort=False)
print(df2_1.divisions)
for i in range(3):
    print(df2_1.partitions[i].compute())

(1, 1, 3, 4)
Empty DataFrame
Columns: [time, flux]
Index: []
    time  flux
id            
1   10.1   1.0
1   10.2   2.0
1   10.2   5.0
1   11.1   3.0
2   11.2   1.0
2   11.3   2.0
    time  flux
id            
3   11.4   3.0
4   15.0   4.0
4   15.1   5.0


### Case 2.2: Does `sort` do anything surprising?

Seems to sort as expected, avoids spillage of IDs

In [18]:
# Dataframe setup
rows = {
        "id": [1, 2, 3, 2, 1, 4, 3, 2, 4],
        "time": [10.1, 10.2, 10.2, 11.1, 11.2, 11.3, 11.4, 15.0, 15.1],
        "flux": [1.0, 2.0, 5.0, 3.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        }

df2_2= dd.from_dict(rows, npartitions=3).set_index("id", sorted=False, sort=True)
print(df2_2.divisions)
for i in range(3):
    print(df2_2.partitions[i].compute())

(1, 2, 3, 4)
    time  flux
id            
1   11.2   1.0
1   10.1   1.0
    time  flux
id            
2   10.2   2.0
2   15.0   4.0
2   11.1   3.0
    time  flux
id            
3   11.4   3.0
3   10.2   5.0
4   11.3   2.0
4   15.1   5.0


### Case 2.3: A weird failure case?

When `sorted` is True, this produces divisions but does not actually sort the data on them. This causes downstream operations like loc to just not work as expected. This appears to only fail when the dataset is not actually sorted.

In [23]:
# Dataframe setup
rows = {
        "id": [1, 2, 3, 2, 1, 4, 3, 2, 4],
        "time": [10.1, 10.2, 10.2, 11.1, 11.2, 11.3, 11.4, 15.0, 15.1],
        "flux": [1.0, 2.0, 5.0, 3.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        }

df2_3 = dd.from_dict(rows, npartitions=3).set_index("id", sorted=True, sort=False)
print(df2_3.divisions)
for i in range(3):
    print(df2_3.partitions[i].compute())

(1, 1, 2, 4)
    time  flux
id            
1   10.1   1.0
2   10.2   2.0
3   10.2   5.0
    time  flux
id            
2   11.1   3.0
1   11.2   1.0
4   11.3   2.0
    time  flux
id            
3   11.4   3.0
2   15.0   4.0
4   15.1   5.0


In [20]:
print(df2_3.loc[2].compute()) # Retrieve a row
print(df2_3.loc[2:4].compute()) #Retrieve a range of rows across partition boundaries

    time  flux
id            
2   15.0   4.0
    time  flux
id            
2   15.0   4.0
4   15.1   5.0


In [22]:
# Dataframe setup
rows = {
        "id": [1, 1, 2, 2, 2, 2, 3, 3, 4],
        "time": [10.1, 10.2, 10.2, 11.1, 11.2, 11.3, 11.4, 15.0, 15.1],
        "flux": [1.0, 2.0, 5.0, 3.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        }

df2_3 = dd.from_dict(rows, npartitions=3).set_index("id", sorted=True, sort=True)
print(df2_3.divisions)
for i in range(3):
    print(df2_3.partitions[i].compute())

(1, 2, 3, 4)
    time  flux
id            
1   10.1   1.0
1   10.2   2.0
    time  flux
id            
2   10.2   5.0
2   11.1   3.0
2   11.2   1.0
2   11.3   2.0
    time  flux
id            
3   11.4   3.0
3   15.0   4.0
4   15.1   5.0


## Case 3: Non-Integer Indices

### Case 3.1: Unique Non-Integer Indices

Actually seems to work well with strings! It's able to sort them and generate divisions.

In [24]:
# Dataframe setup
rows = {
        "id": ["a", "b1", "b2", "d", "e", "r", "g", "h", "i"],
        "time": [10.1, 10.2, 10.2, 11.1, 11.2, 11.3, 11.4, 15.0, 15.1],
        "flux": [1.0, 2.0, 5.0, 3.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        }

df3_1 = dd.from_dict(rows, npartitions=3).set_index("id", sorted=False, sort=True)
print(df3_1.divisions)
for i in range(3):
    print(df3_1.partitions[i].compute())

('a', 'b2', 'g', 'r')
    time  flux
id            
a   10.1   1.0
b1  10.2   2.0
    time  flux
id            
b2  10.2   5.0
d   11.1   3.0
e   11.2   1.0
    time  flux
id            
g   11.4   3.0
h   15.0   4.0
i   15.1   5.0
r   11.3   2.0


In [25]:
print(df3_1.loc["a"].compute()) # Retrieve a row
print(df3_1.loc["b":"d"].compute()) #Retrieve a range of rows across partition boundaries

    time  flux
id            
a   10.1   1.0
    time  flux
id            
b1  10.2   2.0
b2  10.2   5.0
d   11.1   3.0


### Case 3.2: Non-Unique Non-Integer Indices

Again, this appears to work well. It even appropriately errors out when attempting to use `sorted` with a non-sorted index.

In [26]:
# Works well when sorted
rows = {
        "id": ["a", "a", "b1", "b1", "b2", "b2", "g", "g", "g"],
        "time": [10.1, 10.2, 10.2, 11.1, 11.2, 11.3, 11.4, 15.0, 15.1],
        "flux": [1.0, 2.0, 5.0, 3.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        }

df3_2 = dd.from_dict(rows, npartitions=3).set_index("id", sorted=True, sort=False)
print(df3_2.divisions)
for i in range(3):
    print(df3_2.partitions[i].compute())

('a', 'b1', 'g', 'g')
    time  flux
id            
a   10.1   1.0
a   10.2   2.0
    time  flux
id            
b1  10.2   5.0
b1  11.1   3.0
b2  11.2   1.0
b2  11.3   2.0
    time  flux
id            
g   11.4   3.0
g   15.0   4.0
g   15.1   5.0


In [27]:
# Correctly produces the sorted on unsorted error
rows = {
        "id": ["a", "a", "b1", "b1", "b2", "b2", "a", "a", "g"],
        "time": [10.1, 10.2, 10.2, 11.1, 11.2, 11.3, 11.4, 15.0, 15.1],
        "flux": [1.0, 2.0, 5.0, 3.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        }

df3_2 = dd.from_dict(rows, npartitions=3).set_index("id", sorted=False, sort=True)
print(df3_2.divisions)
for i in range(3):
    print(df3_2.partitions[i].compute())

('a', 'b1', 'b2', 'g')
    time  flux
id            
a   11.4   3.0
a   15.0   4.0
a   10.1   1.0
a   10.2   2.0
    time  flux
id            
b1  10.2   5.0
b1  11.1   3.0
    time  flux
id            
b2  11.2   1.0
b2  11.3   2.0
g   15.1   5.0


In [28]:
# Correctly produces the sorted on unsorted error
rows = {
        "id": ["a", "a", "b1", "b1", "b2", "b2", "a", "a", "g"],
        "time": [10.1, 10.2, 10.2, 11.1, 11.2, 11.3, 11.4, 15.0, 15.1],
        "flux": [1.0, 2.0, 5.0, 3.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        }

df3_2 = dd.from_dict(rows, npartitions=3).set_index("id", sorted=True, sort=False)
print(df3_2.divisions)
for i in range(3):
    print(df3_2.partitions[i].compute())

ValueError: ('Partitions are not sorted ascending by id', "In your dataset the (min, max, len) values of id for each partition are : [('a', 'b1', 3), ('b1', 'b2', 3), ('a', 'g', 3)]")

## Case 4: Non-unique indices interaction with parquet row-groups

Dask reads parquet partitions more directly, do divisions maintain the same flexibility when being populated by parquet files?

Findings:
* Parquet does strongarm partitions, to the point where you can have spillage for a non-unique index. Using `calculate_divisions` within the `read_parquet` call seems unwise given this.
* This case is the first instance of actual spillage, and Dask doesn't seem to handle it well, given that loc was missing some occurences of a given ID.
* By having dask not retrieve the index from parquet, and instead set it afterwards it seems to be get the flexible behavior found above.

In [29]:
# Dataframe setup
rows = {
        "id": [1, 1, 1, 2, 2, 3, 3, 4, 4],
        "time": [10.1, 10.2, 10.2, 11.1, 11.2, 11.3, 11.4, 15.0, 15.1],
        "flux": [1.0, 2.0, 5.0, 3.0, 1.0, 2.0, 3.0, 4.0, 5.0],
        }

df4_1 = dd.from_dict(rows, npartitions=3).set_index("id", sorted=False, sort=False)
print(df4_1.divisions)

df4_1.to_parquet("divdata/")

(None, None, None, None)


### Case 4.1: Reading with read_parquet's `calculate_divisions`

This creates ID spillage. Operations like loc don't look in both partitions, so Dask seems unsuited to natively handle these spills.

In [30]:
df4_1 = dd.read_parquet("divdata/*.parquet", index="id", calculate_divisions=True)

print(df4_1.divisions)
for i in range(3):
    print(df4_1.partitions[i].compute())

(1, 2, 3, 4)
    time  flux
id            
1   10.1   1.0
1   10.2   2.0
1   10.2   5.0
    time  flux
id            
2   11.1   3.0
2   11.2   1.0
3   11.3   2.0
    time  flux
id            
3   11.4   3.0
4   15.0   4.0
4   15.1   5.0


In [31]:
print(df4_1.loc[3].compute()) # Retrieve a row
print(df4_1.loc[2:4].compute()) #Retrieve a range of rows across partition boundaries

    time  flux
id            
3   11.4   3.0
    time  flux
id            
2   11.1   3.0
2   11.2   1.0
3   11.3   2.0
3   11.4   3.0
4   15.0   4.0
4   15.1   5.0


### Case 4.2: Can we get around this?

This is possibly just a bug in read_parquet. By setting the index after a read, we can retrieve the expected behavior as when loading from dictionaries. The second cell achieves this by setting index=False, this seems more elegant than a reset_index call, but not sure if it has any less of a performance penalty.

In [32]:
# If the index is present in the parquet files, set index will be a no-op

df4_2 = dd.read_parquet("divdata/*.parquet").set_index("id", sort=True)

print(df4_2.divisions)
for i in range(3):
    print(df4_2.partitions[i].compute())

(None, None, None, None)
    time  flux
id            
1   10.1   1.0
1   10.2   2.0
1   10.2   5.0
    time  flux
id            
2   11.1   3.0
2   11.2   1.0
3   11.3   2.0
    time  flux
id            
3   11.4   3.0
4   15.0   4.0
4   15.1   5.0




In [33]:
# By turning the index off, this allows for set_index to produce the expected divisions 

df4_2 = dd.read_parquet("divdata/*.parquet", index=False).set_index("id", sorted=True)

print(df4_2.divisions)
for i in range(3):
    print(df4_2.partitions[i].compute())

(1, 2, 3, 4)
    time  flux
id            
1   10.1   1.0
1   10.2   2.0
1   10.2   5.0
    time  flux
id            
2   11.1   3.0
2   11.2   1.0
    time  flux
id            
3   11.3   2.0
3   11.4   3.0
4   15.0   4.0
4   15.1   5.0


In [34]:
print(df4_2.loc[3].compute()) # Retrieve a row
print(df4_2.loc[2:4].compute()) #Retrieve a range of rows across partition boundaries

    time  flux
id            
3   11.3   2.0
3   11.4   3.0
    time  flux
id            
2   11.1   3.0
2   11.2   1.0
3   11.3   2.0
3   11.4   3.0
4   15.0   4.0
4   15.1   5.0
