Fractional split bug on duplicated dataframes indices #10

xandaau · 2023-01-11T18:24:32Z

Fractional split feature of Splitter returns an undesired result when one tries to split a pandas dataframe with duplicated indices without passing any argument for id_column.

The following examples are illustrating the bug.

Let's create a dataframe with duplicated indices:

import pandas as pd

# Create separate dfs
df_1 = pd.DataFrame(np.random.normal(size=(5_000, )),
                   columns=["metric_val"])
df_1['frame'] = 1

df_2 = pd.DataFrame(np.random.normal(size=(5_000, )),
                   columns=["metric_val"])
df_2['frame'] = 2

# Concat and shuffle
dataframe = pd.concat([df_1, df_2]).sample(frac=1)

Now perform a fractional split on it:

from ambrosia.splitter import Splitter

# Create `Splitter` instance and make split based on dataframe index (no `id_column` provided)
splitter = Splitter()
factor = 0.5

result_1 = splitter.run(dataframe=dataframe, 
                        method='hash', 
                        part_of_table=factor,
                        salt='bug')
result_1.group.value_counts()

# Output:
# A    15000
# B    10000
# Name: group, dtype: int64

So, some of the objects after the split are duplicated and now appear in groups several times.
We can see that totally groups are bigger than the original dataframe.

This behaviour does not repeat if we try to split dataframe on the column with duplicated ids.

# Create column from dataframe indices and split on it

dataframe = dataframe.reset_index().rename(columns={'index': 'id_column'})

result_2 = splitter.run(dataframe=dataframe, 
                        id_column='id_column',
                        method='hash', 
                        part_of_table=factor,
                        salt='bug')

result_2.group.value_counts()

# Output:
# A    5000
# B    5000
# Name: group, dtype: int64

But if we look deeper, there is another unusual behaviour:

# Let's count objects origin dataframe frequencies in group A

result_2[result_2.group == 'A'].frame.value_counts()

# Output:
# A    2500
# B    2500
# Name: frame, dtype: int64

Objects from two original dataframes appear in the group equally, which in general is not desired.
This should be inspected further.

Bug was not checked on Spark implementation of same methods, but the care should be taken for them as well.

At the end, I want to add that duplicate indices are undesirable on the id column in the vast majority of splitting issues.
It will be nice to add duplicated id check in Splitter and warn user via logger.

The text was updated successfully, but these errors were encountered:

xandaau · 2023-02-02T21:19:08Z

Now for Splitter all objects must have unique ids.

xandaau added the bug Something isn't working label Jan 11, 2023

xandaau mentioned this issue Feb 1, 2023

Fractional split duplication bug #24

Merged

xandaau self-assigned this Feb 10, 2023

xandaau closed this as completed Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fractional split bug on duplicated dataframes indices #10

Fractional split bug on duplicated dataframes indices #10

xandaau commented Jan 11, 2023 •

edited

Loading

xandaau commented Feb 2, 2023

Fractional split bug on duplicated dataframes indices #10

Fractional split bug on duplicated dataframes indices #10

Comments

xandaau commented Jan 11, 2023 • edited Loading

xandaau commented Feb 2, 2023

xandaau commented Jan 11, 2023 •

edited

Loading