Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fractional split bug on duplicated dataframes indices #10

Closed
xandaau opened this issue Jan 11, 2023 · 1 comment
Closed

Fractional split bug on duplicated dataframes indices #10

xandaau opened this issue Jan 11, 2023 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@xandaau
Copy link
Collaborator

xandaau commented Jan 11, 2023

Fractional split feature of Splitter returns an undesired result when one tries to split a pandas dataframe with duplicated indices without passing any argument for id_column.

The following examples are illustrating the bug.

Let's create a dataframe with duplicated indices:

import pandas as pd

# Create separate dfs
df_1 = pd.DataFrame(np.random.normal(size=(5_000, )),
                   columns=["metric_val"])
df_1['frame'] = 1

df_2 = pd.DataFrame(np.random.normal(size=(5_000, )),
                   columns=["metric_val"])
df_2['frame'] = 2

# Concat and shuffle
dataframe = pd.concat([df_1, df_2]).sample(frac=1)

Now perform a fractional split on it:

from ambrosia.splitter import Splitter

# Create `Splitter` instance and make split based on dataframe index (no `id_column` provided)
splitter = Splitter()
factor = 0.5

result_1 = splitter.run(dataframe=dataframe, 
                        method='hash', 
                        part_of_table=factor,
                        salt='bug')
result_1.group.value_counts()

# Output:
# A    15000
# B    10000
# Name: group, dtype: int64

So, some of the objects after the split are duplicated and now appear in groups several times.
We can see that totally groups are bigger than the original dataframe.


This behaviour does not repeat if we try to split dataframe on the column with duplicated ids.

# Create column from dataframe indices and split on it

dataframe = dataframe.reset_index().rename(columns={'index': 'id_column'})

result_2 = splitter.run(dataframe=dataframe, 
                        id_column='id_column',
                        method='hash', 
                        part_of_table=factor,
                        salt='bug')

result_2.group.value_counts()

# Output:
# A    5000
# B    5000
# Name: group, dtype: int64

But if we look deeper, there is another unusual behaviour:

# Let's count objects origin dataframe frequencies in group A

result_2[result_2.group == 'A'].frame.value_counts()

# Output:
# A    2500
# B    2500
# Name: frame, dtype: int64

Objects from two original dataframes appear in the group equally, which in general is not desired.
This should be inspected further.


Bug was not checked on Spark implementation of same methods, but the care should be taken for them as well.

At the end, I want to add that duplicate indices are undesirable on the id column in the vast majority of splitting issues.
It will be nice to add duplicated id check in Splitter and warn user via logger.

@xandaau xandaau added the bug Something isn't working label Jan 11, 2023
@xandaau
Copy link
Collaborator Author

xandaau commented Feb 2, 2023

Now for Splitter all objects must have unique ids.

@xandaau xandaau self-assigned this Feb 10, 2023
@xandaau xandaau closed this as completed Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant