You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fractional split feature of Splitter returns an undesired result when one tries to split a pandas dataframe with duplicated indices without passing any argument for id_column.
The following examples are illustrating the bug.
Let's create a dataframe with duplicated indices:
importpandasaspd# Create separate dfsdf_1=pd.DataFrame(np.random.normal(size=(5_000, )),
columns=["metric_val"])
df_1['frame'] =1df_2=pd.DataFrame(np.random.normal(size=(5_000, )),
columns=["metric_val"])
df_2['frame'] =2# Concat and shuffledataframe=pd.concat([df_1, df_2]).sample(frac=1)
Now perform a fractional split on it:
fromambrosia.splitterimportSplitter# Create `Splitter` instance and make split based on dataframe index (no `id_column` provided)splitter=Splitter()
factor=0.5result_1=splitter.run(dataframe=dataframe,
method='hash',
part_of_table=factor,
salt='bug')
result_1.group.value_counts()
# Output:# A 15000# B 10000# Name: group, dtype: int64
So, some of the objects after the split are duplicated and now appear in groups several times.
We can see that totally groups are bigger than the original dataframe.
This behaviour does not repeat if we try to split dataframe on the column with duplicated ids.
# Create column from dataframe indices and split on itdataframe=dataframe.reset_index().rename(columns={'index': 'id_column'})
result_2=splitter.run(dataframe=dataframe,
id_column='id_column',
method='hash',
part_of_table=factor,
salt='bug')
result_2.group.value_counts()
# Output:# A 5000# B 5000# Name: group, dtype: int64
But if we look deeper, there is another unusual behaviour:
# Let's count objects origin dataframe frequencies in group Aresult_2[result_2.group=='A'].frame.value_counts()
# Output:# A 2500# B 2500# Name: frame, dtype: int64
Objects from two original dataframes appear in the group equally, which in general is not desired.
This should be inspected further.
Bug was not checked on Spark implementation of same methods, but the care should be taken for them as well.
At the end, I want to add that duplicate indices are undesirable on the id column in the vast majority of splitting issues.
It will be nice to add duplicated id check in Splitter and warn user via logger.
The text was updated successfully, but these errors were encountered:
Fractional split feature of
Splitter
returns an undesired result when one tries to split apandas
dataframe with duplicated indices without passing any argument forid_column
.The following examples are illustrating the bug.
Let's create a dataframe with duplicated indices:
Now perform a fractional split on it:
So, some of the objects after the split are duplicated and now appear in groups several times.
We can see that totally groups are bigger than the original dataframe.
This behaviour does not repeat if we try to split dataframe on the column with duplicated ids.
But if we look deeper, there is another unusual behaviour:
Objects from two original dataframes appear in the group equally, which in general is not desired.
This should be inspected further.
Bug was not checked on
Spark
implementation of same methods, but the care should be taken for them as well.At the end, I want to add that duplicate indices are undesirable on the id column in the vast majority of splitting issues.
It will be nice to add duplicated id check in
Splitter
and warn user via logger.The text was updated successfully, but these errors were encountered: