assertDataFrameEquals() Failure due to Different Row Order #220

toddleo · 2018-01-03T08:07:55Z

Hi,

I'm not sure if this issue is raised earlier after some searches.

It's common to have the same DataFrames with different row order. Hence my test case fails sometimes, and occasionally it successes. Is there a better way to compare DataFrames?

spova · 2018-02-07T13:04:14Z

Hi,
Shouldn't you sort them before comparison? Like this:

assertDataFrameEquals(expected.orderBy(col("ID").asc), output.orderBy(col("ID").asc))

toddleo · 2018-02-08T02:29:47Z

Hi @spova ,

Imagine I have more than one column in my DataFrame. Let's say 5 (ID, name, gender, etc.). Indeed you can sort by ID, however, the rest will stay unsorted, which makes them un-comparable. You may explicitly sort every column at the same time but it is non-trivial and not elegant coding work to do.

…le duplicates in dataframe

noleto · 2018-11-28T11:46:29Z

Hi folks,
I'm experiencing the same problem when comparing DataFrame with different number of partitions, i.e. row order isn't' strictly the same but all row are present. I took a look at @smadarasmi fix and it seems a good behaviour for testing purposes (sort by all columns before comparing dataframes).

@smadarasmi any plans to PR your fix into the main repo?
Thanks,

zbstof · 2019-02-01T00:37:54Z

Hello. Is this going to be implemented?
This should be built-in functionality, imho, akin to containsInAnyOrder from Java's Hamcrest library

smadarasmi · 2019-02-03T11:27:11Z

@noleto I don't have write access to this repo.

nsutcliffe · 2019-02-19T09:53:44Z

+1 for this

nsutcliffe · 2019-02-19T09:55:42Z

@smadarasmi you'll need to resolve the build errors on your PR #228

…le duplicates in dataframe

smadarasmi · 2019-02-19T10:26:34Z

@nsutcliffe The build fails on this step: Initializing download: http://www-us.apache.org/dist/spark/spark-2.2.2/spark-2.2.2-bin-hadoop2.7.tgz

It returns 404, not found.

Don't think it is related to my change.

holdenk · 2019-02-19T10:29:23Z

Oh yeah that's a good point, Spark changed it's release packaging so the older versions are available for download from the normal mirrors anymore and I haven't had a chance to update the travis file to point to the new version yet.

nsutcliffe · 2019-02-19T10:35:41Z

@smadarasmi in the mean time you could check the build is fine by updating .travis.yml, find the link to spark-2.2.2 (line 25) and replace with:
http://archive.apache.org/dist/spark/spark-2.2.2/spark-2.2.2-bin-hadoop2.7.tgz
@holdenk would it be ok to merge like that?

holdenk · 2019-02-19T10:45:04Z

If you wouldn't mind updating the travis file in your PR it would just be able to run CI and merge as normal? Otherwise I can do a quick PR to do that.

nsutcliffe · 2019-02-19T10:46:20Z

Do you want it on 2.2.3 or 2.2.2?

holdenk · 2019-02-19T14:12:07Z

Lets do 2.2.3 it should remain on the mirrors for longer.

pablogomez93 · 2019-04-15T20:13:31Z

Hi :D! There are any news about this proposal? I think it's a huge feature, it adds a lot of value to the library for a very tiny changes.

smadarasmi · 2019-04-16T10:04:22Z

@holdenk Does it look okay for merging?

minnieshi · 2019-07-11T07:47:09Z

can this also work for spark 2.2?
currently i could not really compare 2 data frame equality as they have different order....

smadarasmi pushed a commit to smadarasmi/spark-testing-base that referenced this issue Feb 25, 2018

Fixes holdenk#220 : Add dataframe comparison without order

a7cd8d0

smadarasmi pushed a commit to smadarasmi/spark-testing-base that referenced this issue Feb 25, 2018

Fixes holdenk#220 : Add dataframe comparison without order

154ba1e

smadarasmi pushed a commit to smadarasmi/spark-testing-base that referenced this issue Oct 21, 2018

Fixes holdenk#220: modify method assertDataFrameNoOrderEquals to hand…

9b0f24c

…le duplicates in dataframe

smadarasmi pushed a commit to smadarasmi/spark-testing-base that referenced this issue Feb 19, 2019

Fixes holdenk#220 : Add dataframe comparison without order

b4c92e5

smadarasmi pushed a commit to smadarasmi/spark-testing-base that referenced this issue Feb 19, 2019

Fixes holdenk#220: modify method assertDataFrameNoOrderEquals to hand…

44e3b4d

…le duplicates in dataframe

holdenk closed this as completed in d72d84f Mar 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assertDataFrameEquals() Failure due to Different Row Order #220

assertDataFrameEquals() Failure due to Different Row Order #220

toddleo commented Jan 3, 2018

spova commented Feb 7, 2018

toddleo commented Feb 8, 2018

noleto commented Nov 28, 2018

zbstof commented Feb 1, 2019

smadarasmi commented Feb 3, 2019 •

edited

nsutcliffe commented Feb 19, 2019

nsutcliffe commented Feb 19, 2019

smadarasmi commented Feb 19, 2019

holdenk commented Feb 19, 2019

nsutcliffe commented Feb 19, 2019

holdenk commented Feb 19, 2019

nsutcliffe commented Feb 19, 2019

holdenk commented Feb 19, 2019

pablogomez93 commented Apr 15, 2019 •

edited

smadarasmi commented Apr 16, 2019

minnieshi commented Jul 11, 2019

assertDataFrameEquals() Failure due to Different Row Order #220

assertDataFrameEquals() Failure due to Different Row Order #220

Comments

toddleo commented Jan 3, 2018

spova commented Feb 7, 2018

toddleo commented Feb 8, 2018

noleto commented Nov 28, 2018

zbstof commented Feb 1, 2019

smadarasmi commented Feb 3, 2019 • edited

nsutcliffe commented Feb 19, 2019

nsutcliffe commented Feb 19, 2019

smadarasmi commented Feb 19, 2019

holdenk commented Feb 19, 2019

nsutcliffe commented Feb 19, 2019

holdenk commented Feb 19, 2019

nsutcliffe commented Feb 19, 2019

holdenk commented Feb 19, 2019

pablogomez93 commented Apr 15, 2019 • edited

smadarasmi commented Apr 16, 2019

minnieshi commented Jul 11, 2019

smadarasmi commented Feb 3, 2019 •

edited

pablogomez93 commented Apr 15, 2019 •

edited