Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assertDataFrameEquals() Failure due to Different Row Order #220

Closed
toddleo opened this issue Jan 3, 2018 · 16 comments
Closed

assertDataFrameEquals() Failure due to Different Row Order #220

toddleo opened this issue Jan 3, 2018 · 16 comments

Comments

@toddleo
Copy link

toddleo commented Jan 3, 2018

Hi,

I'm not sure if this issue is raised earlier after some searches.

It's common to have the same DataFrames with different row order. Hence my test case fails sometimes, and occasionally it successes. Is there a better way to compare DataFrames?

@spova
Copy link

spova commented Feb 7, 2018

Hi,
Shouldn't you sort them before comparison? Like this:

assertDataFrameEquals(expected.orderBy(col("ID").asc), output.orderBy(col("ID").asc))

@toddleo
Copy link
Author

toddleo commented Feb 8, 2018

Hi @spova ,

Imagine I have more than one column in my DataFrame. Let's say 5 (ID, name, gender, etc.). Indeed you can sort by ID, however, the rest will stay unsorted, which makes them un-comparable. You may explicitly sort every column at the same time but it is non-trivial and not elegant coding work to do.

smadarasmi pushed a commit to smadarasmi/spark-testing-base that referenced this issue Feb 25, 2018
smadarasmi pushed a commit to smadarasmi/spark-testing-base that referenced this issue Feb 25, 2018
smadarasmi pushed a commit to smadarasmi/spark-testing-base that referenced this issue Oct 21, 2018
@noleto
Copy link

noleto commented Nov 28, 2018

Hi folks,
I'm experiencing the same problem when comparing DataFrame with different number of partitions, i.e. row order isn't' strictly the same but all row are present. I took a look at @smadarasmi fix and it seems a good behaviour for testing purposes (sort by all columns before comparing dataframes).

@smadarasmi any plans to PR your fix into the main repo?
Thanks,

@zbstof
Copy link

zbstof commented Feb 1, 2019

Hello. Is this going to be implemented?
This should be built-in functionality, imho, akin to containsInAnyOrder from Java's Hamcrest library

@smadarasmi
Copy link
Contributor

smadarasmi commented Feb 3, 2019

@noleto I don't have write access to this repo.

@nsutcliffe
Copy link

+1 for this

@nsutcliffe
Copy link

@smadarasmi you'll need to resolve the build errors on your PR #228

smadarasmi pushed a commit to smadarasmi/spark-testing-base that referenced this issue Feb 19, 2019
smadarasmi pushed a commit to smadarasmi/spark-testing-base that referenced this issue Feb 19, 2019
@smadarasmi
Copy link
Contributor

@nsutcliffe The build fails on this step: Initializing download: http://www-us.apache.org/dist/spark/spark-2.2.2/spark-2.2.2-bin-hadoop2.7.tgz

It returns 404, not found.

Don't think it is related to my change.

@holdenk
Copy link
Owner

holdenk commented Feb 19, 2019

Oh yeah that's a good point, Spark changed it's release packaging so the older versions are available for download from the normal mirrors anymore and I haven't had a chance to update the travis file to point to the new version yet.

@nsutcliffe
Copy link

@smadarasmi in the mean time you could check the build is fine by updating .travis.yml, find the link to spark-2.2.2 (line 25) and replace with:
http://archive.apache.org/dist/spark/spark-2.2.2/spark-2.2.2-bin-hadoop2.7.tgz
@holdenk would it be ok to merge like that?

@holdenk
Copy link
Owner

holdenk commented Feb 19, 2019

If you wouldn't mind updating the travis file in your PR it would just be able to run CI and merge as normal? Otherwise I can do a quick PR to do that.

@nsutcliffe
Copy link

Do you want it on 2.2.3 or 2.2.2?

@holdenk
Copy link
Owner

holdenk commented Feb 19, 2019

Lets do 2.2.3 it should remain on the mirrors for longer.

@pablogomez93
Copy link

pablogomez93 commented Apr 15, 2019

Hi :D! There are any news about this proposal? I think it's a huge feature, it adds a lot of value to the library for a very tiny changes.

@smadarasmi
Copy link
Contributor

@holdenk Does it look okay for merging?

@minnieshi
Copy link

can this also work for spark 2.2?
currently i could not really compare 2 data frame equality as they have different order....

@holdenk holdenk closed this as completed in d72d84f Mar 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants