Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datafram schema comparision fails with assertSmallDataFrameEquality method, eventhough schema is same #63

Open
prkorp opened this issue Feb 7, 2019 · 1 comment

Comments

@prkorp
Copy link

prkorp commented Feb 7, 2019

I am trying to write some test cases to validate the data between a .parquet file in s3 and target (hive table). I have loaded the .parquet data into one dataframe and the hive table data into another dataframe. When I now try to compare the schema of the two dataframes, using 'assertSmallDataFrameEquality' it returns false, eventhough schema is same. Not sure why it is failing. Any suggestions would be helpful?

@MrPowers
Copy link
Owner

MrPowers commented Feb 7, 2019

Thanks for opening this issue @prkorp.

The spark-fast-tests library defines the assertSmallDataFrameEquality method that checks if the schema and data in two DataFrames is equal. In your case, the schemas might be the same, but the data might be different.

This project, spark-daria, contains a validateSchema method that's defined here to make sure two schemas are the same.

If you only want to confirm schema equality, then validateSchema will probably be more useful. You can always print out the schemas of both DataFrames with the printSchema method and manually compare the differences. Hopefully this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants