Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanCovid19Data not handling Province/State names properly #43

Open
vinceecws opened this issue Aug 20, 2021 · 0 comments
Open

cleanCovid19Data not handling Province/State names properly #43

vinceecws opened this issue Aug 20, 2021 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@vinceecws
Copy link
Collaborator

It was found that tools.Tools.cleanCovid19Data() results in the majority of fields in the Province/State column in covid_19_data.csv being set to NULL.

Refer to the code snippet:

val dfP = covid_df.withColumn("region", when(col("region").rlike("Diamond Princess"),
"Diamond Princess").when(col("region").rlike("Grand Princess"),
"Grand Princess").when(col("region")===col("region")||col("region")==="None"||col("region").rlike("Unknown"), null)
.otherwise(trim(col("region"))))

On line 148, .when(col("region")===col("region") ... , null) evaluates to True for all cases, which leads to all values that do not get filtered out prior to this method call to be set to NULL. Those that would be filtered out prior to this are only "Diamond Princess" and "Grand Princess".

I also have visual confirmation by eyeballing the results of a .show()

@vinceecws vinceecws added the bug Something isn't working label Aug 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants