Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connect_hub() behaves differently for invalid model-output files in parquet format (vs .csv files) #33

Closed
bsweger opened this issue May 15, 2024 · 3 comments

Comments

@bsweger
Copy link
Contributor

bsweger commented May 15, 2024

I came across this discrepancy when testing S3-based FluSight parquet model-output files and was able to reproduce the behavior on my local machine.

The discrepancy involves how hubData handles connections to model-output data when there's a model-output file with invalid data.

Not sure if:

  • the behavior is expected or
  • it's a hubData issue (i.e., the intent is to replicate the graceful error-handling to non-csv files) or
  • we should attempt to correct these dates somewhere upstream?

Also, the tasks.json for this hub specify reference_date and target_end_date in YYYY-MM-DD format, so I'm not sure how much work we should do to accommodate this example of an invalid file.

Happy to move the issue elsewhere if necessary!


Background
The FluSight hub contains one model-output file with invalid date formats (reference_date and target_end_date are 3/2/2024 instead of 2024-03-02).

.csv example
hubData handles the situation very gracefully when using it against the .csv model-output files on my local machine. Essentially, it warns about an invalid file, keeps going, and still produces a tibble.

> local_hub_path <- '/Users/rsweger/code/FluSight-forecast-hub/'
> local_hub_con <- connect_hub(local_hub_path)
Warning message:
!  The following potentially invalid model output file not opened successfully.
/Users/rsweger/code/FluSight-forecast-hub/model-output/FluSight-baseline_cat/2024-03-02-FluSight-baseline_cat.csv 
> local_hub_con

── <hub_connection/FileSystemDataset> ──

• hub_name: "US CDC FluSight"hub_path: /Users/rsweger/code/FluSight-forecast-hub/file_format: "csv(1127/1128)"file_system: "LocalFileSystem"model_output_dir: "/Users/rsweger/code/FluSight-forecast-hub/model-output"config_admin: hub-config/admin.jsonconfig_tasks: hub-config/tasks.json

── Connection schema 
hub_connection with 1127 csv files
reference_date: date32[day]
target: string
horizon: int32
target_end_date: date32[day]
location: string
output_type: string
output_type_id: string
value: double
model_id: string
> local_hub_con.collect()
Error in local_hub_con.collect() : 
  could not find function "local_hub_con.collect"
> local_hub_con %>%
+   collect()
# A tibble: 5,811,548 × 9
   reference_date target          horizon target_end_date location output_type output_type_id value model_id             
   <date>         <chr>             <int> <date>          <chr>    <chr>       <chr>          <dbl> <chr>                
 1 2023-10-14     wk inc flu hosp      -1 2023-10-07      06       quantile    0.01            40.0 CADPH-FluCAT_Ensemble
 2 2023-10-14     wk inc flu hosp      -1 2023-10-07      06       quantile    0.025           40.9 CADPH-FluCAT_Ensemble

parquet example
When using hubData to connect to model-output files we converted to parquet format in the cloud, there's no warning, but trying to see the tibble produces an error.

> cloud_hub_path <- s3_bucket('bsweger-flusight-forecast/')
> cloud_hub_con <- connect_hub(cloud_hub_path)
> cloud_hub_con

── <hub_connection/FileSystemDataset> ──

• hub_name: "US CDC FluSight"hub_path: bsweger-flusight-forecast/file_format: "parquet(1128/1128)"file_system: "S3FileSystem"model_output_dir: "model-output/"config_admin: hub-config/admin.jsonconfig_tasks: hub-config/tasks.json

── Connection schema 
hub_connection with 1128 Parquet files
reference_date: date32[day]
target: string
horizon: int32
location: string
target_end_date: date32[day]
output_type: string
output_type_id: string
value: double
model_id: string
> cloud_hub_con %>%
+   collect()
Error in `compute.Dataset()`:
! Invalid: Failed to parse string: '3/2/2024' as a scalar of type date32[day]

try local connection using the same converted parquet files
To rule out a difference between connect locally vs connecting to S3, I copied the converted parquet files to my local machine and tried again to connect. Result was the same error as connecting to S3.

> transformed_path <- '/Users/rsweger/code/FluSight-forecast-hub/'
> transformed_con <- connect_hub(transformed_path)
> transformed_con

── <hub_connection/FileSystemDataset> ──

• hub_name: "US CDC FluSight"hub_path: /Users/rsweger/code/FluSight-forecast-hub/file_format: "parquet(1128/1128)"file_system: "LocalFileSystem"model_output_dir: "/Users/rsweger/code/FluSight-forecast-hub/transformed-model-output"config_admin: hub-config/admin.jsonconfig_tasks: hub-config/tasks.json

── Connection schema 
hub_connection with 1128 Parquet files
reference_date: date32[day]
target: string
horizon: int32
location: string
target_end_date: date32[day]
output_type: string
output_type_id: string
value: double
model_id: string
> transformed_con %>%
+   collect()
Error in `compute.Dataset()`:
! Invalid: Failed to parse string: '3/2/2024' as a scalar of type date32[day]
@annakrystalli
Copy link
Contributor

Good spot!

Unfortunately this seems to be something emanating from arrow itself in that it seems to handle errors in files differently and at separate times for different file formats.

With csvs @LucieContamin had noticed that if there was a parsing issue in csvs, arrow was just ignoring the file completely when opening the dataset with no notification whatsoever: hubverse-org/hubUtils#124, so we implemented a check to detect such a situation and issue a warning message.

With parquet files it seems to still open the file (hence it's not being picked up by the check that works on csvs) but just throws an error down the line when trying to collect.

While it would definitely be nice to have consistent behaviour for a user, I'm not sure how easy/straightforward it will be to implement given the lower level difference in behaviour and will need some exploring. At least an error is being thrown at some point (while for csvs the data was just excluded without notice), but it certainly is not as helpful as the csv one.

That's also why the validation stage is such an important part of the hubs. Because arrow can be a bit brittle and the error messaging is not super helpful yet, it's much more efficient to ensure data integrity on the way in than try and do too much at runtime when accessing the data.

@LucieContamin
Copy link

I had that error with parquet files recently, because also sometimes we have columns in an expected format, with both hubData and arrow.
I don't think there is an easy solution, but might be wrong.

@bsweger
Copy link
Contributor Author

bsweger commented May 17, 2024

Ah, I didn't realize that the reason hubData has that nice check for the .csvs is because Arrow was silently excluding them!

@annakrystalli I think it's completely reasonable not to invest a lot of time doing downstream checks and workarounds to accommodate hub data that doesn't pass the upstream validations. Especially since, as you say, arrow throws an error in this case (i.e., users won't unknowingly get incorrect information).

If we were going to invest time, I'd argue that it would be better spent upstream, helping people submit correct data to begin with.

I'm happy to close this one as a "wontfix"

@bsweger bsweger closed this as completed May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants