Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 dataset for associationsByDatatypeIndirect does not correspond with FTP file #2661

Closed
d0choa opened this issue Jul 7, 2022 · 3 comments
Closed
Assignees
Labels
Backend Relates to Open Targets backend team Bug Something isn't working Data Relates to Open Targets data team Platform Issues related to Open Targets Platform

Comments

@d0choa
Copy link
Contributor

d0choa commented Jul 7, 2022

AWS pulls our data from FTP to serve it as a public dataset in https://registry.opendata.aws/opentargets/

A user in the community site has reported that for 22.06, the data in

s3://aws-roda-hcls-datalake/opentargets_latest/associationbydatatypeindirect/

does not correspond with the data in:

ftp://ftp.ebi.ac.uk/pub/databases/opentargets/platform/22.06/output/etl/parquet/associationByDatatypeIndirect

The data on our FTP seems to have the right format but not the one on the AWS. Although this is beyond our codebase, it would be good to try to resolve it so we can ensure data in S3 is correct.

I believe all code to pull our data should be publicly available here:
https://github.com/aws-samples/data-lake-as-code/
From what I see there are several branches that might be relevant

@d0choa d0choa added Bug Something isn't working Data Relates to Open Targets data team Backend Relates to Open Targets backend team Platform Issues related to Open Targets Platform labels Jul 7, 2022
@d0choa
Copy link
Contributor Author

d0choa commented Jul 7, 2022

@DSuveges
Copy link

DSuveges commented Jul 8, 2022

That's really interesting.. I think they are making a mistake here:

"aws s3 sync opentargets/sourceExports/open-targets-data-releases/latest/ s3://{{openTargetsSourceFileTargetBucketLocation}}/opentargets/sourceExports/latest/"

The aws documentation says sync doesn't delete files, instead just copies and overwrites them. As the partitioned datasets have a unique hash in the file names, you can copy files from multiple releases without overwriting each other. So the user's observation was correct: there might be multiple entries with the same disease id in the disease dataset coming from multiple releases. Unless they do something about it, form September there will be a fifth identical identifier.

Demonstrating the mistake:

-rw-r--r--  1 dsuveges  EBI\Domain Users   326K 14 Dec  2021 part-00199-ad8db45e-239a-4036-88a1-012033909e5a-c000.snappy.parquet
-rw-r--r--  1 dsuveges  EBI\Domain Users   335K  6 Jul 08:20 part-00199-bd041124-4c12-491a-80e1-2c06a071d275-c000.snappy.parquet

The same partition from December and July. When reading the dataset seemingly the data is duplicated. I am not sure if anyone reads the issues on the AWS repo, but I open a ticket there. Nah, they have 4k repositories, and unanswered issues from 2020.

@mbdebian
Copy link
Contributor

I've reviewed their repo and, as far as I can see, although they start with a clean local folder, as you can see here, when it comes to syncing the download to the Open Targets Source bucket, it always goes to the same destination and, as pointed out by @DSuveges , the data is not mirrored, but aggregated, as no --delete parameter is used for the synchronization process.
I have submitted the following PR, aws-samples/data-lake-as-code#25, explaining the issue and implementing the operation as a mirroring process, instead of an aggregation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backend Relates to Open Targets backend team Bug Something isn't working Data Relates to Open Targets data team Platform Issues related to Open Targets Platform
Projects
None yet
Development

No branches or pull requests

3 participants